1
|
Gjelsvik EL, Tøndel K. Increased interpretation of deep learning models using hierarchical cluster-based modelling. PLoS One 2023; 18:e0295251. [PMID: 38060472 PMCID: PMC10703235 DOI: 10.1371/journal.pone.0295251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/20/2023] [Indexed: 12/18/2023] Open
Abstract
Linear prediction models based on data with large inhomogeneity or abrupt non-linearities often perform poorly because relationships between groups in the data dominate the model. Given that the data is locally linear, this can be overcome by splitting the data into smaller clusters and creating a local model within each cluster. In this study, the previously published Hierarchical Cluster-based Partial Least Squares Regression (HC-PLSR) procedure was extended to deep learning, in order to increase the interpretability of the deep learning models through local modelling. Hierarchical Cluster-based Convolutional Neural Networks (HC-CNNs), Hierarchical Cluster-based Recurrent Neural Networks (HC-RNNs) and Hierarchical Cluster-based Support Vector Regression models (HC-SVRs) were implemented and tested on spectroscopic data consisting of Fourier Transform Infrared (FT-IR) measurements of raw material dry films, for prediction of average molecular weight during hydrolysis and a simulated data set constructed to contain three clusters of observations with different non-linear relationships between the independent variables and the response. HC-CNN, HC-RNN and HC-SVR outperformed HC-PLSR for the simulated data set, showing the disadvantage of PLSR for highly non-linear data, but for the FT-IR data set there was little to gain in prediction ability from using more complex models than HC-PLSR. Local modelling can ease the interpretation of deep learning models through highlighting differences in feature importance between different regions of the input or output space. Our results showed clear differences between the feature importance for the various local models, which demonstrate the advantages of a local modelling approach with regards to interpretation of deep learning models.
Collapse
Affiliation(s)
- Elise Lunde Gjelsvik
- Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway
| | - Kristin Tøndel
- Faculty of Science and Technology, Norwegian University of Life Sciences, Aas, Norway
| |
Collapse
|
2
|
Translating unusual computational methods to drug discovery: taking advantage of work in other fields. Future Med Chem 2019; 11:157-160. [PMID: 30762434 DOI: 10.4155/fmc-2018-0287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
3
|
Liu J, Ning X. Differential Compound Prioritization via Bidirectional Selectivity Push with Power. J Chem Inf Model 2017; 57:2958-2975. [PMID: 29178784 DOI: 10.1021/acs.jcim.7b00552] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Junfeng Liu
- Indiana University - Purdue University Indianapolis, 723 West Michigan Street, SL 280, Indianapolis, Indiana 46202, United States
| | - Xia Ning
- Indiana University - Purdue University Indianapolis, 723 West Michigan Street, SL 280, Indianapolis, Indiana 46202, United States
- Center
for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 West 10th Street, HITS 5000, Indianapolis, Indiana 46202, United States
| |
Collapse
|
4
|
Liu J, Ning X. Multi-Assay-Based Compound Prioritization via Assistance Utilization: A Machine Learning Framework. J Chem Inf Model 2017; 57:484-498. [PMID: 28234477 DOI: 10.1021/acs.jcim.6b00737] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Junfeng Liu
- Indiana University-Purdue University, Indianapolis, 723 West Michigan St., SL 280, Indianapolis, Indiana 46202, United States
| | - Xia Ning
- Indiana University-Purdue University, Indianapolis, 723 West Michigan St., SL 280, Indianapolis, Indiana 46202, United States
- Center
for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 West 10th St., HITS 5000, Indianapolis, Indiana 46202, United States
| |
Collapse
|
5
|
Torell F, Bennett K, Cereghini S, Rännar S, Lundstedt-Enkel K, Moritz T, Haumaitre C, Trygg J, Lundstedt T. Multi-Organ Contribution to the Metabolic Plasma Profile Using Hierarchical Modelling. PLoS One 2015; 10:e0129260. [PMID: 26086868 PMCID: PMC4472231 DOI: 10.1371/journal.pone.0129260] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Accepted: 05/06/2015] [Indexed: 12/17/2022] Open
Abstract
Hierarchical modelling was applied in order to identify the organs that contribute to the levels of metabolites in plasma. Plasma and organ samples from gut, kidney, liver, muscle and pancreas were obtained from mice. The samples were analysed using gas chromatography time-of-flight mass spectrometry (GC TOF-MS) at the Swedish Metabolomics centre, Umeå University, Sweden. The multivariate analysis was performed by means of principal component analysis (PCA) and orthogonal projections to latent structures (OPLS). The main goal of this study was to investigate how each organ contributes to the metabolic plasma profile. This was performed using hierarchical modelling. Each organ was found to have a unique metabolic profile. The hierarchical modelling showed that the gut, kidney and liver demonstrated the greatest contribution to the metabolic pattern of plasma. For example, we found that metabolites were absorbed in the gut and transported to the plasma. The kidneys excrete branched chain amino acids (BCAAs) and fatty acids are transported in the plasma to the muscles and liver. Lactic acid was also found to be transported from the pancreas to plasma. The results indicated that hierarchical modelling can be utilized to identify the organ contribution of unknown metabolites to the metabolic profile of plasma.
Collapse
Affiliation(s)
- Frida Torell
- Computational Life Science Cluster (CLiC), Department of Chemistry, Umeå University, Umeå, Sweden
- Karlsruhe Institute of Technology, Karlsruhe, Germany
| | | | - Silvia Cereghini
- CNRS, UMR7622, 75005, Paris, France
- Sorbonne Universités, UPMC, UMR7622, 75005, Paris, France
- Inserm U-1156, Paris, France
| | | | | | | | - Cecile Haumaitre
- CNRS, UMR7622, 75005, Paris, France
- Sorbonne Universités, UPMC, UMR7622, 75005, Paris, France
- Inserm U-1156, Paris, France
| | - Johan Trygg
- Computational Life Science Cluster (CLiC), Department of Chemistry, Umeå University, Umeå, Sweden
- * E-mail:
| | | |
Collapse
|
6
|
Wang Y, Guo Y, Kuang Q, Pu X, Ji Y, Zhang Z, Li M. A comparative study of family-specific protein-ligand complex affinity prediction based on random forest approach. J Comput Aided Mol Des 2014; 29:349-60. [PMID: 25527073 DOI: 10.1007/s10822-014-9827-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Accepted: 12/16/2014] [Indexed: 01/13/2023]
Abstract
The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein-ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients (R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.
Collapse
Affiliation(s)
- Yu Wang
- College of Chemistry, Sichuan University, Chengdu, 610064, Sichuan, People's Republic of China
| | | | | | | | | | | | | |
Collapse
|
7
|
Hasegawa K, Funatsu K. Evolution of PLS for Modeling SAR and omics Data. Mol Inform 2012; 31:766-75. [DOI: 10.1002/minf.201200090] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2012] [Accepted: 10/15/2012] [Indexed: 11/06/2022]
|
8
|
Fan H, Schneidman-Duhovny D, Irwin JJ, Dong G, Shoichet BK, Sali A. Statistical potential for modeling and ranking of protein-ligand interactions. J Chem Inf Model 2011; 51:3078-92. [PMID: 22014038 DOI: 10.1021/ci200377u] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Applications in structural biology and medicinal chemistry require protein-ligand scoring functions for two distinct tasks: (i) ranking different poses of a small molecule in a protein binding site and (ii) ranking different small molecules by their complementarity to a protein site. Using probability theory, we developed two atomic distance-dependent statistical scoring functions: PoseScore was optimized for recognizing native binding geometries of ligands from other poses and RankScore was optimized for distinguishing ligands from nonbinding molecules. Both scores are based on a set of 8,885 crystallographic structures of protein-ligand complexes but differ in the values of three key parameters. Factors influencing the accuracy of scoring were investigated, including the maximal atomic distance and non-native ligand geometries used for scoring, as well as the use of protein models instead of crystallographic structures for training and testing the scoring function. For the test set of 19 targets, RankScore improved the ligand enrichment (logAUC) and early enrichment (EF(1)) scores computed by DOCK 3.6 for 13 and 14 targets, respectively. In addition, RankScore performed better at rescoring than each of seven other scoring functions tested. Accepting both the crystal structure and decoy geometries with all-atom root-mean-square errors of up to 2 Å from the crystal structure as correct binding poses, PoseScore gave the best score to a correct binding pose among 100 decoys for 88% of all cases in a benchmark set containing 100 protein-ligand complexes. PoseScore accuracy is comparable to that of DrugScore(CSD) and ITScore/SE and superior to 12 other tested scoring functions. Therefore, RankScore can facilitate ligand discovery, by ranking complexes of the target with different small molecules; PoseScore can be used for protein-ligand complex structure prediction, by ranking different conformations of a given protein-ligand pair. The statistical potentials are available through the Integrative Modeling Platform (IMP) software package (http://salilab.org/imp) and the LigScore Web server (http://salilab.org/ligscore/).
Collapse
Affiliation(s)
- Hao Fan
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, USA
| | | | | | | | | | | |
Collapse
|
9
|
Tøndel K, Indahl UG, Gjuvsland AB, Vik JO, Hunter P, Omholt SW, Martens H. Hierarchical cluster-based partial least squares regression (HC-PLSR) is an efficient tool for metamodelling of nonlinear dynamic models. BMC SYSTEMS BIOLOGY 2011; 5:90. [PMID: 21627852 PMCID: PMC3127793 DOI: 10.1186/1752-0509-5-90] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/22/2010] [Accepted: 06/01/2011] [Indexed: 11/22/2022]
Abstract
Background Deterministic dynamic models of complex biological systems contain a large number of parameters and state variables, related through nonlinear differential equations with various types of feedback. A metamodel of such a dynamic model is a statistical approximation model that maps variation in parameters and initial conditions (inputs) to variation in features of the trajectories of the state variables (outputs) throughout the entire biologically relevant input space. A sufficiently accurate mapping can be exploited both instrumentally and epistemically. Multivariate regression methodology is a commonly used approach for emulating dynamic models. However, when the input-output relations are highly nonlinear or non-monotone, a standard linear regression approach is prone to give suboptimal results. We therefore hypothesised that a more accurate mapping can be obtained by locally linear or locally polynomial regression. We present here a new method for local regression modelling, Hierarchical Cluster-based PLS regression (HC-PLSR), where fuzzy C-means clustering is used to separate the data set into parts according to the structure of the response surface. We compare the metamodelling performance of HC-PLSR with polynomial partial least squares regression (PLSR) and ordinary least squares (OLS) regression on various systems: six different gene regulatory network models with various types of feedback, a deterministic mathematical model of the mammalian circadian clock and a model of the mouse ventricular myocyte function. Results Our results indicate that multivariate regression is well suited for emulating dynamic models in systems biology. The hierarchical approach turned out to be superior to both polynomial PLSR and OLS regression in all three test cases. The advantage, in terms of explained variance and prediction accuracy, was largest in systems with highly nonlinear functional relationships and in systems with positive feedback loops. Conclusions HC-PLSR is a promising approach for metamodelling in systems biology, especially for highly nonlinear or non-monotone parameter to phenotype maps. The algorithm can be flexibly adjusted to suit the complexity of the dynamic model behaviour, inviting automation in the metamodelling of complex systems.
Collapse
Affiliation(s)
- Kristin Tøndel
- Centre for Integrative Genetics, Dept. of Mathematical Sciences and Technology, Norwegian University of Life Sciences, N-1432 Ås, Norway.
| | | | | | | | | | | | | |
Collapse
|
10
|
Ranu S, Singh AK. Novel Method for Pharmacophore Analysis by Examining the Joint Pharmacophore Space. J Chem Inf Model 2011; 51:1106-21. [DOI: 10.1021/ci100503y] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Sayan Ranu
- Department of Computer Science, University of California, Santa Barbara, Santa Barbara, California, United States
| | - Ambuj K. Singh
- Department of Computer Science, University of California, Santa Barbara, Santa Barbara, California, United States
| |
Collapse
|
11
|
van Westen GJP, Wegner JK, IJzerman AP, van Vlijmen HWT, Bender A. Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets. MEDCHEMCOMM 2011. [DOI: 10.1039/c0md00165a] [Citation(s) in RCA: 123] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Proteochemometric modeling is founded on the principles of QSAR but is able to benefit from additional information in model training due to the inclusion of target information.
Collapse
Affiliation(s)
- Gerard J. P. van Westen
- Division of Medicinal Chemistry
- Leiden/Amsterdam Center for Drug Research
- Leiden
- The Netherlands
| | | | - Adriaan P. IJzerman
- Division of Medicinal Chemistry
- Leiden/Amsterdam Center for Drug Research
- Leiden
- The Netherlands
| | - Herman W. T. van Vlijmen
- Division of Medicinal Chemistry
- Leiden/Amsterdam Center for Drug Research
- Leiden
- The Netherlands
- Tibotec BVBA
| | - A. Bender
- Division of Medicinal Chemistry
- Leiden/Amsterdam Center for Drug Research
- Leiden
- The Netherlands
- Unilever Centre for Molecular Science Informatics
| |
Collapse
|
12
|
Ning X, Karypis G. In silico structure-activity-relationship (SAR) models from machine learning: a review. Drug Dev Res 2010. [DOI: 10.1002/ddr.20410] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
13
|
Strömbergsson H, Lapins M, Kleywegt GJ, Wikberg JES. Towards Proteome-Wide Interaction Models Using the Proteochemometrics Approach. Mol Inform 2010; 29:499-508. [PMID: 27463328 DOI: 10.1002/minf.201000052] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2010] [Accepted: 05/25/2010] [Indexed: 02/02/2023]
Abstract
A proteochemometrics model was induced from all interaction data in the BindingDB database, comprizing in all 7078 protein-ligand complexes with representatives from all major drug target categories. Proteins were represented by alignment-independent sequence descriptors holding information on properties such as hydrophobicity, charge, and secondary structure. Ligands were represented by commonly used QSAR descriptors. The inhibition constant (pKi ) values of protein-ligand complexes were discretized into "high" and "low" interaction activity. Different machine-learning techniques were used to induce models relating protein and ligand properties to the interaction activity. The best was decision trees, which gave an accuracy of 80 % and an area under the ROC curve of 0.81. The tree pointed to the protein and ligand properties, which are relevant for the interaction. As the approach does neither require alignments nor knowledge of protein 3D structures virtually all available protein-ligand interaction data could be utilized, thus opening a way to completely general interaction models that may span entire proteomes.
Collapse
Affiliation(s)
- Helena Strömbergsson
- The Linnaeus Centre for Bioinformatics, Department of Cell and Molecular Biology, Biomedical Centre, Box 598, SE-751 24, Uppsala, Sweden.
| | - Maris Lapins
- Department of Pharmaceutical Pharmacology, Biomedical Centre, Box 591, SE-751 24 Uppsala, Sweden
| | - Gerard J Kleywegt
- Department of Cell and Molecular Biology, Biomedical Centre, Box 596, SE-751 24, Uppsala, Sweden
| | - Jarl E S Wikberg
- Department of Pharmaceutical Pharmacology, Biomedical Centre, Box 591, SE-751 24 Uppsala, Sweden
| |
Collapse
|
14
|
Das S, Krein MP, Breneman CM. Binding affinity prediction with property-encoded shape distribution signatures. J Chem Inf Model 2010; 50:298-308. [PMID: 20095526 PMCID: PMC2846646 DOI: 10.1021/ci9004139] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
We report the use of the molecular signatures known as "property-encoded shape distributions" (PESD) together with standard support vector machine (SVM) techniques to produce validated models that can predict the binding affinity of a large number of protein ligand complexes. This "PESD-SVM" method uses PESD signatures that encode molecular shapes and property distributions on protein and ligand surfaces as features to build SVM models that require no subjective feature selection. A simple protocol was employed for tuning the SVM models during their development, and the results were compared to SFCscore, a regression-based method that was previously shown to perform better than 14 other scoring functions. Although the PESD-SVM method is based on only two surface property maps, the overall results were comparable. For most complexes with a dominant enthalpic contribution to binding (DeltaH/-TDeltaS > 3), a good correlation between true and predicted affinities was observed. Entropy and solvent were not considered in the present approach, and further improvement in accuracy would require accounting for these components rigorously.
Collapse
Affiliation(s)
- Sourav Das
- Department of Chemistry & Chemical Biology, Rensselaer Polytechnic Institute, 110-8th Street, Troy, NY 12180
| | - Michael P. Krein
- Department of Chemistry & Chemical Biology, Rensselaer Polytechnic Institute, 110-8th Street, Troy, NY 12180
| | - Curt M. Breneman
- Department of Chemistry & Chemical Biology / RECCR Center Rensselaer Polytechnic Institute, 110-8th Street, Center for Biotechnology and Interdisciplinary Studies, Troy, NY 12180, Phone Number: 518-276-2678, Fax Number: 518-276-4887,
| |
Collapse
|
15
|
Bian X, Cai W, Shao X, Chen D, Grant ER. Detecting influential observations by cluster analysis and Monte Carlo cross-validation. Analyst 2010; 135:2841-7. [DOI: 10.1039/c0an00345j] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
16
|
Ning X, Rangwala H, Karypis G. Multi-Assay-Based Structure−Activity Relationship Models: Improving Structure−Activity Relationship Models by Incorporating Activity Information from Related Targets. J Chem Inf Model 2009; 49:2444-56. [DOI: 10.1021/ci900182q] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Xia Ning
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| | - Huzefa Rangwala
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| | - George Karypis
- Department of Computer Science and Computer Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, Minnesota 55455 and Department of Computer Science, George Mason University, 4400 University Drive MSN 4A5, Fairfax, Virginia 22030
| |
Collapse
|
17
|
Abstract
BACKGROUND Chemogenomics is an emerging inter-disciplinary approach to drug discovery that combines traditional ligand-based approaches with biological information on drug targets and lies at the interface of chemistry, biology and informatics. The ultimate goal in chemogenomics is to understand molecular recognition between all possible ligands and all possible drug targets. Protein and ligand space have previously been studied as separate entities, but chemogenomics studies deal with large datasets that cover parts of the joint protein-ligand space. Since drug discovery has traditionally focused on ligand optimization, the chemical space has been studied extensively. The protein space has been studied to some extent, typically for the purpose of classification of proteins into functional and structural classes. Since chemogenomics deals not only with ligands but also with the macromolecules the ligands interact with, it is of interest to find means to explore, compare and visualize protein-ligand subspaces. RESULTS Two chemogenomics protein-ligand interaction datasets were prepared for this study. The first dataset covers the known structural protein-ligand space, and includes all non-redundant protein-ligand interactions found in the worldwide Protein Data Bank (PDB). The second dataset contains all approved drugs and drug targets stored in the DrugBank database, and represents the approved drug-drug target space. To capture biological and physicochemical features of the chemogenomics datasets, sequence-based descriptors were computed for the proteins, and 0, 1 and 2 dimensional descriptors for the ligands. Principal component analysis (PCA) was used to analyze the multidimensional data and to create global models of protein-ligand space. The nearest neighbour method, computed using the principal components, was used to obtain a measure of overlap between the datasets. CONCLUSION In this study, we present an approach to visualize protein-ligand spaces from a chemogenomics perspective, where both ligand and protein features are taken into account. The method can be applied to any protein-ligand interaction dataset. Here, the approach is applied to analyze the structural protein-ligand space and the protein-ligand space of all approved drugs and their targets. We show that this approach can be used to visualize and compare chemogenomics datasets, and possibly to identify cross-interaction complexes in protein-ligand space.
Collapse
Affiliation(s)
- Helena Strömbergsson
- Department of Cell and Molecular Biology/The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden
| | - Gerard J Kleywegt
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
| |
Collapse
|
18
|
Li S, Xi L, Wang C, Li J, Lei B, Liu H, Yao X. A novel method for protein-ligand binding affinity prediction and the related descriptors exploration. J Comput Chem 2009; 30:900-9. [DOI: 10.1002/jcc.21078] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
19
|
Strömbergsson H, Daniluk P, Kryshtafovych A, Fidelis K, Wikberg JES, Kleywegt GJ, Hvidsten TR. Interaction model based on local protein substructures generalizes to the entire structural enzyme-ligand space. J Chem Inf Model 2008; 48:2278-88. [PMID: 18937438 DOI: 10.1021/ci800200e] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Chemogenomics is a new strategy in in silico drug discovery, where the ultimate goal is to understand molecular recognition for all molecules interacting with all proteins in the proteome. To study such cross interactions, methods that can generalize over proteins that vary greatly in sequence, structure, and function are needed. We present a general quantitative approach to protein-ligand binding affinity prediction that spans the entire structural enzyme-ligand space. The model was trained on a data set composed of all available enzymes cocrystallized with druglike ligands, taken from four publicly available interaction databases, for which a crystal structure is available. Each enzyme was characterized by a set of local descriptors of protein structure that describe the binding site of the cocrystallized ligand. The ligands in the training set were described by traditional QSAR descriptors. To evaluate the model, a comprehensive test set consisting of enzyme structures and ligands was manually curated. The test set contained enzyme-ligand complexes for which no crystal structures were available, and thus the binding modes were unknown. The test set enzymes were therefore characterized by matching their entire structures to the local descriptor library constructed from the training set. Both the training and the test set contained enzyme-ligand complexes from all major enzyme classes, and the enzymes spanned a large range of sequences and folds. The experimental binding affinities (p K i) ranged from 0.5 to 11.9 (0.7-11.0 in the test set). The induced model predicted the binding affinities of the external test set enzyme-ligand complexes with an r (2) of 0.53 and an RMSEP of 1.5. This demonstrates that the use of local descriptors makes it possible to create rough predictive models that can generalize over a wide range of protein targets.
Collapse
Affiliation(s)
- Helena Strömbergsson
- The Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden, Department of Biophysics, Faculty of Physics, University of Warsaw, Warsaw, Poland
| | | | | | | | | | | | | |
Collapse
|
20
|
Andersson CD, Thysell E, Lindström A, Bylesjö M, Raubacher F, Linusson A. A Multivariate Approach to Investigate Docking Parameters' Effects on Docking Performance. J Chem Inf Model 2007; 47:1673-87. [PMID: 17559207 DOI: 10.1021/ci6005596] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Increasingly powerful docking programs for analyzing and estimating the strength of protein-ligand interactions have been developed in recent decades, and they are now valuable tools in drug discovery. Software used to perform dockings relies on a number of parameters that affect various steps in the docking procedure. However, identifying the best choices of the settings for these parameters is often challenging. Therefore, the settings of the parameters are quite often left at their default values, even though scientists with long experience with a specific docking tool know that modifying certain parameters can improve the results. In the study presented here, we have used statistical experimental design and subsequent regression based on root-mean-square deviation values using partial least-square projections to latent structures (PLS) to scrutinize the effects of different parameters on the docking performance of two software packages: FRED and GOLD. Protein-ligand complexes with a high level of ligand diversity were selected from the PDBbind database for the study, using principal component analysis based on 1D and 2D descriptors, and space-filling design. The PLS models showed quantitative relationships between the docking parameters and the ability of the programs to reproduce the ligand crystallographic conformation. The PLS models also revealed which of the parameters and what parameter settings were important for the docking performance of the two programs. Furthermore, the variation in docking results obtained with specific parameter settings for different protein-ligand complexes in the diverse set examined indicates that there is great potential for optimizing the parameter settings for selected sets of proteins.
Collapse
|
21
|
Rosania GR, Crippen G, Woolf P, States D, Shedden K. A Cheminformatic Toolkit for Mining Biomedical Knowledge. Pharm Res 2007; 24:1791-802. [PMID: 17385012 DOI: 10.1007/s11095-007-9285-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2007] [Accepted: 02/27/2007] [Indexed: 01/31/2023]
Abstract
PURPOSE Cheminformatics can be broadly defined to encompass any activity related to the application of information technology to the study of properties, effects and uses of chemical agents. One of the most important current challenges in cheminformatics is to allow researchers to search databases of biomedical knowledge, using chemical structures as input. MATERIALS AND METHODS An important step towards this goal was the establishment of PubChem, an open, centralized database of small molecules accessible through the World Wide Web. While PubChem is primarily intended to serve as a repository for high throughput screening data from federally-funded screening centers and academic research laboratories, the major impact of PubChem could also reside in its ability to serve as a chemical gateway to biomedical databases such as PubMed. CONCLUSION This article will review cheminformatic tools that can be applied to facilitate annotation of PubChem through links to the scientific literature; to integrate PubChem with transcriptomic, proteomic, and metabolomic datasets; to incorporate results of numerical simulations of physiological systems into PubChem annotation; and ultimately, to translate data of chemical genomics screening efforts into information that will benefit biomedical researchers and physician scientists across all therapeutic areas.
Collapse
Affiliation(s)
- Gus R Rosania
- Department of Pharmaceutical Sciences, University of Michigan College of Pharmacy, 428 Church Street, Ann Arbor, MI 48109, USA.
| | | | | | | | | |
Collapse
|
22
|
Yuan H, Wang Y, Cheng Y. Local and Global Quantitative Structure−Activity Relationship Modeling and Prediction for the Baseline Toxicity. J Chem Inf Model 2006; 47:159-69. [PMID: 17238261 DOI: 10.1021/ci600299j] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The predictive accuracy of the model is of the most concern for computational chemists in quantitative structure-activity relationship (QSAR) investigations. It is hypothesized that the model based on analogical chemicals will exhibit better predictive performance than that derived from diverse compounds. This paper develops a novel scheme called "clustering first, and then modeling" to build local QSAR models for the subsets resulted from clustering of the training set according to structural similarity. For validation and prediction, the validation set and test set were first classified into the corresponding subsets just as those of the training set, and then the prediction was performed by the relevant local model for each subset. This approach was validated on two independent data sets by local modeling and prediction of the baseline toxicity for the fathead minnow. In this process, hierarchical clustering was employed for cluster analysis, k-nearest neighbor for classification, and partial least squares for the model generation. The statistical results indicated that the predictive performances of the local models based on the subsets were much superior to those of the global model based on the whole training set, which was consistent with the hypothesis. This approach proposed here is promising for extension to QSAR modeling for various physicochemical properties, biological activities, and toxicities.
Collapse
Affiliation(s)
- Hua Yuan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310027, China
| | | | | |
Collapse
|
23
|
Wahlstrom JL, Rock DA, Slatter JG, Wienkers LC. Advances in predicting CYP-mediated drug interactions in the drug discovery setting. Expert Opin Drug Discov 2006; 1:677-91. [DOI: 10.1517/17460441.1.7.677] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|