1
|
Barradas-Bautista D, Almajed A, Oliva R, Kalnis P, Cavallo L. Improving classification of correct and incorrect protein-protein docking models by augmenting the training set. BIOINFORMATICS ADVANCES 2023; 3:vbad012. [PMID: 36789292 PMCID: PMC9923443 DOI: 10.1093/bioadv/vbad012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 01/20/2023] [Accepted: 02/01/2023] [Indexed: 02/04/2023]
Abstract
Motivation Protein-protein interactions drive many relevant biological events, such as infection, replication and recognition. To control or engineer such events, we need to access the molecular details of the interaction provided by experimental 3D structures. However, such experiments take time and are expensive; moreover, the current technology cannot keep up with the high discovery rate of new interactions. Computational modeling, like protein-protein docking, can help to fill this gap by generating docking poses. Protein-protein docking generally consists of two parts, sampling and scoring. The sampling is an exhaustive search of the tridimensional space. The caveat of the sampling is that it generates a large number of incorrect poses, producing a highly unbalanced dataset. This limits the utility of the data to train machine learning classifiers. Results Using weak supervision, we developed a data augmentation method that we named hAIkal. Using hAIkal, we increased the labeled training data to train several algorithms. We trained and obtained different classifiers; the best classifier has 81% accuracy and 0.51 Matthews' correlation coefficient on the test set, surpassing the state-of-the-art scoring functions. Availability and implementation Docking models from Benchmark 5 are available at https://doi.org/10.5281/zenodo.4012018. Processed tabular data are available at https://repository.kaust.edu.sa/handle/10754/666961. Google colab is available at https://colab.research.google.com/drive/1vbVrJcQSf6\_C3jOAmZzgQbTpuJ5zC1RP?usp=sharing. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Ali Almajed
- Computer, Electrical and Mathematical Science and Engineering Division, Kaust Extreme Computing Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Romina Oliva
- Department of Sciences and Technologies, University of Naples “Parthenope”, I-80143 Naples, Italy
| | - Panos Kalnis
- Computer, Electrical and Mathematical Science and Engineering Division, Kaust Extreme Computing Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Luigi Cavallo
- Physical Sciences and Engineering Division, Kaust Catalysis Center, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
2
|
Yang YX, Wang P, Zhu BT. Relative importance of interface and surface areas in protein-protein binding affinity prediction: A machine learning analysis based on linear regression and artificial neural network. Biophys Chem 2022; 283:106762. [DOI: 10.1016/j.bpc.2022.106762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 01/11/2022] [Accepted: 01/14/2022] [Indexed: 11/02/2022]
|
3
|
Jankauskaite J, Jiménez-García B, Dapkunas J, Fernández-Recio J, Moal IH. SKEMPI 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics 2019; 35:462-469. [PMID: 30020414 PMCID: PMC6361233 DOI: 10.1093/bioinformatics/bty635] [Citation(s) in RCA: 189] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 07/17/2018] [Indexed: 11/18/2022] Open
Abstract
Motivation Understanding the relationship between the sequence, structure, binding energy, binding kinetics and binding thermodynamics of protein–protein interactions is crucial to understanding cellular signaling, the assembly and regulation of molecular complexes, the mechanisms through which mutations lead to disease, and protein engineering. Results We present SKEMPI 2.0, a major update to our database of binding free energy changes upon mutation for structurally resolved protein–protein interactions. This version now contains manually curated binding data for 7085 mutations, an increase of 133%, including changes in kinetics for 1844 mutations, enthalpy and entropy changes for 443 mutations, and 440 mutations, which abolish detectable binding. Availability and implementation The database is available as supplementary data and at https://life.bsc.es/pid/skempi2/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justina Jankauskaite
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Brian Jiménez-García
- Barcelona Supercomputing Center (BSC), Barcelona, Spain.,Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht University, Utrecht, the Netherlands
| | - Justas Dapkunas
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Juan Fernández-Recio
- Barcelona Supercomputing Center (BSC), Barcelona, Spain.,Institut de Biologia Molecular de Barcelona (IBMB), CSIC, Barcelona, Spain
| | - Iain H Moal
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK
| |
Collapse
|
4
|
Smith JK, Jiang S, Pfaendtner J. Redefining the Protein-Protein Interface: Coarse Graining and Combinatorics for an Improved Understanding of Amino Acid Contributions to the Protein-Protein Binding Affinity. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2017; 33:11511-11517. [PMID: 28850233 DOI: 10.1021/acs.langmuir.7b02438] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The ability to intervene in biological pathways has for decades been limited by the lack of a quantitative description of protein-protein interactions (PPIs). Herein we generate and compare millions of simple PPI models for insight into the mechanisms of specific recognition and binding. We use a coarse-grained approach whereby amino acids are counted in the interface, and these counts are used as binding affinity predictors. We perform lasso regression, a modern regression technique aimed at interpretability, with every possible amino acid combination (over 106 unique feature sets) to select only those amino acid predictors that provide more information than noise. This approach circumvents arbitrary binning and assumptions about the binding environment that obscure other binding affinity models. Aggregated analysis of these models trained at various interfacial cutoff distances informs the roles of specific amino acids in different binding contexts. We find that a simple amino acid count model outperforms detailed intermolecular contact and binned residue type models. We identify the prevalence of serine, glycine, and tryptophan in the interface as particularly important for predicting binding affinity across a range of distance cutoffs. Although current sample size limitations prevent a robust consensus model for binding affinity prediction, our approach underscores the relevance of a residue-based description of the protein-protein interface to increase our understanding of specific interactions.
Collapse
Affiliation(s)
- Josh K Smith
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| | - Shaoyi Jiang
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| | - Jim Pfaendtner
- Department of Chemical Engineering, University of Washington , Seattle, Washington 98195, United States
| |
Collapse
|
5
|
Moal IH, Barradas-Bautista D, Jiménez-García B, Torchala M, van der Velde A, Vreven T, Weng Z, Bates PA, Fernández-Recio J. IRaPPA: information retrieval based integration of biophysical models for protein assembly selection. Bioinformatics 2017; 33:1806-1813. [PMID: 28200016 PMCID: PMC5783285 DOI: 10.1093/bioinformatics/btx068] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 01/26/2017] [Accepted: 02/12/2017] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION In order to function, proteins frequently bind to one another and form 3D assemblies. Knowledge of the atomic details of these structures helps our understanding of how proteins work together, how mutations can lead to disease, and facilitates the designing of drugs which prevent or mimic the interaction. RESULTS Atomic modeling of protein-protein interactions requires the selection of near-native structures from a set of docked poses based on their calculable properties. By considering this as an information retrieval problem, we have adapted methods developed for Internet search ranking and electoral voting into IRaPPA, a pipeline integrating biophysical properties. The approach enhances the identification of near-native structures when applied to four docking methods, resulting in a near-native appearing in the top 10 solutions for up to 50% of complexes benchmarked, and up to 70% in the top 100. AVAILABILITY AND IMPLEMENTATION IRaPPA has been implemented in the SwarmDock server ( http://bmm.crick.ac.uk/∼SwarmDock/ ), pyDock server ( http://life.bsc.es/pid/pydockrescoring/ ) and ZDOCK server ( http://zdock.umassmed.edu/ ), with code available on request. CONTACT moal@ebi.ac.uk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Iain H Moal
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
- Life Science Department, Joint BSC-IRB Research Program in Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain
| | - Didier Barradas-Bautista
- Life Science Department, Joint BSC-IRB Research Program in Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain
| | - Brian Jiménez-García
- Life Science Department, Joint BSC-IRB Research Program in Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain
| | | | - Arjan van der Velde
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
- Bioinformatics Program, Boston University, Boston, MA, USA
| | - Thom Vreven
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Paul A Bates
- Biomolecular Modelling Laboratory, The Francis Crick Institute, London, UK
| | - Juan Fernández-Recio
- Life Science Department, Joint BSC-IRB Research Program in Computational Biology, Barcelona Supercomputing Center, Barcelona, Spain
| |
Collapse
|
6
|
Chéron JB, Zacharias M, Antonczak S, Fiorucci S. Update of the ATTRACT force field for the prediction of protein-protein binding affinity. J Comput Chem 2017; 38:1887-1890. [PMID: 28580613 DOI: 10.1002/jcc.24836] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Revised: 04/20/2017] [Accepted: 04/22/2017] [Indexed: 12/13/2022]
Abstract
Determining the protein-protein interactions is still a major challenge for molecular biology. Docking protocols has come of age in predicting the structure of macromolecular complexes. However, they still lack accuracy to estimate the binding affinities, the thermodynamic quantity that drives the formation of a complex. Here, an updated version of the protein-protein ATTRACT force field aiming at predicting experimental binding affinities is reported. It has been designed on a dataset of 218 protein-protein complexes. The correlation between the experimental and predicted affinities reaches 0.6, outperforming most of the available protocols. Focusing on a subset of rigid and flexible complexes, the performance raises to 0.76 and 0.69, respectively. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Jean-Baptiste Chéron
- Université Côte d'Azur, CNRS, Institut de Chimie de Nice UMR7272, 06108 Nice, France
| | - Martin Zacharias
- Physik-Department T38, Technische Universität München, Garching, Germany.,Center for Integrated Protein Science, Munich, 81377, Germany
| | - Serge Antonczak
- Université Côte d'Azur, CNRS, Institut de Chimie de Nice UMR7272, 06108 Nice, France
| | - Sébastien Fiorucci
- Université Côte d'Azur, CNRS, Institut de Chimie de Nice UMR7272, 06108 Nice, France
| |
Collapse
|
7
|
Pfeiffenberger E, Chaleil RA, Moal IH, Bates PA. A machine learning approach for ranking clusters of docked protein-protein complexes by pairwise cluster comparison. Proteins 2017; 85:528-543. [PMID: 27935158 PMCID: PMC5396268 DOI: 10.1002/prot.25218] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Revised: 11/14/2016] [Accepted: 11/21/2016] [Indexed: 01/28/2023]
Abstract
Reliable identification of near-native poses of docked protein-protein complexes is still an unsolved problem. The intrinsic heterogeneity of protein-protein interactions is challenging for traditional biophysical or knowledge based potentials and the identification of many false positive binding sites is not unusual. Often, ranking protocols are based on initial clustering of docked poses followed by the application of an energy function to rank each cluster according to its lowest energy member. Here, we present an approach of cluster ranking based not only on one molecular descriptor (e.g., an energy function) but also employing a large number of descriptors that are integrated in a machine learning model, whereby, an extremely randomized tree classifier based on 109 molecular descriptors is trained. The protocol is based on first locally enriching clusters with additional poses, the clusters are then characterized using features describing the distribution of molecular descriptors within the cluster, which are combined into a pairwise cluster comparison model to discriminate near-native from incorrect clusters. The results show that our approach is able to identify clusters containing near-native protein-protein complexes. In addition, we present an analysis of the descriptors with respect to their power to discriminate near native from incorrect clusters and how data transformations and recursive feature elimination can improve the ranking performance. Proteins 2017; 85:528-543. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
| | | | - Iain H. Moal
- European Molecular Biology LaboratoryEuropean Bioinformatics Institute, Wellcome Trust Genome Campus, HinxtonCambridgeCB10 1SDUK
| | - Paul A. Bates
- Biomolecular Modelling LaboratoryThe Francis Crick InstituteLondonNW1 1ATUK
| |
Collapse
|
8
|
Hamzeh-Mivehroud M, Sokouti B, Dastmalchi S. Molecular Docking at a Glance. Oncology 2017. [DOI: 10.4018/978-1-5225-0549-5.ch030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The current chapter introduces different aspects of molecular docking technique in order to give an overview to the readers about the topics which will be dealt with throughout this volume. Like many other fields of science, molecular docking studies has experienced a lagging period of slow and steady increase in terms of acquiring attention of scientific community as well as its frequency of application, followed by a pronounced era of exponential expansion in theory, methodology, areas of application and performance due to developments in related technologies such as computational resources and theoretical as well as experimental biophysical methods. In the following sections the evolution of molecular docking will be reviewed and its different components including methods, search algorithms, scoring functions, validation of the methods, and area of applications plus few case studies will be touched briefly.
Collapse
Affiliation(s)
| | | | - Siavoush Dastmalchi
- Biotechnology Research Center, Tabriz University of Medical Sciences, Iran & School of Pharmacy, Tabriz University of Medical Sciences, Iran
| |
Collapse
|
9
|
Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, Chaleil R, Jiménez-García B, Bates PA, Fernandez-Recio J, Bonvin AMJJ, Weng Z. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol 2015; 427:3031-41. [PMID: 26231283 PMCID: PMC4677049 DOI: 10.1016/j.jmb.2015.07.016] [Citation(s) in RCA: 275] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2015] [Revised: 07/17/2015] [Accepted: 07/17/2015] [Indexed: 01/31/2023]
Abstract
We present an updated and integrated version of our widely used protein-protein docking and binding affinity benchmarks. The benchmarks consist of non-redundant, high-quality structures of protein-protein complexes along with the unbound structures of their components. Fifty-five new complexes were added to the docking benchmark, 35 of which have experimentally measured binding affinities. These updated docking and affinity benchmarks now contain 230 and 179 entries, respectively. In particular, the number of antibody-antigen complexes has increased significantly, by 67% and 74% in the docking and affinity benchmarks, respectively. We tested previously developed docking and affinity prediction algorithms on the new cases. Considering only the top 10 docking predictions per benchmark case, a prediction accuracy of 38% is achieved on all 55 cases and up to 50% for the 32 rigid-body cases only. Predicted affinity scores are found to correlate with experimental binding energies up to r=0.52 overall and r=0.72 for the rigid complexes.
Collapse
Affiliation(s)
- Thom Vreven
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Iain H Moal
- Joint BSC-CRG-IRB Research Program in Computational Biology, Life Sciences Department, Barcelona Supercomputing Center, C/Jordi Girona 29, 08034 Barcelona, Spain
| | - Anna Vangone
- Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht University, 3584CH Utrecht, The Netherlands
| | - Brian G Pierce
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA
| | - Panagiotis L Kastritis
- Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht University, 3584CH Utrecht, The Netherlands
| | - Mieczyslaw Torchala
- Biomolecular Modelling Laboratory, The Francis Crick Institute, Lincoln's Inn Fields Laboratory, London WC2A 3LY, United Kingdom
| | - Raphael Chaleil
- Biomolecular Modelling Laboratory, The Francis Crick Institute, Lincoln's Inn Fields Laboratory, London WC2A 3LY, United Kingdom
| | - Brian Jiménez-García
- Joint BSC-CRG-IRB Research Program in Computational Biology, Life Sciences Department, Barcelona Supercomputing Center, C/Jordi Girona 29, 08034 Barcelona, Spain
| | - Paul A Bates
- Biomolecular Modelling Laboratory, The Francis Crick Institute, Lincoln's Inn Fields Laboratory, London WC2A 3LY, United Kingdom.
| | - Juan Fernandez-Recio
- Joint BSC-CRG-IRB Research Program in Computational Biology, Life Sciences Department, Barcelona Supercomputing Center, C/Jordi Girona 29, 08034 Barcelona, Spain.
| | - Alexandre M J J Bonvin
- Bijvoet Center for Biomolecular Research, Faculty of Science, Utrecht University, 3584CH Utrecht, The Netherlands.
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA.
| |
Collapse
|