1
|
Hashemi N, Hao B, Ignatov M, Paschalidis IC, Vakili P, Vajda S, Kozakov D. Improved prediction of MHC-peptide binding using protein language models. FRONTIERS IN BIOINFORMATICS 2023; 3:1207380. [PMID: 37663788 PMCID: PMC10469926 DOI: 10.3389/fbinf.2023.1207380] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Accepted: 08/04/2023] [Indexed: 09/05/2023] Open
Abstract
Major histocompatibility complex Class I (MHC-I) molecules bind to peptides derived from intracellular antigens and present them on the surface of cells, allowing the immune system (T cells) to detect them. Elucidating the process of this presentation is essential for regulation and potential manipulation of the cellular immune system. Predicting whether a given peptide binds to an MHC molecule is an important step in the above process and has motivated the introduction of many computational approaches to address this problem. NetMHCPan, a pan-specific model for predicting binding of peptides to any MHC molecule, is one of the most widely used methods which focuses on solving this binary classification problem using shallow neural networks. The recent successful results of Deep Learning (DL) methods, especially Natural Language Processing (NLP-based) pretrained models in various applications, including protein structure determination, motivated us to explore their use in this problem. Specifically, we consider the application of deep learning models pretrained on large datasets of protein sequences to predict MHC Class I-peptide binding. Using the standard performance metrics in this area, and the same training and test sets, we show that our models outperform NetMHCpan4.1, currently considered as the-state-of-the-art.
Collapse
Affiliation(s)
- Nasser Hashemi
- Division of Systems Engineering, Boston University, Boston, MA, United States
| | - Boran Hao
- Department of Electrical and Computer Engineering, Boston University, Boston, MA, United States
| | - Mikhail Ignatov
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, United States
| | - Ioannis Ch. Paschalidis
- Division of Systems Engineering, Boston University, Boston, MA, United States
- Department of Electrical and Computer Engineering, Boston University, Boston, MA, United States
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Pirooz Vakili
- Division of Systems Engineering, Boston University, Boston, MA, United States
| | - Sandor Vajda
- Division of Systems Engineering, Boston University, Boston, MA, United States
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- Department of Chemistry, Boston University, Boston, MA, United States
| | - Dima Kozakov
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY, United States
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| |
Collapse
|
2
|
Plonski AP, Reed SM. Assessing protein homology models with docking reproducibility. J Mol Graph Model 2023; 121:108430. [PMID: 36812741 DOI: 10.1016/j.jmgm.2023.108430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 02/08/2023] [Accepted: 02/10/2023] [Indexed: 02/12/2023]
Abstract
Results of the recent Critical Assessment of Protein Structure (CASP) competitions demonstrate that protein backbones can be predicted with very high accuracy. In particular, the artificial intelligence methods of AlphaFold 2 from DeepMind were able to produce structures that were similar enough to experimental structures that many described the problem of protein prediction solved. However, for such structures to be used for drug docking studies requires precision in the placement of side chain atoms as well. Here we built a library of 1334 small molecules and examined how reproducibly they bound to the same site on a protein using QuickVina-W, a branch of the program Autodock that is optimized for blind searches. We discovered that the higher the backbone quality of the homology model the greater the similarity between the small molecule docking to the experimental and modeled structures. Furthermore, we found that specific subsets of this library were particularly useful for identifying small differences between the best of the best modeled structures. Specifically, when the number of rotatable bonds in the small molecule increased, differences in binding sites became more apparent.
Collapse
|
3
|
Egbert M, Jones G, Collins MR, Kozakov D, Vajda S. FTMove: A Web Server for Detection and Analysis of Cryptic and Allosteric Binding Sites by Mapping Multiple Protein Structures. J Mol Biol 2022; 434:167587. [PMID: 35662465 PMCID: PMC9789685 DOI: 10.1016/j.jmb.2022.167587] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 02/25/2022] [Accepted: 04/07/2022] [Indexed: 12/27/2022]
Abstract
Protein mapping distributes many copies of different molecular probes on the surface of a target protein in order to determine binding hot spots, regions that are highly preferable for ligand binding. While mapping of X-ray structures by the FTMap server is inherently static, this limitation can be overcome by the simultaneous analysis of multiple structures of the protein. FTMove is an automated web server that implements this approach. From the input of a target protein, by PDB code, the server identifies all structures of the protein available in the PDB, runs mapping on them, and combines the results to form binding hot spots and binding sites. The user may also upload their own protein structures, bypassing the PDB search for similar structures. Output of the server consists of the consensus binding sites and the individual mapping results for each structure - including the number of probes located in each binding site, for each structure. This level of detail allows the users to investigate how the strength of a binding site relates to the protein conformation, other binding sites, and the presence of ligands or mutations. In addition, the structures are clustered on the basis of their binding properties. The use of FTMove is demonstrated by application to 22 proteins with known allosteric binding sites; the orthosteric and allosteric binding sites were identified in all but one case, and the sites were typically ranked among the top five. The FTMove server is publicly available at https://ftmove.bu.edu.
Collapse
Affiliation(s)
- Megan Egbert
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - George Jones
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Matthew R Collins
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - Dima Kozakov
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Sandor Vajda
- Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA; Department of Chemistry, Boston University, Boston, MA 02215, USA.
| |
Collapse
|
4
|
Jones G, Jindal A, Ghani U, Kotelnikov S, Egbert M, Hashemi N, Vajda S, Padhorny D, Kozakov D. Elucidation of protein function using computational docking and hotspot analysis by ClusPro and FTMap. Acta Crystallogr D Struct Biol 2022; 78:690-697. [PMID: 35647916 PMCID: PMC9159284 DOI: 10.1107/s2059798322002741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 03/10/2022] [Indexed: 08/30/2024] Open
Abstract
Starting with a crystal structure of a macromolecule, computational structural modeling can help to understand the associated biological processes, structure and function, as well as to reduce the number of further experiments required to characterize a given molecular entity. In the past decade, two classes of powerful automated tools for investigating the binding properties of proteins have been developed: the protein-protein docking program ClusPro and the FTMap and FTSite programs for protein hotspot identification. These methods have been widely used by the research community by means of publicly available online servers, and models built using these automated tools have been reported in a large number of publications. Importantly, additional experimental information can be leveraged to further improve the predictive power of these approaches. Here, an overview of the methods and their biological applications is provided together with a brief interpretation of the results.
Collapse
Affiliation(s)
- George Jones
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Akhil Jindal
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| | - Usman Ghani
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| | - Sergei Kotelnikov
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Megan Egbert
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| | - Nasser Hashemi
- Department of Systems Engineering, Boston University, Boston, Massachusetts, USA
| | - Sandor Vajda
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| | - Dzmitry Padhorny
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Dima Kozakov
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
5
|
Holland J, Grigoryan G. Structure‐conditioned amino‐acid couplings: how contact geometry affects pairwise sequence preferences. Protein Sci 2022; 31:900-917. [PMID: 35060221 PMCID: PMC8927866 DOI: 10.1002/pro.4280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 01/06/2022] [Accepted: 01/12/2022] [Indexed: 11/11/2022]
Abstract
Relating a protein's sequence to its conformation is a central challenge for both structure prediction and sequence design. Statistical contact potentials, as well as their more descriptive versions that account for side‐chain orientation and other geometric descriptors, have served as simplistic but useful means of representing second‐order contributions in sequence–structure relationships. Here we ask what happens when a pairwise potential is conditioned on the fully defined geometry of interacting backbones fragments. We show that the resulting structure‐conditioned coupling energies more accurately reflect pair preferences as a function of structural contexts. These structure‐conditioned energies more reliably encode native sequence information and more highly correlate with experimentally determined coupling energies. Clustering a database of interaction motifs by structure results in ensembles of similar energies and clustering them by energy results in ensembles of similar structures. By comparing many pairs of interaction motifs and showing that structural similarity and energetic similarity go hand‐in‐hand, we provide a tangible link between modular sequence and structure elements. This link is applicable to structural modeling, and we show that scoring CASP models with structured‐conditioned energies results in substantially higher correlation with structural quality than scoring the same models with a contact potential. We conclude that structure‐conditioned coupling energies are a good way to model the impact of interaction geometry on second‐order sequence preferences.
Collapse
Affiliation(s)
- Jack Holland
- Department of Computer Science Dartmouth College Hanover New Hampshire USA
| | - Gevorg Grigoryan
- Department of Computer Science Dartmouth College Hanover New Hampshire USA
| |
Collapse
|
6
|
Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J. Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins 2021; 89:1607-1617. [PMID: 34533838 PMCID: PMC8726744 DOI: 10.1002/prot.26237] [Citation(s) in RCA: 273] [Impact Index Per Article: 68.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 07/28/2021] [Indexed: 01/14/2023]
Abstract
Critical assessment of structure prediction (CASP) is a community experiment to advance methods of computing three-dimensional protein structure from amino acid sequence. Core components are rigorous blind testing of methods and evaluation of the results by independent assessors. In the most recent experiment (CASP14), deep-learning methods from one research group consistently delivered computed structures rivaling the corresponding experimental ones in accuracy. In this sense, the results represent a solution to the classical protein-folding problem, at least for single proteins. The models have already been shown to be capable of providing solutions for problematic crystal structures, and there are broad implications for the rest of structural biology. Other research groups also substantially improved performance. Here, we describe these results and outline some of the many implications. Other related areas of CASP, including modeling of protein complexes, structure refinement, estimation of model accuracy, and prediction of inter-residue contacts and distances, are also described.
Collapse
Affiliation(s)
- Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, USA
| | - Torsten Schwede
- University of Basel, Biozentrum & SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Maya Topf
- Centre for Structural Systems Biology, Leibniz-Institut für Experimentelle Virologie and Universit tsklinikum Hamburg-Eppendorf (UKE), Hamburg, Germany
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, 9600 Gudelsky Drive, Rockville, MD 20850, USA, Department of Cell Biology and Molecular Genetics, University of Maryland
| |
Collapse
|
7
|
Simpkin AJ, Rodríguez FS, Mesdaghi S, Kryshtafovych A, Rigden DJ. Evaluation of model refinement in CASP14. Proteins 2021; 89:1852-1869. [PMID: 34288138 PMCID: PMC8616799 DOI: 10.1002/prot.26185] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/19/2021] [Accepted: 07/11/2021] [Indexed: 12/15/2022]
Abstract
We report here an assessment of the model refinement category of the 14th round of Critical Assessment of Structure Prediction (CASP14). As before, predictors submitted up to five ranked refinements, along with associated residue-level error estimates, for targets that had a wide range of starting quality. The ability of groups to accurately rank their submissions and to predict coordinate error varied widely. Overall, only four groups out-performed a "naïve predictor" corresponding to the resubmission of the starting model. Among the top groups, there are interesting differences of approach and in the spread of improvements seen: some methods are more conservative, others more adventurous. Some targets were "double-barreled" for which predictors were offered a high-quality AlphaFold 2 (AF2)-derived prediction alongside another of lower quality. The AF2-derived models were largely unimprovable, many of their apparent errors being found to reside at domain and, especially, crystal lattice contacts. Refinement is shown to have a mixed impact overall on structure-based function annotation methods to predict nucleic acid binding, spot catalytic sites, and dock protein structures.
Collapse
Affiliation(s)
- Adam J. Simpkin
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Filomeno Sánchez Rodríguez
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
- Life Science, Diamond Light Source, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0DE, England
| | - Shahram Mesdaghi
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | | | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
8
|
Kinch LN, Pei J, Kryshtafovych A, Schaeffer RD, Grishin NV. Topology evaluation of models for difficult targets in the 14th round of the critical assessment of protein structure prediction. Proteins 2021; 89:1673-1686. [PMID: 34240477 DOI: 10.1002/prot.26172] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 06/28/2021] [Accepted: 07/01/2021] [Indexed: 12/25/2022]
Abstract
This report describes the tertiary structure prediction assessment of difficult modeling targets in the 14th round of the Critical Assessment of Structure Prediction (CASP14). We implemented an official ranking scheme that used the same scores as the previous CASP topology-based assessment, but combined these scores with one that emphasized physically realistic models. The top performing AlphaFold2 group outperformed the rest of the prediction community on all but two of the difficult targets considered in this assessment. They provided high quality models for most of the targets (86% over GDT_TS 70), including larger targets above 150 residues, and they correctly predicted the topology of almost all the rest. AlphaFold2 performance was followed by two manual Baker methods, a Feig method that refined Zhang-server models, two notable automated Zhang server methods (QUARK and Zhang-server), and a Zhang manual group. Despite the remarkable progress in protein structure prediction of difficult targets, both the prediction community and AlphaFold2, to a lesser extent, faced challenges with flexible regions and obligate oligomeric assemblies. The official ranking of top-performing methods was supported by performance generated PCA and heatmap clusters that gave insight into target difficulties and the most successful state-of-the-art structure prediction methodologies.
Collapse
Affiliation(s)
- Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | | | - R Dustin Schaeffer
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|