1
|
Ali S, Chourasia P, Patterson M. From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets. Med Biol Eng Comput 2024; 62:2449-2483. [PMID: 38622438 DOI: 10.1007/s11517-024-03074-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 03/13/2024] [Indexed: 04/17/2024]
Abstract
Understanding protein structures is crucial for various bioinformatics research, including drug discovery, disease diagnosis, and evolutionary studies. Protein structure classification is a critical aspect of structural biology, where supervised machine learning algorithms classify structures based on data from databases such as Protein Data Bank (PDB). However, the challenge lies in designing numerical embeddings for protein structures without losing essential information. Although some effort has been made in the literature, researchers have not effectively and rigorously combined the structural and sequence-based features for efficient protein classification to the best of our knowledge. To this end, we propose numerical embeddings that extract relevant features for protein sequences fetched from PDB structures from popular datasets such as PDB Bind and STCRDAB. The features are physicochemical properties such as aromaticity, instability index, flexibility, Grand Average of Hydropathy (GRAVY), isoelectric point, charge at pH, secondary structure fracture, molar extinction coefficient, and molecular weight. We also incorporate scaling features for the sliding windows (e.g., k-mers), which include Kyte and Doolittle (KD) hydropathy scale, Eisenberg hydrophobicity scale, Hydrophilicity scale, Flexibility of the amino acids, and Hydropathy scale. Multiple-feature selection aims to improve the accuracy of protein classification models. The results showed that the selected features significantly improved the predictive performance of existing embeddings.
Collapse
Affiliation(s)
- Sarwan Ali
- Georgia State University, Atlanta, GA, USA.
| | | | | |
Collapse
|
2
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
3
|
Kilian M, Bischofs IB. Co-evolution at protein-protein interfaces guides inference of stoichiometry of oligomeric protein complexes by de novo structure prediction. Mol Microbiol 2023; 120:763-782. [PMID: 37777474 DOI: 10.1111/mmi.15169] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/10/2023] [Accepted: 09/11/2023] [Indexed: 10/02/2023]
Abstract
The quaternary structure with specific stoichiometry is pivotal to the specific function of protein complexes. However, determining the structure of many protein complexes experimentally remains a major bottleneck. Structural bioinformatics approaches, such as the deep learning algorithm Alphafold2-multimer (AF2-multimer), leverage the co-evolution of amino acids and sequence-structure relationships for accurate de novo structure and contact prediction. Pseudo-likelihood maximization direct coupling analysis (plmDCA) has been used to detect co-evolving residue pairs by statistical modeling. Here, we provide evidence that combining both methods can be used for de novo prediction of the quaternary structure and stoichiometry of a protein complex. We achieve this by augmenting the existing AF2-multimer confidence metrics with an interpretable score to identify the complex with an optimal fraction of native contacts of co-evolving residue pairs at intermolecular interfaces. We use this strategy to predict the quaternary structure and non-trivial stoichiometries of Bacillus subtilis spore germination protein complexes with unknown structures. Co-evolution at intermolecular interfaces may therefore synergize with AI-based de novo quaternary structure prediction of structurally uncharacterized bacterial protein complexes.
Collapse
Affiliation(s)
- Max Kilian
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| | - Ilka B Bischofs
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| |
Collapse
|
4
|
Gaudreault F, Corbeil CR, Sulea T. Enhanced antibody-antigen structure prediction from molecular docking using AlphaFold2. Sci Rep 2023; 13:15107. [PMID: 37704686 PMCID: PMC10499836 DOI: 10.1038/s41598-023-42090-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Accepted: 09/05/2023] [Indexed: 09/15/2023] Open
Abstract
Predicting the structure of antibody-antigen complexes has tremendous value in biomedical research but unfortunately suffers from a poor performance in real-life applications. AlphaFold2 (AF2) has provided renewed hope for improvements in the field of protein-protein docking but has shown limited success against antibody-antigen complexes due to the lack of co-evolutionary constraints. In this study, we used physics-based protein docking methods for building decoy sets consisting of low-energy docking solutions that were either geometrically close to the native structure (positives) or not (negatives). The docking models were then fed into AF2 to assess their confidence with a novel composite score based on normalized pLDDT and pTMscore metrics after AF2 structural refinement. We show benefits of the AF2 composite score for rescoring docking poses both in terms of (1) classification of positives/negatives and of (2) success rates with particular emphasis on early enrichment. Docking models of at least medium quality present in the decoy set, but not necessarily highly ranked by docking methods, benefitted most from AF2 rescoring by experiencing large advances towards the top of the reranked list of models. These improvements, obtained without any calibration or novel methodologies, led to a notable level of performance in antibody-antigen unbound docking that was never achieved previously.
Collapse
Affiliation(s)
- Francis Gaudreault
- Human Health Therapeutics Research Centre, National Research Council Canada, 6100 Royalmount Avenue, Montreal, QC, H4P 2R2, Canada
| | - Christopher R Corbeil
- Human Health Therapeutics Research Centre, National Research Council Canada, 6100 Royalmount Avenue, Montreal, QC, H4P 2R2, Canada
| | - Traian Sulea
- Human Health Therapeutics Research Centre, National Research Council Canada, 6100 Royalmount Avenue, Montreal, QC, H4P 2R2, Canada.
- Institute of Parasitology, McGill University, 21111 Lakeshore Road, Sainte-Anne-de-Bellevue, QC, H9X 3V9, Canada.
| |
Collapse
|
5
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
6
|
Wu C, Guo D. Computational Docking Reveals Co-Evolution of C4 Carbon Delivery Enzymes in Diverse Plants. Int J Mol Sci 2022; 23:12688. [PMID: 36293547 PMCID: PMC9604239 DOI: 10.3390/ijms232012688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/14/2022] [Accepted: 10/19/2022] [Indexed: 11/16/2022] Open
Abstract
Proteins are modular functionalities regulating multiple cellular activities in prokaryotes and eukaryotes. As a consequence of higher plants adapting to arid and thermal conditions, C4 photosynthesis is the carbon fixation process involving multi-enzymes working in a coordinated fashion. However, how these enzymes interact with each other and whether they co-evolve in parallel to maintain interactions in different plants remain elusive to date. Here, we report our findings on the global protein co-evolution relationship and local dynamics of co-varying site shifts in key C4 photosynthetic enzymes. We found that in most of the selected key C4 photosynthetic enzymes, global pairwise co-evolution events exist to form functional couplings. Besides, protein-protein interactions between these enzymes may suggest their unknown functionalities in the carbon delivery process. For PEPC and PPCK regulation pairs, pocket formation at the interactive interface are not necessary for their function. This feature is distinct from another well-known regulation pair in C4 photosynthesis, namely, PPDK and PPDK-RP, where the pockets are necessary. Our findings facilitate the discovery of novel protein regulation types and contribute to expanding our knowledge about C4 photosynthesis.
Collapse
Affiliation(s)
| | - Dianjing Guo
- State Key Laboratory of Agrobiotechnology, School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| |
Collapse
|
7
|
Rodríguez FS, Mesdaghi S, Simpkin AJ, Burgos-Mármol JJ, Murphy DL, Uski V, Keegan RM, Rigden DJ. ConPlot: Web-based application for the visualisation of protein contact maps integrated with other data. Bioinformatics 2021; 37:2763-2765. [PMID: 34499718 PMCID: PMC8428603 DOI: 10.1093/bioinformatics/btab049] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Revised: 12/18/2020] [Accepted: 01/21/2021] [Indexed: 12/15/2022] Open
Abstract
Summary Covariance-based predictions of residue contacts and inter-residue distances are an increasingly popular data type in protein bioinformatics. Here we present ConPlot, a web-based application for convenient display and analysis of contact maps and distograms. Integration of predicted contact data with other predictions is often required to facilitate inference of structural features. ConPlot can therefore use the empty space near the contact map diagonal to display multiple coloured tracks representing other sequence-based predictions. Popular file formats are natively read and bespoke data can also be flexibly displayed. This novel visualization will enable easier interpretation of predicted contact maps. Availability and implementation available online at www.conplot.org, along with documentation and examples. Alternatively, ConPlot can be installed and used locally using the docker image from the project’s Docker Hub repository. ConPlot is licensed under the BSD 3-Clause. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Filomeno Sánchez Rodríguez
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England.,Life Science, Diamond Light Source, Harwell Science and Innovation Campus, Oxfordshire OX11 0DE, Didcot, England
| | - Shahram Mesdaghi
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Adam J Simpkin
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - J Javier Burgos-Mármol
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - David L Murphy
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ville Uski
- UKRI-STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, England
| | - Ronan M Keegan
- UKRI-STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, England
| | - Daniel J Rigden
- Institute of Structural, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
8
|
Correa Marrero M, Immink RGH, de Ridder D, van Dijk ADJ. Improved inference of intermolecular contacts through protein-protein interaction prediction using coevolutionary analysis. Bioinformatics 2020; 35:2036-2042. [PMID: 30398547 DOI: 10.1093/bioinformatics/bty924] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Revised: 10/11/2018] [Accepted: 11/05/2018] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Predicting residue-residue contacts between interacting proteins is an important problem in bioinformatics. The growing wealth of sequence data can be used to infer these contacts through correlated mutation analysis on multiple sequence alignments of interacting homologs of the proteins of interest. This requires correct identification of pairs of interacting proteins for many species, in order to avoid introducing noise (i.e. non-interacting sequences) in the analysis that will decrease predictive performance. RESULTS We have designed Ouroboros, a novel algorithm to reduce such noise in intermolecular contact prediction. Our method iterates between weighting proteins according to how likely they are to interact based on the correlated mutations signal, and predicting correlated mutations based on the weighted sequence alignment. We show that this approach accurately discriminates between protein interaction versus non-interaction and simultaneously improves the prediction of intermolecular contact residues compared to a naive application of correlated mutation analysis. This requires no training labels concerning interactions or contacts. Furthermore, the method relaxes the assumption of one-to-one interaction of previous approaches, allowing for the study of many-to-many interactions. AVAILABILITY AND IMPLEMENTATION Source code and test data are available at www.bif.wur.nl/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Richard G H Immink
- Laboratory of Molecular Biology, Department of Plant Sciences.,Bioscience, Wageningen Plant Research
| | | | - Aalt D J van Dijk
- Bioinformatics Group, Department of Plant Sciences.,Bioscience, Wageningen Plant Research.,Biometris, Department of Plant Sciences, Wageningen University & Research, Wageningen PB, The Netherlands
| |
Collapse
|
9
|
Chen J, Siu SWI. Machine Learning Approaches for Quality Assessment of Protein Structures. Biomolecules 2020; 10:biom10040626. [PMID: 32316682 PMCID: PMC7226485 DOI: 10.3390/biom10040626] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/07/2020] [Accepted: 04/09/2020] [Indexed: 11/16/2022] Open
Abstract
Protein structures play a very important role in biomedical research, especially in drug discovery and design, which require accurate protein structures in advance. However, experimental determinations of protein structure are prohibitively costly and time-consuming, and computational predictions of protein structures have not been perfected. Methods that assess the quality of protein models can help in selecting the most accurate candidates for further work. Driven by this demand, many structural bioinformatics laboratories have developed methods for estimating model accuracy (EMA). In recent years, EMA by machine learning (ML) have consistently ranked among the top-performing methods in the community-wide CASP challenge. Accordingly, we systematically review all the major ML-based EMA methods developed within the past ten years. The methods are grouped by their employed ML approach-support vector machine, artificial neural networks, ensemble learning, or Bayesian learning-and their significances are discussed from a methodology viewpoint. To orient the reader, we also briefly describe the background of EMA, including the CASP challenge and its evaluation metrics, and introduce the major ML/DL techniques. Overall, this review provides an introductory guide to modern research on protein quality assessment and directions for future research in this area.
Collapse
|
10
|
Geng C, Jung Y, Renaud N, Honavar V, Bonvin AMJJ, Xue LC. iScore: a novel graph kernel-based function for scoring protein-protein docking models. Bioinformatics 2020; 36:112-121. [PMID: 31199455 PMCID: PMC6956772 DOI: 10.1093/bioinformatics/btz496] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Revised: 05/08/2019] [Accepted: 06/11/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. RESULTS Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. AVAILABILITY AND IMPLEMENTATION The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cunliang Geng
- Bijvoet Center for Biomolecular Research, Faculty of Science – Chemistry, Utrecht University, Utrecht 3584 CH, The Netherlands
| | - Yong Jung
- Bioinformatics & Genomics Graduate Program, Pennsylvania State University, University Park, PA 16802, USA
- Artificial Intelligence Research Laboratory, Pennsylvania State University, University Park, PA 16823, USA
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA 16802, USA
| | - Nicolas Renaud
- Netherlands eScience Center, Amsterdam 1098 XG, The Netherlands
| | - Vasant Honavar
- Bioinformatics & Genomics Graduate Program, Pennsylvania State University, University Park, PA 16802, USA
- Artificial Intelligence Research Laboratory, Pennsylvania State University, University Park, PA 16823, USA
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA 16802, USA
- Center for Big Data Analytics and Discovery Informatics, Pennsylvania State University, University Park, PA 16823, USA
- Institute for Cyberscience, University Park, PA 16802, USA
- Clinical and Translational Sciences Institute, University Park, PA 16802, USA
- College of Information Sciences & Technology, Pennsylvania State University, University Park, PA 16802, USA
| | - Alexandre M J J Bonvin
- Bijvoet Center for Biomolecular Research, Faculty of Science – Chemistry, Utrecht University, Utrecht 3584 CH, The Netherlands
| | - Li C Xue
- Bijvoet Center for Biomolecular Research, Faculty of Science – Chemistry, Utrecht University, Utrecht 3584 CH, The Netherlands
| |
Collapse
|
11
|
Bittrich S, Schroeder M, Labudde D. StructureDistiller: Structural relevance scoring identifies the most informative entries of a contact map. Sci Rep 2019; 9:18517. [PMID: 31811259 PMCID: PMC6898053 DOI: 10.1038/s41598-019-55047-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 11/21/2019] [Indexed: 12/17/2022] Open
Abstract
Protein folding and structure prediction are two sides of the same coin. Contact maps and the related techniques of constraint-based structure reconstruction can be considered as unifying aspects of both processes. We present the Structural Relevance (SR) score which quantifies the information content of individual contacts and residues in the context of the whole native structure. The physical process of protein folding is commonly characterized with spatial and temporal resolution: some residues are Early Folding while others are Highly Stable with respect to unfolding events. We employ the proposed SR score to demonstrate that folding initiation and structure stabilization are subprocesses realized by distinct sets of residues. The example of cytochrome c is used to demonstrate how StructureDistiller identifies the most important contacts needed for correct protein folding. This shows that entries of a contact map are not equally relevant for structural integrity. The proposed StructureDistiller algorithm identifies contacts with the highest information content; these entries convey unique constraints not captured by other contacts. Identification of the most informative contacts effectively doubles resilience toward contacts which are not observed in the native contact map. Furthermore, this knowledge increases reconstruction fidelity on sparse contact maps significantly by 0.4 Å.
Collapse
Affiliation(s)
- Sebastian Bittrich
- University of Applied Sciences Mittweida, Mittweida, 09648, Germany. .,Biotechnology Center (BIOTEC), TU Dresden, Dresden, 01307, Germany. .,Research Collaboratory for Structural Bioinformatics Protein Data Bank, University of California, San Diego, La Jolla, CA, 92093, USA.
| | | | - Dirk Labudde
- University of Applied Sciences Mittweida, Mittweida, 09648, Germany
| |
Collapse
|
12
|
Simpkin AJ, Thomas JMH, Simkovic F, Keegan RM, Rigden DJ. Molecular replacement using structure predictions from databases. Acta Crystallogr D Struct Biol 2019; 75:1051-1062. [PMID: 31793899 PMCID: PMC6889911 DOI: 10.1107/s2059798319013962] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Accepted: 10/12/2019] [Indexed: 01/19/2023] Open
Abstract
Molecular replacement (MR) is the predominant route to solution of the phase problem in macromolecular crystallography. Where the lack of a suitable homologue precludes conventional MR, one option is to predict the target structure using bioinformatics. Such modelling, in the absence of homologous templates, is called ab initio or de novo modelling. Recently, the accuracy of such models has improved significantly as a result of the availability, in many cases, of residue-contact predictions derived from evolutionary covariance analysis. Covariance-assisted ab initio models representing structurally uncharacterized Pfam families are now available on a large scale in databases, potentially representing a valuable and easily accessible supplement to the PDB as a source of search models. Here, the unconventional MR pipeline AMPLE is employed to explore the value of structure predictions in the GREMLIN and PconsFam databases. It was tested whether these deposited predictions, processed in various ways, could solve the structures of PDB entries that were subsequently deposited. The results were encouraging: nine of 27 GREMLIN cases were solved, covering target lengths of 109-355 residues and a resolution range of 1.4-2.9 Å, and with target-model shared sequence identity as low as 20%. The cluster-and-truncate approach in AMPLE proved to be essential for most successes. For the overall lower quality structure predictions in the PconsFam database, remodelling with Rosetta within the AMPLE pipeline proved to be the best approach, generating ensemble search models from single-structure deposits. Finally, it is shown that the AMPLE-obtained search models deriving from GREMLIN deposits are of sufficiently high quality to be selected by the sequence-independent MR pipeline SIMBAD. Overall, the results help to point the way towards the optimal use of the expanding databases of ab initio structure predictions.
Collapse
Affiliation(s)
- Adam J. Simpkin
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Jens M. H. Thomas
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Felix Simkovic
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ronan M. Keegan
- STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, England
| | - Daniel J. Rigden
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
13
|
AlQuraishi M. AlphaFold at CASP13. Bioinformatics 2019; 35:4862-4865. [PMID: 31116374 PMCID: PMC6907002 DOI: 10.1093/bioinformatics/btz422] [Citation(s) in RCA: 164] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 03/26/2019] [Accepted: 05/15/2019] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Computational prediction of protein structure from sequence is broadly viewed as a foundational problem of biochemistry and one of the most difficult challenges in bioinformatics. Once every two years the Critical Assessment of protein Structure Prediction (CASP) experiments are held to assess the state of the art in the field in a blind fashion, by presenting predictor groups with protein sequences whose structures have been solved but have not yet been made publicly available. The first CASP was organized in 1994, and the latest, CASP13, took place last December, when for the first time the industrial laboratory DeepMind entered the competition. DeepMind's entry, AlphaFold, placed first in the Free Modeling (FM) category, which assesses methods on their ability to predict novel protein folds (the Zhang group placed first in the Template-Based Modeling (TBM) category, which assess methods on predicting proteins whose folds are related to ones already in the Protein Data Bank.) DeepMind's success generated significant public interest. Their approach builds on two ideas developed in the academic community during the preceding decade: (i) the use of co-evolutionary analysis to map residue co-variation in protein sequence to physical contact in protein structure, and (ii) the application of deep neural networks to robustly identify patterns in protein sequence and co-evolutionary couplings and convert them into contact maps. In this Letter, we contextualize the significance of DeepMind's entry within the broader history of CASP, relate AlphaFold's methodological advances to prior work, and speculate on the future of this important problem.
Collapse
Affiliation(s)
- Mohammed AlQuraishi
- Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
- Lab of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
14
|
AlQuraishi M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics 2019; 20:311. [PMID: 31185886 PMCID: PMC6560865 DOI: 10.1186/s12859-019-2932-0] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Accepted: 06/05/2019] [Indexed: 02/01/2023] Open
Abstract
Background Rapid progress in deep learning has spurred its application to bioinformatics problems including protein structure prediction and design. In classic machine learning problems like computer vision, progress has been driven by standardized data sets that facilitate fair assessment of new methods and lower the barrier to entry for non-domain experts. While data sets of protein sequence and structure exist, they lack certain components critical for machine learning, including high-quality multiple sequence alignments and insulated training/validation splits that account for deep but only weakly detectable homology across protein space. Results We created the ProteinNet series of data sets to provide a standardized mechanism for training and assessing data-driven models of protein sequence-structure relationships. ProteinNet integrates sequence, structure, and evolutionary information in programmatically accessible file formats tailored for machine learning frameworks. Multiple sequence alignments of all structurally characterized proteins were created using substantial high-performance computing resources. Standardized data splits were also generated to emulate the difficulty of past CASP (Critical Assessment of protein Structure Prediction) experiments by resetting protein sequence and structure space to the historical states that preceded six prior CASPs. Utilizing sensitive evolution-based distance metrics to segregate distantly related proteins, we have additionally created validation sets distinct from the official CASP sets that faithfully mimic their difficulty. Conclusion ProteinNet represents a comprehensive and accessible resource for training and assessing machine-learned models of protein structure.
Collapse
Affiliation(s)
- Mohammed AlQuraishi
- Laboratory of Systems Pharmacology, Department of Systems Biology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA.
| |
Collapse
|
15
|
Zlobin A, Suplatov D, Kopylov K, Švedas V. CASBench: A Benchmarking Set of Proteins with Annotated Catalytic and Allosteric Sites in Their Structures. Acta Naturae 2019; 11:74-80. [PMID: 31024751 PMCID: PMC6475866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Indexed: 10/24/2022] Open
Abstract
In recent years, the phenomenon of allostery has witnessed growing attention driven by a fundamental interest in new ways to regulate the functional properties of proteins, as well as the prospects of using allosteric sites as targets to design novel drugs with lower toxicity due to a higher selectivity of binding and specificity of the mechanism of action. The currently available bioinformatic methods can sometimes correctly detect previously unknown ligand binding sites in protein structures. However, the development of universal and more efficient approaches requires a deeper understanding of the common and distinctive features of the structural organization of both functional (catalytic) and allosteric sites, the evolution of their amino acid sequences in respective protein families, and allosteric communication pathways. The CASBench benchmark set contains 91 entries related to enzymes with both catalytic and allosteric sites within their structures annotated based on the experimental information from the Allosteric Database, Catalytic Site Atlas, and Protein Data Bank. The obtained dataset can be used to benchmark the performance of existing computational approaches and develop/train perspective algorithms to search for new catalytic and regulatory sites, as well as to study the mechanisms of protein regulation on a large collection of allosteric enzymes. Establishing a relationship between the structure, function, and regulation is expected to improve our understanding of the mechanisms of action of enzymes and open up new prospects for discovering new drugs and designing more efficient biocatalysts. The CASBench can be operated offline on a local computer or online using built-in interactive tools at https://biokinet.belozersky.msu.ru/casbench.
Collapse
Affiliation(s)
- A. Zlobin
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology and Faculty of Bioengineering and Bioinformatics, Lenin hills 1, bldg. 73, 119991, Moscow, Russia
| | - D. Suplatov
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology and Faculty of Bioengineering and Bioinformatics, Lenin hills 1, bldg. 73, 119991, Moscow, Russia
| | - K. Kopylov
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology and Faculty of Bioengineering and Bioinformatics, Lenin hills 1, bldg. 73, 119991, Moscow, Russia
| | - V. Švedas
- Lomonosov Moscow State University, Belozersky Institute of Physicochemical Biology and Faculty of Bioengineering and Bioinformatics, Lenin hills 1, bldg. 73, 119991, Moscow, Russia
| |
Collapse
|
16
|
Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics 2018; 34:3308-3315. [PMID: 29718112 PMCID: PMC6157083 DOI: 10.1093/bioinformatics/bty341] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Revised: 03/06/2018] [Accepted: 04/25/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability and implementation DeepCov is freely available at https://github.com/psipred/DeepCov. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| |
Collapse
|
17
|
Hu J, Liu HF, Sun J, Wang J, Liu R. Integrating co-evolutionary signals and other properties of residue pairs to distinguish biological interfaces from crystal contacts. Protein Sci 2018; 27:1723-1735. [PMID: 29931702 DOI: 10.1002/pro.3448] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Revised: 04/21/2018] [Accepted: 05/16/2018] [Indexed: 12/25/2022]
Abstract
It remains challenging to accurately discriminate between biological and crystal interfaces. Most existing analyses and algorithms focused on the features derived from a single side of the interface. However, less attention has been paid to the properties of residue pairs across protein interfaces. To address this problem, we defined a novel co-evolutionary feature for homodimers through integrating direct coupling analysis and image processing techniques. The residue pairs across biological homodimeric interfaces were significantly enriched in co-evolving residues compared to those across crystal contacts, resulting in a promising classification accuracy with area under the curves (AUCs) of >0.85. Considering the availability of co-evolutionary feature, we also designed other residue pair based features that were useful for both homodimers and heterodimers. The most informative residue pairs were identified to reflect the interaction preferences across protein interfaces. Regarding the other extant properties, we designed the new descriptors at the interface residue level as well as at the pairwise contact level. Extensive validation showed that these single properties can be used to identify biological interfaces with AUCs ranging from 0.60 to 0.88. By integrating co-evolutionary feature with other residue pair based properties, our final prediction model output excellent performance with AUCs of >0.91 on different datasets. Compared to existing methods, our algorithm not only yielded better or comparable results but also provided complementary information. An easy-to-use web server is freely accessible at http://liulab.hzau.edu.cn/RPAIAnalyst.
Collapse
Affiliation(s)
- Jian Hu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China.,College of Biomedical Engineering, South-Central University for Nationalities, Wuhan, 430074, P. R. China
| | - Hui-Fang Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Jun Sun
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Jia Wang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Rong Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| |
Collapse
|
18
|
Delarue M, Koehl P. Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)? F1000Res 2018; 7. [PMID: 30079234 PMCID: PMC6058471 DOI: 10.12688/f1000research.14870.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/19/2018] [Indexed: 11/20/2022] Open
Abstract
Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.
Collapse
Affiliation(s)
- Marc Delarue
- Unité Dynamique Structurale des Macromolécules, Institut Pasteur, and UMR 3528 du CNRS, Paris, France
| | - Patrice Koehl
- Department of Computer Science, Genome Center, University of California, Davis, Davis, California, USA
| |
Collapse
|
19
|
de Oliveira SHP, Law EC, Shi J, Deane CM. Sequential search leads to faster, more efficient fragment-based de novo protein structure prediction. Bioinformatics 2018; 34:1132-1140. [PMID: 29136098 PMCID: PMC6030820 DOI: 10.1093/bioinformatics/btx722] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 09/22/2017] [Accepted: 11/04/2017] [Indexed: 01/12/2023] Open
Abstract
Motivation Most current de novo structure prediction methods randomly sample protein conformations and thus require large amounts of computational resource. Here, we consider a sequential sampling strategy, building on ideas from recent experimental work which shows that many proteins fold cotranslationally. Results We have investigated whether a pseudo-greedy search approach, which begins sequentially from one of the termini, can improve the performance and accuracy of de novo protein structure prediction. We observed that our sequential approach converges when fewer than 20 000 decoys have been produced, fewer than commonly expected. Using our software, SAINT2, we also compared the run time and quality of models produced in a sequential fashion against a standard, non-sequential approach. Sequential prediction produces an individual decoy 1.5-2.5 times faster than non-sequential prediction. When considering the quality of the best model, sequential prediction led to a better model being produced for 31 out of 41 soluble protein validation cases and for 18 out of 24 transmembrane protein cases. Correct models (TM-Score > 0.5) were produced for 29 of these cases by the sequential mode and for only 22 by the non-sequential mode. Our comparison reveals that a sequential search strategy can be used to drastically reduce computational time of de novo protein structure prediction and improve accuracy. Availability and implementation Data are available for download from: http://opig.stats.ox.ac.uk/resources. SAINT2 is available for download from: https://github.com/sauloho/SAINT2. Contact saulo.deoliveira@dtc.ox.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Eleanor C Law
- Department of Statistics, University of Oxford, Oxford, UK
| | - Jiye Shi
- Department of Informatics, UCB Pharma, Slough, UK
- Division of Physical Biology, Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai, China
| | | |
Collapse
|