251
|
Wuyun Q, Zheng W, Peng Z, Yang J. A large-scale comparative assessment of methods for residue-residue contact prediction. Brief Bioinform 2019; 19:219-230. [PMID: 27802931 DOI: 10.1093/bib/bbw106] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Indexed: 11/14/2022] Open
Abstract
Sequence-based prediction of residue-residue contact in proteins becomes increasingly more important for improving protein structure prediction in the big data era. In this study, we performed a large-scale comparative assessment of 15 locally installed contact predictors. To assess these methods, we collected a big data set consisting of 680 nonredundant proteins covering different structural classes and target difficulties. We investigated a wide range of factors that may influence the precision of contact prediction, including target difficulty, structural class, the alignment depth and distribution of contact pairs in a protein structure. We found that: (1) the machine learning-based methods outperform the direct-coupling-based methods for short-range contact prediction, while the latter are significantly better for long-range contact prediction. The consensus-based methods, which combine machine learning and direct-coupling methods, perform the best. (2) The target difficulty does not have clear influence on the machine learning-based methods, while it does affect the direct-coupling and consensus-based methods significantly. (3) The alignment depth has relatively weak effect on the machine learning-based methods. However, for the direct-coupling-based methods and consensus-based methods, the predicted contacts for targets with deeper alignment tend to be more accurate. (4) All methods perform relatively better on β and α + β proteins than on α proteins. (5) Residues buried in the core of protein structure are more prone to be in contact than residues on the surface (22 versus 6%). We believe these are useful results for guiding future development of new approach to contact prediction.
Collapse
Affiliation(s)
- Qiqige Wuyun
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Wei Zheng
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| |
Collapse
|
252
|
Téletchéa S, Santuz H, Léonard S, Etchebest C. Repository of Enriched Structures of Proteins Involved in the Red Blood Cell Environment (RESPIRE). PLoS One 2019; 14:e0211043. [PMID: 30794542 PMCID: PMC6386447 DOI: 10.1371/journal.pone.0211043] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 01/07/2019] [Indexed: 12/25/2022] Open
Abstract
The Red Blood Cell (RBC) is a metabolically-driven cell vital for processes such a gas transport and homeostasis. RBC possesses at its surface exposing antigens proteins that are critical in blood transfusion. Due to their importance, numerous studies address the cell function as a whole but more and more details of RBC structure and protein content are now studied using massive state-of-the art characterisation techniques. Yet, the resulting information is frequently scattered in many scientific articles, in many databases and specialized web servers. To provide a more compendious view of erythrocytes and of their protein content, we developed a dedicated database called RESPIRE that aims at gathering a comprehensive and coherent ensemble of information and data about proteins in RBC. This cell-driven database lists proteins found in erythrocytes. For a given protein entry, initial data are processed from external portals and enriched by using state-of-the-art bioinformatics methods. As structural information is extremely useful to understand protein function and predict the impact of mutations, a strong effort has been put on the prediction of protein structures with a special treatment for membrane proteins. Browsing the database is available through text search for reference gene names or protein identifiers, through pre-defined queries or via hyperlinks. The RESPIRE database provides valuable information and unique annotations that should be useful to a wide audience of biologists, clinicians and structural biologists. Database URL:http://www.dsimb.inserm.fr/respire
Collapse
Affiliation(s)
- S. Téletchéa
- Institut National de la Transfusion Sanguine, Paris, France
- Inserm, UMR_S 1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, Paris, France
- Laboratory of Excellence GR-Ex., Paris, France
- UFIP, University of Nantes, CNRS UMR 6286, Nantes, France
| | - H. Santuz
- Institut National de la Transfusion Sanguine, Paris, France
- Inserm, UMR_S 1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, Paris, France
- Laboratory of Excellence GR-Ex., Paris, France
| | - S. Léonard
- Institut National de la Transfusion Sanguine, Paris, France
- Inserm, UMR_S 1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, Paris, France
- Laboratory of Excellence GR-Ex., Paris, France
| | - C. Etchebest
- Institut National de la Transfusion Sanguine, Paris, France
- Inserm, UMR_S 1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, Paris, France
- Laboratory of Excellence GR-Ex., Paris, France
- * E-mail:
| |
Collapse
|
253
|
Figliuzzi M, Barrat-Charlaix P, Weigt M. How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins? Mol Biol Evol 2019; 35:1018-1027. [PMID: 29351669 DOI: 10.1093/molbev/msy007] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Global coevolutionary models of homologous protein families, as constructed by direct coupling analysis (DCA), have recently gained popularity in particular due to their capacity to accurately predict residue-residue contacts from sequence information alone, and thereby to facilitate tertiary and quaternary protein structure prediction. More recently, they have also been used to predict fitness effects of amino-acid substitutions in proteins, and to predict evolutionary conserved protein-protein interactions. These models are based on two currently unjustified hypotheses: 1) correlations in the amino-acid usage of different positions are resulting collectively from networks of direct couplings; and 2) pairwise couplings are sufficient to capture the amino-acid variability. Here, we propose a highly precise inference scheme based on Boltzmann-machine learning, which allows us to systematically address these hypotheses. We show how correlations are built up in a highly collective way by a large number of coupling paths, which are based on the proteins three-dimensional structure. We further find that pairwise coevolutionary models capture the collective residue variability across homologous proteins even for quantities which are not imposed by the inference procedure, like three-residue correlations, the clustered structure of protein families in sequence space or the sequence distances between homologs. These findings strongly suggest that pairwise coevolutionary models are actually sufficient to accurately capture the residue variability in homologous protein families.
Collapse
Affiliation(s)
- Matteo Figliuzzi
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology - UMR7238, 75005 Paris, France
| | - Pierre Barrat-Charlaix
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology - UMR7238, 75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology - UMR7238, 75005 Paris, France
| |
Collapse
|
254
|
Ji S, Oruç T, Mead L, Rehman MF, Thomas CM, Butterworth S, Winn PJ. DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure. PLoS One 2019; 14:e0205214. [PMID: 30620738 PMCID: PMC6324825 DOI: 10.1371/journal.pone.0205214] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 12/13/2018] [Indexed: 11/28/2022] Open
Abstract
Rapid, accurate prediction of protein structure from amino acid sequence would accelerate fields as diverse as drug discovery, synthetic biology and disease diagnosis. Massively improved prediction of protein structures has been driven by improving the prediction of the amino acid residues that contact in their 3D structure. For an average globular protein, around 92% of all residue pairs are non-contacting, therefore accurate prediction of only a small percentage of inter-amino acid distances could increase the number of constraints to guide structure determination. We have trained deep neural networks to predict inter-residue contacts and distances. Distances are predicted with an accuracy better than most contact prediction techniques. Addition of distance constraints improved de novo structure predictions for test sets of 158 protein structures, as compared to using the best contact prediction methods alone. Importantly, usage of distance predictions allows the selection of better models from the structure pool without a need for an external model assessment tool. The results also indicate how the accuracy of distance prediction methods might be improved further.
Collapse
Affiliation(s)
- Shuangxi Ji
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Tuğçe Oruç
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Liam Mead
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Muhammad Fayyaz Rehman
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | | | - Sam Butterworth
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- Division of Pharmacy and Optometry, School of Health Sciences, Manchester Academic Health Sciences Centre, University of Manchester, Manchester, M13 9PL, United Kingdom
| | - Peter James Winn
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- * E-mail:
| |
Collapse
|
255
|
Abstract
Thanks to the explosion of genomic sequencing, coevolutionary analysis of protein sequences has gained great and ever-increasing popularity in the last decade, and it is currently an important and well-established tool in structural bioinformatics and computational biology. This chapter concisely introduces the theoretical foundation and the practical aspects of coevolutionary analysis, as well as discusses the molecular modeling strategies to exploit its results in the study of protein structure, dynamics, and interactions. We present here a complete pipeline from sequence extraction to contact prediction through two examples, focusing on the predictions of inter-residue contacts in a single protein domain and on the analysis of a multi-domain protein that undergoes functional, large-scale conformational transitions.
Collapse
Affiliation(s)
- Duccio Malinverni
- Laboratory of Statistical Biophysics, Institute of Physics, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland.
| | - Alessandro Barducci
- Centre de Biochimie Structurale (CBS), INSERM, CNRS, Université de Montpellier, Montpellier, France.
| |
Collapse
|
256
|
Dehghani T, Naghibzadeh M, Eghdami M. BetaDL: A protein beta-sheet predictor utilizing a deep learning model and independent set solution. Comput Biol Med 2019; 104:241-249. [PMID: 30530227 DOI: 10.1016/j.compbiomed.2018.11.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/23/2018] [Accepted: 11/27/2018] [Indexed: 10/27/2022]
Abstract
The sequence-based prediction of beta-residue contacts and beta-sheet structures contain key information for protein structure prediction. However, the determination of beta-sheet structures poses numerous challenges due to long-range beta-residue interactions and the huge number of possible beta-sheet structures. Recently gaining attention has been the prediction of residue contacts based on deep learning models whose results have led to improvement in protein structure prediction. In addition, to reduce the computational complexity of determining beta-sheet structures, it has been suggested that this problem be transformed into graph-based solutions. Consequently, the current work proposes BetaDL, a combination of a deep learning and a graph-based beta-sheet structure predictor. BetaDL adopts deep learning models to capture beta-residue contacts and improve beta-sheet structure predictions. In addition, a graph-based approach is presented to model the beta-sheets conformational space and a new score function is introduced to evaluate beta-sheets. Furthermore, the present study demonstrates that the beta-sheet structure can be predicted within an acceptable computational time by the utilization of a heuristic maximum weight independent set solution. When compared to state-of-the-art methods, experimental results from BetaSheet916 and BetaSheet1452 datasets indicate that BetaDL improves the accuracy of beta-residue contact and beta-sheet structure prediction. Using BetaDL, beta-sheet structures are predicted with a 4% and 6% improvement in the F1-score at the residue and strand levels, respectively. BetaDL's source code and data are available at http://kerg.um.ac.ir/index.php/datasets/#BetaDL.
Collapse
Affiliation(s)
- Toktam Dehghani
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran.
| | - Mahdie Eghdami
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
257
|
MacCarthy E, Perry D, Kc DB. Advances in Protein Super-Secondary Structure Prediction and Application to Protein Structure Prediction. Methods Mol Biol 2019; 1958:15-45. [PMID: 30945212 DOI: 10.1007/978-1-4939-9161-7_2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Due to the advancement in various sequencing technologies, the gap between the number of protein sequences and the number of experimental protein structures is ever increasing. Community-wide initiatives like CASP have resulted in considerable efforts in the development of computational methods to accurately model protein structures from sequences. Sequence-based prediction of super-secondary structure has direct application in protein structure prediction, and there have been significant efforts in the prediction of super-secondary structure in the last decade. In this chapter, we first introduce the protein structure prediction problem and highlight some of the important progress in the field of protein structure prediction. Next, we discuss recent methods for the prediction of super-secondary structures. Finally, we discuss applications of super-secondary structure prediction in structure prediction/analysis of proteins. We also discuss prediction of protein structures that are composed of simple super-secondary structure repeats and protein structures that are composed of complex super-secondary structure repeats. Finally, we also discuss the recent trends in the field.
Collapse
Affiliation(s)
- Elijah MacCarthy
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Derrick Perry
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA
| | - Dukka B Kc
- Department of Computational Science and Engineering, North Carolina A&T State University, Greensboro, NC, USA.
| |
Collapse
|
258
|
Wan C. Background on Biology of Ageing and Bioinformatics. ADVANCED INFORMATION AND KNOWLEDGE PROCESSING 2019:25-43. [DOI: 10.1007/978-3-319-97919-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
259
|
Schmidt M, Hamacher K. hoDCA: higher order direct-coupling analysis. BMC Bioinformatics 2018; 19:546. [PMID: 30594145 PMCID: PMC6311078 DOI: 10.1186/s12859-018-2583-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2018] [Accepted: 12/12/2018] [Indexed: 11/30/2022] Open
Abstract
Background Direct-coupling analysis (DCA) is a method for protein contact prediction from sequence information alone. Its underlying principle is parameter estimation for a Hamiltonian interaction function stemming from a maximum entropy model with one- and two-point interactions. Vastly growing sequence databases enable the construction of large multiple sequence alignments (MSA). Thus, enough data exists to include higher order terms, such as three-body correlations. Results We present an implementation of hoDCA, which is an extension of DCA by including three-body interactions into the inverse Ising problem posed by parameter estimation. In a previous study, these three-body-interactions improved contact prediction accuracy for the PSICOV benchmark dataset. Our implementation can be executed in parallel, which results in fast runtimes and makes it suitable for large-scale application. Conclusion Our hoDCA software allows improved contact prediction using the Julia language, leveraging power of multi-core machines in an automated fashion.
Collapse
Affiliation(s)
- Michael Schmidt
- Department of Physics, TU Darmstadt, Karolinenpl. 5, Darmstadt, 64287, Germany.
| | - Kay Hamacher
- Department of Physics, TU Darmstadt, Karolinenpl. 5, Darmstadt, 64287, Germany.,Department of Biology, TU Darmstadt, Schnittspahnstr. 10, Darmstadt, 64287, Germany.,Department of Computer Science, TU Darmstadt, Karolinenpl. 5, Darmstadt, 64287, Germany
| |
Collapse
|
260
|
Koehl P, Orland H, Delarue M. Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments. Molecules 2018; 24:E104. [PMID: 30597916 PMCID: PMC6337344 DOI: 10.3390/molecules24010104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 12/21/2018] [Accepted: 12/24/2018] [Indexed: 11/17/2022] Open
Abstract
Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.
Collapse
Affiliation(s)
- Patrice Koehl
- Department of Computer Science, University of California, Davis, CA 95211, USA.
| | - Henri Orland
- Institut de Physique Théorique, CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France.
| | - Marc Delarue
- Department of Structural Biology and Chemistry and UMR 3528 du CNRS, Institut Pasteur, 75015 Paris, France.
| |
Collapse
|
261
|
Huang YJ, Brock KP, Ishida Y, Swapna GVT, Inouye M, Marks DS, Sander C, Montelione GT. Combining Evolutionary Covariance and NMR Data for Protein Structure Determination. Methods Enzymol 2018; 614:363-392. [PMID: 30611430 PMCID: PMC6640129 DOI: 10.1016/bs.mie.2018.11.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Accurate protein structure determination by solution-state NMR is challenging for proteins greater than about 20kDa, for which extensive perdeuteration is generally required, providing experimental data that are incomplete (sparse) and ambiguous. However, the massive increase in evolutionary sequence information coupled with advances in methods for sequence covariance analysis can provide reliable residue-residue contact information for a protein from sequence data alone. These "evolutionary couplings (ECs)" can be combined with sparse NMR data to determine accurate 3D protein structures. This hybrid "EC-NMR" method has been developed using NMR data for several soluble proteins and validated by comparison with corresponding reference structures determined by X-ray crystallography and/or conventional NMR methods. For small proteins, only backbone resonance assignments are utilized, while for larger proteins both backbone and some sidechain methyl resonance assignments are generally required. ECs can be combined with sparse NMR data obtained on deuterated, selectively protonated protein samples to provide structures that are more accurate and complete than those obtained using such sparse NMR data alone. EC-NMR also has significant potential for analysis of protein structures from solid-state NMR data and for studies of integral membrane proteins. The requirement that ECs are consistent with NMR data recorded on a specific member of a protein family, under specific conditions, also allows identification of ECs that reflect alternative allosteric or excited states of the protein structure.
Collapse
Affiliation(s)
- Yuanpeng Janet Huang
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Kelly P Brock
- Department of Systems Biology, Harvard Medical School, Boston, MA, United States
| | - Yojiro Ishida
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Gurla V T Swapna
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Masayori Inouye
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, United States
| | - Chris Sander
- Department of Cell Biology, Harvard Medical School and cBio Center, Dana-Farber Cancer Institute, Boston, MA, United States
| | - Gaetano T Montelione
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey, Piscataway, NJ, United States; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, Rutgers, The State University of New Jersey, Piscataway, NJ, United States.
| |
Collapse
|
262
|
Neuwald AF, Altschul SF. Statistical investigations of protein residue direct couplings. PLoS Comput Biol 2018; 14:e1006237. [PMID: 30596639 PMCID: PMC6329532 DOI: 10.1371/journal.pcbi.1006237] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 01/11/2019] [Accepted: 11/23/2018] [Indexed: 12/12/2022] Open
Abstract
Protein Direct Coupling Analysis (DCA), which predicts residue-residue contacts based on covarying positions within a multiple sequence alignment, has been remarkably effective. This suggests that there is more to learn from sequence correlations than is generally assumed, and calls for deeper investigations into DCA and perhaps into other types of correlations. Here we describe an approach that enables such investigations by measuring, as an estimated p-value, the statistical significance of the association between residue-residue covariance and structural interactions, either internal or homodimeric. Its application to thirty protein superfamilies confirms that direct coupling (DC) scores correlate with 3D pairwise contacts with very high significance. This method also permits quantitative assessment of the relative performance of alternative DCA methods, and of the degree to which they detect direct versus indirect couplings. We illustrate its use to assess, for a given protein, the biological relevance of alternative conformational states, to investigate the possible mechanistic implications of differences between these states, and to characterize subtle aspects of direct couplings. Our analysis indicates that direct pairwise correlations may be largely distinct from correlated patterns associated with functional specialization, and that the joint analysis of both types of correlations can yield greater power. Data, programs, and source code are freely available at http://evaldca.igs.umaryland.edu.
Collapse
Affiliation(s)
- Andrew F. Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, Maryland, United States of America
| | - Stephen F. Altschul
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
263
|
Butler BM, Kazan IC, Kumar A, Ozkan SB. Coevolving residues inform protein dynamics profiles and disease susceptibility of nSNVs. PLoS Comput Biol 2018; 14:e1006626. [PMID: 30496278 PMCID: PMC6289467 DOI: 10.1371/journal.pcbi.1006626] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 12/11/2018] [Accepted: 11/09/2018] [Indexed: 11/18/2022] Open
Abstract
The conformational dynamics of proteins is rarely used in methodologies used to predict the impact of genetic mutations due to the paucity of three-dimensional protein structures as compared to the vast number of available sequences. Until now a three-dimensional (3D) structure has been required to predict the conformational dynamics of a protein. We introduce an approach that estimates the conformational dynamics of a protein, without relying on structural information. This de novo approach utilizes coevolving residues identified from a multiple sequence alignment (MSA) using Potts models. These coevolving residues are used as contacts in a Gaussian network model (GNM) to obtain protein dynamics. B-factors calculated using sequence-based GNM (Seq-GNM) are in agreement with crystallographic B-factors as well as theoretical B-factors from the original GNM that utilizes the 3D structure. Moreover, we demonstrate the ability of the calculated B-factors from the Seq-GNM approach to discriminate genomic variants according to their phenotypes for a wide range of proteins. These results suggest that protein dynamics can be approximated based on sequence information alone, making it possible to assess the phenotypes of nSNVs in cases where a 3D structure is unknown. We hope this work will promote the use of dynamics information in genetic disease prediction at scale by circumventing the need for 3D structures. Proteins are dynamic machines that undergo atomic fluctuations, side chain rotations, and collective domain movements that are required for biological function. There is, therefore, a need for quantitative metrics that capture the dynamic fluctuations per position to understand the critical role of protein dynamics in shaping biological functions. A limiting factor in incorporating structural dynamics information in the classification of non-synonymous single nucleotide variants (nSNVs) is the limited number of known 3D structures compared to the vast number of available sequences. We have developed a new sequence-based GNM method, termed Seq-GNM, which uses co-evolving amino acid positions based on the multiple sequence alignment of a given query sequence to estimate the thermal motions of C-alpha atoms. In this paper, we have demonstrated that the predicted thermal motions using Seq-GNM are in reasonable agreement with experimental B-factors as well as B-factors computed using 3D crystal structures. We also provide evidence that B-factors predicted by Seq-GNM are capable of distinguishing between disease-associated and neutral nSNVs.
Collapse
Affiliation(s)
- Brandon M. Butler
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
| | - I. Can Kazan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
| | - Avishek Kumar
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
- Harris School of Public Policy and Center for Data Science and Public Policy, University of Chicago, Chicago, IL, United States of America
| | - S. Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, AZ, United States of America
- * E-mail:
| |
Collapse
|
264
|
Cheung NJ, Yu W. De novo protein structure prediction using ultra-fast molecular dynamics simulation. PLoS One 2018; 13:e0205819. [PMID: 30458007 PMCID: PMC6245515 DOI: 10.1371/journal.pone.0205819] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Accepted: 10/02/2018] [Indexed: 11/19/2022] Open
Abstract
Modern genomics sequencing techniques have provided a massive amount of protein sequences, but experimental endeavor in determining protein structures is largely lagging far behind the vast and unexplored sequences. Apparently, computational biology is playing a more important role in protein structure prediction than ever. Here, we present a system of de novo predictor, termed NiDelta, building on a deep convolutional neural network and statistical potential enabling molecular dynamics simulation for modeling protein tertiary structure. Combining with evolutionary-based residue-contacts, the presented predictor can predict the tertiary structures of a number of target proteins with remarkable accuracy. The proposed approach is demonstrated by calculations on a set of eighteen large proteins from different fold classes. The results show that the ultra-fast molecular dynamics simulation could dramatically reduce the gap between the sequence and its structure at atom level, and it could also present high efficiency in protein structure determination if sparse experimental data is available.
Collapse
Affiliation(s)
- Ngaam J. Cheung
- Department of Brain and Cognitive Science, DGIST, Daegu, South Korea
- Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge, United Kingdom
| | - Wookyung Yu
- Department of Brain and Cognitive Science, DGIST, Daegu, South Korea
- Core Protein Resources Center, DGIST, Daegu, South Korea
- * E-mail:
| |
Collapse
|
265
|
Ding W, Mao W, Shao D, Zhang W, Gong H. DeepConPred2: An Improved Method for the Prediction of Protein Residue Contacts. Comput Struct Biotechnol J 2018; 16:503-510. [PMID: 30505403 PMCID: PMC6247404 DOI: 10.1016/j.csbj.2018.10.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 10/16/2018] [Accepted: 10/18/2018] [Indexed: 12/18/2022] Open
Abstract
Information of residue-residue contacts is essential for understanding the mechanism of protein folding, and has been successfully applied as special topological restraints to simplify the conformational sampling in de novo protein structure prediction. Prediction of protein residue contacts has experienced amazingly rapid progresses recently, with prediction accuracy approaching impressively high levels in the past two years. In this work, we introduce a second version of our residue contact predictor, DeepConPred2, which exhibits substantially improved performance and sufficiently reduced running time after model re-optimization and feature updates. When testing on the CASP12 free modeling targets, our program reaches at least the same level of prediction accuracy as the best contact predictors so far and provides information complementary to other state-of-the-art methods in contact-assisted folding.
Collapse
Affiliation(s)
- Wenze Ding
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenzhi Mao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Di Shao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenxuan Zhang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
266
|
Vorberg S, Seemayer S, Söding J. Synthetic protein alignments by CCMgen quantify noise in residue-residue contact prediction. PLoS Comput Biol 2018; 14:e1006526. [PMID: 30395601 PMCID: PMC6237422 DOI: 10.1371/journal.pcbi.1006526] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Revised: 11/15/2018] [Accepted: 09/24/2018] [Indexed: 12/01/2022] Open
Abstract
Compensatory mutations between protein residues in physical contact can manifest themselves as statistical couplings between the corresponding columns in a multiple sequence alignment (MSA) of the protein family. Conversely, large coupling coefficients predict residue contacts. Methods for de-novo protein structure prediction based on this approach are becoming increasingly reliable. Their main limitation is the strong systematic and statistical noise in the estimation of coupling coefficients, which has so far limited their application to very large protein families. While most research has focused on improving predictions by adding external information, little progress has been made to improve the statistical procedure at the core, because our lack of understanding of the sources of noise poses a major obstacle. First, we show theoretically that the expectation value of the coupling score assuming no coupling is proportional to the product of the square roots of the column entropies, and we propose a simple entropy bias correction (EntC) that subtracts out this expectation value. Second, we show that the average product correction (APC) includes the correction of the entropy bias, partly explaining its success. Third, we have developed CCMgen, the first method for simulating protein evolution and generating realistic synthetic MSAs with pairwise statistical residue couplings. Fourth, to learn exact statistical models that reliably reproduce observed alignment statistics, we developed CCMpredPy, an implementation of the persistent contrastive divergence (PCD) method for exact inference. Fifth, we demonstrate how CCMgen and CCMpredPy can facilitate the development of contact prediction methods by analysing the systematic noise contributions from phylogeny and entropy. Using the entropy bias correction, we can disentangle both sources of noise and find that entropy contributes roughly twice as much noise as phylogeny. Knowledge about the three-dimensional structure of proteins is key to understanding their function and role in biological processes and diseases. The experimental structure determination techniques, such as X-ray crystallography or electron cryo-microscopy, are labour intensive, time-consuming and expensive. Therefore, complementary computational methods to predict a protein’s structure have become indispensable. Over the last years, immense progress has been made in predicting protein structures from their amino acid sequence by utilizing highly accurate predictions of spatial contacts between amino acid residues as constraints in folding simulations. However, contact prediction methods require large numbers of homologous protein sequences in order to discriminate between signal and noise. A major obstacle preventing progress on the statistical methodology is our limited understanding of the different components of noise that are known to affect the predictions. We provide two tools, CCMpredPy and CCMgen, that can be used to learn highly accurate statistical models for contact prediction and to simulate protein evolution according to the statistical constraints between positions of residues as specified by these models, respectively. We showcase their usefulness by quantifying the relative contribution of noise arising from entropy and phylogeny on the predicted contacts, which will facilitate the improvement of the statistical methodology.
Collapse
Affiliation(s)
- Susann Vorberg
- Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Stefan Seemayer
- Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology Group, Max-Planck Institute for Biophysical Chemistry, Göttingen, Germany
| |
Collapse
|
267
|
Bitbol AF. Inferring interaction partners from protein sequences using mutual information. PLoS Comput Biol 2018; 14:e1006401. [PMID: 30422978 PMCID: PMC6258550 DOI: 10.1371/journal.pcbi.1006401] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 11/27/2018] [Accepted: 10/27/2018] [Indexed: 11/30/2022] Open
Abstract
Functional protein-protein interactions are crucial in most cellular processes. They enable multi-protein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are functional interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. Our mutual information-based method also provides signatures of the existence of interactions between protein families. These results stand in contrast with structure prediction of proteins and of multi-protein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins.
Collapse
Affiliation(s)
- Anne-Florence Bitbol
- Sorbonne Université, CNRS, Laboratoire Jean Perrin (UMR 8237), F-75005 Paris, France
| |
Collapse
|
268
|
Schelling M, Hopf TA, Rost B. Evolutionary couplings and sequence variation effect predict protein binding sites. Proteins 2018; 86:1064-1074. [DOI: 10.1002/prot.25585] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2017] [Revised: 06/14/2018] [Accepted: 07/04/2018] [Indexed: 01/16/2023]
Affiliation(s)
- Maria Schelling
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics, & Computational Biology - i12; Garching/Munich Germany
| | - Thomas A. Hopf
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics, & Computational Biology - i12; Garching/Munich Germany
- Department of Systems Biology & Department of Cell Biology; Harvard Medical School; Boston Massachusetts
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics, & Computational Biology - i12; Garching/Munich Germany
- Institute for Advanced Study (TUM-IAS); Garching/Munich Germany
- TUM School of Life Sciences Weihenstephan (WZW); Freising Germany
- Department of Biochemistry and Molecular Biophysics & New York Consortium on Membrane Protein Structure (NYCOMPS); Columbia University; New York New York
| |
Collapse
|
269
|
Dhar A, Davidsen K, Matsen FA, Minin VN. Predicting B cell receptor substitution profiles using public repertoire data. PLoS Comput Biol 2018; 14:e1006388. [PMID: 30332400 PMCID: PMC6205660 DOI: 10.1371/journal.pcbi.1006388] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2018] [Revised: 10/29/2018] [Accepted: 07/22/2018] [Indexed: 12/31/2022] Open
Abstract
B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same “clonal family”) are released from the germinal center; their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called “substitution profiles”, are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method “Substitution Profiles Using Related Families” (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on two external datasets. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models. Antibody engineering can be greatly informed by knowledge about the underlying affinity maturation process. As such this can be probed by sequencing, but unfortunately, in practice often only one member of the clonal family is sequenced, making it difficult to determine a set of possible amino acid mutations that would retain the original antibody antigen binding affinity. We overcome this data sparsity by developing a statistical learning approach that leverages vast information about amino acid preferences available in public immune system repertoire data. We use a penalized regression approach to devise a flexible statistical model that integrates multiple sources of information into a coherent prediction framework and validate our prediction algorithm using subsampling and held out data.
Collapse
Affiliation(s)
- Amrit Dhar
- Department of Statistics, University of Washington, Seattle, Washington, United States of America
- Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Kristian Davidsen
- Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Frederick A. Matsen
- Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- * E-mail: (FAM); (VNM)
| | - Vladimir N. Minin
- Department of Statistics, University of California, Irvine, California, United States of America
- * E-mail: (FAM); (VNM)
| |
Collapse
|
270
|
Rouse SL, Matthews SJ, Dueholm MS. Ecology and Biogenesis of Functional Amyloids in Pseudomonas. J Mol Biol 2018; 430:3685-3695. [PMID: 29753779 PMCID: PMC6173800 DOI: 10.1016/j.jmb.2018.05.004] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 05/03/2018] [Accepted: 05/04/2018] [Indexed: 12/02/2022]
Abstract
Functional amyloids can be found in the extracellular matrix produced by many bacteria during biofilm growth. They mediate the initial attachment of bacteria to surfaces and provide stability and functionality to mature biofilms. Efficient amyloid biogenesis requires a highly coordinated system of amyloid subunits, molecular chaperones and transport systems. The functional amyloid of Pseudomonas (Fap) represents such a system. Here, we review the phylogenetic diversification of the Fap system, its potential ecological role and the dedicated machinery required for Fap biogenesis, with a particular focus on the amyloid exporter FapF, the structure of which has been recently resolved. We also present a sequence covariance-based in silico model of the FapC fiber-forming subunit. Finally, we highlight key questions that remain unanswered and we believe deserve further attention by the scientific community.
Collapse
Affiliation(s)
- Sarah L Rouse
- Department of Life Sciences, Imperial College London, South Kensington Campus, London, SW72AZ, UK
| | - Stephen J Matthews
- Department of Life Sciences, Imperial College London, South Kensington Campus, London, SW72AZ, UK
| | - Morten S Dueholm
- Center for Microbial Communities, Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark.
| |
Collapse
|
271
|
How is structural divergence related to evolutionary information? Mol Phylogenet Evol 2018; 127:859-866. [DOI: 10.1016/j.ympev.2018.06.033] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Revised: 06/01/2018] [Accepted: 06/19/2018] [Indexed: 12/15/2022]
|
272
|
Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics 2018; 34:3308-3315. [PMID: 29718112 PMCID: PMC6157083 DOI: 10.1093/bioinformatics/bty341] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Revised: 03/06/2018] [Accepted: 04/25/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability and implementation DeepCov is freely available at https://github.com/psipred/DeepCov. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
| |
Collapse
|
273
|
Abstract
Chemical Shift-Rosetta (CS-Rosetta) is an automated method that employs NMR chemical shifts to model protein structures de novo. In this chapter, we introduce the terminology and central concepts of CS-Rosetta. We describe the architecture and functionality of automatic NOESY assignment (AutoNOE) and structure determination protocols (Abrelax and RASREC) within the CS-Rosetta framework. We further demonstrate how CS-Rosetta can discriminate near-native structures against a large conformational search space using restraints obtained from NMR data, and/or sequence and structure homology. We highlight how CS-Rosetta can be combined with alternative automated approaches to (i) model oligomeric systems and (ii) create NMR-based structure determination pipeline. To show its practical applicability, we emphasize on the computational requirements and performance of CS-Rosetta for protein targets of varying molecular weight and complexity. Finally, we discuss the current Python interface, which enables easy execution of protocols for rapid and accurate high-resolution structure determination.
Collapse
Affiliation(s)
- Santrupti Nerli
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA, United States; Department of Computer Science, University of California Santa Cruz, Santa Cruz, CA, United States
| | - Nikolaos G Sgourakis
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA, United States.
| |
Collapse
|
274
|
Wu H, Cao C, Xia X, Lu Q. Unified Deep Learning Architecture for Modeling Biology Sequence. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1445-1452. [PMID: 28991751 DOI: 10.1109/tcbb.2017.2760832] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Prediction of the spatial structure or function of biological macromolecules based on their sequences remains an important challenge in bioinformatics. When modeling biological sequences using traditional sequencing models, long-range interaction, complicated and variable output of labeled structures, and variable length of biological sequences usually lead to different solutions on a case-by-case basis. This study proposed a unified deep learning architecture based on long short-term memory or a gated recurrent unit to capture long-range interactions. The architecture designs the optional reshape operator to adapt to the diversity of the output labels and implements a training algorithm to support the training of sequence models capable of processing variable-length sequences. The merging and pooling operators enhances the ability of capturing short-range interactions between basic units of biological sequences. The proposed deep-learning architecture and its training algorithm might be capable of solving currently variable biological sequence-modeling problems under a unified framework. We validated the model on one of the most difficult biological sequence-modeling problems, protein residue interaction prediction. The results indicate that the accuracy of obtaining the residue interactions of the model exceeded popular approaches by 10 percent on multiple widely-used benchmarks.
Collapse
|
275
|
Jakubec D, Kratochvíl M, Vymĕtal J, Vondrášek J. Widespread evolutionary crosstalk among protein domains in the context of multi-domain proteins. PLoS One 2018; 13:e0203085. [PMID: 30169546 PMCID: PMC6118372 DOI: 10.1371/journal.pone.0203085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Accepted: 08/14/2018] [Indexed: 11/20/2022] Open
Abstract
Domains are distinct units within proteins that typically can fold independently into recognizable three-dimensional structures to facilitate their functions. The structural and functional independence of protein domains is reflected by their apparent modularity in the context of multi-domain proteins. In this work, we examined the coupling of evolution of domain sequences co-occurring within multi-domain proteins to see if it proceeds independently, or in a coordinated manner. We used continuous information theory measures to assess the extent of correlated mutations among domains in multi-domain proteins from organisms across the tree of life. In all multi-domain architectures we examined, domains co-occurring within protein sequences had to some degree undergone concerted evolution. This finding challenges the notion of complete modularity and independence of protein domains, providing new perspective on the evolution of protein sequence and function.
Collapse
Affiliation(s)
- David Jakubec
- Department of Bioinformatics, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 166 10 Prague 6, Czech Republic
- Department of Physical and Macromolecular Chemistry, Faculty of Science, Charles University, 128 43 Prague 2, Czech Republic
| | - Miroslav Kratochvíl
- Department of Bioinformatics, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 166 10 Prague 6, Czech Republic
- Department of Software Engineering, Faculty of Mathematics and Physics, Charles University, 118 00 Prague 1, Czech Republic
| | - Jiří Vymĕtal
- Department of Bioinformatics, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 166 10 Prague 6, Czech Republic
| | - Jiří Vondrášek
- Department of Bioinformatics, Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 166 10 Prague 6, Czech Republic
| |
Collapse
|
276
|
Kc DB. Recent advances in sequence-based protein structure prediction. Brief Bioinform 2018; 18:1021-1032. [PMID: 27562963 DOI: 10.1093/bib/bbw070] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Indexed: 11/13/2022] Open
Abstract
The most accurate characterizations of the structure of proteins are provided by structural biology experiments. However, because of the high cost and labor-intensive nature of the structural experiments, the gap between the number of protein sequences and solved structures is widening rapidly. Development of computational methods to accurately model protein structures from sequences is becoming increasingly important to the biological community. In this article, we highlight some important progress in the field of protein structure prediction, especially those related to free modeling (FM) methods that generate structure models without using homologous templates. We also provide a short synopsis of some of the recent advances in FM approaches as demonstrated in the recent Computational Assessment of Structure Prediction competition as well as recent trends and outlook for FM approaches in protein structure prediction.
Collapse
|
277
|
Khan S, Guo TW, Misra S. A coevolution-guided model for the rotor of the bacterial flagellar motor. Sci Rep 2018; 8:11754. [PMID: 30082903 PMCID: PMC6079021 DOI: 10.1038/s41598-018-30293-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2018] [Accepted: 07/19/2018] [Indexed: 01/17/2023] Open
Abstract
The Salmonella typhimurium trans-membrane FliF MS ring templates assembly of the rotary bacterial flagellar motor, which also contains a cytoplasmic C-ring. A full-frame fusion of FliF with the rotor protein FliG assembles rings in non-motile expression hosts. 3D electron microscopy reconstructions of these FliFFliG rings show three high electron-density sub-volumes. 3D-classification revealed heterogeneity of the assigned cytoplasmic volume consistent with FliG lability. We used residue coevolution to construct homodimer building blocks for ring assembly, with X-ray crystal structures from other species and injectisome analogs. The coevolution signal validates folds and, importantly, indicates strong homodimer contacts for three ring building motifs (RBMs), initially identified in injectisome structures. It also indicates that the cofolded domains of the FliG N-terminal domain (FliG_N) with embedded α-helical FliF carboxy-terminal tail homo-oligomerize. The FliG middle and C-terminal domains (FliG_MC) have a weak signal for homo-dimerization but have coevolved to conserve their stacking contact. The homodimers and their ring models fit well into the 3D reconstruction. We hypothesize that a stable FliF periplasmic hub provides a platform for FliG ring self-assembly, but the FliG_MC ring has only limited stability without the C-ring. We also present a mechanical model for torque transmission in the FliFFliG ring.
Collapse
Affiliation(s)
- Shahid Khan
- Laboratory of Cell Biology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, 20892, USA.
- Molecular Biology Consortium, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.
| | - Tai Wei Guo
- Laboratory of Cell Biology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, 20892, USA
| | - Saurav Misra
- Laboratory of Cell Biology, Center for Cancer Research, National Cancer Institute, NIH, Bethesda, MD, 20892, USA
- Department of Biochemistry & Molecular Biophysics, Kansas State University, Manhattan, KS, 66506, USA
| |
Collapse
|
278
|
Kassem MM, Christoffersen LB, Cavalli A, Lindorff-Larsen K. Enhancing coevolution-based contact prediction by imposing structural self-consistency of the contacts. Sci Rep 2018; 8:11112. [PMID: 30042380 PMCID: PMC6057941 DOI: 10.1038/s41598-018-29357-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Accepted: 07/10/2018] [Indexed: 11/29/2022] Open
Abstract
Based on the development of new algorithms and growth of sequence databases, it has recently become possible to build robust higher-order sequence models based on sets of aligned protein sequences. Such models have proven useful in de novo structure prediction, where the sequence models are used to find pairs of residues that co-vary during evolution, and hence are likely to be in spatial proximity in the native protein. The accuracy of these algorithms, however, drop dramatically when the number of sequences in the alignment is small. We have developed a method that we termed CE-YAPP (CoEvolution-YAPP), that is based on YAPP (Yet Another Peak Processor), which has been shown to solve a similar problem in NMR spectroscopy. By simultaneously performing structure prediction and contact assignment, CE-YAPP uses structural self-consistency as a filter to remove false positive contacts. Furthermore, CE-YAPP solves another problem, namely how many contacts to choose from the ordered list of covarying amino acid pairs. We show that CE-YAPP consistently improves contact prediction from multiple sequence alignments, in particular for proteins that are difficult targets. We further show that the structures determined from CE-YAPP are also in better agreement with those determined using traditional methods in structural biology.
Collapse
Affiliation(s)
- Maher M Kassem
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark
| | - Lars B Christoffersen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark
| | - Andrea Cavalli
- Institute for Research in Biomedicine, Università della Svizzera italiana (USI), Via Vincenzo Vela 6, 6500, Bellinzona, Switzerland.
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, Copenhagen, DK, 2200, Denmark.
| |
Collapse
|
279
|
Delarue M, Koehl P. Combined approaches from physics, statistics, and computer science for ab initio protein structure prediction: ex unitate vires (unity is strength)? F1000Res 2018; 7. [PMID: 30079234 PMCID: PMC6058471 DOI: 10.12688/f1000research.14870.1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/19/2018] [Indexed: 11/20/2022] Open
Abstract
Connecting the dots among the amino acid sequence of a protein, its structure, and its function remains a central theme in molecular biology, as it would have many applications in the treatment of illnesses related to misfolding or protein instability. As a result of high-throughput sequencing methods, biologists currently live in a protein sequence-rich world. However, our knowledge of protein structure based on experimental data remains comparatively limited. As a consequence, protein structure prediction has established itself as a very active field of research to fill in this gap. This field, once thought to be reserved for theoretical biophysicists, is constantly reinventing itself, borrowing ideas informed by an ever-increasing assembly of scientific domains, from biology, chemistry, (statistical) physics, mathematics, computer science, statistics, bioinformatics, and more recently data sciences. We review the recent progress arising from this integration of knowledge, from the development of specific computer architecture to allow for longer timescales in physics-based simulations of protein folding to the recent advances in predicting contacts in proteins based on detection of coevolution using very large data sets of aligned protein sequences.
Collapse
Affiliation(s)
- Marc Delarue
- Unité Dynamique Structurale des Macromolécules, Institut Pasteur, and UMR 3528 du CNRS, Paris, France
| | - Patrice Koehl
- Department of Computer Science, Genome Center, University of California, Davis, Davis, California, USA
| |
Collapse
|
280
|
de Oliveira SHP, Shi J, Deane CM. Comparing co-evolution methods and their application to template-free protein structure prediction. Bioinformatics 2018; 33:373-381. [PMID: 28171606 PMCID: PMC5860252 DOI: 10.1093/bioinformatics/btw618] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2016] [Revised: 09/19/2016] [Accepted: 09/22/2016] [Indexed: 02/01/2023] Open
Abstract
Motivation Co-evolution methods have been used as contact predictors to identify pairs of residues that share spatial proximity. Such contact predictors have been compared in terms of the precision of their predictions, but there is no study that compares their usefulness to model generation. Results We compared eight different co-evolution methods for a set of ∼3500 proteins and found that metaPSICOV stage 2 produces, on average, the most precise predictions. Precision of all the methods is dependent on SCOP class, with most methods predicting contacts in all α and membrane proteins poorly. The contact predictions were then used to assist in de novo model generation. We found that it was not the method with the highest average precision, but rather metaPSICOV stage 1 predictions that consistently led to the best models being produced. Our modelling results show a correlation between the proportion of predicted long range contacts that are satisfied on a model and its quality. We used this proportion to effectively classify models as correct/incorrect; discarding decoys classified as incorrect led to an enrichment in the proportion of good decoys in our final ensemble by a factor of seven. For 17 out of the 18 cases where correct answers were generated, the best models were not discarded by this approach. We were also able to identify eight cases where no correct decoy had been generated. Availability and Implementation Data is available for download from: http://opig.stats.ox.ac.uk/resources. Contact saulo.deoliveira@dtc.ox.ac.uk Supplimentary Information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jiye Shi
- Department of Informatics, UCB Pharma, Slough SL1 3WE, UK,Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai 201800, China
| | | |
Collapse
|
281
|
Abstract
Since the 1980s, deep learning and biomedical data have been coevolving and feeding each other. The breadth, complexity, and rapidly expanding size of biomedical data have stimulated the development of novel deep learning methods, and application of these methods to biomedical data have led to scientific discoveries and practical solutions. This overview provides technical and historical pointers to the field, and surveys current applications of deep learning to biomedical data organized around five subareas, roughly of increasing spatial scale: chemoinformatics, proteomics, genomics and transcriptomics, biomedical imaging, and health care. The black box problem of deep learning methods is also briefly discussed.
Collapse
Affiliation(s)
- Pierre Baldi
- Department of Computer Science, Institute for Genomics and Bioinformatics, and Center for Machine Learning and Intelligent Systems, University of California, Irvine, California 92697, USA
| |
Collapse
|
282
|
Holland J, Pan Q, Grigoryan G. Contact prediction is hardest for the most informative contacts, but improves with the incorporation of contact potentials. PLoS One 2018; 13:e0199585. [PMID: 29953468 PMCID: PMC6023208 DOI: 10.1371/journal.pone.0199585] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2017] [Accepted: 06/11/2018] [Indexed: 11/18/2022] Open
Abstract
Co-evolution between pairs of residues in a multiple sequence alignment (MSA) of homologous proteins has long been proposed as an indicator of structural contacts. Recently, several methods, such as direct-coupling analysis (DCA) and MetaPSICOV, have been shown to achieve impressive rates of contact prediction by taking advantage of considerable sequence data. In this paper, we show that prediction success rates are highly sensitive to the structural definition of a contact, with more permissive definitions (i.e., those classifying more pairs as true contacts) naturally leading to higher positive predictive rates, but at the expense of the amount of structural information contributed by each contact. Thus, the remaining limitations of contact prediction algorithms are most noticeable in conjunction with geometrically restrictive contacts—precisely those that contribute more information in structure prediction. We suggest that to improve prediction rates for such “informative” contacts one could combine co-evolution scores with additional indicators of contact likelihood. Specifically, we find that when a pair of co-varying positions in an MSA is occupied by residue pairs with favorable statistical contact energies, that pair is more likely to represent a true contact. We show that combining a contact potential metric with DCA or MetaPSICOV performs considerably better than DCA or MetaPSICOV alone, respectively. This is true regardless of contact definition, but especially true for stricter and more informative contact definitions. In summary, this work outlines some remaining challenges to be addressed in contact prediction and proposes and validates a promising direction towards improvement.
Collapse
Affiliation(s)
- Jack Holland
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
| | - Qinxin Pan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
| | - Gevorg Grigoryan
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States of America
- Department of Biological Sciences, Dartmouth College, Hanover, NH 03755, United States of America
- * E-mail:
| |
Collapse
|
283
|
Nerli S, McShan AC, Sgourakis NG. Chemical shift-based methods in NMR structure determination. PROGRESS IN NUCLEAR MAGNETIC RESONANCE SPECTROSCOPY 2018; 106-107:1-25. [PMID: 31047599 PMCID: PMC6788782 DOI: 10.1016/j.pnmrs.2018.03.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2018] [Revised: 03/09/2018] [Accepted: 03/09/2018] [Indexed: 05/08/2023]
Abstract
Chemical shifts are highly sensitive probes harnessed by NMR spectroscopists and structural biologists as conformational parameters to characterize a range of biological molecules. Traditionally, assignment of chemical shifts has been a labor-intensive process requiring numerous samples and a suite of multidimensional experiments. Over the past two decades, the development of complementary computational approaches has bolstered the analysis, interpretation and utilization of chemical shifts for elucidation of high resolution protein and nucleic acid structures. Here, we review the development and application of chemical shift-based methods for structure determination with a focus on ab initio fragment assembly, comparative modeling, oligomeric systems, and automated assignment methods. Throughout our discussion, we point out practical uses, as well as advantages and caveats, of using chemical shifts in structure modeling. We additionally highlight (i) hybrid methods that employ chemical shifts with other types of NMR restraints (residual dipolar couplings, paramagnetic relaxation enhancements and pseudocontact shifts) that allow for improved accuracy and resolution of generated 3D structures, (ii) the utilization of chemical shifts to model the structures of sparsely populated excited states, and (iii) modeling of sidechain conformations. Finally, we briefly discuss the advantages of contemporary methods that employ sparse NMR data recorded using site-specific isotope labeling schemes for chemical shift-driven structure determination of larger molecules. With this review, we aim to emphasize the accessibility and versatility of chemical shifts for structure determination of challenging biological systems, and to point out emerging areas of development that lead us towards the next generation of tools.
Collapse
Affiliation(s)
- Santrupti Nerli
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States; Department of Computer Science, University of California Santa Cruz, Santa Cruz, CA 95064, United States
| | - Andrew C McShan
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States
| | - Nikolaos G Sgourakis
- Department of Chemistry and Biochemistry, University of California Santa Cruz, Santa Cruz, CA 95064, United States.
| |
Collapse
|
284
|
Szurmant H, Weigt M. Inter-residue, inter-protein and inter-family coevolution: bridging the scales. Curr Opin Struct Biol 2018; 50:26-32. [PMID: 29101847 PMCID: PMC5940578 DOI: 10.1016/j.sbi.2017.10.014] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Revised: 10/12/2017] [Accepted: 10/13/2017] [Indexed: 10/18/2022]
Abstract
Interacting proteins coevolve at multiple but interconnected scales, from the residue-residue over the protein-protein up to the family-family level. The recent accumulation of enormous amounts of sequence data allows for the development of novel, data-driven computational approaches. Notably, these approaches can bridge scales within a single statistical framework. Although being currently applied mostly to isolated problems on single scales, their immense potential for an evolutionary informed, structural systems biology is steadily emerging.
Collapse
Affiliation(s)
- Hendrik Szurmant
- Department of Basic Medical Sciences, College of Osteopathic Medicine of the Pacific, Western University of Health Sciences, Pomona, CA 91766, USA.
| | - Martin Weigt
- Sorbonne Universités, UPMC Université Paris 06, CNRS, Biologie Computationnelle et Quantitative - Institut de Biologie Paris Seine, 75005 Paris, France.
| |
Collapse
|
285
|
Dutta S, Eckmann JP, Libchaber A, Tlusty T. Green function of correlated genes in a minimal mechanical model of protein evolution. Proc Natl Acad Sci U S A 2018; 115:E4559-E4568. [PMID: 29712824 PMCID: PMC5960285 DOI: 10.1073/pnas.1716215115] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The function of proteins arises from cooperative interactions and rearrangements of their amino acids, which exhibit large-scale dynamical modes. Long-range correlations have also been revealed in protein sequences, and this has motivated the search for physical links between the observed genetic and dynamic cooperativity. We outline here a simplified theory of protein, which relates sequence correlations to physical interactions and to the emergence of mechanical function. Our protein is modeled as a strongly coupled amino acid network with interactions and motions that are captured by the mechanical propagator, the Green function. The propagator describes how the gene determines the connectivity of the amino acids and thereby, the transmission of forces. Mutations introduce localized perturbations to the propagator that scatter the force field. The emergence of function is manifested by a topological transition when a band of such perturbations divides the protein into subdomains. We find that epistasis-the interaction among mutations in the gene-is related to the nonlinearity of the Green function, which can be interpreted as a sum over multiple scattering paths. We apply this mechanical framework to simulations of protein evolution and observe long-range epistasis, which facilitates collective functional modes.
Collapse
Affiliation(s)
- Sandipan Dutta
- Center for Soft and Living Matter, Institute for Basic Science, Ulsan 44919, Korea
| | - Jean-Pierre Eckmann
- Département de Physique Théorique and Section de Mathématiques, Université de Genève, CH-1211 Geneva 4, Switzerland
| | - Albert Libchaber
- Center for Studies in Physics and Biology, The Rockefeller University, New York, NY 10021;
| | - Tsvi Tlusty
- Center for Soft and Living Matter, Institute for Basic Science, Ulsan 44919, Korea;
- Department of Physics, Ulsan National Institute of Science and Technology, Ulsan 44919, Korea
| |
Collapse
|
286
|
Simkovic F, Thomas JMH, Rigden DJ. ConKit: a python interface to contact predictions. Bioinformatics 2018; 33:2209-2211. [PMID: 28369168 PMCID: PMC5870551 DOI: 10.1093/bioinformatics/btx148] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Accepted: 03/14/2017] [Indexed: 11/19/2022] Open
Abstract
Summary Recent advances in protein residue contact prediction algorithms have led to the emergence of many new methods and a variety of file formats. We present ConKit, an open source, modular and extensible Python interface which allows facile conversion between formats and provides an interface to analyses of sequence alignments and sets of contact predictions. Availability and Implementation ConKit is available via the Python Package Index. The documentation can be found at http://www.conkit.org. ConKit is licensed under the BSD 3-Clause. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Felix Simkovic
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Jens M H Thomas
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Daniel J Rigden
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| |
Collapse
|
287
|
Fonseca-Júnior NJ, Afonso MQ, Oliveira LC, Bleicher L. PFstats: A Network-Based Open Tool for Protein Family Analysis. J Comput Biol 2018; 25:480-486. [DOI: 10.1089/cmb.2017.0181] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Néli J. Fonseca-Júnior
- Departamento de Bioquimica e Imunologia, Instituto de Ciências Biologicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| | - Marcelo Q.L. Afonso
- Departamento de Bioquimica e Imunologia, Instituto de Ciências Biologicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| | - Lucas C. Oliveira
- Departamento de Bioquimica e Imunologia, Instituto de Ciências Biologicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| | - Lucas Bleicher
- Departamento de Bioquimica e Imunologia, Instituto de Ciências Biologicas (ICB), Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| |
Collapse
|
288
|
Raimondi D, Orlando G, Moreau Y, Vranken WF. Ultra-fast global homology detection with Discrete Cosine Transform and Dynamic Time Warping. Bioinformatics 2018; 34:3118-3125. [DOI: 10.1093/bioinformatics/bty309] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Accepted: 04/18/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Daniele Raimondi
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- Machine Learning Group, Université Libre De Bruxelles, Brussels, Belgium
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
- Machine Learning Group, Université Libre De Bruxelles, Brussels, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- Imec, Leuven, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| |
Collapse
|
289
|
Mao W, Wang T, Zhang W, Gong H. Identification of residue pairing in interacting β-strands from a predicted residue contact map. BMC Bioinformatics 2018; 19:146. [PMID: 29673311 PMCID: PMC5907701 DOI: 10.1186/s12859-018-2150-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 04/09/2018] [Indexed: 12/04/2022] Open
Abstract
Background Despite the rapid progress of protein residue contact prediction, predicted residue contact maps frequently contain many errors. However, information of residue pairing in β strands could be extracted from a noisy contact map, due to the presence of characteristic contact patterns in β-β interactions. This information may benefit the tertiary structure prediction of mainly β proteins. In this work, we propose a novel ridge-detection-based β-β contact predictor to identify residue pairing in β strands from any predicted residue contact map. Results Our algorithm RDb2C adopts ridge detection, a well-developed technique in computer image processing, to capture consecutive residue contacts, and then utilizes a novel multi-stage random forest framework to integrate the ridge information and additional features for prediction. Starting from the predicted contact map of CCMpred, RDb2C remarkably outperforms all state-of-the-art methods on two conventional test sets of β proteins (BetaSheet916 and BetaSheet1452), and achieves F1-scores of ~ 62% and ~ 76% at the residue level and strand level, respectively. Taking the prediction of the more advanced RaptorX-Contact as input, RDb2C achieves impressively higher performance, with F1-scores reaching ~ 76% and ~ 86% at the residue level and strand level, respectively. In a test of structural modeling using the top 1 L predicted contacts as constraints, for 61 mainly β proteins, the average TM-score achieves 0.442 when using the raw RaptorX-Contact prediction, but increases to 0.506 when using the improved prediction by RDb2C. Conclusion Our method can significantly improve the prediction of β-β contacts from any predicted residue contact maps. Prediction results of our algorithm could be directly applied to effectively facilitate the practical structure prediction of mainly β proteins. Availability All source data and codes are available at http://166.111.152.91/Downloads.html or the GitHub address of https://github.com/wzmao/RDb2C. Electronic supplementary material The online version of this article (10.1186/s12859-018-2150-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wenzhi Mao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Tong Wang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Wenxuan Zhang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China. .,Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing, China.
| |
Collapse
|
290
|
Gil N, Fiser A. Identifying functionally informative evolutionary sequence profiles. Bioinformatics 2018; 34:1278-1286. [PMID: 29211823 PMCID: PMC5905606 DOI: 10.1093/bioinformatics/btx779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 11/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. Results We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. Contact andras.fiser@einstein.yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
291
|
de Oliveira SHP, Law EC, Shi J, Deane CM. Sequential search leads to faster, more efficient fragment-based de novo protein structure prediction. Bioinformatics 2018; 34:1132-1140. [PMID: 29136098 PMCID: PMC6030820 DOI: 10.1093/bioinformatics/btx722] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 09/22/2017] [Accepted: 11/04/2017] [Indexed: 01/12/2023] Open
Abstract
Motivation Most current de novo structure prediction methods randomly sample protein conformations and thus require large amounts of computational resource. Here, we consider a sequential sampling strategy, building on ideas from recent experimental work which shows that many proteins fold cotranslationally. Results We have investigated whether a pseudo-greedy search approach, which begins sequentially from one of the termini, can improve the performance and accuracy of de novo protein structure prediction. We observed that our sequential approach converges when fewer than 20 000 decoys have been produced, fewer than commonly expected. Using our software, SAINT2, we also compared the run time and quality of models produced in a sequential fashion against a standard, non-sequential approach. Sequential prediction produces an individual decoy 1.5-2.5 times faster than non-sequential prediction. When considering the quality of the best model, sequential prediction led to a better model being produced for 31 out of 41 soluble protein validation cases and for 18 out of 24 transmembrane protein cases. Correct models (TM-Score > 0.5) were produced for 29 of these cases by the sequential mode and for only 22 by the non-sequential mode. Our comparison reveals that a sequential search strategy can be used to drastically reduce computational time of de novo protein structure prediction and improve accuracy. Availability and implementation Data are available for download from: http://opig.stats.ox.ac.uk/resources. SAINT2 is available for download from: https://github.com/sauloho/SAINT2. Contact saulo.deoliveira@dtc.ox.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Eleanor C Law
- Department of Statistics, University of Oxford, Oxford, UK
| | - Jiye Shi
- Department of Informatics, UCB Pharma, Slough, UK
- Division of Physical Biology, Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai, China
| | | |
Collapse
|
292
|
Nicoludis JM, Gaudet R. Applications of sequence coevolution in membrane protein biochemistry. BIOCHIMICA ET BIOPHYSICA ACTA. BIOMEMBRANES 2018; 1860:895-908. [PMID: 28993150 PMCID: PMC5807202 DOI: 10.1016/j.bbamem.2017.10.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Revised: 09/28/2017] [Accepted: 10/02/2017] [Indexed: 12/22/2022]
Abstract
Recently, protein sequence coevolution analysis has matured into a predictive powerhouse for protein structure and function. Direct methods, which use global statistical models of sequence coevolution, have enabled the prediction of membrane and disordered protein structures, protein complex architectures, and the functional effects of mutations in proteins. The field of membrane protein biochemistry and structural biology has embraced these computational techniques, which provide functional and structural information in an otherwise experimentally-challenging field. Here we review recent applications of protein sequence coevolution analysis to membrane protein structure and function and highlight the promising directions and future obstacles in these fields. We provide insights and guidelines for membrane protein biochemists who wish to apply sequence coevolution analysis to a given experimental system.
Collapse
Affiliation(s)
- John M Nicoludis
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, United States
| | - Rachelle Gaudet
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, United States.
| |
Collapse
|
293
|
He B, Mortuza SM, Wang Y, Shen HB, Zhang Y. NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers. Bioinformatics 2018; 33:2296-2306. [PMID: 28369334 DOI: 10.1093/bioinformatics/btx164] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 03/21/2017] [Indexed: 12/12/2022] Open
Abstract
Motivation Recent CASP experiments have witnessed exciting progress on folding large-size non-humongous proteins with the assistance of co-evolution based contact predictions. The success is however anecdotal due to the requirement of the contact prediction methods for the high volume of sequence homologs that are not available to most of the non-humongous protein targets. Development of efficient methods that can generate balanced and reliable contact maps for different type of protein targets is essential to enhance the success rate of the ab initio protein structure prediction. Results We developed a new pipeline, NeBcon, which uses the naïve Bayes classifier (NBC) theorem to combine eight state of the art contact methods that are built from co-evolution and machine learning approaches. The posterior probabilities of the NBC model are then trained with intrinsic structural features through neural network learning for the final contact map prediction. NeBcon was tested on 98 non-redundant proteins, which improves the accuracy of the best co-evolution based meta-server predictor by 22%; the magnitude of the improvement increases to 45% for the hard targets that lack sequence and structural homologs in the databases. Detailed data analysis showed that the major contribution to the improvement is due to the optimized NBC combination of the complementary information from both co-evolution and machine learning predictions. The neural network training also helps to improve the coupling of the NBC posterior probability and the intrinsic structural features, which were found particularly important for the proteins that do not have sufficient number of homologous sequences to derive reliable co-evolution profiles. Availiablity and Implementation On-line server and standalone package of the program are available at http://zhanglab.ccmb.med.umich.edu/NeBcon/ . Contact zhng@umich.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Baoji He
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.,School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yanting Wang
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.,School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hong-Bin Shen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
294
|
Douam F, Fusil F, Enguehard M, Dib L, Nadalin F, Schwaller L, Hrebikova G, Mancip J, Mailly L, Montserret R, Ding Q, Maisse C, Carlot E, Xu K, Verhoeyen E, Baumert TF, Ploss A, Carbone A, Cosset FL, Lavillette D. A protein coevolution method uncovers critical features of the Hepatitis C Virus fusion mechanism. PLoS Pathog 2018; 14:e1006908. [PMID: 29505618 PMCID: PMC5854445 DOI: 10.1371/journal.ppat.1006908] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Revised: 03/15/2018] [Accepted: 01/26/2018] [Indexed: 12/15/2022] Open
Abstract
Amino-acid coevolution can be referred to mutational compensatory patterns preserving the function of a protein. Viral envelope glycoproteins, which mediate entry of enveloped viruses into their host cells, are shaped by coevolution signals that confer to viruses the plasticity to evade neutralizing antibodies without altering viral entry mechanisms. The functions and structures of the two envelope glycoproteins of the Hepatitis C Virus (HCV), E1 and E2, are poorly described. Especially, how these two proteins mediate the HCV fusion process between the viral and the cell membrane remains elusive. Here, as a proof of concept, we aimed to take advantage of an original coevolution method recently developed to shed light on the HCV fusion mechanism. When first applied to the well-characterized Dengue Virus (DENV) envelope glycoproteins, coevolution analysis was able to predict important structural features and rearrangements of these viral protein complexes. When applied to HCV E1E2, computational coevolution analysis predicted that E1 and E2 refold interdependently during fusion through rearrangements of the E2 Back Layer (BL). Consistently, a soluble BL-derived polypeptide inhibited HCV infection of hepatoma cell lines, primary human hepatocytes and humanized liver mice. We showed that this polypeptide specifically inhibited HCV fusogenic rearrangements, hence supporting the critical role of this domain during HCV fusion. By combining coevolution analysis and in vitro assays, we also uncovered functionally-significant coevolving signals between E1 and E2 BL/Stem regions that govern HCV fusion, demonstrating the accuracy of our coevolution predictions. Altogether, our work shed light on important structural features of the HCV fusion mechanism and contributes to advance our functional understanding of this process. This study also provides an important proof of concept that coevolution can be employed to explore viral protein mediated-processes, and can guide the development of innovative translational strategies against challenging human-tropic viruses. Several virus-mediated molecular processes remain poorly described, which dampen the development of potent anti-viral therapies. Hence, new experimental strategies need to be undertaken to improve and accelerate our understanding of these processes. Here, as a proof of concept, we employ amino-acid coevolution as a tool to gain insights into the structural rearrangements of Hepatitis C Virus (HCV) envelope glycoproteins E1 and E2 during virus fusion with the cell membrane, and provide a basis for the inhibition of this process. Our coevolution analysis predicted that a specific domain of E2, the Back Layer (BL) is involved into significant conformational changes with E1 during the fusion of the HCV membrane with the cellular membrane. Consistently, a recombinant, soluble form of the BL was able to inhibit E1E2 fusogenic rearrangements and HCV infection. Moreover, predicted coevolution networks involving E1 and BL residues, as well as E1 and BL-adjacent residues, were found to modulate virus fusion. Our data shows that coevolution analysis is a powerful and underused approach that can provide significant insights into the functions and structural rearrangements of viral proteins. Importantly, this approach can also provide structural and molecular basis for the design of effective anti-viral drugs, and opens new perspectives to rapidly identify effective antiviral strategies against emerging and re-emerging viral pathogens.
Collapse
Affiliation(s)
- Florian Douam
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
- CNRS UMR5557 Microbial ecology, Université Claude Bernard Lyon 1, INRA, UMR1418, Villeurbanne, France
- Department of Molecular Biology, Princeton University, Princeton NJ, United States of America
| | - Floriane Fusil
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
| | - Margot Enguehard
- CNRS UMR5557 Microbial ecology, Université Claude Bernard Lyon 1, INRA, UMR1418, Villeurbanne, France
- University of Lyon, Université Claude Bernard Lyon1, INRA, EPHE, IVPC, Viral Infections and Comparative Pathology, UMR754, Lyon, France
- Institut Hospitalo-Universitaire, Pôle Hépato-digestif, Hôpitaux Universitaires de Strasbourg, Strasbourg, France
| | - Linda Dib
- Molecular Phylogenetics and Speciation, Département d’écologie et évolution, Université de Lausanne, Lausanne, Suisse
| | - Francesca Nadalin
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
| | - Loïc Schwaller
- Mathematical Institute, Leiden University, Leiden, The Netherlands
| | - Gabriela Hrebikova
- Department of Molecular Biology, Princeton University, Princeton NJ, United States of America
| | - Jimmy Mancip
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
| | - Laurent Mailly
- Inserm, U1110, Institut de Recherche sur les Maladies Virales et Hépatiques, Strasbourg, France
- Université de Strasbourg, Strasbourg, France
| | - Roland Montserret
- Institut de Biologie et Chimie des Protéines, Bases Moléculaires et Structurales des Systèmes Infectieux, Labex Ecofect, UMR 5086 CNRS, Université de Lyon, Lyon, France
| | - Qiang Ding
- Department of Molecular Biology, Princeton University, Princeton NJ, United States of America
| | - Carine Maisse
- University of Lyon, Université Claude Bernard Lyon1, INRA, EPHE, IVPC, Viral Infections and Comparative Pathology, UMR754, Lyon, France
| | - Emilie Carlot
- CAS Key Laboratory of Molecular Virology and Immunology, Unit of interspecies transmission of arboviruses and antivirals, Institut Pasteur of Shanghai, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Ke Xu
- CAS Key Laboratory of Molecular Virology and Immunology, Unit of interspecies transmission of arboviruses and antivirals, Institut Pasteur of Shanghai, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Els Verhoeyen
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
| | - Thomas F. Baumert
- Institut Hospitalo-Universitaire, Pôle Hépato-digestif, Hôpitaux Universitaires de Strasbourg, Strasbourg, France
- Inserm, U1110, Institut de Recherche sur les Maladies Virales et Hépatiques, Strasbourg, France
- Université de Strasbourg, Strasbourg, France
| | - Alexander Ploss
- Department of Molecular Biology, Princeton University, Princeton NJ, United States of America
| | - Alessandra Carbone
- Sorbonne Université, CNRS, IBPS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France
- Institut Universitaire de France, Paris, France
- * E-mail: (FLC); (AC); (DL)
| | - François-Loïc Cosset
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
- * E-mail: (FLC); (AC); (DL)
| | - Dimitri Lavillette
- CIRI–International Center for Infectiology Research, Team EVIR, Inserm, U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, Ecole Normale Supérieure de Lyon, Univ Lyon, Lyon, France
- CNRS UMR5557 Microbial ecology, Université Claude Bernard Lyon 1, INRA, UMR1418, Villeurbanne, France
- University of Lyon, Université Claude Bernard Lyon1, INRA, EPHE, IVPC, Viral Infections and Comparative Pathology, UMR754, Lyon, France
- CAS Key Laboratory of Molecular Virology and Immunology, Unit of interspecies transmission of arboviruses and antivirals, Institut Pasteur of Shanghai, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (FLC); (AC); (DL)
| |
Collapse
|
295
|
Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. REPORTS ON PROGRESS IN PHYSICS. PHYSICAL SOCIETY (GREAT BRITAIN) 2018; 81:032601. [PMID: 29120346 DOI: 10.1088/1361-6633/aa9965] [Citation(s) in RCA: 126] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratoire de Physique Statistique de l'Ecole Normale Supérieure-UMR 8549, CNRS and PSL Research, Sorbonne Universités UPMC, Paris, France
| | | | | | | | | |
Collapse
|
296
|
Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin AM. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins 2018; 86 Suppl 1:51-66. [PMID: 29071738 PMCID: PMC5820169 DOI: 10.1002/prot.25407] [Citation(s) in RCA: 126] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Revised: 10/06/2017] [Accepted: 10/24/2017] [Indexed: 12/20/2022]
Abstract
Following up on the encouraging results of residue-residue contact prediction in the CASP11 experiment, we present the analysis of predictions submitted for CASP12. The submissions include predictions of 34 groups for 38 domains classified as free modeling targets which are not accessible to homology-based modeling due to a lack of structural templates. CASP11 saw a rise of coevolution-based methods outperforming other approaches. The improvement of these methods coupled to machine learning and sequence database growth are most likely the main driver for a significant improvement in average precision from 27% in CASP11 to 47% in CASP12. In more than half of the targets, especially those with many homologous sequences accessible, precisions above 90% were achieved with the best predictors reaching a precision of 100% in some cases. We furthermore tested the impact of using these contacts as restraints in ab initio modeling of 14 single-domain free modeling targets using Rosetta. Adding contacts to the Rosetta calculations resulted in improvements of up to 26% in GDT_TS within the top five structures.
Collapse
Affiliation(s)
- Joerg Schaarschmidt
- Faculty of Science ‐ ChemistryComputational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht UniversityUtrechtThe Netherlands
| | | | | | - Alexandre M.J.J. Bonvin
- Faculty of Science ‐ ChemistryComputational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht UniversityUtrechtThe Netherlands
| |
Collapse
|
297
|
Garrido-Martín D, Pazos F. Effect of the sequence data deluge on the performance of methods for detecting protein functional residues. BMC Bioinformatics 2018; 19:67. [PMID: 29482506 PMCID: PMC5827975 DOI: 10.1186/s12859-018-2084-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2017] [Accepted: 02/21/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The exponential accumulation of new sequences in public databases is expected to improve the performance of all the approaches for predicting protein structural and functional features. Nevertheless, this was never assessed or quantified for some widely used methodologies, such as those aimed at detecting functional sites and functional subfamilies in protein multiple sequence alignments. Using raw protein sequences as only input, these approaches can detect fully conserved positions, as well as those with a family-dependent conservation pattern. Both types of residues are routinely used as predictors of functional sites and, consequently, understanding how the sequence content of the databases affects them is relevant and timely. RESULTS In this work we evaluate how the growth and change with time in the content of sequence databases affect five sequence-based approaches for detecting functional sites and subfamilies. We do that by recreating historical versions of the multiple sequence alignments that would have been obtained in the past based on the database contents at different time points, covering a period of 20 years. Applying the methods to these historical alignments allows quantifying the temporal variation in their performance. Our results show that the number of families to which these methods can be applied sharply increases with time, while their ability to detect potentially functional residues remains almost constant. CONCLUSIONS These results are informative for the methods' developers and final users, and may have implications in the design of new sequencing initiatives.
Collapse
Affiliation(s)
- Diego Garrido-Martín
- Present address: Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, c/ Dr. Aiguader, 88, 08003, Barcelona, Spain.,Present address: Universitat Pompeu Fabra (UPF), Plaça de la Mercè, 10-12, 08002, Barcelona, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, Systems Biology Program, National Centre for Biotechnology (CNB-CSIC), c/ Darwin, 3, 28049, Madrid, Spain.
| |
Collapse
|
298
|
Le Q, Sievers F, Higgins DG. Protein multiple sequence alignment benchmarking through secondary structure prediction. Bioinformatics 2018; 33:1331-1337. [PMID: 28093407 PMCID: PMC5408826 DOI: 10.1093/bioinformatics/btw840] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 01/10/2017] [Indexed: 12/26/2022] Open
Abstract
Motivation Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of ‘true’ alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. Results In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. Availability and Implementation QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Quan Le
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Fabian Sievers
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| | - Desmond G Higgins
- Conway Institute, UCD School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Dublin 4, Ireland
| |
Collapse
|
299
|
Barrat-Charlaix P, Weigt M. [From sequence variability to structural and functional prediction: modeling of homologous protein families]. Biol Aujourdhui 2018; 211:239-244. [PMID: 29412135 DOI: 10.1051/jbio/2017030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Indexed: 06/08/2023]
Abstract
Thanks to next-generation sequencing, the number of sequenced genomes grows rapidly, providing in particular ample examples for the sequence variability between homologous proteins. This article discusses data-driven probabilistic sequence models, which are able to extract a multitude of information from sequence data alone, including (i) structural features like residue-residue contacts, which are formed in the folded protein, (ii) protein-protein interaction interfaces and (iii) phenotypic effects of amino-acid substitutions in proteins.
Collapse
Affiliation(s)
- Pierre Barrat-Charlaix
- Sorbonne Universités, UPMC Université Paris 06, CNRS, Biologie Computationnelle et Quantitative, Institut de Biologie Paris Seine, 75005 Paris, France
| | - Martin Weigt
- Sorbonne Universités, UPMC Université Paris 06, CNRS, Biologie Computationnelle et Quantitative, Institut de Biologie Paris Seine, 75005 Paris, France
| |
Collapse
|
300
|
Cortés Cabrera Á, Sánchez-Murcia PA, Gago F. Making sense of the past: hyperstability of ancestral thioredoxins explained by free energy simulations. Phys Chem Chem Phys 2018; 19:23239-23246. [PMID: 28825743 DOI: 10.1039/c7cp03659k] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Thioredoxin (Trx), a small and globular protein, is present in all kinds of organisms, from Archea to higher mammals. Throughout evolution, the Trx sequence has undergone subtle modifications to adapt to varying environmental conditions. The high degree of sequence conservation makes Trx very amenable to ancestral protein reconstruction techniques. In this work, we address the study of the structural and energetic determinants of thermostability in E. coli Trx using a dataset of mutations inspired by ancestral reconstruction. We compute, from first principles, the expected contribution of 19 different amino acid substitutions to the stability (ΔΔG) and the melting temperature (ΔTm) of the protein. We also describe the specific changes in structure and protein dynamics responsible for the stabilizing or destabilizing effects of these mutations. Our results point to local and independent changes for most of the variants. Our predictions are accurate enough to substantiate the proposal of new hypotheses regarding evolutionary relationships between mutations, as in the case of T89R, P68A and G74S or K90L and F102A, and reach beyond the initial set to suggest improved variants, such as K90I or K90Y.
Collapse
Affiliation(s)
- Álvaro Cortés Cabrera
- Área de Farmacología, Departamento de Ciencias Biomédicas, Facultad de Medicina y Ciencias de la Salud, Universidad de Alcalá, E-28805 Alcalá de Henares, Madrid, Spain.
| | - Pedro A Sánchez-Murcia
- Área de Farmacología, Departamento de Ciencias Biomédicas, Facultad de Medicina y Ciencias de la Salud, Universidad de Alcalá, E-28805 Alcalá de Henares, Madrid, Spain.
| | - Federico Gago
- Área de Farmacología, Departamento de Ciencias Biomédicas, Facultad de Medicina y Ciencias de la Salud, Universidad de Alcalá, E-28805 Alcalá de Henares, Madrid, Spain.
| |
Collapse
|