451
|
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol 2015; 11:e1004226. [PMID: 25950956 PMCID: PMC4423992 DOI: 10.1371/journal.pcbi.1004226] [Citation(s) in RCA: 882] [Impact Index Per Article: 88.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2014] [Accepted: 03/02/2015] [Indexed: 11/19/2022] Open
Abstract
16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions (from metabolic and immunological health in mammals to ecological stability in soils and oceans), identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from amplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sample. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation) for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU association networks is severely under-powered, and additional information (or assumptions) are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological networks from amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark in the absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a set of computational tools to generate OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial associations using data from the American Gut project.
Collapse
Affiliation(s)
- Zachary D. Kurtz
- Departments of Microbiology and Medicine, New York University School of Medicine, New York, New York, United States of America
| | - Christian L. Müller
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
- Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
| | - Emily R. Miraldi
- Departments of Microbiology and Medicine, New York University School of Medicine, New York, New York, United States of America
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
- Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
| | - Dan R. Littman
- Departments of Microbiology and Medicine, New York University School of Medicine, New York, New York, United States of America
| | - Martin J. Blaser
- Departments of Microbiology and Medicine, New York University School of Medicine, New York, New York, United States of America
| | - Richard A. Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, New York, United States of America
- Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
- Simons Center for Data Analysis, Simons Foundation, New York, New York, United States of America
| |
Collapse
|
452
|
Banach M, Prudhomme N, Carpentier M, Duprat E, Papandreou N, Kalinowska B, Chomilier J, Roterman I. Contribution to the prediction of the fold code: application to immunoglobulin and flavodoxin cases. PLoS One 2015; 10:e0125098. [PMID: 25915049 PMCID: PMC4411048 DOI: 10.1371/journal.pone.0125098] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2014] [Accepted: 03/20/2015] [Indexed: 12/19/2022] Open
Abstract
Background Folding nucleus of globular proteins formation starts by the mutual interaction of a group of hydrophobic amino acids whose close contacts allow subsequent formation and stability of the 3D structure. These early steps can be predicted by simulation of the folding process through a Monte Carlo (MC) coarse grain model in a discrete space. We previously defined MIRs (Most Interacting Residues), as the set of residues presenting a large number of non-covalent neighbour interactions during such simulation. MIRs are good candidates to define the minimal number of residues giving rise to a given fold instead of another one, although their proportion is rather high, typically [15-20]% of the sequences. Having in mind experiments with two sequences of very high levels of sequence identity (up to 90%) but different folds, we combined the MIR method, which takes sequence as single input, with the “fuzzy oil drop” (FOD) model that requires a 3D structure, in order to estimate the residues coding for the fold. FOD assumes that a globular protein follows an idealised 3D Gaussian distribution of hydrophobicity density, with the maximum in the centre and minima at the surface of the “drop”. If the actual local density of hydrophobicity around a given amino acid is as high as the ideal one, then this amino acid is assigned to the core of the globular protein, and it is assumed to follow the FOD model. Therefore one obtains a distribution of the amino acids of a protein according to their agreement or rejection with the FOD model. Results We compared and combined MIR and FOD methods to define the minimal nucleus, or keystone, of two populated folds: immunoglobulin-like (Ig) and flavodoxins (Flav). The combination of these two approaches defines some positions both predicted as a MIR and assigned as accordant with the FOD model. It is shown here that for these two folds, the intersection of the predicted sets of residues significantly differs from random selection. It reduces the number of selected residues by each individual method and allows a reasonable agreement with experimentally determined key residues coding for the particular fold. In addition, the intersection of the two methods significantly increases the specificity of the prediction, providing a robust set of residues that constitute the folding nucleus.
Collapse
Affiliation(s)
- Mateusz Banach
- Department of Bioinformatics and Telemedicine, Medical College, Jagiellonian University, Krakow, Poland
| | - Nicolas Prudhomme
- Protein Structure Prediction group, IMPMC, UPMC & CNRS, Paris, France
| | - Mathilde Carpentier
- Protein Structure Prediction group, IMPMC, UPMC & CNRS, Paris, France
- RPBS, 35 rue Hélène Brion, 75013, Paris, France
| | - Elodie Duprat
- Protein Structure Prediction group, IMPMC, UPMC & CNRS, Paris, France
- RPBS, 35 rue Hélène Brion, 75013, Paris, France
| | - Nikolaos Papandreou
- Genetics Department, Agricultural University of Athens, Iera Odos 75, Athens, Greece
| | - Barbara Kalinowska
- Department of Bioinformatics and Telemedicine, Medical College, Jagiellonian University, Krakow, Poland
| | - Jacques Chomilier
- Protein Structure Prediction group, IMPMC, UPMC & CNRS, Paris, France
- RPBS, 35 rue Hélène Brion, 75013, Paris, France
- * E-mail: (JC); (IR)
| | - Irena Roterman
- Department of Bioinformatics and Telemedicine, Medical College, Jagiellonian University, Krakow, Poland
- * E-mail: (JC); (IR)
| |
Collapse
|
453
|
de Oliveira SHP, Shi J, Deane CM. Building a better fragment library for de novo protein structure prediction. PLoS One 2015; 10:e0123998. [PMID: 25901595 PMCID: PMC4406757 DOI: 10.1371/journal.pone.0123998] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 02/25/2015] [Indexed: 01/11/2023] Open
Abstract
Fragment-based approaches are the current standard for de novo protein structure prediction. These approaches rely on accurate and reliable fragment libraries to generate good structural models. In this work, we describe a novel method for structure fragment library generation and its application in fragment-based de novo protein structure prediction. The importance of correct testing procedures in assessing the quality of fragment libraries is demonstrated. In particular, the exclusion of homologs to the target from the libraries to correctly simulate a de novo protein structure prediction scenario, something which surprisingly is not always done. We demonstrate that fragments presenting different predominant predicted secondary structures should be treated differently during the fragment library generation step and that exhaustive and random search strategies should both be used. This information was used to develop a novel method, Flib. On a validation set of 41 structurally diverse proteins, Flib libraries presents both a higher precision and coverage than two of the state-of-the-art methods, NNMake and HHFrag. Flib also achieves better precision and coverage on the set of 275 protein domains used in the two previous experiments of the the Critical Assessment of Structure Prediction (CASP9 and CASP10). We compared Flib libraries against NNMake libraries in a structure prediction context. Of the 13 cases in which a correct answer was generated, Flib models were more accurate than NNMake models for 10. “Flib is available for download at: http://www.stats.ox.ac.uk/research/proteins/resources”.
Collapse
Affiliation(s)
| | - Jiye Shi
- Department of Informatics, UCB Pharma, Slough, United Kingdom
- Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai, China
| | - Charlotte M. Deane
- Department of Statistics, Oxford University, Oxford, Oxfordshire, United Kingdom
| |
Collapse
|
454
|
Li F, Liu J, Garavito RM, Ferguson-Miller S. Evolving understanding of translocator protein 18 kDa (TSPO). Pharmacol Res 2015; 99:404-9. [PMID: 25882248 DOI: 10.1016/j.phrs.2015.03.022] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Revised: 03/25/2015] [Accepted: 03/27/2015] [Indexed: 02/01/2023]
Abstract
The translocator protein 18 kDa (TSPO) has been the focus of intense research by the biomedical community and the pharmaceutical industry because of its apparent involvement in many disease-related processes. These include steroidogenesis, apoptosis, inflammation, neurological disease and cancer, resulting in the use of TSPO as a biomarker and its potential as a drug target. Despite more than 30 years of study, the precise function of TSPO remains elusive. A recent breakthrough in determining the high-resolution crystal structures of bacterial homologs of mitochondrial TSPO provides new insight into the structural and functional properties at a molecular level and new opportunities for investigating the significance of this ancient and highly conserved protein family. The availability of atomic level structural information from different species also provides a platform for structure-based drug development. Here we briefly review current knowledge regarding TSPO and the implications of the new structures with respect to hypotheses and controversies in the field.
Collapse
Affiliation(s)
- Fei Li
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Jian Liu
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - R Michael Garavito
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Shelagh Ferguson-Miller
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA.
| |
Collapse
|
455
|
Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev 2015; 44:1172-239. [PMID: 25503938 PMCID: PMC4349129 DOI: 10.1039/c4cs00351a] [Citation(s) in RCA: 258] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Indexed: 12/21/2022]
Abstract
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the 'search space' of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (Kd) and catalytic (kcat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving kcat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the 'best' amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| | - Neil Swainston
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- School of Computer Science , The University of Manchester , Manchester M13 9PL , UK
| | - Philip J. Day
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- Faculty of Medical and Human Sciences , The University of Manchester , Manchester M13 9PT , UK
| | - Douglas B. Kell
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| |
Collapse
|
456
|
Ochoa D, Juan D, Valencia A, Pazos F. Detection of significant protein coevolution. ACTA ACUST UNITED AC 2015; 31:2166-73. [PMID: 25717190 DOI: 10.1093/bioinformatics/btv102] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Accepted: 02/11/2015] [Indexed: 11/14/2022]
Abstract
MOTIVATION The evolution of proteins cannot be fully understood without taking into account the coevolutionary linkages entangling them. From a practical point of view, coevolution between protein families has been used as a way of detecting protein interactions and functional relationships from genomic information. The most common approach to inferring protein coevolution involves the quantification of phylogenetic tree similarity using a family of methodologies termed mirrortree. In spite of their success, a fundamental problem of these approaches is the lack of an adequate statistical framework to assess the significance of a given coevolutionary score (tree similarity). As a consequence, a number of ad hoc filters and arbitrary thresholds are required in an attempt to obtain a final set of confident coevolutionary signals. RESULTS In this work, we developed a method for associating confidence estimators (P values) to the tree-similarity scores, using a null model specifically designed for the tree comparison problem. We show how this approach largely improves the quality and coverage (number of pairs that can be evaluated) of the detected coevolution in all the stages of the mirrortree workflow, independently of the starting genomic information. This not only leads to a better understanding of protein coevolution and its biological implications, but also to obtain a highly reliable and comprehensive network of predicted interactions, as well as information on the substructure of macromolecular complexes using only genomic information. AVAILABILITY AND IMPLEMENTATION The software and datasets used in this work are freely available at: http://csbg.cnb.csic.es/pMT/. CONTACT pazos@cnb.csic.es SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Ochoa
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid and Structural Bioinformatics Group, Spanish National Cancer Research Centre (CNIO), C/ Melchor Fernández Almagro 3, 28029 Madrid, Spain
| | - David Juan
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid and Structural Bioinformatics Group, Spanish National Cancer Research Centre (CNIO), C/ Melchor Fernández Almagro 3, 28029 Madrid, Spain
| | - Alfonso Valencia
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid and Structural Bioinformatics Group, Spanish National Cancer Research Centre (CNIO), C/ Melchor Fernández Almagro 3, 28029 Madrid, Spain
| | - Florencio Pazos
- Computational Systems Biology Group, National Centre for Biotechnology (CNB-CSIC), C/ Darwin 3, 28049 Madrid and Structural Bioinformatics Group, Spanish National Cancer Research Centre (CNIO), C/ Melchor Fernández Almagro 3, 28029 Madrid, Spain
| |
Collapse
|
457
|
Mao W, Kaya C, Dutta A, Horovitz A, Bahar I. Comparative study of the effectiveness and limitations of current methods for detecting sequence coevolution. Bioinformatics 2015; 31:1929-37. [PMID: 25697822 PMCID: PMC4481699 DOI: 10.1093/bioinformatics/btv103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2014] [Accepted: 02/02/2015] [Indexed: 01/02/2023] Open
Abstract
Motivation: With rapid accumulation of sequence data on several species, extracting rational and systematic information from multiple sequence alignments (MSAs) is becoming increasingly important. Currently, there is a plethora of computational methods for investigating coupled evolutionary changes in pairs of positions along the amino acid sequence, and making inferences on structure and function. Yet, the significance of coevolution signals remains to be established. Also, a large number of false positives (FPs) arise from insufficient MSA size, phylogenetic background and indirect couplings. Results: Here, a set of 16 pairs of non-interacting proteins is thoroughly examined to assess the effectiveness and limitations of different methods. The analysis shows that recent computationally expensive methods designed to remove biases from indirect couplings outperform others in detecting tertiary structural contacts as well as eliminating intermolecular FPs; whereas traditional methods such as mutual information benefit from refinements such as shuffling, while being highly efficient. Computations repeated with 2,330 pairs of protein families from the Negatome database corroborated these results. Finally, using a training dataset of 162 families of proteins, we propose a combined method that outperforms existing individual methods. Overall, the study provides simple guidelines towards the choice of suitable methods and strategies based on available MSA size and computing resources. Availability and implementation: Software is freely available through the Evol component of ProDy API. Contact:bahar@pitt.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenzhi Mao
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Cihan Kaya
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Anindita Dutta
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Amnon Horovitz
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Ivet Bahar
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| |
Collapse
|
458
|
Soltan Ghoraie L, Burkowski F, Zhu M. Sparse networks of directly coupled, polymorphic, and functional side chains in allosteric proteins. Proteins 2015; 83:497-516. [DOI: 10.1002/prot.24752] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2014] [Revised: 12/05/2014] [Accepted: 12/13/2014] [Indexed: 02/05/2023]
Affiliation(s)
| | - Forbes Burkowski
- School of Computer Science, University of Waterloo; Waterloo Ontario Canada
| | - Mu Zhu
- Department of Statistics and Actuarial Science; University of Waterloo; Waterloo Ontario Canada
| |
Collapse
|
459
|
Sun HP, Huang Y, Wang XF, Zhang Y, Shen HB. Improving accuracy of protein contact prediction using balanced network deconvolution. Proteins 2015; 83:485-96. [PMID: 25524593 DOI: 10.1002/prot.24744] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2014] [Revised: 11/20/2014] [Accepted: 12/02/2014] [Indexed: 12/28/2022]
Abstract
Residue contact map is essential for protein three-dimensional structure determination. But most of the current contact prediction methods based on residue co-evolution suffer from high false-positives as introduced by indirect and transitive contacts (i.e., residues A-B and B-C are in contact, but A-C are not). Built on the work by Feizi et al. (Nat Biotechnol 2013; 31:726-733), which demonstrated a general network model to distinguish direct dependencies by network deconvolution, this study presents a new balanced network deconvolution (BND) algorithm to identify optimized dependency matrix without limit on the eigenvalue range in the applied network systems. The algorithm was used to filter contact predictions of five widely used co-evolution methods. On the test of proteins from three benchmark datasets of the 9th critical assessment of protein structure prediction (CASP9), CASP10, and PSICOV (precise structural contact prediction using sparse inverse covariance estimation) database experiments, the BND can improve the medium- and long-range contact predictions at the L/5 cutoff by 55.59% and 47.68%, respectively, without additional central processing unit cost. The improvement is statistically significant, with a P-value < 5.93 × 10(-3) in the Student's t-test. A further comparison with the ab initio structure predictions in CASPs showed that the usefulness of the current co-evolution-based contact prediction to the three-dimensional structure modeling relies on the number of homologous sequences existing in the sequence databases. BND can be used as a general contact refinement method, which is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/BND/.
Collapse
Affiliation(s)
- Hai-Ping Sun
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, 200240, China
| | | | | | | | | |
Collapse
|
460
|
Andreani J, Söding J. bbcontacts: prediction of β-strand pairing from direct coupling patterns. ACTA ACUST UNITED AC 2015; 31:1729-37. [PMID: 25618863 DOI: 10.1093/bioinformatics/btv041] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 01/17/2015] [Indexed: 01/08/2023]
Abstract
MOTIVATION It has recently become possible to build reliable de novo models of proteins if a multiple sequence alignment (MSA) of at least 1000 homologous sequences can be built. Methods of global statistical network analysis can explain the observed correlations between columns in the MSA by a small set of directly coupled pairs of columns. Strong couplings are indicative of residue-residue contacts, and from the predicted contacts a structure can be computed. Here, we exploit the structural regularity of paired β-strands that leads to characteristic patterns in the noisy matrices of couplings. The β-β contacts should be detected more reliably than single contacts, reducing the required number of sequences in the MSAs. RESULTS bbcontacts predicts β-β contacts by detecting these characteristic patterns in the 2D map of coupling scores using two hidden Markov models (HMMs), one for parallel and one for antiparallel contacts. β-bulges are modelled as indel states. In contrast to existing methods, bbcontacts uses predicted instead of true secondary structure. On a standard set of 916 test proteins, 34% of which have MSAs with < 1000 sequences, bbcontacts achieves 50% precision for contacting β-β residue pairs at 50% recall using predicted secondary structure and 64% precision at 64% recall using true secondary structure, while existing tools achieve around 45% precision at 45% recall using true secondary structure. AVAILABILITY AND IMPLEMENTATION bbcontacts is open source software (GNU Affero GPL v3) available at https://bitbucket.org/soedinglab/bbcontacts .
Collapse
Affiliation(s)
- Jessica Andreani
- Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377 Munich, Germany and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
461
|
Li G, Theys K, Verheyen J, Pineda-Peña AC, Khouri R, Piampongsant S, Eusébio M, Ramon J, Vandamme AM. A new ensemble coevolution system for detecting HIV-1 protein coevolution. Biol Direct 2015; 10:1. [PMID: 25564011 PMCID: PMC4332441 DOI: 10.1186/s13062-014-0031-8] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Accepted: 12/02/2014] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND A key challenge in the field of HIV-1 protein evolution is the identification of coevolving amino acids at the molecular level. In the past decades, many sequence-based methods have been designed to detect position-specific coevolution within and between different proteins. However, an ensemble coevolution system that integrates different methods to improve the detection of HIV-1 protein coevolution has not been developed. RESULTS We integrated 27 sequence-based prediction methods published between 2004 and 2013 into an ensemble coevolution system. This system allowed combinations of different sequence-based methods for coevolution predictions. Using HIV-1 protein structures and experimental data, we evaluated the performance of individual and combined sequence-based methods in the prediction of HIV-1 intra- and inter-protein coevolution. We showed that sequence-based methods clustered according to their methodology, and a combination of four methods outperformed any of the 27 individual methods. This four-method combination estimated that HIV-1 intra-protein coevolving positions were mainly located in functional domains and physically contacted with each other in the protein tertiary structures. In the analysis of HIV-1 inter-protein coevolving positions between Gag and protease, protease drug resistance positions near the active site mostly coevolved with Gag cleavage positions (V128, S373-T375, A431, F448-P453) and Gag C-terminal positions (S489-Q500) under selective pressure of protease inhibitors. CONCLUSIONS This study presents a new ensemble coevolution system which detects position-specific coevolution using combinations of 27 different sequence-based methods. Our findings highlight key coevolving residues within HIV-1 structural proteins and between Gag and protease, shedding light on HIV-1 intra- and inter-protein coevolution.
Collapse
Affiliation(s)
- Guangdi Li
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Kristof Theys
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Jens Verheyen
- Institute of Virology, University hospital, University Duisburg-Essen, Essen, Germany.
| | - Andrea-Clemencia Pineda-Peña
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium. .,Clinical and Molecular Infectious Disease Group, Faculty of Sciences and Mathematics, Universidad del Rosario, Bogotá, Colombia.
| | - Ricardo Khouri
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Supinya Piampongsant
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Mónica Eusébio
- Centro de Malária e Outras Doenças Tropicais and Unidade de Microbiologia, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Lisboa, Portugal.
| | - Jan Ramon
- Department of Computer Science, KU Leuven - University of Leuven, Leuven, Belgium.
| | - Anne-Mieke Vandamme
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium. .,Centro de Malária e Outras Doenças Tropicais and Unidade de Microbiologia, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Lisboa, Portugal.
| |
Collapse
|
462
|
Abstract
Recent advances in identifying residue-residue contacts from large multiple sequence alignments have enabled impressive gains to be made in the field of protein structure prediction. In this chapter, we discuss these advances and provide a step-by-step guide to applying the latest tools to the de novo modelling of alpha-helical transmembrane proteins. As a practical example, we demonstrate the process of building an accurate 3D model of a G protein-coupled receptor, correctly orientated in the membrane, using only its primary protein sequence.
Collapse
Affiliation(s)
- Timothy Nugent
- Bioinformatics Group, Department of Computer Science, University College London, Office: 8.11, Desk: 206, Gower Street, London, WC1E 6BT, UK,
| |
Collapse
|
463
|
Tian P, Boomsma W, Wang Y, Otzen DE, Jensen MH, Lindorff-Larsen K. Structure of a Functional Amyloid Protein Subunit Computed Using Sequence Variation. J Am Chem Soc 2014; 137:22-5. [DOI: 10.1021/ja5093634] [Citation(s) in RCA: 82] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Pengfei Tian
- Niels
Bohr Institute, University of Copenhagen, Blegdamsvej 17, 2100 Copenhagen, Denmark
| | - Wouter Boomsma
- Structural
Biology and NMR Laboratory, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5 DK-2200 Copenhagen N, Denmark
| | - Yong Wang
- Structural
Biology and NMR Laboratory, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5 DK-2200 Copenhagen N, Denmark
| | - Daniel E. Otzen
- Interdisciplinary
Nanoscience Center (iNANO), Centre for Insoluble Protein Structures
(inSPIN), Department of Molecular Biology and Genetics, Aarhus University, Gustav Wieds Vej 14, 8000 Aarhus C, Denmark
| | - Mogens H. Jensen
- Niels
Bohr Institute, University of Copenhagen, Blegdamsvej 17, 2100 Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Structural
Biology and NMR Laboratory, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5 DK-2200 Copenhagen N, Denmark
| |
Collapse
|
464
|
Raimondi D, Orlando G, Vranken WF. Clustering-based model of cysteine co-evolution improves disulfide bond connectivity prediction and reduces homologous sequence requirements. Bioinformatics 2014; 31:1219-25. [DOI: 10.1093/bioinformatics/btu794] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2014] [Accepted: 11/18/2014] [Indexed: 12/23/2022] Open
|
465
|
Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. ACTA ACUST UNITED AC 2014; 31:999-1006. [PMID: 25431331 PMCID: PMC4382908 DOI: 10.1093/bioinformatics/btu791] [Citation(s) in RCA: 237] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Accepted: 11/22/2014] [Indexed: 12/13/2022]
Abstract
Motivation: Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Results: Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts—around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. Availability and implementation: MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Contact:d.t.jones@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tanya Singh
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tomasz Kosciolek
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Stuart Tetchner
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| |
Collapse
|
466
|
Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol 2014; 10:e1003889. [PMID: 25375897 PMCID: PMC4222596 DOI: 10.1371/journal.pcbi.1003889] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 09/03/2014] [Indexed: 11/23/2022] Open
Abstract
Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction. Here, we introduce a novel protein contact prediction method PconsC2 that, to the best of our knowledge, outperforms earlier methods. PconsC2 is based on our earlier method, PconsC, as it utilizes the same set of contact predictions from plmDCA and PSICOV. However, in contrast to PconsC, where each residue pair is analysed independently, the initial predictions are analysed in context of neighbouring residue pairs using a deep learning approach, inspired by earlier work. We find that for each layer the deep learning procedure improves the predictions. At the end, after five layers of deep learning and inclusion of a few extra features provides the best performance. An improvement can be seen for all types of proteins, independent on length, number of homologous sequences and structural class. However, the improvement is largest for β-sheet containing proteins. Most importantly the improvement brings for the first time sufficiently accurate predictions to some protein families with less than 1000 homologous sequences. PconsC2 outperforms as well state of the art machine learning based predictors for protein families larger than 100 effective sequences. PconsC2 is licensed under the GNU General Public License v3 and freely available from http://c2.pcons.net/.
Collapse
|
467
|
Touw WG, Baakman C, Black J, te Beek TAH, Krieger E, Joosten RP, Vriend G. A series of PDB-related databanks for everyday needs. Nucleic Acids Res 2014; 43:D364-8. [PMID: 25352545 PMCID: PMC4383885 DOI: 10.1093/nar/gku1028] [Citation(s) in RCA: 676] [Impact Index Per Article: 61.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
We present a series of databanks (http://swift.cmbi.ru.nl/gv/facilities/) that hold information that is computationally derived from Protein Data Bank (PDB) entries and that might augment macromolecular structure studies. These derived databanks run parallel to the PDB, i.e. they have one entry per PDB entry. Several of the well-established databanks such as HSSP, PDBREPORT and PDB_REDO have been updated and/or improved. The software that creates the DSSP databank, for example, has been rewritten to better cope with π-helices. A large number of databanks have been added to aid computational structural biology; some examples are lists of residues that make crystal contacts, lists of contacting residues using a series of contact definitions or lists of residue accessibilities. PDB files are not the optimal presentation of the underlying data for many studies. We therefore made a series of databanks that hold PDB files in an easier to use or more consistent representation. The BDB databank holds X-ray PDB files with consistently represented B-factors. We also added several visualization tools to aid the users of our databanks.
Collapse
Affiliation(s)
- Wouter G Touw
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Coos Baakman
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Jon Black
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Tim A H te Beek
- Bio-Prodict BV, Nieuwe Marktstraat 54E, 6511 AA Nijmegen, The Netherlands
| | - E Krieger
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| | - Robbie P Joosten
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands Department of Biochemistry, Netherlands Cancer Institute, Plesmanlaan 121, Amsterdam 1066 CX, The Netherlands
| | - Gert Vriend
- Centre for Molecular and Biomolecular Informatics, CMBI, Radboud university medical center, Geert Grooteplein Zuid 26-28 6525 GA Nijmegen, The Netherlands
| |
Collapse
|
468
|
Hinsen K, Vaitinadapoule A, Ostuni MA, Etchebest C, Lacapere JJ. Construction and validation of an atomic model for bacterial TSPO from electron microscopy density, evolutionary constraints, and biochemical and biophysical data. BIOCHIMICA ET BIOPHYSICA ACTA-BIOMEMBRANES 2014; 1848:568-80. [PMID: 25450341 DOI: 10.1016/j.bbamem.2014.10.028] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 10/01/2014] [Accepted: 10/20/2014] [Indexed: 11/30/2022]
Abstract
The 18 kDa protein TSPO is a highly conserved transmembrane protein found in bacteria, yeast, animals and plants. TSPO is involved in a wide range of physiological functions, among which the transport of several molecules. The atomic structure of monomeric ligand-bound mouse TSPO in detergent has been published recently. A previously published low-resolution structure of Rhodobacter sphaeroides TSPO, obtained from tubular crystals with lipids and observed in cryo-electron microscopy, revealed an oligomeric structure without any ligand. We analyze this electron microscopy density in view of available biochemical and biophysical data, building a matching atomic model for the monomer and then the entire crystal. We compare its intra- and inter-molecular contacts with those predicted by amino acid covariation in TSPO proteins from evolutionary sequence analysis. The arrangement of the five transmembrane helices in a monomer of our model is different from that observed for the mouse TSPO. We analyze possible ligand binding sites for protoporphyrin, for the high-affinity ligand PK 11195, and for cholesterol in TSPO monomers and/or oligomers, and we discuss possible functional implications.
Collapse
Affiliation(s)
- Konrad Hinsen
- Centre de Biophysique Moléculaire (CNRS), Rue Charles Sadron, 45071 Orléans Cedex, France; Synchrotron SOLEIL, Division Expériences, Saint Aubin, B.P. 48, 91192 Gif-sur-Yvette Cedex, France.
| | - Aurore Vaitinadapoule
- INSERM, UMR-S1134, 6 rue Alexandre Cabanel, Université Paris 7 Denis Diderot, F-75015 Paris, France; Université Paris Diderot, Sorbonne Paris Cité, Paris, France; Institut National de la Transfusion Sanguine (INTS), Paris, France; GR-Ex, Laboratoire d'Excellence, Paris, France; National Centre for Biological Sciences (NCBS), Tata Institute for Fundamental Research, GKVK Campus, Bangalore, Karnataka, India; Dynamique des Structures et des Interactions des des Macromolécules Biologiques, France.
| | - Mariano A Ostuni
- INSERM, UMR-S1134, 6 rue Alexandre Cabanel, Université Paris 7 Denis Diderot, F-75015 Paris, France; Université Paris Diderot, Sorbonne Paris Cité, Paris, France; Institut National de la Transfusion Sanguine (INTS), Paris, France; GR-Ex, Laboratoire d'Excellence, Paris, France.
| | - Catherine Etchebest
- INSERM, UMR-S1134, 6 rue Alexandre Cabanel, Université Paris 7 Denis Diderot, F-75015 Paris, France; Université Paris Diderot, Sorbonne Paris Cité, Paris, France; Institut National de la Transfusion Sanguine (INTS), Paris, France; GR-Ex, Laboratoire d'Excellence, Paris, France; Dynamique des Structures et des Interactions des des Macromolécules Biologiques, France.
| | - Jean-Jacques Lacapere
- Sorbonne Universités, UPMC Univ Paris 06, Laboratoire de Biomolécules (LBM), 4 Place Jussieu, F-75005 Paris, France; Ecole Normale Supérieure - PSL Research University, Département de Chimie, 24, rue Lhomond, 75005 Paris, France; CNRS, UMR 7203 LBM, F-75005 Paris, France.
| |
Collapse
|
469
|
Schneider M, Brock O. Combining physicochemical and evolutionary information for protein contact prediction. PLoS One 2014; 9:e108438. [PMID: 25338092 PMCID: PMC4206277 DOI: 10.1371/journal.pone.0108438] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/28/2014] [Indexed: 11/18/2022] Open
Abstract
We introduce a novel contact prediction method that achieves high prediction accuracy by combining evolutionary and physicochemical information about native contacts. We obtain evolutionary information from multiple-sequence alignments and physicochemical information from predicted ab initio protein structures. These structures represent low-energy states in an energy landscape and thus capture the physicochemical information encoded in the energy function. Such low-energy structures are likely to contain native contacts, even if their overall fold is not native. To differentiate native from non-native contacts in those structures, we develop a graph-based representation of the structural context of contacts. We then use this representation to train an support vector machine classifier to identify most likely native contacts in otherwise non-native structures. The resulting contact predictions are highly accurate. As a result of combining two sources of information--evolutionary and physicochemical--we maintain prediction accuracy even when only few sequence homologs are present. We show that the predicted contacts help to improve ab initio structure prediction. A web service is available at http://compbio.robotics.tu-berlin.de/epc-map/.
Collapse
Affiliation(s)
- Michael Schneider
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany
| | - Oliver Brock
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Berlin, Germany
- * E-mail:
| |
Collapse
|
470
|
Campeotto I, Percy MG, MacDonald JT, Förster A, Freemont PS, Gründling A. Structural and mechanistic insight into the Listeria monocytogenes two-enzyme lipoteichoic acid synthesis system. J Biol Chem 2014; 289:28054-69. [PMID: 25128528 PMCID: PMC4192460 DOI: 10.1074/jbc.m114.590570] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Revised: 08/12/2014] [Indexed: 11/07/2022] Open
Abstract
Lipoteichoic acid (LTA) is an important cell wall component required for proper cell growth in many Gram-positive bacteria. In Listeria monocytogenes, two enzymes are required for the synthesis of this polyglycerolphosphate polymer. The LTA primase LtaP(Lm) initiates LTA synthesis by transferring the first glycerolphosphate (GroP) subunit onto the glycolipid anchor and the LTA synthase LtaS(Lm) extends the polymer by the repeated addition of GroP subunits to the tip of the growing chain. Here, we present the crystal structures of the enzymatic domains of LtaP(Lm) and LtaS(Lm). Although the enzymes share the same fold, substantial differences in the cavity of the catalytic site and surface charge distribution contribute to enzyme specialization. The eLtaS(Lm) structure was also determined in complex with GroP revealing a second GroP binding site. Mutational analysis confirmed an essential function for this binding site and allowed us to propose a model for the binding of the growing chain.
Collapse
Affiliation(s)
- Ivan Campeotto
- From the Section of Microbiology and MRC Centre for Molecular Bacteriology and Infection, and
| | - Matthew G Percy
- From the Section of Microbiology and MRC Centre for Molecular Bacteriology and Infection, and
| | - James T MacDonald
- the Centre for Structural Biology, Imperial College London, London SW7 2AZ, United Kingdom
| | - Andreas Förster
- the Centre for Structural Biology, Imperial College London, London SW7 2AZ, United Kingdom
| | - Paul S Freemont
- the Centre for Structural Biology, Imperial College London, London SW7 2AZ, United Kingdom
| | - Angelika Gründling
- From the Section of Microbiology and MRC Centre for Molecular Bacteriology and Infection, and
| |
Collapse
|
471
|
Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol 2014; 10:e1003847. [PMID: 25299132 PMCID: PMC4191875 DOI: 10.1371/journal.pcbi.1003847] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Accepted: 08/07/2014] [Indexed: 11/18/2022] Open
Abstract
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date. Proteins are large molecules that living cells make by stringing together building blocks called amino acids or peptides, following their blue-prints in the DNA. Freshly made proteins are typically long, structure-less chains of peptides, but shortly afterwards most of them fold into characteristic structures. Proteins execute many functions in the cell, for which they need to have the right structure, which is therefore very important in determining what the proteins can do. The structure of a protein can be determined by X-ray diffraction and other experimental approaches which are all, to this day, somewhat labor-intensive and difficult. On the other hand, the order of the peptides in a protein can be read off from the DNA blue-print, and such protein sequences are today routinely produced in large numbers. In this paper we show that many similar protein sequences can be used to find information about the structure. The basic approach is to construct a probabilistic model for sequence variability, and then to use the parameters of that model to predict structure in three-dimensional space. The main technical novelty compared to previous contributions in the same general direction is that we use models more directly matched to the data.
Collapse
Affiliation(s)
- Christoph Feinauer
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
| | - Marcin J. Skwark
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
| | - Andrea Pagnani
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
- Human Genetics Foundation-Torino, Molecular Biotechnology Center, Torino, Italy
| | - Erik Aurell
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
- Department of Computational Biology, Royal Institute of Technology, AlbaNova University Centre, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
472
|
Shahmoradi A, Sydykova DK, Spielman SJ, Jackson EL, Dawson ET, Meyer AG, Wilke CO. Predicting evolutionary site variability from structure in viral proteins: buriedness, packing, flexibility, and design. J Mol Evol 2014; 79:130-42. [PMID: 25217382 PMCID: PMC4216736 DOI: 10.1007/s00239-014-9644-x] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 08/31/2014] [Indexed: 12/27/2022]
Abstract
Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The quantities we considered include buriedness (as measured by relative solvent accessibility), packing density (as measured by contact number), structural flexibility (as measured by B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on nine non-homologous viral protein structures and from variation in homologous variants of those proteins, where they were available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1-0.4). Moreover, we found that buriedness and packing density were better predictors of evolutionary variation than structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than buriedness or packing density, but it was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness and packing density are better predictors of evolutionary variation than the more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.
Collapse
Affiliation(s)
- Amir Shahmoradi
- Department of Physics, The University of Texas at Austin, Austin, TX 78712, USA
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Dariya K. Sydykova
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Stephanie J. Spielman
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Eleisha L. Jackson
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Eric T. Dawson
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Austin G. Meyer
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| | - Claus O. Wilke
- Department of Integrative Biology, Center for Computational Biology and Bioinformatics, and Institute for Cellular and Molecular Biology, The University of Texas at Austin, Austin, TX 78712, USA
| |
Collapse
|
473
|
Hopf TA, Schärfe CPI, Rodrigues JPGLM, Green AG, Kohlbacher O, Sander C, Bonvin AMJJ, Marks DS. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 2014; 3. [PMID: 25255213 PMCID: PMC4360534 DOI: 10.7554/elife.03430] [Citation(s) in RCA: 351] [Impact Index Per Article: 31.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 09/23/2014] [Indexed: 12/24/2022] Open
Abstract
Protein-protein interactions are fundamental to many biological processes. Experimental screens have identified tens of thousands of interactions, and structural biology has provided detailed functional insight for select 3D protein complexes. An alternative rich source of information about protein interactions is the evolutionary sequence record. Building on earlier work, we show that analysis of correlated evolutionary sequence changes across proteins identifies residues that are close in space with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We evaluate prediction performance in blinded tests on 76 complexes of known 3D structure, predict protein-protein contacts in 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be used to distinguish between interacting and non-interacting protein pairs in a large complex. With the current growth of sequences, we expect that the method can be generalized to genome-wide elucidation of protein-protein interaction networks and used for interaction predictions at residue resolution.
Collapse
Affiliation(s)
- Thomas A Hopf
- Department of Systems Biology, Harvard University, Boston, United States
| | | | - João P G L M Rodrigues
- Computational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, Netherlands
| | - Anna G Green
- Department of Systems Biology, Harvard University, Boston, United States
| | - Oliver Kohlbacher
- Applied Bioinformatics, Quantitative Biology Center, University of Tübingen, Tübingen, Germany
| | - Chris Sander
- Computational Biology Center, Memorial Sloan Kettering Cancer Center, New York, United States
| | - Alexandre M J J Bonvin
- Computational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, Netherlands
| | - Debora S Marks
- Department of Systems Biology, Harvard University, Boston, United States
| |
Collapse
|
474
|
Michel M, Hayat S, Skwark MJ, Sander C, Marks DS, Elofsson A. PconsFold: improved contact predictions improve protein models. Bioinformatics 2014; 30:i482-8. [PMID: 25161237 PMCID: PMC4147911 DOI: 10.1093/bioinformatics/btu458] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. RESULTS In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. AVAILABILITY PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at https://www.github.com/ElofssonLab/pcons-fold under the MIT license. PconsC is available from http://c.pcons.net/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mirco Michel
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Sikander Hayat
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Marcin J Skwark
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Chris Sander
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Debora S Marks
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| |
Collapse
|
475
|
HCV E2 core structures and mAbs: something is still missing. Drug Discov Today 2014; 19:1964-70. [PMID: 25172800 DOI: 10.1016/j.drudis.2014.08.011] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2014] [Revised: 07/17/2014] [Accepted: 08/21/2014] [Indexed: 02/07/2023]
Abstract
The lack of structural information on hepatitis C virus (HCV) surface proteins has so far hampered the development of effective vaccines. Recently, two crystallographic structures have described the core portion (E2c) of E2 surface glycoprotein, the primary mediator of HCV entry. Despite the importance of these studies, the E2 overall structure is still unknown and, most importantly, several biochemical and functional studies are in disagreement with E2c structures. Here, the main literature will be discussed and an alternative disulfide bridge pattern will be proposed, based on unpublished human monoclonal antibody reactivity. A modeling strategy aiming at recapitulating the available structural and functional studies of E2 will also be proposed.
Collapse
|
476
|
Seemayer S, Gruber M, Söding J. CCMpred--fast and precise prediction of protein residue-residue contacts from correlated mutations. ACTA ACUST UNITED AC 2014; 30:3128-30. [PMID: 25064567 PMCID: PMC4201158 DOI: 10.1093/bioinformatics/btu500] [Citation(s) in RCA: 281] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Motivation: Recent breakthroughs in protein residue–residue contact prediction have made reliable de novo prediction of protein structures possible. The key was to apply statistical methods that can distinguish direct couplings between pairs of columns in a multiple sequence alignment from merely correlated pairs, i.e. to separate direct from indirect effects. Two classes of such methods exist, either relying on regularized inversion of the covariance matrix or on pseudo-likelihood maximization (PLM). Although PLM-based methods offer clearly higher precision, available tools are not sufficiently optimized and are written in interpreted languages that introduce additional overheads. This impedes the runtime and large-scale contact prediction for larger protein families, multi-domain proteins and protein–protein interactions. Results: Here we introduce CCMpred, our performance-optimized PLM implementation in C and CUDA C. Using graphics cards in the price range of current six-core processors, CCMpred can predict contacts for typical alignments 35–113 times faster and with the same precision as the most accurate published methods. For users without a CUDA-capable graphics card, CCMpred can also run in a CPU mode that is still 4–14 times faster. Thanks to our speed-ups (http://dictionary.cambridge.org/dictionary/british/speed-up) contacts for typical protein families can be predicted in 15–60 s on a consumer-grade GPU and 1–6 min on a six-core CPU. Availability and implementation: CCMpred is free and open-source software under the GNU Affero General Public License v3 (or later) available at https://bitbucket.org/soedinglab/ccmpred Contact: johannes.soeding@mpibpc.mpg.de Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stefan Seemayer
- Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377, Munich and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Gruber
- Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377, Munich and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377, Munich and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, LMU Munich, Feodor-Lynen-Strasse 25, 81377, Munich and Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
477
|
Andreani J, Guerois R. Evolution of protein interactions: From interactomes to interfaces. Arch Biochem Biophys 2014; 554:65-75. [DOI: 10.1016/j.abb.2014.05.010] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2014] [Revised: 04/28/2014] [Accepted: 05/12/2014] [Indexed: 12/16/2022]
|
478
|
Ivankov DN, Finkelstein AV, Kondrashov FA. A structural perspective of compensatory evolution. Curr Opin Struct Biol 2014; 26:104-12. [PMID: 24981969 PMCID: PMC4141909 DOI: 10.1016/j.sbi.2014.05.004] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Revised: 04/11/2014] [Accepted: 05/16/2014] [Indexed: 11/25/2022]
Abstract
The study of molecular evolution is important because it reveals how protein functions emerge and evolve. Recently, several types of studies indicated that substitutions in molecular evolution occur in a compensatory manner, whereby the occurrence of a substitution depends on the amino acid residues at other sites. However, a molecular or structural basis behind the compensation often remains obscure. Here, we review studies on the interface of structural biology and molecular evolution that revealed novel aspects of compensatory evolution. In many cases structural studies benefit from evolutionary data while structural data often add a functional dimension to the study of molecular evolution.
Collapse
Affiliation(s)
- Dmitry N Ivankov
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 88 Dr. Aiguader, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; Laboratory of Protein Physics, Institute of Protein Research of the Russian Academy of Sciences, 4 Institutskaya str., Pushchino, Moscow Region, 142290, Russia
| | - Alexei V Finkelstein
- Laboratory of Protein Physics, Institute of Protein Research of the Russian Academy of Sciences, 4 Institutskaya str., Pushchino, Moscow Region, 142290, Russia
| | - Fyodor A Kondrashov
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 88 Dr. Aiguader, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), 23 Pg. Lluís Companys, 08010 Barcelona, Spain.
| |
Collapse
|
479
|
Gotoh O, Morita M, Nelson DR. Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinformatics 2014; 15:189. [PMID: 24927652 PMCID: PMC4065584 DOI: 10.1186/1471-2105-15-189] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2014] [Accepted: 06/09/2014] [Indexed: 03/29/2024] Open
Abstract
Background Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. Results We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. Conclusions Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
Collapse
Affiliation(s)
- Osamu Gotoh
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
480
|
Clark GW, Ackerman SH, Tillier ER, Gatti DL. Multidimensional mutual information methods for the analysis of covariation in multiple sequence alignments. BMC Bioinformatics 2014; 15:157. [PMID: 24886131 PMCID: PMC4046016 DOI: 10.1186/1471-2105-15-157] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2013] [Accepted: 05/06/2014] [Indexed: 11/10/2022] Open
Abstract
Background Several methods are available for the detection of covarying positions from a multiple sequence alignment (MSA). If the MSA contains a large number of sequences, information about the proximities between residues derived from covariation maps can be sufficient to predict a protein fold. However, in many cases the structure is already known, and information on the covarying positions can be valuable to understand the protein mechanism and dynamic properties. Results In this study we have sought to determine whether a multivariate (multidimensional) extension of traditional mutual information (MI) can be an additional tool to study covariation. The performance of two multidimensional MI (mdMI) methods, designed to remove the effect of ternary/quaternary interdependencies, was tested with a set of 9 MSAs each containing <400 sequences, and was shown to be comparable to that of the newest methods based on maximum entropy/pseudolikelyhood statistical models of protein sequences. However, while all the methods tested detected a similar number of covarying pairs among the residues separated by < 8 Å in the reference X-ray structures, there was on average less than 65% overlap between the top scoring pairs detected by methods that are based on different principles. Conclusions Given the large variety of structure and evolutionary history of different proteins it is possible that a single best method to detect covariation in all proteins does not exist, and that for each protein family the best information can be derived by merging/comparing results obtained with different methods. This approach may be particularly valuable in those cases in which the size of the MSA is small or the quality of the alignment is low, leading to significant differences in the pairs detected by different methods.
Collapse
Affiliation(s)
| | | | - Elisabeth R Tillier
- Department of Medical Biophysics, University of Toronto, Campbell Family Institute for Cancer Research, Ontario Cancer Institute, University Health Network, Toronto, Ontario, Canada.
| | | |
Collapse
|
481
|
General IJ, Liu Y, Blackburn ME, Mao W, Gierasch LM, Bahar I. ATPase subdomain IA is a mediator of interdomain allostery in Hsp70 molecular chaperones. PLoS Comput Biol 2014; 10:e1003624. [PMID: 24831085 PMCID: PMC4022485 DOI: 10.1371/journal.pcbi.1003624] [Citation(s) in RCA: 79] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2013] [Accepted: 03/31/2014] [Indexed: 11/18/2022] Open
Abstract
The versatile functions of the heat shock protein 70 (Hsp70) family of molecular chaperones rely on allosteric interactions between their nucleotide-binding and substrate-binding domains, NBD and SBD. Understanding the mechanism of interdomain allostery is essential to rational design of Hsp70 modulators. Yet, despite significant progress in recent years, how the two Hsp70 domains regulate each other's activity remains elusive. Covariance data from experiments and computations emerged in recent years as valuable sources of information towards gaining insights into the molecular events that mediate allostery. In the present study, conservation and covariance properties derived from both sequence and structural dynamics data are integrated with results from Perturbation Response Scanning and in vivo functional assays, so as to establish the dynamical basis of interdomain signal transduction in Hsp70s. Our study highlights the critical roles of SBD residues D481 and T417 in mediating the coupled motions of the two domains, as well as that of G506 in enabling the movements of the α-helical lid with respect to the β-sandwich. It also draws attention to the distinctive role of the NBD subdomains: Subdomain IA acts as a key mediator of signal transduction between the ATP- and substrate-binding sites, this function being achieved by a cascade of interactions predominantly involving conserved residues such as V139, D148, R167 and K155. Subdomain IIA, on the other hand, is distinguished by strong coevolutionary signals (with the SBD) exhibited by a series of residues (D211, E217, L219, T383) implicated in DnaJ recognition. The occurrence of coevolving residues at the DnaJ recognition region parallels the behavior recently observed at the nucleotide-exchange-factor recognition region of subdomain IIB. These findings suggest that Hsp70 tends to adapt to co-chaperone recognition and activity via coevolving residues, whereas interdomain allostery, critical to chaperoning, is robustly enabled by conserved interactions.
Collapse
Affiliation(s)
- Ignacio J. General
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Ying Liu
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Mandy E. Blackburn
- Department of Biochemistry & Molecular Biology, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Wenzhi Mao
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Pharmacology, Tsinghua University, Beijing, China
| | - Lila M. Gierasch
- Department of Biochemistry & Molecular Biology, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Ivet Bahar
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
482
|
Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 2014; 3:e02030. [PMID: 24842992 PMCID: PMC4034769 DOI: 10.7554/elife.02030] [Citation(s) in RCA: 461] [Impact Index Per Article: 41.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Do the amino acid sequence identities of residues that make contact across protein interfaces covary during evolution? If so, such covariance could be used to predict contacts across interfaces and assemble models of biological complexes. We find that residue pairs identified using a pseudo-likelihood-based method to covary across protein–protein interfaces in the 50S ribosomal unit and 28 additional bacterial protein complexes with known structure are almost always in contact in the complex, provided that the number of aligned sequences is greater than the average length of the two proteins. We use this method to make subunit contact predictions for an additional 36 protein complexes with unknown structures, and present models based on these predictions for the tripartite ATP-independent periplasmic (TRAP) transporter, the tripartite efflux system, the pyruvate formate lyase-activating enzyme complex, and the methionine ABC transporter. DOI:http://dx.doi.org/10.7554/eLife.02030.001 Proteins are considered the ‘workhorse molecules’ of life and they are involved in virtually everything that cells do. Proteins are strings of amino acids that have folded into a specific three-dimensional shape. Proteins must have the correct shape to function properly, as they often work by binding to other proteins or molecules—much like a key fitting into a lock. Working out the structure of a protein can, therefore, provide major insights into how the protein does its job. Two or more proteins can bind together and form a complex to perform various tasks; and solving the structures of these complexes can be challenging, even if the structures of the protein subunits are known. Now, Ovchinnikov, Kamisetty, and Baker have developed a method for predicting which parts of the proteins make contact with each other in a two-protein complex. Different species can have copies of the same proteins; but a copy from one species might have different amino acids at certain positions when compared to a related copy from another species. As such, when pairs of interacting proteins from different species are compared, there will be many positions in the two proteins that vary. However, if the amino acid at a position in one protein (let's call it ‘X’) varies, and the amino acid at, say, position ‘Y’ in the other protein also varies such that for any given amino acid at position Y there is often a specific amino acid at position X; positions X and Y are said to ‘co-vary’. Ovchinnikov et al. noticed that when a pair of amino acids (one from each protein in a two-protein complex) co-varied, these two amino acids tended to make contact with each other at the protein–protein interface. Ovchinnikov et al. used the new method to make predictions about the protein–protein interfaces in 28 protein complexes found in bacteria, and also to make a prediction about the interface between protein subunits in the bacterial ribosome. When these predictions were checked against the actual structures, which were all known beforehand, they were found to be accurate if the number of copies of each protein being compared is greater than the average length of the two proteins. Ovchinnikov et al. went on to predict the amino acids on the protein–protein interfaces for another 36 bacterial protein complexes with unknown structures, and to present models for four larger complexes. The next challenge is to extend the method to protein complexes that are found only in eukaryotes (i.e., not in bacteria). Since the number of related copies for eukaryotic proteins tends to be smaller, there are fewer proteins to compare and it is therefore harder to detect ‘covariation’ when it occurs. DOI:http://dx.doi.org/10.7554/eLife.02030.002
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- Department of Biochemistry, Howard Hughes Medical Institute, University of Washington, Seattle, United States Molecular and Cellular Biology Program, University of Washington, Seattle, United States
| | - Hetunandan Kamisetty
- Department of Biochemistry, Howard Hughes Medical Institute, University of Washington, Seattle, United States Facebook Inc., Seattle, United States
| | - David Baker
- Department of Biochemistry, Howard Hughes Medical Institute, University of Washington, Seattle, United States
| |
Collapse
|
483
|
Janda JO, Popal A, Bauer J, Busch M, Klocke M, Spitzer W, Keller J, Merkl R. H2rs: deducing evolutionary and functionally important residue positions by means of an entropy and similarity based analysis of multiple sequence alignments. BMC Bioinformatics 2014; 15:118. [PMID: 24766829 PMCID: PMC4021312 DOI: 10.1186/1471-2105-15-118] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2014] [Accepted: 04/17/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of functionally important residue positions is an important task of computational biology. Methods of correlation analysis allow for the identification of pairs of residue positions, whose occupancy is mutually dependent due to constraints imposed by protein structure or function. A common measure assessing these dependencies is the mutual information, which is based on Shannon's information theory that utilizes probabilities only. Consequently, such approaches do not consider the similarity of residue pairs, which may degrade the algorithm's performance. One typical algorithm is H2r, which characterizes each individual residue position k by the conn(k)-value, which is the number of significantly correlated pairs it belongs to. RESULTS To improve specificity of H2r, we developed a revised algorithm, named H2rs, which is based on the von Neumann entropy (vNE). To compute the corresponding mutual information, a matrix A is required, which assesses the similarity of residue pairs. We determined A by deducing substitution frequencies from contacting residue pairs observed in the homologs of 35 809 proteins, whose structure is known. In analogy to H2r, the enhanced algorithm computes a normalized conn(k)-value. Within the framework of H2rs, only statistically significant vNE values were considered. To decide on significance, the algorithm calculates a p-value by performing a randomization test for each individual pair of residue positions. The analysis of a large in silico testbed demonstrated that specificity and precision were higher for H2rs than for H2r and two other methods of correlation analysis. The gain in prediction quality is further confirmed by a detailed assessment of five well-studied enzymes. The outcome of H2rs and of a method that predicts contacting residue positions (PSICOV) overlapped only marginally. H2rs can be downloaded from http://www-bioinf.uni-regensburg.de. CONCLUSIONS Considering substitution frequencies for residue pairs by means of the von Neumann entropy and a p-value improved the success rate in identifying important residue positions. The integration of proven statistical concepts and normalization allows for an easier comparison of results obtained with different proteins. Comparing the outcome of the local method H2rs and of the global method PSICOV indicates that such methods supplement each other and have different scopes of application.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Rainer Merkl
- Institute of Biophysics and Physical Biochemistry, University of Regensburg, D-93040 Regensburg, Germany.
| |
Collapse
|
484
|
Gültas M, Düzgün G, Herzog S, Jäger SJ, Meckbach C, Wingender E, Waack S. Quantum coupled mutation finder: predicting functionally or structurally important sites in proteins using quantum Jensen-Shannon divergence and CUDA programming. BMC Bioinformatics 2014; 15:96. [PMID: 24694117 PMCID: PMC4098773 DOI: 10.1186/1471-2105-15-96] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2013] [Accepted: 03/26/2014] [Indexed: 11/29/2022] Open
Abstract
Background The identification of functionally or structurally important non-conserved residue sites in protein MSAs is an important challenge for understanding the structural basis and molecular mechanism of protein functions. Despite the rich literature on compensatory mutations as well as sequence conservation analysis for the detection of those important residues, previous methods often rely on classical information-theoretic measures. However, these measures usually do not take into account dis/similarities of amino acids which are likely to be crucial for those residues. In this study, we present a new method, the Quantum Coupled Mutation Finder (QCMF) that incorporates significant dis/similar amino acid pair signals in the prediction of functionally or structurally important sites. Results The result of this study is twofold. First, using the essential sites of two human proteins, namely epidermal growth factor receptor (EGFR) and glucokinase (GCK), we tested the QCMF-method. The QCMF includes two metrics based on quantum Jensen-Shannon divergence to measure both sequence conservation and compensatory mutations. We found that the QCMF reaches an improved performance in identifying essential sites from MSAs of both proteins with a significantly higher Matthews correlation coefficient (MCC) value in comparison to previous methods. Second, using a data set of 153 proteins, we made a pairwise comparison between QCMF and three conventional methods. This comparison study strongly suggests that QCMF complements the conventional methods for the identification of correlated mutations in MSAs. Conclusions QCMF utilizes the notion of entanglement, which is a major resource of quantum information, to model significant dissimilar and similar amino acid pair signals in the detection of functionally or structurally important sites. Our results suggest that on the one hand QCMF significantly outperforms the previous method, which mainly focuses on dissimilar amino acid signals, to detect essential sites in proteins. On the other hand, it is complementary to the existing methods for the identification of correlated mutations. The method of QCMF is computationally intensive. To ensure a feasible computation time of the QCMF’s algorithm, we leveraged Compute Unified Device Architecture (CUDA). The QCMF server is freely accessible at http://qcmf.informatik.uni-goettingen.de/.
Collapse
Affiliation(s)
- Mehmet Gültas
- Institute of Computer Science, University of Göttingen, Goldschmidtstr, 7, 37077 Göttingen, Germany.
| | | | | | | | | | | | | |
Collapse
|
485
|
Konopka BM, Ciombor M, Kurczynska M, Kotulska M. Automated procedure for contact-map-based protein structure reconstruction. J Membr Biol 2014; 247:409-20. [PMID: 24682239 PMCID: PMC3983884 DOI: 10.1007/s00232-014-9648-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Accepted: 03/04/2014] [Indexed: 11/25/2022]
Abstract
Knowledge of the three-dimensional structures of ion channels allows for modeling their conductivity characteristics using biophysical models and can lead to discovering their cellular functionality. Recent studies show that quality of structure predictions can be significantly improved using protein contact site information. Therefore, a number of procedures for protein structure prediction based on their contact-map have been proposed. Their comparison is difficult due to different methodologies used for validation. In this work, a Contact Map-to-Structure pipeline (C2S_pipeline) for contact-based protein structure reconstruction is designed and validated. The C2S_pipeline can be used to reconstruct monomeric and multimeric proteins. The median RMSD of structures obtained during validation on a representative set of protein structures, equaled 5.27 Å, and the best structure was reconstructed with RMSD of 1.59 Å. The validation is followed by a detailed case study on the KcsA ion channel. Models of KcsA are reconstructed based on different portions of contact site information. Structural feature analysis of acquired KcsA models is supported by a thorough analysis of electrostatic potential distributions inside the channels. The study shows that electrostatic parameters are correlated with structural quality of models. Therefore, they can be used to discriminate between high and low quality structures. We show that 30 % of contact information is needed to obtain accurate structures of KcsA, if contacts are selected randomly. This number increases to 70 % in case of erroneous maps in which the remaining contacts or non-contacts are changed to the opposite. Furthermore, the study reveals that local reconstruction accuracy is correlated with the number of contacts in which amino acid are involved. This results in higher reconstruction accuracy in the structure core than peripheral regions.
Collapse
Affiliation(s)
- Bogumil M Konopka
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370, Wrocław, Poland
| | | | | | | |
Collapse
|
486
|
Ma J, Wang S, Wang Z, Xu J. MRFalign: protein homology detection through alignment of Markov random fields. PLoS Comput Biol 2014; 10:e1003500. [PMID: 24675572 PMCID: PMC3967925 DOI: 10.1371/journal.pcbi.1003500] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2013] [Accepted: 01/08/2014] [Indexed: 11/24/2022] Open
Abstract
Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5. Sequence-based protein homology detection has been extensively studied, but it remains very challenging for remote homologs with divergent sequences. So far the most sensitive methods employ HMM-HMM comparison, which models a protein family using HMM (Hidden Markov Model) and then detects homologs using HMM-HMM alignment. HMM cannot model long-range residue interaction patterns and thus, carries very little information regarding the global 3D structure of a protein family. As such, HMM comparison is not sensitive enough for distantly-related homologs. In this paper, we present an MRF-MRF comparison method for homology detection. In particular, we model a protein family using Markov Random Fields (MRF) and then detect homologs by MRF-MRF alignment. Compared to HMM, MRFs are able to model long-range residue interaction pattern and thus, contains information for the overall 3D structure of a protein family. Consequently, MRF-MRF comparison is much more sensitive than HMM-HMM comparison. To implement MRF-MRF comparison, we have developed a new scoring function to measure the similarity of two MRFs and also an efficient ADMM algorithm to optimize the scoring function. Experiments confirm that MRF-MRF comparison indeed outperforms HMM-HMM comparison in terms of both alignment accuracy and remote homology detection, especially for mainly beta proteins.
Collapse
Affiliation(s)
- Jianzhu Ma
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Sheng Wang
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Zhiyong Wang
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
487
|
Kaján L, Hopf TA, Kalaš M, Marks DS, Rost B. FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics 2014; 15:85. [PMID: 24669753 PMCID: PMC3987048 DOI: 10.1186/1471-2105-15-85] [Citation(s) in RCA: 128] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Accepted: 03/18/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND 20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software. RESULTS Here, we present FreeContact, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins, FreeContact was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of FreeContact was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software. FreeContact is implemented as the free C++ library "libfreecontact", complete with command line tool "freecontact", as well as Perl and Python modules. All components are available as Debian packages. FreeContact supports the BioXSD format for interoperability. CONCLUSIONS FreeContact provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud).
Collapse
Affiliation(s)
| | | | | | | | - Burkhard Rost
- Department for Bioinformatics and Computational Biology, TU Munich, Boltzmannstraße 3, Garching 85748, Germany.
| |
Collapse
|
488
|
Baldassi C, Zamparo M, Feinauer C, Procaccini A, Zecchina R, Weigt M, Pagnani A. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners. PLoS One 2014; 9:e92721. [PMID: 24663061 PMCID: PMC3963956 DOI: 10.1371/journal.pone.0092721] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Accepted: 02/24/2014] [Indexed: 11/18/2022] Open
Abstract
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.
Collapse
Affiliation(s)
- Carlo Baldassi
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Marco Zamparo
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Christoph Feinauer
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
| | | | - Riccardo Zecchina
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
| | - Martin Weigt
- Sorbonne Universités, Université Pierre et Marie Curie Paris 06, UMR 7238, Computational and Quantitative Biology, Paris, France
- Centre National de la Recherche Scientifique, UMR 7238, Computational and Quantitative Biology, Paris, France
| | - Andrea Pagnani
- Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
- Human Genetics Foundation-Torino, Torino, Italy
- * E-mail:
| |
Collapse
|
489
|
Kosciolek T, Jones DT. De novo structure prediction of globular proteins aided by sequence variation-derived contacts. PLoS One 2014; 9:e92197. [PMID: 24637808 PMCID: PMC3956894 DOI: 10.1371/journal.pone.0092197] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 02/19/2014] [Indexed: 12/21/2022] Open
Abstract
The advent of high accuracy residue-residue intra-protein contact prediction methods enabled a significant boost in the quality of de novo structure predictions. Here, we investigate the potential benefits of combining a well-established fragment-based folding algorithm--FRAGFOLD, with PSICOV, a contact prediction method which uses sparse inverse covariance estimation to identify co-varying sites in multiple sequence alignments. Using a comprehensive set of 150 diverse globular target proteins, up to 266 amino acids in length, we are able to address the effectiveness and some limitations of such approaches to globular proteins in practice. Overall we find that using fragment assembly with both statistical potentials and predicted contacts is significantly better than either statistical potentials or contacts alone. Results show up to nearly 80% of correct predictions (TM-score ≥0.5) within analysed dataset and a mean TM-score of 0.54. Unsuccessful modelling cases emerged either from conformational sampling problems, or insufficient contact prediction accuracy. Nevertheless, a strong dependency of the quality of final models on the fraction of satisfied predicted long-range contacts was observed. This not only highlights the importance of these contacts on determining the protein fold, but also (combined with other ensemble-derived qualities) provides a powerful guide as to the choice of correct models and the global quality of the selected model. A proposed quality assessment scoring function achieves 0.93 precision and 0.77 recall for the discrimination of correct folds on our dataset of decoys. These findings suggest the approach is well-suited for blind predictions on a variety of globular proteins of unknown 3D structure, provided that enough homologous sequences are available to construct a large and accurate multiple sequence alignment for the initial contact prediction step.
Collapse
Affiliation(s)
- Tomasz Kosciolek
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - David T. Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| |
Collapse
|
490
|
Jana B, Morcos F, Onuchic JN. From structure to function: the convergence of structure based models and co-evolutionary information. Phys Chem Chem Phys 2014; 16:6496-507. [PMID: 24603809 DOI: 10.1039/c3cp55275f] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Understanding protein folding and function is one of the most important problems in biological research. Energy landscape theory and the folding funnel concept have provided a framework to investigate the mechanisms associated to these processes. Since protein energy landscapes are in most cases minimally frustrated, structure based models (SMBs) have successfully determined the geometrical features associated with folding and functional transitions. However, structural information is limited, particularly with respect to different functional configurations. This is a major limitation for SBMs. Alternatively, statistical methods to study amino acid co-evolution provide information on residue-residue interactions useful for the study of structure and function. Here, we show how the combination of these two methods gives rise to a novel way to investigate the mechanisms associated with folding and function. We use this methodology to explore the mechanistic aspects of protein translocation in the integral membrane protease FtsH. Dual basin-SBM simulations using the open and closed state of this hexameric motor reveals a functionally important paddling motion in the catalytic cycle. We also find that Direct Coupling Analysis (DCA) predicts physical contacts between AAA and peptidase domains of the motor, which are crucial for the open to close transition. Our combined method, which uses structural information from the open state experimental structure and co-evolutionary couplings, suggests that this methodology can be used to explore the functional landscape of complex biological macromolecules previously inaccessible to methods dependent on experimental structural information. This efficient way to sample the conformational space of large systems creates a theoretical/computational framework capable of better characterizing the functional landscape in large biomolecular assemblies.
Collapse
Affiliation(s)
- Biman Jana
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005-1827, USA.
| | | | | |
Collapse
|
491
|
Reconstructing protein structures by neural network pairwise interaction fields and iterative decoy set construction. Biomolecules 2014; 4:160-80. [PMID: 24970210 PMCID: PMC4030983 DOI: 10.3390/biom4010160] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Revised: 01/22/2014] [Accepted: 01/30/2014] [Indexed: 11/17/2022] Open
Abstract
Predicting the fold of a protein from its amino acid sequence is one of the grand problems in computational biology. While there has been progress towards a solution, especially when a protein can be modelled based on one or more known structures (templates), in the absence of templates, even the best predictions are generally much less reliable. In this paper, we present an approach for predicting the three-dimensional structure of a protein from the sequence alone, when templates of known structure are not available. This approach relies on a simple reconstruction procedure guided by a novel knowledge-based evaluation function implemented as a class of artificial neural networks that we have designed: Neural Network Pairwise Interaction Fields (NNPIF). This evaluation function takes into account the contextual information for each residue and is trained to identify native-like conformations from non-native-like ones by using large sets of decoys as a training set. The training set is generated and then iteratively expanded during successive folding simulations. As NNPIF are fast at evaluating conformations, thousands of models can be processed in a short amount of time, and clustering techniques can be adopted for model selection. Although the results we present here are very preliminary, we consider them to be promising, with predictions being generated at state-of-the-art levels in some of the cases.
Collapse
|
492
|
Kell DB, Goodacre R. Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discov Today 2014; 19:171-82. [PMID: 23892182 PMCID: PMC3989035 DOI: 10.1016/j.drudis.2013.07.014] [Citation(s) in RCA: 111] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 07/03/2013] [Accepted: 07/16/2013] [Indexed: 02/06/2023]
Abstract
Metabolism represents the 'sharp end' of systems biology, because changes in metabolite concentrations are necessarily amplified relative to changes in the transcriptome, proteome and enzyme activities, which can be modulated by drugs. To understand such behaviour, we therefore need (and increasingly have) reliable consensus (community) models of the human metabolic network that include the important transporters. Small molecule 'drug' transporters are in fact metabolite transporters, because drugs bear structural similarities to metabolites known from the network reconstructions and from measurements of the metabolome. Recon2 represents the present state-of-the-art human metabolic network reconstruction; it can predict inter alia: (i) the effects of inborn errors of metabolism; (ii) which metabolites are exometabolites, and (iii) how metabolism varies between tissues and cellular compartments. However, even these qualitative network models are not yet complete. As our understanding improves so do we recognise more clearly the need for a systems (poly)pharmacology.
Collapse
Affiliation(s)
- Douglas B Kell
- School of Chemistry and Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK.
| | - Royston Goodacre
- School of Chemistry and Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| |
Collapse
|
493
|
Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc Natl Acad Sci U S A 2014; 111:E563-71. [PMID: 24449878 DOI: 10.1073/pnas.1323734111] [Citation(s) in RCA: 94] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
A challenge in molecular biology is to distinguish the key subset of residues that allow two-component signaling (TCS) proteins to recognize their correct signaling partner such that they can transiently bind and transfer signal, i.e., phosphoryl group. Detailed knowledge of this information would allow one to search sequence space for mutations that can be used to systematically tune the signal transmission between TCS partners as well as potentially encode a TCS protein to preferentially transfer signals to a nonpartner. Motivated by the notion that this detailed information is found in sequence data, we explore the sequence coevolution between signaling partners to better understand how mutations can positively or negatively alter their ability to transfer signal. Using direct coupling analysis for determining evolutionarily conserved protein-protein interactions, we apply a metric called the direct information score to quantify mutational changes in the interaction between TCS proteins and demonstrate that it accurately correlates with experimental mutagenesis studies probing the mutational change in measured in vitro phosphotransfer. Furthermore, by subtracting from our metric an appropriate null model corresponding to generic, conserved features in TCS signaling pairs, we can isolate the determinants that give rise to interaction specificity and recognition, which are variable among different TCS partners. Our methodology forms a potential framework for the rational design of TCS systems by allowing one to quickly search sequence space for mutations or even entirely new sequences that can increase or decrease our metric, as a proxy for increasing or decreasing phosphotransfer ability between TCS proteins.
Collapse
|
494
|
Tetchner S, Kosciolek T, Jones DT. Opportunities and limitations in applying coevolution-derived contacts to protein structure prediction. BIO-ALGORITHMS AND MED-SYSTEMS 2014. [DOI: 10.1515/bams-2014-0013] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
AbstractThe prospect of identifying contacts in protein structures purely from aligned protein sequences has lured researchers for a long time, but progress has been modest until recently. Here, we reviewed the most successful methods for identifying structural contacts from sequence and how these methods differ and made an initial assessment of the overlap of predicted contacts by alternative approaches. We then discussed the limitations of these methods and possibilities for future development and highlighted the recent applications of contacts in tertiary structure prediction, identifying the residues at the interfaces of protein-protein interactions, and the use of these methods in disentangling alternative conformational states. Finally, we identified the current challenges in the field of contact prediction, concentrating on the limitations imposed by available data, dependencies on the sequence alignments, and possible future developments.
Collapse
|
495
|
Kukic P, Mirabello C, Tradigo G, Walsh I, Veltri P, Pollastri G. Toward an accurate prediction of inter-residue distances in proteins using 2D recursive neural networks. BMC Bioinformatics 2014; 15:6. [PMID: 24410833 PMCID: PMC3893389 DOI: 10.1186/1471-2105-15-6] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 12/20/2013] [Indexed: 11/21/2022] Open
Abstract
Background Protein inter-residue contact maps provide a translation and rotation invariant topological representation of a protein. They can be used as an intermediary step in protein structure predictions. However, the prediction of contact maps represents an unbalanced problem as far fewer examples of contacts than non-contacts exist in a protein structure. In this study we explore the possibility of completely eliminating the unbalanced nature of the contact map prediction problem by predicting real-value distances between residues. Predicting full inter-residue distance maps and applying them in protein structure predictions has been relatively unexplored in the past. Results We initially demonstrate that the use of native-like distance maps is able to reproduce 3D structures almost identical to the targets, giving an average RMSD of 0.5Å. In addition, the corrupted physical maps with an introduced random error of ±6Å are able to reconstruct the targets within an average RMSD of 2Å. After demonstrating the reconstruction potential of distance maps, we develop two classes of predictors using two-dimensional recursive neural networks: an ab initio predictor that relies only on the protein sequence and evolutionary information, and a template-based predictor in which additional structural homology information is provided. We find that the ab initio predictor is able to reproduce distances with an RMSD of 6Å, regardless of the evolutionary content provided. Furthermore, we show that the template-based predictor exploits both sequence and structure information even in cases of dubious homology and outperforms the best template hit with a clear margin of up to 3.7Å. Lastly, we demonstrate the ability of the two predictors to reconstruct the CASP9 targets shorter than 200 residues producing the results similar to the state of the machine learning art approach implemented in the Distill server. Conclusions The methodology presented here, if complemented by more complex reconstruction protocols, can represent a possible path to improve machine learning algorithms for 3D protein structure prediction. Moreover, it can be used as an intermediary step in protein structure predictions either on its own or complemented by NMR restraints.
Collapse
Affiliation(s)
- Predrag Kukic
- School of Computer Science and Informatics, Complex and Adaptive Systems Laboratory, University College Dublin, Belfield, Dublin 4, Ireland.
| | | | | | | | | | | |
Collapse
|
496
|
Morcos F, Hwa T, Onuchic JN, Weigt M. Direct coupling analysis for protein contact prediction. Methods Mol Biol 2014; 1137:55-70. [PMID: 24573474 DOI: 10.1007/978-1-4939-0366-5_5] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
During evolution, structure, and function of proteins are remarkably conserved, whereas amino-acid sequences vary strongly between homologous proteins. Structural conservation constrains sequence variability and forces different residues to coevolve, i.e., to show correlated patterns of amino-acid occurrences. However, residue correlation may result from direct coupling, e.g., by a contact in the folded protein, or be induced indirectly via intermediate residues. To use empirically observed correlations for predicting residue-residue contacts, direct and indirect effects have to be disentangled. Here we present mechanistic details on how to achieve this using a methodology called Direct Coupling Analysis (DCA). DCA has been shown to produce highly accurate estimates of amino-acid pairs that have direct reciprocal constraints in evolution. Specifically, we provide instructions and protocols on how to use the algorithmic implementations of DCA starting from data extraction to predicted-contact visualization in contact maps or representative protein structures.
Collapse
Affiliation(s)
- Faruck Morcos
- Center for Theoretical Biological Physics, Rice University, Houston, TX, USA
| | | | | | | |
Collapse
|
497
|
Kamisetty H, Ghosh B, Langmead CJ, Bailey-Kellogg C. Learning Sequence Determinants of Protein:protein Interaction Specificity with Sparse Graphical Models. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2014; 8394:129-143. [PMID: 25414914 DOI: 10.1007/978-3-319-05269-4_10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
In studying the strength and specificity of interaction between members of two protein families, key questions center on which pairs of possible partners actually interact, how well they interact, and why they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method, DgSpi (Data-driven Graphical models of Specificity in Protein:protein Interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (why) and extend earlier classification-oriented approaches (which) to predict the ΔG of binding (how well). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules, against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted ΔG values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level ΔG predictions to be naturally understood in terms of residue-level constraints. Finally, as a generative model, DgSpi readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.
Collapse
|
498
|
Terashi G, Nakamura Y, Shimoyama H, Takeda-Shitaka M. Quality Assessment Methods for 3D Protein Structure Models Based on a Residue–Residue Distance Matrix Prediction. Chem Pharm Bull (Tokyo) 2014; 62:744-53. [PMID: 25087626 DOI: 10.1248/cpb.c13-00973] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
499
|
Probabilistic grammatical model for helix-helix contact site classification. Algorithms Mol Biol 2013; 8:31. [PMID: 24350601 PMCID: PMC3892132 DOI: 10.1186/1748-7188-8-31] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2013] [Accepted: 11/28/2013] [Indexed: 11/25/2022] Open
Abstract
Background Hidden Markov Models power many state‐of‐the‐art tools in
the field of protein bioinformatics. While excelling in their tasks, these
methods of protein analysis do not convey directly information on
medium‐ and long‐range residue‐residue interactions. This
requires an expressive power of at least context‐free grammars.
However, application of more powerful grammar formalisms to protein analysis
has been surprisingly limited. Results In this work, we present a probabilistic grammatical framework for
problem‐specific protein languages and apply it to classification of
transmembrane helix‐helix pairs configurations. The core of the model
consists of a probabilistic context‐free grammar, automatically
inferred by a genetic algorithm from only a generic set of
expert‐based rules and positive training samples. The model was
applied to produce sequence based descriptors of four classes of
transmembrane helix‐helix contact site configurations. The highest
performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability
of representing structural features of helix‐helix contact sites. Conclusions We demonstrated that our probabilistic context‐free framework for
analysis of protein sequences outperforms the state of the art in the task
of helix‐helix contact site classification. However, this is achieved
without necessarily requiring modeling long range dependencies between
interacting residues. A significant feature of our approach is that grammar
rules and parse trees are human‐readable. Thus they could provide
biologically meaningful information for molecular biologists.
Collapse
|
500
|
Wang Z, Xu J. Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics 2013; 29:i266-73. [PMID: 23812992 PMCID: PMC3694661 DOI: 10.1093/bioinformatics/btt211] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation: Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole-contact map. A couple of recent methods predict contact map by using mutual information, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods demand for a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically infeasible. Results: This article presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming. The evolutionary restraints are much more informative than mutual information, and the physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and, thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. Availability:http://raptorx.uchicago.edu. Contact:jinboxu@gmail.com
Collapse
Affiliation(s)
- Zhiyong Wang
- Toyota Technological Institute at Chicago, 6045 S Kenwood, IL 60637, USA
| | | |
Collapse
|