1
|
Harteveld Z, Van Hall-Beauvais A, Morozova I, Southern J, Goverde C, Georgeon S, Rosset S, Defferrard M, Loukas A, Vandergheynst P, Bronstein MM, Correia BE. Exploring "dark-matter" protein folds using deep learning. Cell Syst 2024; 15:898-910.e5. [PMID: 39383860 DOI: 10.1016/j.cels.2024.09.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Revised: 06/13/2024] [Accepted: 09/16/2024] [Indexed: 10/11/2024]
Abstract
De novo protein design explores uncharted sequence and structure space to generate novel proteins not sampled by evolution. A main challenge in de novo design involves crafting "designable" structural templates to guide the sequence searches toward adopting target structures. We present a convolutional variational autoencoder that learns patterns of protein structure, dubbed Genesis. We coupled Genesis with trRosetta to design sequences for a set of protein folds and found that Genesis is capable of reconstructing native-like distance and angle distributions for five native folds and three novel, the so-called "dark-matter" folds as a demonstration of generalizability. We used a high-throughput assay to characterize the stability of the designs through protease resistance, obtaining encouraging success rates for folded proteins. Genesis enables exploration of the protein fold space within minutes, unrestricted by protein topologies. Our approach addresses the backbone designability problem, showing that small neural networks can efficiently learn structural patterns in proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Zander Harteveld
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Alexandra Van Hall-Beauvais
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Irina Morozova
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | | | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Stéphane Rosset
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | | | - Andreas Loukas
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Prescient Design, gRED, Roche, Basel, Switzerland
| | | | | | - Bruno E Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
2
|
Middendorf L, Ravi Iyengar B, Eicholt LA. Sequence, Structure, and Functional Space of Drosophila De Novo Proteins. Genome Biol Evol 2024; 16:evae176. [PMID: 39212966 PMCID: PMC11363682 DOI: 10.1093/gbe/evae176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
During de novo emergence, new protein coding genes emerge from previously nongenic sequences. The de novo proteins they encode are dissimilar in composition and predicted biochemical properties to conserved proteins. However, functional de novo proteins indeed exist. Both identification of functional de novo proteins and their structural characterization are experimentally laborious. To identify functional and structured de novo proteins in silico, we applied recently developed machine learning based tools and found that most de novo proteins are indeed different from conserved proteins both in their structure and sequence. However, some de novo proteins are predicted to adopt known protein folds, participate in cellular reactions, and to form biomolecular condensates. Apart from broadening our understanding of de novo protein evolution, our study also provides a large set of testable hypotheses for focused experimental studies on structure and function of de novo proteins in Drosophila.
Collapse
Affiliation(s)
- Lasse Middendorf
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| | - Bharat Ravi Iyengar
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| | - Lars A Eicholt
- Institute for Evolution and Biodiversity, University of Muenster, Huefferstrasse 1, 48149 Muenster, Germany
| |
Collapse
|
3
|
Listov D, Goverde CA, Correia BE, Fleishman SJ. Opportunities and challenges in design and optimization of protein function. Nat Rev Mol Cell Biol 2024; 25:639-653. [PMID: 38565617 PMCID: PMC7616297 DOI: 10.1038/s41580-024-00718-y] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/04/2024]
Abstract
The field of protein design has made remarkable progress over the past decade. Historically, the low reliability of purely structure-based design methods limited their application, but recent strategies that combine structure-based and sequence-based calculations, as well as machine learning tools, have dramatically improved protein engineering and design. In this Review, we discuss how these methods have enabled the design of increasingly complex structures and therapeutically relevant activities. Additionally, protein optimization methods have improved the stability and activity of complex eukaryotic proteins. Thanks to their increased reliability, computational design methods have been applied to improve therapeutics and enzymes for green chemistry and have generated vaccine antigens, antivirals and drug-delivery nano-vehicles. Moreover, the high success of design methods reflects an increased understanding of basic rules that govern the relationships among protein sequence, structure and function. However, de novo design is still limited mostly to α-helix bundles, restricting its potential to generate sophisticated enzymes and diverse protein and small-molecule binders. Designing complex protein structures is a challenging but necessary next step if we are to realize our objective of generating new-to-nature activities.
Collapse
Affiliation(s)
- Dina Listov
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Casper A Goverde
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Bruno E Correia
- Institute of Bioengineering, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
| | - Sarel Jacob Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel.
| |
Collapse
|
4
|
Porter LL. Fluid protein fold space and its implications. Bioessays 2023; 45:e2300057. [PMID: 37431685 PMCID: PMC10529699 DOI: 10.1002/bies.202300057] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/21/2023] [Accepted: 06/23/2023] [Indexed: 07/12/2023]
Abstract
Fold-switching proteins, which remodel their secondary and tertiary structures in response to cellular stimuli, suggest a new view of protein fold space. For decades, experimental evidence has indicated that protein fold space is discrete: dissimilar folds are encoded by dissimilar amino acid sequences. Challenging this assumption, fold-switching proteins interconnect discrete groups of dissimilar protein folds, making protein fold space fluid. Three recent observations support the concept of fluid fold space: (1) some amino acid sequences interconvert between folds with distinct secondary structures, (2) some naturally occurring sequences have switched folds by stepwise mutation, and (3) fold switching is evolutionarily selected and likely confers advantage. These observations indicate that minor amino acid sequence modifications can transform protein structure and function. Consequently, proteomic structural and functional diversity may be expanded by alternative splicing, small nucleotide polymorphisms, post-translational modifications, and modified translation rates.
Collapse
Affiliation(s)
- Lauren L. Porter
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
- National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD
| |
Collapse
|
5
|
Burroughs A, Aravind L. New biochemistry in the Rhodanese-phosphatase superfamily: emerging roles in diverse metabolic processes, nucleic acid modifications, and biological conflicts. NAR Genom Bioinform 2023; 5:lqad029. [PMID: 36968430 PMCID: PMC10034599 DOI: 10.1093/nargab/lqad029] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 02/10/2023] [Accepted: 03/09/2023] [Indexed: 03/25/2023] Open
Abstract
The protein-tyrosine/dual-specificity phosphatases and rhodanese domains constitute a sprawling superfamily of Rossmannoid domains that use a conserved active site with a cysteine to catalyze a range of phosphate-transfer, thiotransfer, selenotransfer and redox activities. While these enzymes have been extensively studied in the context of protein/lipid head group dephosphorylation and various thiotransfer reactions, their overall diversity and catalytic potential remain poorly understood. Using comparative genomics and sequence/structure analysis, we comprehensively investigate and develop a natural classification for this superfamily. As a result, we identified several novel clades, both those which retain the catalytic cysteine and those where a distinct active site has emerged in the same location (e.g. diphthine synthase-like methylases and RNA 2' OH ribosyl phosphate transferases). We also present evidence that the superfamily has a wider range of catalytic capabilities than previously known, including a set of parallel activities operating on various sugar/sugar alcohol groups in the context of NAD+-derivatives and RNA termini, and potential phosphate transfer activities involving sugars and nucleotides. We show that such activities are particularly expanded in the RapZ-C-DUF488-DUF4326 clade, defined here for the first time. Some enzymes from this clade are predicted to catalyze novel DNA-end processing activities as part of nucleic-acid-modifying systems that are likely to function in biological conflicts between viruses and their hosts.
Collapse
Affiliation(s)
- A Maxwell Burroughs
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - L Aravind
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
6
|
Tajana M, Trovato A, Tiana G. Key interaction patterns in proteins revealed by cluster expansion of the partition function. THE EUROPEAN PHYSICAL JOURNAL. E, SOFT MATTER 2022; 45:95. [PMID: 36447074 DOI: 10.1140/epje/s10189-022-00250-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 11/19/2022] [Indexed: 06/16/2023]
Abstract
The native conformation of structured proteins is stabilized by a complex network of interactions. We analyzed the elementary patterns that constitute such network and ranked them according to their importance in shaping protein sequence design. To achieve this goal, we employed a cluster expansion of the partition function in the space of sequences and evaluated numerically the statistical importance of each cluster. An important feature of this procedure is that it is applied to a dense finite system. We found that patterns that contribute most to the partition function are cycles with even numbers of nodes, while cliques are typically detrimental. Each cluster also gives a contribute to the sequence entropy, which is a measure of the evolutionary designability of a fold. We compared the entropies associated with different interaction patterns to their abundances in the native structures of real proteins.
Collapse
Affiliation(s)
- Matteo Tajana
- Department of Physics, Università degli Studi di Milano, Via Celoria 16, 20133, Milan, Italy
| | - Antonio Trovato
- Department of Physics and Astronomy "G. Galilei", Università degli Studi di Padova and INFN, Via Marzolo 8, 35121, Padova, Italy
| | - Guido Tiana
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, Via Celoria 16, 20133, Milan, Italy.
| |
Collapse
|
7
|
Abstract
De novo protein design enables the exploration of novel sequences and structures absent from the natural protein universe. De novo design also stands as a stringent test for our understanding of the underlying physical principles of protein folding and may lead to the development of proteins with unmatched functional characteristics. The first fundamental challenge of de novo design is to devise "designable" structural templates leading to sequences that will adopt the predicted fold. Here, we built on the TopoBuilder (TB) de novo design method, to automatically assemble structural templates with native-like features starting from string descriptors that capture the overall topology of proteins. Our framework eliminates the dependency of hand-crafted and fold-specific rules through an iterative, data-driven approach that extracts geometrical parameters from structural tertiary motifs. We evaluated the TopoBuilder framework by designing sequences for a set of five protein folds and experimental characterization revealed that several sequences were folded and stable in solution. The TopoBuilder de novo design framework will be broadly useful to guide the generation of artificial proteins with customized geometries, enabling the exploration of the protein universe.
Collapse
|
8
|
Youssef N, Susko E, Roger AJ, Bielawski JP. Evolution of amino acid propensities under stability-mediated epistasis. Mol Biol Evol 2022; 39:6522130. [PMID: 35134997 PMCID: PMC8896634 DOI: 10.1093/molbev/msac030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Site-specific amino acid preferences are influenced by the genetic background of the protein. The preferences for resident amino acids are expected to, on average, increase over time because of replacements at other sites - a nonadaptive phenomenon referred to as the 'evolutionary Stokes shift'. Alternatively, decreases in resident amino acid propensity have recently been viewed as evidence of adaptations to external environmental changes. Using population genetics theory and thermodynamic stability-constraints, we show that nonadaptive evolution can lead to both positive and negative shifts in propensities following the fixation of an amino acid, emphasizing that the detection of negative shifts is not conclusive evidence of adaptation. Considering shifts in propensities over windows between substitutions at a focal site, we find that following ≈ 50% of substitutions the propensity for the new resident amino acid decreases over time, and both positive and negative shifts were comparable in magnitude. Preferences were often conserved via a significant negative autocorrelation in propensity changes-increases in propensities often followed by decreases, and vice versa. Lastly, we explore the underlying mechanisms that lead propensities to fluctuate. We observe that stabilizing replacements increase the mutational tolerance at a site and in doing so decrease the propensity for the resident amino acid. In contrast, destabilizing substitutions result in more rugged fitness landscapes that tend to favor the resident amino acid. In summary, our results characterize propensity trajectories under nonadaptive stability-constrained evolution against which evidence of adaptations should be calibrated.
Collapse
Affiliation(s)
- Noor Youssef
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Edward Susko
- Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada
| | - Andrew J Roger
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS, Canada
| | - Joseph P Bielawski
- Department of Biology, Dalhousie University, Halifax, Nova Scotia, Canada Department of Mathematics and Statistics, Dalhousie University, Halifax, Nova Scotia, Canada
| |
Collapse
|
9
|
Negri M, Tiana G, Zecchina R. Native state of natural proteins optimizes local entropy. Phys Rev E 2021; 104:064117. [PMID: 35030941 DOI: 10.1103/physreve.104.064117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 11/24/2021] [Indexed: 06/14/2023]
Abstract
The differing ability of polypeptide conformations to act as the native state of proteins has long been rationalized in terms of differing kinetic accessibility or thermodynamic stability. Building on the successful applications of physical concepts and sampling algorithms recently introduced in the study of disordered systems, in particular artificial neural networks, we quantitatively explore how well a quantity known as the local entropy describes the native state of model proteins. In lattice models and all-atom representations of proteins, we are able to efficiently sample high local entropy states and to provide a proof of concept of enhanced stability and folding rate. Our methods are based on simple and general statistical-mechanics arguments, and thus we expect that they are of very general use.
Collapse
Affiliation(s)
- M Negri
- Department Applied Science and Technology, Politecnico di Torino, CorsoDuca degli Abruzzi 24, I-10129 Turin, Italy
| | - G Tiana
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133 Milan, Italy
| | - R Zecchina
- Artificial Intelligence Lab, Bocconi University, Via Sarfatti, 25, 20136 Milan, Italy
| |
Collapse
|
10
|
Kinjo AR. Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families. J Theor Biol 2018; 443:18-27. [PMID: 29355538 DOI: 10.1016/j.jtbi.2018.01.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Revised: 01/16/2018] [Accepted: 01/17/2018] [Indexed: 12/23/2022]
Abstract
In the protein sequence space, natural proteins form clusters of families which are characterized by their unique native folds whereas the great majority of random polypeptides are neither clustered nor foldable to unique structures. Since a given polypeptide can be either foldable or unfoldable, a kind of "folding transition" is expected at the boundary of a protein family in the sequence space. By Monte Carlo simulations of a statistical mechanical model of protein sequence alignment that coherently incorporates both short-range and long-range interactions as well as variable-length insertions to reproduce the statistics of the multiple sequence alignment of a given protein family, we demonstrate the existence of such transition between natural-like sequences and random sequences in the sequence subspaces for 15 domain families of various folds. The transition was found to be highly cooperative and two-state-like. Furthermore, enforcing or suppressing consensus residues on a few of the well-conserved sites enhanced or diminished, respectively, the natural-like pattern formation over the entire sequence. In most families, the key sites included ligand binding sites. These results suggest some selective pressure on the key residues, such as ligand binding activity, may cooperatively facilitate the emergence of a protein family during evolution. From a more practical aspect, the present results highlight an essential role of long-range effects in precisely defining protein families, which are absent in conventional sequence models.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
| |
Collapse
|
11
|
Williams PD, Pollock DD, Goldstein RA. Functionality and the Evolution of Marginal Stability in Proteins: Inferences from Lattice Simulations. Evol Bioinform Online 2017. [DOI: 10.1177/117693430600200013] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
It has been known for some time that many proteins are marginally stable. This has inspired several explanations. Having noted that the functionality of many enzymes is correlated with subunit motion, flexibility, or general disorder, some have suggested that marginally stable proteins should have an evolutionary advantage over proteins of differing stability. Others have suggested that stability and functionality are contradictory qualities, and that selection for both criteria results in marginally stable proteins, optimised to satisfy the competing design pressures. While these explanations are plausible, recent research simulating the evolution of model proteins has shown that selection for stability, ignoring any aspects of functionality, can result in marginally stable proteins because of the underlying makeup of protein sequence-space. We extend this research by simulating the evolution of proteins, using a computational protein model that equates functionality with binding and catalysis. In the model, marginal stability is not required for ligand-binding functionality and we observe no competing design pressures. The resulting proteins are marginally stable, again demonstrating that neutral evolution is sufficient for explaining marginal stability in observed proteins.
Collapse
Affiliation(s)
- Paul D. Williams
- Department of Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| | - David D. Pollock
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Richard A. Goldstein
- Mathematical Biology, National Institute for Medical Sciences, The Ridgeway, Mill Hill, London MW7 1AA, UK
| |
Collapse
|
12
|
Tian P, Best RB. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis. Biophys J 2017; 113:1719-1730. [PMID: 29045866 PMCID: PMC5647607 DOI: 10.1016/j.bpj.2017.08.039] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Revised: 08/03/2017] [Accepted: 08/08/2017] [Indexed: 12/23/2022] Open
Abstract
Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance.
Collapse
Affiliation(s)
- Pengfei Tian
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland
| | - Robert B Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.
| |
Collapse
|
13
|
Abstract
Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence-a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure.
Collapse
|
14
|
The universal statistical distributions of the affinity, equilibrium constants, kinetics and specificity in biomolecular recognition. PLoS Comput Biol 2015; 11:e1004212. [PMID: 25885453 PMCID: PMC4401658 DOI: 10.1371/journal.pcbi.1004212] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 02/24/2015] [Indexed: 01/01/2023] Open
Abstract
We uncovered the universal statistical laws for the biomolecular recognition/binding process. We quantified the statistical energy landscapes for binding, from which we can characterize the distributions of the binding free energy (affinity), the equilibrium constants, the kinetics and the specificity by exploring the different ligands binding with a particular receptor. The results of the analytical studies are confirmed by the microscopic flexible docking simulations. The distribution of binding affinity is Gaussian around the mean and becomes exponential near the tail. The equilibrium constants of the binding follow a log-normal distribution around the mean and a power law distribution in the tail. The intrinsic specificity for biomolecular recognition measures the degree of discrimination of native versus non-native binding and the optimization of which becomes the maximization of the ratio of the free energy gap between the native state and the average of non-native states versus the roughness measured by the variance of the free energy landscape around its mean. The intrinsic specificity obeys a Gaussian distribution near the mean and an exponential distribution near the tail. Furthermore, the kinetics of binding follows a log-normal distribution near the mean and a power law distribution at the tail. Our study provides new insights into the statistical nature of thermodynamics, kinetics and function from different ligands binding with a specific receptor or equivalently specific ligand binding with different receptors. The elucidation of distributions of the kinetics and free energy has guiding roles in studying biomolecular recognition and function through small-molecule evolution and chemical genetics. Uncovering the principles and underlying mechanisms of biomolecular recognition and molecular binding process is crucial for understanding the function and evolution, yet challenging. We meet the challenge by quantifying the statistical natures of the relevant physical variables of biomolecular recognition using the analytical model combined with microscopic flexible docking simulation methods. We uncovered the universal statistical laws obeyed by the affinity, equilibrium constant, intrinsic specificity and kinetics for biomolecular recognition. The general statistical laws based on energy landscape theory can serve as a conceptual framework for molecular recognition in biological repertoires. They can be applied to molecular selection, in vitro evolution process, high throughput screening and virtual screening for drug discovery. The statistical laws in combinations with experiments provide quantitative signatures of a specific ligand binding to a specific receptor, these resultant laws as a guideline will contribute to drug design against a specific target. Our developed statistical methodology is general and applicable for all other biomolecular recognitions.
Collapse
|
15
|
Ferrada E. The amino acid alphabet and the architecture of the protein sequence-structure map. I. Binary alphabets. PLoS Comput Biol 2014; 10:e1003946. [PMID: 25473967 PMCID: PMC4256021 DOI: 10.1371/journal.pcbi.1003946] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2014] [Accepted: 09/26/2014] [Indexed: 11/19/2022] Open
Abstract
The correspondence between protein sequences and structures, or sequence-structure map, relates to fundamental aspects of structural, evolutionary and synthetic biology. The specifics of the mapping, such as the fraction of accessible sequences and structures, or the sequences' ability to fold fast, are dictated by the type of interactions between the monomers that compose the sequences. The set of possible interactions between monomers is encapsulated by the potential energy function. In this study, I explore the impact of the relative forces of the potential on the architecture of the sequence-structure map. My observations rely on simple exact models of proteins and random samples of the space of potential energy functions of binary alphabets. I adopt a graph perspective and study the distribution of viable sequences and the structures they produce, as networks of sequences connected by point mutations. I observe that the relative proportion of attractive, neutral and repulsive forces defines types of potentials, that induce sequence-structure maps of vastly different architectures. I characterize the properties underlying these differences and relate them to the structure of the potential. Among these properties are the expected number and relative distribution of sequences associated to specific structures and the diversity of structures as a function of sequence divergence. I study the types of binary potentials observed in natural amino acids and show that there is a strong bias towards only some types of potentials, a bias that seems to characterize the folding code of natural proteins. I discuss implications of these observations for the architecture of the sequence-structure map of natural proteins, the construction of random libraries of peptides, and the early evolution of the natural amino acid alphabet.
Collapse
Affiliation(s)
- Evandro Ferrada
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| |
Collapse
|
16
|
‘Come into the fold’: A comparative analysis of bacterial redox enzyme maturation protein members of the NarJ subfamily. BIOCHIMICA ET BIOPHYSICA ACTA-BIOMEMBRANES 2014; 1838:2971-2984. [DOI: 10.1016/j.bbamem.2014.08.020] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Revised: 07/24/2014] [Accepted: 08/15/2014] [Indexed: 11/19/2022]
|
17
|
Mannige RV. Two modes of protein sequence evolution and their compositional dependencies. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013; 87:062714. [PMID: 23848722 DOI: 10.1103/physreve.87.062714] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/14/2013] [Revised: 05/10/2013] [Indexed: 06/02/2023]
Abstract
Protein sequence evolution has resulted in a vast repertoire of molecular functionality crucial to life. Despite the central importance of sequence evolution to biology, our fundamental understanding of how sequence composition affects evolution is incomplete. This report describes the utilization of lattice model simulations of directed evolution, which indicate that, on average, peptide and protein evolvability is strongly dependent on initial sequence composition. The report also discusses two distinct regimes of sequence evolution by point mutation: (a) the "classical" mode where sequences "crawl" over free energy barriers towards acquiring a target fold, and (b) the "quantum" mode where sequences appear to "tunnel" through large energy barriers generally insurmountable by means of a crawl. Finally, the simulations indicate that oily and charged peptides are the most efficient substrates for evolution at the "classical" and "quantum" regimes, respectively, and that their respective response to temperature is commensurate with analogies made to barrier crossing in classical and quantum systems. On the whole, these results show that sequence composition can tune both the evolvability and the optimal mode of evolution of peptides and proteins.
Collapse
Affiliation(s)
- Ranjan V Mannige
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA.
| |
Collapse
|
18
|
|
19
|
Abstract
The observation of a limited secondary-structural alphabet in native proteins, with significant sequence preferences, has profoundly influenced the fields of protein design and structure prediction (Simons, Kooperberg, Huang, & Baker, 1997; Verschueren et al., 2011). In the era of structural genomics, as the size of the structural dataset continues to grow rapidly, it is becoming possible to extend this analysis to tertiary structural motifs and their sequences. For a hypothetical tertiary motif, the rate of its utilization in natural proteins may be used to assess its designability-the ease with which the motif can be realized with natural amino acids. This requires a structural similarity search methodology, which rather than looking for global topological agreement (more appropriate for categorization of full proteins or domains), identifies detailed geometric matches. In this chapter, we introduce such a method, called MaDCaT, and demonstrate its use by assessing the designability landscapes of two tertiary structural motifs. We also show that such analysis can establish structure/sequence links by providing the sequence constraints necessary to encode designable motifs. As logical extension of their secondary-structure counterparts, tertiary structural preferences will likely prove extremely useful in de novo protein design and structure prediction.
Collapse
Affiliation(s)
- Jian Zhang
- Department of Computer Science, Dartmouth College, Fax: 603-646-1672, 6211 Sudikoff Lab, Room 210, Hanover, NH 03755-3510, USA
| | - Gevorg Grigoryan
- Adjunct Professor of Biology, Dartmouth College, Phone: 603-646-3173, Fax: 603-646-1672, 6211 Sudikoff Lab, Room 113, Hanover, NH 03755-3510, USA
| |
Collapse
|
20
|
Narasimhan SL, Rajarajan AK, Vardharaj L. HP-sequence design for lattice proteins—An exact enumeration study on diamond as well as square lattice. J Chem Phys 2012; 137:115102. [DOI: 10.1063/1.4752479] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
21
|
Sikosek T, Bornberg-Bauer E, Chan HS. Evolutionary dynamics on protein bi-stability landscapes can potentially resolve adaptive conflicts. PLoS Comput Biol 2012; 8:e1002659. [PMID: 23028272 PMCID: PMC3441461 DOI: 10.1371/journal.pcbi.1002659] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Accepted: 07/12/2012] [Indexed: 11/18/2022] Open
Abstract
Experimental studies have shown that some proteins exist in two alternative native-state conformations. It has been proposed that such bi-stable proteins can potentially function as evolutionary bridges at the interface between two neutral networks of protein sequences that fold uniquely into the two different native conformations. Under adaptive conflict scenarios, bi-stable proteins may be of particular advantage if they simultaneously provide two beneficial biological functions. However, computational models that simulate protein structure evolution do not yet recognize the importance of bi-stability. Here we use a biophysical model to analyze sequence space to identify bi-stable or multi-stable proteins with two or more equally stable native-state structures. The inclusion of such proteins enhances phenotype connectivity between neutral networks in sequence space. Consideration of the sequence space neighborhood of bridge proteins revealed that bi-stability decreases gradually with each mutation that takes the sequence further away from an exactly bi-stable protein. With relaxed selection pressures, we found that bi-stable proteins in our model are highly successful under simulated adaptive conflict. Inspired by these model predictions, we developed a method to identify real proteins in the PDB with bridge-like properties, and have verified a clear bi-stability gradient for a series of mutants studied by Alexander et al. (Proc Nat Acad Sci USA 2009, 106:21149–21154) that connect two sequences that fold uniquely into two different native structures via a bridge-like intermediate mutant sequence. Based on these findings, new testable predictions for future studies on protein bi-stability and evolution are discussed. Proteins are essential molecules for performing a majority of functions in all biological systems. These functions often depend on the three-dimensional structures of proteins. Here, we investigate a fundamental question in molecular evolution: how can proteins acquire new advantageous structures via mutations while not sacrificing their existing structures that are still needed? Some authors have suggested that the same protein may adopt two or more alternative structures, switch between them and thus perform different functions with each of the alternative structures. Intuitively, such a protein could provide an evolutionary compromise between conflicting demands for existing and new protein structures. Yet no theoretical study has systematically tackled the biophysical basis of such compromises during evolutionary processes. Here we devise a model of evolution that specifically recognizes protein molecules that can exist in several different stable structures. Our model demonstrates that proteins can indeed utilize multiple structures to satisfy conflicting evolutionary requirements. In light of these results, we identify data from known protein structures that are consistent with our predictions and suggest novel directions for future investigation.
Collapse
Affiliation(s)
- Tobias Sikosek
- Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Münster, Münster, Germany.
| | | | | |
Collapse
|
22
|
Burke S, Elber R. Super folds, networks, and barriers. Proteins 2012; 80:463-70. [PMID: 22095563 PMCID: PMC3290721 DOI: 10.1002/prot.23212] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Revised: 08/31/2011] [Accepted: 09/22/2011] [Indexed: 11/06/2022]
Abstract
Exhaustive enumeration of sequences and folds is conducted for a simple lattice model of conformations, sequences, and energies. Examination of all foldable sequences and their nearest connected neighbors (sequences that differ by no more than a point mutation) illustrates the following: (i) There exist unusually large number of sequences that fold into a few structures (super-folds). The same observation was made experimentally and computationally using stochastic sampling and exhaustive enumeration of related models. (ii) There exist only a few large networks of connected sequences that are not restricted to one fold. These networks cover a significant fraction of fold spaces (super-networks). (iii) There exist barriers in sequence space that prevent foldable sequences of the same structure to "connect" through a series of single point mutations (super-barrier), even in the presence of the sequence connection between folds. While there is ample experimental evidence for the existence of super-folds, evidence for a super-network is just starting to emerge. The prediction of a sequence barrier is an intriguing characteristic of sequence space, suggesting that the overall sequence space may be disconnected. The implications and limitations of these observations for evolution of protein structures are discussed.
Collapse
Affiliation(s)
- Sean Burke
- Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin TX 78712
| | - Ron Elber
- Institute for Computational Engineering and Sciences, University of Texas at Austin, Austin TX 78712
- Department of Chemistry and Biochemistry, University of Texas at Austin, Austin TX 78712
| |
Collapse
|
23
|
Szilágyi A, Zhang Y, Závodszky P. Intra-chain 3D segment swapping spawns the evolution of new multidomain protein architectures. J Mol Biol 2011; 415:221-35. [PMID: 22079367 DOI: 10.1016/j.jmb.2011.10.045] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Revised: 10/07/2011] [Accepted: 10/27/2011] [Indexed: 10/15/2022]
Abstract
Multidomain proteins form in evolution through the concatenation of domains, but structural domains may comprise multiple segments of the chain. In this work, we demonstrate that new multidomain architectures can evolve by an apparent three-dimensional swap of segments between structurally similar domains within a single-chain monomer. By a comprehensive structural search of the current Protein Data Bank (PDB), we identified 32 well-defined segment-swapped proteins (SSPs) belonging to 18 structural families. Nearly 13% of all multidomain proteins in the PDB may have a segment-swapped evolutionary precursor as estimated by more permissive searching criteria. The formation of SSPs can be explained by two principal evolutionary mechanisms: (i) domain swapping and fusion (DSF) and (ii) circular permutation (CP). By large-scale comparative analyses using structural alignment and hidden Markov model methods, it was found that the majority of SSPs have evolved via the DSF mechanism, and a much smaller fraction, via CP. Functional analyses further revealed that segment swapping, which results in two linkers connecting the domains, may impart directed flexibility to multidomain proteins and contributes to the development of new functions. Thus, inter-domain segment swapping represents a novel general mechanism by which new protein folds and multidomain architectures arise in evolution, and SSPs have structural and functional properties that make them worth defining as a separate group.
Collapse
Affiliation(s)
- András Szilágyi
- Institute of Enzymology, Hungarian Academy of Sciences, Karolina út 29, H-1113 Budapest, Hungary
| | | | | |
Collapse
|
24
|
Dai L, Zhou Y. Characterizing the existing and potential structural space of proteins by large-scale multiple loop permutations. J Mol Biol 2011; 408:585-95. [PMID: 21376059 PMCID: PMC3075335 DOI: 10.1016/j.jmb.2011.02.056] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2010] [Revised: 02/22/2011] [Accepted: 02/24/2011] [Indexed: 10/18/2022]
Abstract
Worldwide structural genomics projects are increasing structure coverage of sequence space but have not significantly expanded the protein structure space itself (i.e., number of unique structural folds) since 2007. Discovering new structural folds experimentally by directed evolution and random recombination of secondary-structure blocks is also proved rarely successful. Meanwhile, previous computational efforts for large-scale mapping of protein structure space are limited to simple model proteins and led to an inconclusive answer on the completeness of the existing observed protein structure space. Here, we build novel protein structures by extending naturally occurring circular (single-loop) permutation to multiple loop permutations (MLPs). These structures are clustered by structural similarity measure called TM-score. The computational technique allows us to produce different structural clusters on the same naturally occurring, packed, stable core but with alternatively connected secondary-structure segments. A large-scale MLP of 2936 domains from structural classification of protein domains reproduces those existing structural clusters (63%) mostly as hubs for many nonredundant sequences and illustrates newly discovered novel clusters as islands adopted by a few sequences only. Results further show that there exist a significant number of novel potentially stable clusters for medium-size or large-size single-domain proteins, in particular, >100 amino acid residues, that are either not yet adopted by nature or adopted only by a few sequences. This study suggests that MLP provides a simple yet highly effective tool for engineering and design of novel protein structures (including naturally knotted proteins). The implication of recovering new-fold targets from critical assessment of structure prediction techniques (CASP) by MLP on template-based structure prediction is also discussed. Our MLP structures are available for download at the publication page of the Web site http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Liang Dai
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, Indiana 46202, USA
| |
Collapse
|
25
|
Grigoryan G, DeGrado WF. Probing designability via a generalized model of helical bundle geometry. J Mol Biol 2011; 405:1079-100. [PMID: 20932976 PMCID: PMC3052747 DOI: 10.1016/j.jmb.2010.08.058] [Citation(s) in RCA: 189] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2010] [Revised: 08/26/2010] [Accepted: 08/31/2010] [Indexed: 10/19/2022]
Abstract
Because the space of folded protein structures is highly degenerate, with recurring secondary and tertiary motifs, methods for representing protein structure in terms of collective physically relevant coordinates are of great interest. By collapsing structural diversity to a handful of parameters, such methods can be used to delineate the space of designable structures (i.e., conformations that can be stabilized with a large number of sequences)-a crucial task for de novo protein design. We first demonstrate this on natural α-helical coiled coils using the Crick parameterization. We show that over 95% of known coiled-coil structures are within 1-Å C(α) root mean square deviation of a Crick-ideal backbone. Derived parameters show that natural geometric space of coiled coils is highly restricted and can be represented by "allowed" conformations amidst a potential continuum of conformers. Allowed structures have (1) restricted axial offsets between helices, which differ starkly between parallel and anti-parallel structures; (2) preferred superhelical radii, which depend linearly on the oligomerization state; (3) pronounced radius-dependent a- and d-position amino acid propensities; and (4) discrete angles of rotation of helices about their axes, which are surprisingly independent of oligomerization state or orientation. In all, we estimate the space of designable coiled-coil structures to be reduced at least 160-fold relative to the space of geometrically feasible structures. To extend the benefits of structural parameterization to other systems, we developed a general mathematical framework for parameterizing arbitrary helical structures, which reduces to the Crick parameterization as a special case. The method is successfully validated on a set of non-coiled-coil helical bundles, frequent in channels and transporter proteins, which show significant helix bending but not supercoiling. Programs for coiled-coil parameter fitting and structure generation are provided via a web interface at http://www.gevorggrigoryan.com/cccp/, and code for generalized helical parameterization is available upon request.
Collapse
Affiliation(s)
- Gevorg Grigoryan
- Department of Biochemistry and Biophysics, University of Pennsylvania, School of Medicine, Philadelphia, PA, USA
| | - William F. DeGrado
- Department of Biochemistry and Biophysics, University of Pennsylvania, School of Medicine, Philadelphia, PA, USA
| |
Collapse
|
26
|
Shukla P. Thermodynamics of protein folding: a random matrix formulation. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2010; 22:415106. [PMID: 21386596 DOI: 10.1088/0953-8984/22/41/415106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The process of protein folding from an unfolded state to a biologically active, folded conformation is governed by many parameters, e.g. the sequence of amino acids, intermolecular interactions, the solvent, temperature and chaperon molecules. Our study, based on random matrix modeling of the interactions, shows, however, that the evolution of the statistical measures, e.g. Gibbs free energy, heat capacity, and entropy, is single parametric. The information can explain the selection of specific folding pathways from an infinite number of possible ways as well as other folding characteristics observed in computer simulation studies.
Collapse
Affiliation(s)
- Pragya Shukla
- Department of Physics, Indian Institute of Technology, Kharagpur, India
| |
Collapse
|
27
|
Bhattacherjee A, Biswas P. Combinatorial design of protein sequences with applications to lattice and real proteins. J Chem Phys 2009; 131:125101. [DOI: 10.1063/1.3236519] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
28
|
Liu X, Zhao YP. Donut-shaped fingerprint in homologous polypeptide relationships--a topological feature related to pathogenic structural changes in conformational disease. J Theor Biol 2009; 258:294-301. [PMID: 19248793 PMCID: PMC7094133 DOI: 10.1016/j.jtbi.2009.02.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Revised: 01/06/2009] [Accepted: 02/11/2009] [Indexed: 02/05/2023]
Abstract
Features of homologous relationship of proteins can provide us a general picture of protein universe, assist protein design and analysis, and further our comprehension of the evolution of organisms. Here we carried out a study of the evolution of protein molecules by investigating homologous relationships among residue segments. The motive was to identify detailed topological features of homologous relationships for short residue segments in the whole protein universe. Based on the data of a large number of non-redundant proteins, the universe of non-membrane polypeptide was analyzed by considering both residue mutations and structural conservation. By connecting homologous segments with edges, we obtained a homologous relationship network of the whole universe of short residue segments, which we named the graph of polypeptide relationships (GPR). Since the network is extremely complicated for topological transitions, to obtain an in-depth understanding, only subgraphs composed of vital nodes of the GPR were analyzed. Such analysis of vital subgraphs of the GPR revealed a donut-shaped fingerprint. Utilization of this topological feature revealed the switch sites (where the beginning of exposure of previously hidden "hot spots" of fibril-forming happens, in consequence a further opportunity for protein aggregation is provided; 188-202) of the conformational conversion of the normal alpha-helix-rich prion protein PrP(C) to the beta-sheet-rich PrP(Sc) that is thought to be responsible for a group of fatal neurodegenerative diseases, transmissible spongiform encephalopathies. Efforts in analyzing other proteins related to various conformational diseases are also introduced.
Collapse
Affiliation(s)
- Xin Liu
- Institute of Mechanics, Chinese Academy of Sciences, Beijing 100080, China
| | - Ya-Pu Zhao
- The State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics, Chinese Academy of Sciences. Beijing 100190, China
| |
Collapse
|
29
|
Zeldovich KB, Shakhnovich EI. Understanding protein evolution: from protein physics to Darwinian selection. Annu Rev Phys Chem 2008; 59:105-27. [PMID: 17937598 DOI: 10.1146/annurev.physchem.58.032806.104449] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Efforts in whole-genome sequencing and structural proteomics start to provide a global view of the protein universe, the set of existing protein structures and sequences. However, approaches based on the selection of individual sequences have not been entirely successful at the quantitative description of the distribution of structures and sequences in the protein universe because evolutionary pressure acts on the entire organism, rather than on a particular molecule. In parallel to this line of study, studies in population genetics and phenomenological molecular evolution established a mathematical framework to describe the changes in genome sequences in populations of organisms over time. Here, we review both microscopic (physics-based) and macroscopic (organism-level) models of protein-sequence evolution and demonstrate that bridging the two scales provides the most complete description of the protein universe starting from clearly defined, testable, and physiologically relevant assumptions.
Collapse
Affiliation(s)
- Konstantin B Zeldovich
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, USA.
| | | |
Collapse
|
30
|
Perskie LL, Street TO, Rose GD. Structures, basins, and energies: a deconstruction of the Protein Coil Library. Protein Sci 2008; 17:1151-61. [PMID: 18434497 DOI: 10.1110/ps.035055.108] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Globular proteins adopt complex folds, composed of organized assemblies of alpha-helix and beta-sheet together with irregular regions that interconnect these scaffold elements. Here, we seek to parse the irregular regions into their structural constituents and to rationalize their formative energetics. Toward this end, we dissected the Protein Coil Library, a structural database of protein segments that are neither alpha-helix nor beta-strand, extracted from high-resolution protein structures. The backbone dihedral angles of residues from coil library segments are distributed indiscriminately across the phi,psi map, but when contoured, seven distinct basins emerge clearly. The structures and energetics associated with the two least-studied basins are the primary focus of this article. Specifically, the structural motifs associated with these basins were characterized in detail and then assessed in simple simulations designed to capture their energetic determinants. It is found that conformational constraints imposed by excluded volume and hydrogen bonding are sufficient to reproduce the observed ,psi distributions of these motifs; no additional energy terms are required. These three motifs in conjunction with alpha-helices, strands of beta-sheet, canonical beta-turns, and polyproline II conformers comprise approximately 90% of all protein structure.
Collapse
Affiliation(s)
- Lauren L Perskie
- TC Jenkins Department of Biophysics, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | | | | |
Collapse
|
31
|
Goldstein RA. The structure of protein evolution and the evolution of protein structure. Curr Opin Struct Biol 2008; 18:170-7. [PMID: 18328690 DOI: 10.1016/j.sbi.2008.01.006] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2007] [Revised: 12/20/2007] [Accepted: 01/09/2008] [Indexed: 11/29/2022]
Abstract
The observed distribution of protein structures can give us important clues about the underlying evolutionary process, imposing important constraints on possible models. The availability of results from an increasing number of genome projects has made the development of these models an active area of research. Models explaining the observed distribution of structures have focused on the inherent functional capabilities and structural properties of different folds and on the evolutionary dynamics. Increasingly, these elements are being combined.
Collapse
Affiliation(s)
- Richard A Goldstein
- Division of Mathematical Biology, National Institute for Medical Research, Mill Hill, London NW7 1AA, UK.
| |
Collapse
|
32
|
Stayton CT. Is convergence surprising? An examination of the frequency of convergence in simulated datasets. J Theor Biol 2008; 252:1-14. [PMID: 18321532 DOI: 10.1016/j.jtbi.2008.01.008] [Citation(s) in RCA: 83] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2007] [Revised: 12/19/2007] [Accepted: 01/11/2008] [Indexed: 11/29/2022]
Abstract
Although convergence is recognized as a central concept in evolutionary biology, very few tools are available for the quantitative study of this phenomenon. Moreover, although many evolutionary assertions assume that convergence should be rare in the absence of influences on organismal phenotypes such as natural selection or constraint, no studies have tested whether this is the case. I simulate random evolution (Brownian motion model) of quantitative characters along phylogenies with varying numbers of terminal taxa, numbers of traits, variance structure, and tree balance, and quantify the amount of convergence observed in these datasets using four metrics. The amount of convergence observed in a dataset increases with increasing number of taxa and decreasing number of traits, approaching the maximum possible amount of convergence under certain circumstances. Some convergence is expected in almost all datasets. Comparison of empirical datasets to those produced by random evolution provides a test of whether empirical datasets actually show elevated levels of convergence. Out of three test datasets, two show more convergence than expected. Given that high levels of convergence can be produced simply by random evolution, no explanation may be necessary for instances of convergence discovered in an evolutionary investigation.
Collapse
Affiliation(s)
- C Tristan Stayton
- Department of Biology, Bucknell University, Lewisburg, PA 17837, USA.
| |
Collapse
|
33
|
Amatori A, Tiana G, Ferkinghoff-Borg J, Broglia RA. Denatured state is critical in determining the properties of model proteins designed on different folds. Proteins 2008; 70:1047-55. [PMID: 17847099 DOI: 10.1002/prot.21599] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The thermodynamics of proteins designed on three common folds (SH3, chymotrypsin inhibitor 2 [CI2], and protein G) is studied with a simplified C(alpha) model and compared with the thermodynamics of proteins designed on random-generated folds. The model allows to design sequences to fold within a dRMSD ranging from 1.2 to 4.2 A from the crystallographic native conformation and to study properties that are hard to be measured experimentally. It is found that the denatured state of all of them is not random but is, to different extents, partially structured. The degree of structure is more abundant for SH3 and protein G, giving rise to a weaker stability but a more efficient folding kinetics than CI2 and, even more, than the random-generated folds. Consequently, the features of the unfolded state seem to be as important in the determination of the thermodynamic properties of these proteins as the features of the native state.
Collapse
Affiliation(s)
- A Amatori
- Department of Physics, University of Milano and INFN, 20133 Milano, Italy
| | | | | | | |
Collapse
|
34
|
A historical perspective of template-based protein structure prediction. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2008; 413:3-42. [PMID: 18075160 DOI: 10.1007/978-1-59745-574-9_1] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
This chapter presents a broad and a historical overview of the problem of protein structure prediction. Different structure prediction methods, including homology modeling, fold recognition (FR)/protein threading, ab initio/de novo approaches, and hybrid techniques involving multiple types of approaches, are introduced in a historical context. The progress of the field as a whole, especially in the threading/FR area, as reflected by the CASP/CAFASP contests, is reviewed. At the end of the chapter, we discuss the challenging issues ahead in the field of protein structure prediction.
Collapse
|
35
|
Meyerguz L, Kleinberg J, Elber R. The network of sequence flow between protein structures. Proc Natl Acad Sci U S A 2007; 104:11627-32. [PMID: 17596339 PMCID: PMC1913895 DOI: 10.1073/pnas.0701393104] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2007] [Indexed: 12/24/2022] Open
Abstract
Sequence-structure relationships in proteins are highly asymmetric because many sequences fold into relatively few structures. What is the number of sequences that fold into a particular protein structure? Is it possible to switch between stable protein folds by point mutations? To address these questions, we compute a directed graph of sequences and structures of proteins, which is based on 2,060 experimentally determined protein shapes from the Protein Data Bank. The directed graph is highly connected at native energies with "sinks" that attract many sequences from other folds. The sinks are rich in beta-sheets. The number of sequences that transition between folds is significantly smaller than the number of sequences retained by their fold. The sequence flow into a particular protein shape from other proteins correlates with the number of sequences that matches this shape in empirically determined genomes. Properties of strongly connected components of the graph are correlated with protein length and secondary structure.
Collapse
Affiliation(s)
- Leonid Meyerguz
- Department of Computer Science, Cornell University, Ithaca, NY 14853
| | - Jon Kleinberg
- Department of Computer Science, Cornell University, Ithaca, NY 14853
| | - Ron Elber
- Department of Computer Science, Cornell University, Ithaca, NY 14853
| |
Collapse
|
36
|
Yang JY, Yu ZG, Anh V. Correlations between designability and various structural characteristics of protein lattice models. J Chem Phys 2007; 126:195101. [PMID: 17523837 DOI: 10.1063/1.2737042] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Using six kinds of lattice types (4 x 4, 5 x 5, and 6 x 6 square lattices; 3 x 3 x 3 cubic lattice; and 2+3+4+3+2 and 4+5+6+5+4 triangular lattices), three different size alphabets (HP, HNUP, and 20 letters), and two energy functions, the designability of protein structures is calculated based on random samplings of structures and common biased sampling (CBS) of protein sequence space. Then three quantities stability (average energy gap), foldability, and partnum of the structure, which are defined to elucidate the designability, are calculated. The authors find that whatever the type of lattice, alphabet size, and energy function used, there will be an emergence of highly designable (preferred) structure. For all cases considered, the local interactions reduce degeneracy and make the designability higher. The designability is sensitive to the lattice type, alphabet size, energy function, and sampling method of the sequence space. Compared with the random sampling method, both the CBS and the Metropolis Monte Carlo sampling methods make the designability higher. The correlation coefficients between the designability, stability, and foldability are mostly larger than 0.5, which demonstrate that they have strong correlation relationship. But the correlation relationship between the designability and the partnum is not so strong because the partnum is independent of the energy. The results are useful in practical use of the designability principle, such as to predict the protein tertiary structure.
Collapse
Affiliation(s)
- Jian-Yi Yang
- School of Mathematics and Computing Science, Xiangtan University, Hunan 411105, China
| | | | | |
Collapse
|
37
|
Liu T, Whitten ST, Hilser VJ. Functional residues serve a dominant role in mediating the cooperativity of the protein ensemble. Proc Natl Acad Sci U S A 2007; 104:4347-52. [PMID: 17360527 PMCID: PMC1838605 DOI: 10.1073/pnas.0607132104] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2006] [Indexed: 11/18/2022] Open
Abstract
Conformational fluctuations in proteins have emerged as a potentially important aspect of biological function, although the precise relationship and the implications have yet to be fully explored. Numerous studies have reported that the binding of ligand can influence fluctuations. However, the role of the binding site in mediating these fluctuations is not known. Of particular interest is whether in addition to serving as structural scaffolds for recognition and catalysis, active-site residues may also play a role in modulating the cooperative network. To address this question, we employ an experimentally validated ensemble-based description of proteins to elucidate the extent to which perturbations at different sites can influence the cooperative network in the protein. Applying this method to a database of test proteins, it is found statistically that binding sites are located in regions most able to affect the cooperative network, even for cooperative interactions between residues distant to the binding sites. This indicates that the conformational manifold under native conditions is determined by the network of cooperative interactions within the protein and suggests that proteins have evolved to use these conformational fluctuations in carrying out their functions. Furthermore, because the energetic coupling pattern calculated for each protein is robust and relatively insensitive to sequence, these studies further suggest that binding sites evolved in regions of the protein that are inherently poised to take advantage of the fluctuations in the native structure.
Collapse
Affiliation(s)
- Tong Liu
- Department of Biochemistry and Molecular Biology, and Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, TX 77555-1068
| | - Steven T. Whitten
- Department of Biochemistry and Molecular Biology, and Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, TX 77555-1068
| | - Vincent J. Hilser
- Department of Biochemistry and Molecular Biology, and Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, TX 77555-1068
| |
Collapse
|
38
|
Berezovsky IN, Zeldovich KB, Shakhnovich EI. Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput Biol 2007; 3:e52. [PMID: 17381236 PMCID: PMC1829478 DOI: 10.1371/journal.pcbi.0030052] [Citation(s) in RCA: 95] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2006] [Accepted: 01/31/2007] [Indexed: 11/18/2022] Open
Abstract
The aim of this work is to elucidate how physical principles of protein design are reflected in natural sequences that evolved in response to the thermal conditions of the environment. Using an exactly solvable lattice model, we design sequences with selected thermal properties. Compositional analysis of designed model sequences and natural proteomes reveals a specific trend in amino acid compositions in response to the requirement of stability at elevated environmental temperature: the increase of fractions of hydrophobic and charged amino acid residues at the expense of polar ones. We show that this “from both ends of the hydrophobicity scale” trend is due to positive (to stabilize the native state) and negative (to destabilize misfolded states) components of protein design. Negative design strengthens specific repulsive non-native interactions that appear in misfolded structures. A pressure to preserve specific repulsive interactions in non-native conformations may result in correlated mutations between amino acids that are far apart in the native state but may be in contact in misfolded conformations. Such correlated mutations are indeed found in TIM barrel and other proteins. What mechanisms does Nature use in her quest for thermophilic proteins? It is known that stability of a protein is mainly determined by the energy gap, or the difference in energy, between native state and a set of incorrectly folded (misfolded) conformations. Here we show that Nature makes thermophilic proteins by widening this gap from both ends. The energy of the native state of a protein is decreased by selecting strongly attractive amino acids at positions that are in contact in the native state (positive design). Simultaneously, energies of the misfolded conformations are increased by selection of strongly repulsive amino acids at positions that are distant in native structure; however, these amino acids will interact repulsively in the misfolded conformations (negative design). These fundamental principles of protein design are manifested in the “from both ends of the hydrophobicity scale” trend observed in thermophilic adaptation, whereby proteomes of thermophilic proteins are enriched in extreme amino acids—hydrophobic and charged—at the expense of polar ones. Hydrophobic amino acids contribute mostly to the positive design, while charged amino acids that repel each other in non-native conformations of proteins contribute to negative design. Our results provide guidance in rational design of proteins with selected thermal properties.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Konstantin B Zeldovich
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - Eugene I Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
39
|
Williams PD, Pollock DD, Goldstein RA. Functionality and the evolution of marginal stability in proteins: inferences from lattice simulations. Evol Bioinform Online 2007; 2:91-101. [PMID: 19455204 PMCID: PMC2674661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/29/2022] Open
Abstract
It has been known for some time that many proteins are marginally stable. This has inspired several explanations. Having noted that the functionality of many enzymes is correlated with subunit motion, flexibility, or general disorder, some have suggested that marginally stable proteins should have an evolutionary advantage over proteins of differing stability. Others have suggested that stability and functionality are contradictory qualities, and that selection for both criteria results in marginally stable proteins, optimised to satisfy the competing design pressures. While these explanations are plausible, recent research simulating the evolution of model proteins has shown that selection for stability, ignoring any aspects of functionality, can result in marginally stable proteins because of the underlying makeup of protein sequence-space. We extend this research by simulating the evolution of proteins, using a computational protein model that equates functionality with binding and catalysis. In the model, marginal stability is not required for ligand-binding functionality and we observe no competing design pressures. The resulting proteins are marginally stable, again demonstrating that neutral evolution is sufficient for explaining marginal stability in observed proteins.
Collapse
Affiliation(s)
- Paul D. Williams
- Department of Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| | - David D. Pollock
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Richard A. Goldstein
- Mathematical Biology, National Institute for Medical Sciences, The Ridgeway, Mill Hill, London MW7 1AA, UK
| |
Collapse
|
40
|
Knott M, Chan HS. Criteria for downhill protein folding: Calorimetry, chevron plot, kinetic relaxation, and single-molecule radius of gyration in chain models with subdued degrees of cooperativity. Proteins 2006; 65:373-91. [PMID: 16909416 DOI: 10.1002/prot.21066] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Recent investigations of possible downhill folding of small proteins such as BBL have focused on the thermodynamics of non-two-state, "barrierless" folding/denaturation transitions. Downhill folding is noncooperative and thermodynamically "one-state," a phenomenon underpinned by a unimodal conformational distribution over chain properties such as enthalpy, hydrophobic exposure, and conformational dimension. In contrast, corresponding distributions for cooperative two-state folding are bimodal with well-separated population peaks. Using simplified atomic modeling of a three-helix bundle-in a scheme that accounts for hydrophobic interactions and hydrogen bonding-and coarse-grained C(alpha) models of four real proteins with various degrees of cooperativity, we evaluate the effectiveness of several observables at defining the underlying distribution. Bimodal distributions generally lead to sharper transitions, with a higher heat capacity peak at the transition midpoint, compared with unimodal distributions. However, the observation of a sigmoidal transition is not a reliable criterion for two-state behavior, and the heat capacity baselines, used to determine the van't Hoff and calorimetric enthalpies of the transition, can introduce ambiguity. Interestingly we find that, if the distribution of the single-molecule radius of gyration were available, it would permit discrimination between unimodal and bimodal underlying distributions. We investigate kinetic implications of thermodynamic noncooperativity using Langevin dynamics. Despite substantial chevron rollovers, the relaxation of the models considered is essentially single-exponential over an extended range of native stabilities. Consistent with experiments, significant deviations from single-exponential behavior occur only under strongly folding conditions.
Collapse
Affiliation(s)
- Michael Knott
- Department of Biochemistry, and of Medical Genetics and Microbiology, Protein Engineering Network of Centres of Excellence, Faculty of Medicine, University of Toronto, Toronto, Ontario M5S 1A8, Canada
| | | |
Collapse
|
41
|
Bloom JD, Drummond DA, Arnold FH, Wilke CO. Structural determinants of the rate of protein evolution in yeast. Mol Biol Evol 2006; 23:1751-61. [PMID: 16782762 DOI: 10.1093/molbev/msl040] [Citation(s) in RCA: 132] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We investigate how a protein's structure influences the rate at which its sequence evolves. Our basic hypothesis is that proteins with highly designable structures (structures that are encoded by many sequences) will evolve more rapidly. Recent theoretical advances argue that structures with a higher density of interresidue contacts are more designable, and we show that high contact density is correlated with an increased rate of sequence evolution in yeast. In addition, we investigate the correlations between the rate of sequence evolution and several other structural descriptors, carefully controlling for the strong effect of expression level on evolutionary rate. Overall, we find that the structural descriptors that we consider appear to explain roughly 10% of the variation in rates of protein evolution in yeast. We also show that despite the well-known trend for buried residues to be more conserved, proteins with a higher fraction of buried residues, nonetheless, tend to evolve their sequences more rapidly. We suggest that this effect is due to the increased designability of structures with more buried residues. Our results provide evidence that protein structure plays an important role in shaping the rate of sequence evolution and provide evidence to support recent theoretical advances linking structural designability to contact density.
Collapse
Affiliation(s)
- Jesse D Bloom
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, USA
| | | | | | | |
Collapse
|
42
|
Ding F, Dokholyan NV. Emergence of protein fold families through rational design. PLoS Comput Biol 2006; 2:e85. [PMID: 16839198 PMCID: PMC1487181 DOI: 10.1371/journal.pcbi.0020085] [Citation(s) in RCA: 163] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2006] [Accepted: 05/26/2006] [Indexed: 11/26/2022] Open
Abstract
Diverse proteins with similar structures are grouped into families of homologs and analogs, if their sequence similarity is higher or lower, respectively, than 20%–30%. It was suggested that protein homologs and analogs originate from a common ancestor and diverge in their distinct evolutionary time scales, emerging as a consequence of the physical properties of the protein sequence space. Although a number of studies have determined key signatures of protein family organization, the sequence-structure factors that differentiate the two evolution-related protein families remain unknown. Here, we stipulate that subtle structural changes, which appear due to accumulating mutations in the homologous families, lead to distinct packing of the protein core and, thus, novel compositions of core residues. The latter process leads to the formation of distinct families of homologs. We propose that such differentiation results in the formation of analogous families. To test our postulate, we developed a molecular modeling and design toolkit, Medusa, to computationally design protein sequences that correspond to the same fold family. We find that analogous proteins emerge when a backbone structure deviates only 1–2 Å root-mean-square deviation from the original structure. For close homologs, core residues are highly conserved. However, when the overall sequence similarity drops to ~25%–30%, the composition of core residues starts to diverge, thereby forming novel families of protein homologs. This direct observation of the formation of protein homologs within a specific fold family supports our hypothesis. The conservation of amino acids in designed sequences recapitulates that of the naturally occurring sequences, thereby validating our computational design methodology. Studies of known proteins have revealed intriguing co-organization of their sequences and structures. Proteins with sequence similarity higher than 25%–30% usually adopt a similar structure and are called homologs, whereas those with low sequence similarity (<20%) can share the same structure and are referred as analogs. The origin of such co-organization has been a topic of extensive discussions among protein folding, design, and evolution research communities, because understanding of the emergence of homologs and analogs in the protein universe has broad implications for our ability to rationally manipulate proteins. In this study, the authors developed a molecular modeling and design method, Medusa, to computationally design diversified protein sequences that correspond to similar backbone structures, which determine a protein fold family. Using Medusa, the authors directly demonstrated the formation of distinct protein homologs within a specific fold family when the structure deviates only 1–2 Å away from the original structure. The study suggests that subtle structural changes, which appear due to accumulating mutations in the families of homologs, lead to a distinct packing of the protein core and, thus, novel compositions of core residues. The latter process leads to the formation of distinct families of homologs.
Collapse
Affiliation(s)
- Feng Ding
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Nikolay V Dokholyan
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
43
|
Shakhnovich E. Protein folding thermodynamics and dynamics: where physics, chemistry, and biology meet. Chem Rev 2006; 106:1559-88. [PMID: 16683745 PMCID: PMC2735084 DOI: 10.1021/cr040425u] [Citation(s) in RCA: 258] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Eugene Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, USA.
| |
Collapse
|
44
|
Marsh L. Evolution of Structural Shape in Bacterial Globin-Related Proteins. J Mol Evol 2006; 62:575-87. [PMID: 16612536 DOI: 10.1007/s00239-005-0025-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2005] [Accepted: 12/31/2005] [Indexed: 10/24/2022]
Abstract
The globin family of proteins has a characteristic structural pattern of helix interactions that nonetheless exhibits some variation. A simplified model for globin structural evolution was developed in which protein shape evolved by random change of contacts between helices. A conserved globin domain of 15 bacterial proteins representing four structural families was studied. Using a parsimony approach ancestral structural states could be reconstructed. The distribution of number of contact changes per site for a fixed topology tree fit a gamma distribution. Homoplasy was high, with multiple changes per site and no support for an invariant class of residue-residue contacts. Contacts changed more slowly than sequence. A phylogenetic reconstruction using a distance measure based on the proportion of shared contacts was generally consistent with a sequence-based phylogeny but not highly resolved. Contact pattern convergence between members of different globin family proteins could not be detected. Simulation studies indicated the convergence test was sensitive enough to have detected convergence involving only 10% of the contacts, suggesting a limit on the extent of selection for a specific contact pattern. Contact site methods may provide additional approaches to study the relationship between protein structure and sequence evolution.
Collapse
Affiliation(s)
- Lorraine Marsh
- Department of Biology, Long Island University, 1 University Plaza, Brooklyn, NY 11201, USA.
| |
Collapse
|
45
|
Zeldovich KB, Berezovsky IN, Shakhnovich EI. Physical origins of protein superfamilies. J Mol Biol 2006; 357:1335-43. [PMID: 16483605 DOI: 10.1016/j.jmb.2006.01.081] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2005] [Accepted: 01/23/2006] [Indexed: 10/25/2022]
Abstract
In this work, we discovered a fundamental connection between selection for protein stability and emergence of preferred structures of proteins. Using a standard exact three-dimensional lattice model we evolve sequences starting from random ones and determine the exact native structure after each mutation. Acceptance of mutations is biased to select for stable proteins. We found that certain structures, "wonderfolds", are independently discovered numerous times as native states of stable proteins in many unrelated runs of selection. The strong dependence of lattice fold usage on the structural determinant of designability quantitatively reproduces uneven fold usage in natural proteins. Diversity of sequences that fold into wonderfold structures gives rise to superfamilies, i.e. sets of dissimilar sequences that fold into the same or very similar structures. The present work establishes a model of pre-biotic structure selection, which identifies dominant structural patterns emerging upon optimization of proteins for survival in a hot environment. Convergently discovered pre-biotic initial superfamilies with wonderfold structures could have served as a seed for subsequent biological evolution involving gene duplications and divergence.
Collapse
Affiliation(s)
- Konstantin B Zeldovich
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, MA 02138, USA
| | | | | |
Collapse
|
46
|
Dokholyan NV. The architecture of the protein domain universe. Gene 2005; 347:199-206. [PMID: 15777630 DOI: 10.1016/j.gene.2004.12.020] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2004] [Revised: 11/08/2004] [Accepted: 12/16/2004] [Indexed: 11/23/2022]
Abstract
Understanding the design of the universe of protein structures may provide insights into protein evolution. We study the architecture of the protein domain universe, which has been found to poses peculiar scale-free properties. We examine the origin of these scale-free properties of the graph of protein domain structures (PDUG) and determine that that the PDUG is not modular, i.e. it does not consist of modules with uniform properties. Instead, we find the PDUG to be self-similar at all scales. We further characterize the PDUG architecture by studying the properties of the hub nodes that are responsible for the scale-free connectivity of the PDUG. We introduce a measure of the betweenness centrality of protein domains in the PDUG and find a power-law distribution of the betweenness centrality values. The scale-free distribution of hubs in the protein universe suggests that a set of specific statistical mechanics models, such as the self-organized criticality model, can potentially identify the principal driving forces of protein evolution. We also find a gatekeeper protein domain, removal of which partitions the largest cluster into two large sub-clusters. We suggest that the loss of such gatekeeper protein domains in the course of evolution is responsible for the creation of new fold families.
Collapse
Affiliation(s)
- Nikolay V Dokholyan
- Department of Biochemistry and Biophysics, The University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC 27599, USA.
| |
Collapse
|
47
|
Shakhnovich BE, Deeds E, Delisi C, Shakhnovich E. Protein structure and evolutionary history determine sequence space topology. Genome Res 2005; 15:385-92. [PMID: 15741509 PMCID: PMC551565 DOI: 10.1101/gr.3133605] [Citation(s) in RCA: 75] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2004] [Accepted: 11/23/2004] [Indexed: 11/24/2022]
Abstract
Understanding the observed variability in the number of homologs of a gene is a very important unsolved problem that has broad implications for research into coevolution of structure and function, gene duplication, pseudogene formation, and possibly for emerging diseases. Here, we attempt to define and elucidate some possible causes behind the observed irregularity in sequence space. We present evidence that sequence variability and functional diversity of a gene or fold family is influenced by quantifiable characteristics of the protein structure. These characteristics reflect the structural potential for sequence plasticity, i.e., the ability to accept mutation without losing thermodynamic stability. We identify a structural feature of a protein domain-contact density-that serves as a determinant of entropy in sequence space, i.e., the ability of a protein to accept mutations without destroying the fold (also known as fold designability). We show that (log) of average gene family size exhibits statistical correlation (R(2) > 0.9.) with contact density of its three-dimensional structure. We present evidence that the size of individual gene families are influenced not only by the designability of the structure, but also by evolutionary history, e.g., the amount of time the gene family was in existence. We further show that our observed statistical correlation between gene family size and contact density of the structure is valid on many levels of evolutionary divergence, i.e., not only for closely related sequence, but also for less-related fold and superfamily levels of homology.
Collapse
|
48
|
Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH. Thermodynamic prediction of protein neutrality. Proc Natl Acad Sci U S A 2005; 102:606-11. [PMID: 15644440 PMCID: PMC545518 DOI: 10.1073/pnas.0406744102] [Citation(s) in RCA: 270] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We present a simple theory that uses thermodynamic parameters to predict the probability that a protein retains the wild-type structure after one or more random amino acid substitutions. Our theory predicts that for large numbers of substitutions the probability that a protein retains its structure will decline exponentially with the number of substitutions, with the severity of this decline determined by properties of the structure. Our theory also predicts that a protein can gain extra robustness to the first few substitutions by increasing its thermodynamic stability. We validate our theory with simulations on lattice protein models and by showing that it quantitatively predicts previously published experimental measurements on subtilisin and our own measurements on variants of TEM1 beta-lactamase. Our work unifies observations about the clustering of functional proteins in sequence space, and provides a basis for interpreting the response of proteins to substitutions in protein engineering applications.
Collapse
Affiliation(s)
- Jesse D Bloom
- Division of Chemistry and Chemical Engineering 210-41, California Institute of Technology, Pasadena, CA 91125, USA.
| | | | | | | | | | | |
Collapse
|
49
|
Wroe R, Bornberg-Bauer E, Chan HS. Comparing folding codes in simple heteropolymer models of protein evolutionary landscape: robustness of the superfunnel paradigm. Biophys J 2005; 88:118-31. [PMID: 15501948 PMCID: PMC1304991 DOI: 10.1529/biophysj.104.050369] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2004] [Accepted: 10/13/2004] [Indexed: 11/18/2022] Open
Abstract
Understanding the evolution of biopolymers is a key element in rationalizing their structures and functions. Simple exact models (SEMs) are well-positioned to address general principles of evolution as they permit the exhaustive enumeration of both sequence and structure (conformational) spaces. The physics-based models of the complete mapping between genotypes and phenotypes afforded by SEMs have proven valuable for gaining insight into how adaptation and selection operate among large collections of sequences and structures. This study compares the properties of evolutionary landscapes of a variety of SEMs to delineate robust predictions and possible model-specific artifacts. Among the models studied, the ruggedness of evolutionary landscape is significantly model-dependent; those derived from more protein-like models appear to be smoother. We found that a common practice of restricting protein structure space to maximally compact lattice conformations results in (i.e., "designs in") many encodable (designable) structures that are not otherwise encodable in the corresponding unrestrained structure space. This discrepancy is especially severe for model potentials that seek to mimic the major role of hydrophobic interactions in protein folding. In general, restricting conformations to be maximally compact leads to larger changes in the model genotype-phenotype mapping than a moderate shifting of reference state energy of the model potential function to allow for more specific encoding via the "designing out" effects of repulsive interactions. Despite these variations, the superfunnel paradigm applies to all SEMs we have tested: For a majority of neutral nets across different models, there exists a funnel-like organization of native stabilities for the sequences in a neutral net encoding for the same structure, and the thermodynamically most stable sequence is also the most robust against mutation.
Collapse
Affiliation(s)
- Richard Wroe
- Faculty of Life Sciences, University of Manchester, United Kingdom; Bioinformatics Division, School of Biological Sciences, University of Münster, Münster, Germany; and Protein Engineering Network of Centres of Excellence, Department of Biochemistry, and Department of Medical Genetics and Microbiology, Faculty of Medicine, University of Toronto, Ontario, Canada
| | - Erich Bornberg-Bauer
- Faculty of Life Sciences, University of Manchester, United Kingdom; Bioinformatics Division, School of Biological Sciences, University of Münster, Münster, Germany; and Protein Engineering Network of Centres of Excellence, Department of Biochemistry, and Department of Medical Genetics and Microbiology, Faculty of Medicine, University of Toronto, Ontario, Canada
| | - Hue Sun Chan
- Faculty of Life Sciences, University of Manchester, United Kingdom; Bioinformatics Division, School of Biological Sciences, University of Münster, Münster, Germany; and Protein Engineering Network of Centres of Excellence, Department of Biochemistry, and Department of Medical Genetics and Microbiology, Faculty of Medicine, University of Toronto, Ontario, Canada
| |
Collapse
|
50
|
Abstract
How well can the current set of known protein families and folds be used to estimate the total number of folds in nature, and will structural genomics initiatives yield representatives for all the major protein families within a reasonable time scale? Although the precise aims differ between the various international structural genomics initiatives currently aiming to illuminate the universe of protein folds, many selectively target protein families for which the fold is unknown. How well can the current set of known protein families and folds be used to estimate the total number of folds in nature, and will structural genomics initiatives yield representatives for all the major protein families within a reasonable time scale?
Collapse
Affiliation(s)
- Alastair Grant
- Department of Biochemistry and Molecular Biology, University College, Gower Street, London WC1E 6BT, UK.
| | | | | |
Collapse
|