1
|
Currin A, Swainston N, Day PJ, Kell DB. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chem Soc Rev 2015; 44:1172-239. [PMID: 25503938 PMCID: PMC4349129 DOI: 10.1039/c4cs00351a] [Citation(s) in RCA: 258] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Indexed: 12/21/2022]
Abstract
The amino acid sequence of a protein affects both its structure and its function. Thus, the ability to modify the sequence, and hence the structure and activity, of individual proteins in a systematic way, opens up many opportunities, both scientifically and (as we focus on here) for exploitation in biocatalysis. Modern methods of synthetic biology, whereby increasingly large sequences of DNA can be synthesised de novo, allow an unprecedented ability to engineer proteins with novel functions. However, the number of possible proteins is far too large to test individually, so we need means for navigating the 'search space' of possible protein sequences efficiently and reliably in order to find desirable activities and other properties. Enzymologists distinguish binding (Kd) and catalytic (kcat) steps. In a similar way, judicious strategies have blended design (for binding, specificity and active site modelling) with the more empirical methods of classical directed evolution (DE) for improving kcat (where natural evolution rarely seeks the highest values), especially with regard to residues distant from the active site and where the functional linkages underpinning enzyme dynamics are both unknown and hard to predict. Epistasis (where the 'best' amino acid at one site depends on that or those at others) is a notable feature of directed evolution. The aim of this review is to highlight some of the approaches that are being developed to allow us to use directed evolution to improve enzyme properties, often dramatically. We note that directed evolution differs in a number of ways from natural evolution, including in particular the available mechanisms and the likely selection pressures. Thus, we stress the opportunities afforded by techniques that enable one to map sequence to (structure and) activity in silico, as an effective means of modelling and exploring protein landscapes. Because known landscapes may be assessed and reasoned about as a whole, simultaneously, this offers opportunities for protein improvement not readily available to natural evolution on rapid timescales. Intelligent landscape navigation, informed by sequence-activity relationships and coupled to the emerging methods of synthetic biology, offers scope for the development of novel biocatalysts that are both highly active and robust.
Collapse
Affiliation(s)
- Andrew Currin
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| | - Neil Swainston
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- School of Computer Science , The University of Manchester , Manchester M13 9PL , UK
| | - Philip J. Day
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
- Faculty of Medical and Human Sciences , The University of Manchester , Manchester M13 9PT , UK
| | - Douglas B. Kell
- Manchester Institute of Biotechnology , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK . ; http://dbkgroup.org/; @dbkell ; Tel: +44 (0)161 306 4492
- School of Chemistry , The University of Manchester , Manchester M13 9PL , UK
- Centre for Synthetic Biology of Fine and Speciality Chemicals (SYNBIOCHEM) , The University of Manchester , 131, Princess St , Manchester M1 7DN , UK
| |
Collapse
|
2
|
Abstract
BACKGROUND DNA shuffling generates combinatorial libraries of chimeric genes by stochastically recombining parent genes. The resulting libraries are subjected to large-scale genetic selection or screening to identify those chimeras with favorable properties (e.g., enhanced stability or enzymatic activity). While DNA shuffling has been applied quite successfully, it is limited by its homology-dependent, stochastic nature. Consequently, it is used only with parents of sufficient overall sequence identity, and provides no control over the resulting chimeric library. RESULTS This paper presents efficient methods to extend the scope of DNA shuffling to handle significantly more diverse parents and to generate more predictable, optimized libraries. Our CODNS (cross-over optimization for DNA shuffling) approach employs polynomial-time dynamic programming algorithms to select codons for the parental amino acids, allowing for zero or a fixed number of conservative substitutions. We first present efficient algorithms to optimize the local sequence identity or the nearest-neighbor approximation of the change in free energy upon annealing, objectives that were previously optimized by computationally-expensive integer programming methods. We then present efficient algorithms for more powerful objectives that seek to localize and enhance the frequency of recombination by producing "runs" of common nucleotides either overall or according to the sequence diversity of the resulting chimeras. We demonstrate the effectiveness of CODNS in choosing codons and allocating substitutions to promote recombination between parents targeted in earlier studies: two GAR transformylases (41% amino acid sequence identity), two very distantly related DNA polymerases, Pol X and β (15%), and beta-lactamases of varying identity (26-47%). CONCLUSIONS Our methods provide the protein engineer with a new approach to DNA shuffling that supports substantially more diverse parents, is more deterministic, and generates more predictable and more diverse chimeric libraries.
Collapse
Affiliation(s)
- Lu He
- Dept of Computer Science, Dartmouth College, 6211 Sudikoff Laboratory, Hanover, NH 03755, USA
| | - Alan M Friedman
- Dept of Biological Sciences, Markey Center for Structural Biology, Purdue Cancer Center, and Bindley Bioscience Center, Purdue University, West Lafayette, IN 47907, USA
| | - Chris Bailey-Kellogg
- Dept of Computer Science, Dartmouth College, 6211 Sudikoff Laboratory, Hanover, NH 03755, USA
| |
Collapse
|
3
|
Xiong F, Friedman AM, Bailey-Kellogg C. Planning combinatorial disulfide cross-links for protein fold determination. BMC Bioinformatics 2011; 12 Suppl 12:S5. [PMID: 22168447 PMCID: PMC3247086 DOI: 10.1186/1471-2105-12-s12-s5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Fold recognition techniques take advantage of the limited number of overall structural organizations, and have become increasingly effective at identifying the fold of a given target sequence. However, in the absence of sufficient sequence identity, it remains difficult for fold recognition methods to always select the correct model. While a native-like model is often among a pool of highly ranked models, it is not necessarily the highest-ranked one, and the model rankings depend sensitively on the scoring function used. Structure elucidation methods can then be employed to decide among the models based on relatively rapid biochemical/biophysical experiments. RESULTS This paper presents an integrated computational-experimental method to determine the fold of a target protein by probing it with a set of planned disulfide cross-links. We start with predicted structural models obtained by standard fold recognition techniques. In a first stage, we characterize the fold-level differences between the models in terms of topological (contact) patterns of secondary structure elements (SSEs), and select a small set of SSE pairs that differentiate the folds. In a second stage, we determine a set of residue-level cross-links to probe the selected SSE pairs. Each stage employs an information-theoretic planning algorithm to maximize information gain while minimizing experimental complexity, along with a Bayes error plan assessment framework to characterize the probability of making a correct decision once data for the plan are collected. By focusing on overall topological differences and planning cross-linking experiments to probe them, our fold determination approach is robust to noise and uncertainty in the models (e.g., threading misalignment) and in the actual structure (e.g., flexibility). We demonstrate the effectiveness of our approach in case studies for a number of CASP targets, showing that the optimized plans have low risk of error while testing only a small portion of the quadratic number of possible cross-link candidates. Simulation studies with these plans further show that they do a very good job of selecting the correct model, according to cross-links simulated from the actual crystal structures. CONCLUSIONS Fold determination can overcome scoring limitations in purely computational fold recognition methods, while requiring less experimental effort than traditional protein structure determination approaches.
Collapse
Affiliation(s)
- Fei Xiong
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | | | | |
Collapse
|
4
|
Zheng W, Griswold KE, Bailey-Kellogg C. Protein fragment swapping: a method for asymmetric, selective site-directed recombination. J Comput Biol 2010; 17:459-75. [PMID: 20377457 DOI: 10.1089/cmb.2009.0189] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This article presents a new approach to site-directed recombination, swapping combinations of selected discontiguous fragments from a source protein in place of corresponding fragments of a target protein. By being both asymmetric (differentiating source and target) and selective (swapping discontiguous fragments), our method focuses experimental effort on a more restricted portion of sequence space, constructing hybrids that are more likely to have the properties that are the objective of the experiment. Furthermore, since the source and target need to be structurally homologous only locally (rather than overall), our method supports swapping fragments from functionally important regions of a source into a target "scaffold" (for example, to humanize an exogenous therapeutic protein). A protein fragment swapping plan is defined by the residue position boundaries of the fragments to be swapped; it is assessed by an average potential score over the resulting hybrid library, with singleton and pairwise terms evaluating the importance and fit of the swapped residues. While we prove that it is NP-hard to choose an optimal set of fragments under such a potential score, we develop an integer programming approach, which we call Swagmer, that works very well in practice. We demonstrate the effectiveness of our method in three swapping problems: selective recombination between beta-lactamases, activity swapping between glutathione transferases, and activity swapping between carboxylases and mutases in the purE family. We show that the selective recombination approach generates better plan (in terms of resulting potential score) than traditional site-directed recombination approaches. We also show that in all cases the optimized experiments are significantly better than ones that would result from stochastic methods.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computer Science, Dartmouth College, Hanover, New Hampshire 03755, USA
| | | | | |
Collapse
|
5
|
Densmore D, Hsiau THC, Kittleson JT, DeLoache W, Batten C, Anderson JC. Algorithms for automated DNA assembly. Nucleic Acids Res 2010; 38:2607-16. [PMID: 20335162 PMCID: PMC2860133 DOI: 10.1093/nar/gkq165] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Generating a defined set of genetic constructs within a large combinatorial space provides a powerful method for engineering novel biological functions. However, the process of assembling more than a few specific DNA sequences can be costly, time consuming and error prone. Even if a correct theoretical construction scheme is developed manually, it is likely to be suboptimal by any number of cost metrics. Modular, robust and formal approaches are needed for exploring these vast design spaces. By automating the design of DNA fabrication schemes using computational algorithms, we can eliminate human error while reducing redundant operations, thus minimizing the time and cost required for conducting biological engineering experiments. Here, we provide algorithms that optimize the simultaneous assembly of a collection of related DNA sequences. We compare our algorithms to an exhaustive search on a small synthetic dataset and our results show that our algorithms can quickly find an optimal solution. Comparison with random search approaches on two real-world datasets show that our algorithms can also quickly find lower-cost solutions for large datasets.
Collapse
Affiliation(s)
- Douglas Densmore
- Department of Fuel Synthesis, Joint BioEnergy Institute, 5885 Hollis St., Fourth Floor, Emeryville CA 94608, USA.
| | | | | | | | | | | |
Collapse
|
6
|
Zheng W, Friedman AM, Bailey-Kellogg C. Algorithms for joint optimization of stability and diversity in planning combinatorial libraries of chimeric proteins. J Comput Biol 2009; 16:1151-68. [PMID: 19645597 DOI: 10.1089/cmb.2009.0090] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In engineering protein variants by constructing and screening combinatorial libraries of chimeric proteins, two complementary and competing goals are desired: the new proteins must be similar enough to the evolutionarily-selected wild-type proteins to be stably folded, and they must be different enough to display functional variation. We present here the first method, Staversity, to simultaneously optimize stability and diversity in selecting sets of breakpoint locations for site-directed recombination. Our goal is to uncover all "undominated" breakpoint sets, for which no other breakpoint set is better in both factors. Our first algorithm finds the undominated sets serving as the vertices of the lower envelope of the two-dimensional (stability and diversity) convex hull containing all possible breakpoint sets. Our second algorithm identifies additional breakpoint sets in the concavities that are either undominated or dominated only by undiscovered breakpoint sets within a distance bound computed by the algorithm. Both algorithms are efficient, requiring only time polynomial in the numbers of residues and breakpoints, while characterizing a space defined by an exponential number of possible breakpoint sets. We applied Staversity to identify 2-10 breakpoint plans for different sets of parent proteins taken from the purE family, as well as for parent proteins TEM-1 and PSE-4 from the beta-lactamase family. The average normalized distance between our plans and the lower bound for optimal plans is around 2%. Our plans dominate most (60-90% on average for each parent set) of the plans found by other possible approaches, random sampling or explicit optimization for stability with implicit optimization for diversity. The identified breakpoint sets provide a compact representation of good plans, enabling a protein engineer to understand and account for the trade-offs between two key considerations in combinatorial chimeragenesis.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computer Science, Dartmouth College , Hanover, New Hampshire, USA
| | | | | |
Collapse
|
7
|
Thomas J, Ramakrishnan N, Bailey-Kellogg C. Protein design by sampling an undirected graphical model of residue constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2009; 6:506-516. [PMID: 19644177 DOI: 10.1109/tcbb.2008.124] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
This paper develops an approach for designing protein variants by sampling sequences that satisfy residue constraints encoded in an undirected probabilistic graphical model. Due to evolutionary pressures on proteins to maintain structure and function, the sequence record of a protein family contains valuable information regarding position-specific residue conservation and coupling (or covariation) constraints. Representing these constraints with a graphical model provides two key benefits for protein design: a probabilistic semantics enabling evaluation of possible sequences for consistency with the constraints, and an explicit factorization of residue dependence and independence supporting efficient exploration of the constrained sequence space. We leverage these benefits in developing two complementary MCMC algorithms for protein design: constrained shuffling mixes wild-type sequences positionwise and evaluates graphical model likelihood, while component sampling directly generates sequences by sampling clique values and propagating to other cliques. We apply our methods to design WW domains. We demonstrate that likelihood under a model of wild-type WWs is highly predictive of foldedness of new WWs. We then show both theoretical and rapid empirical convergence of our algorithms in generating high-likelihood, diverse new sequences. We further show that these sequences capture the original sequence constraints, yielding a model as predictive of foldedness as the original one.
Collapse
Affiliation(s)
- John Thomas
- Department of Computer Science, Dartmouth College, 6211 Sudikoff Laboratory, Hanover, NH 03755, USA.
| | | | | |
Collapse
|
8
|
Thomas J, Ramakrishnan N, Bailey-Kellogg C. Graphical models of residue coupling in protein families. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:183-197. [PMID: 18451428 DOI: 10.1109/tcbb.2007.70225] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Many statistical measures and algorithmic techniques have been proposed for studying residue coupling in protein families. Generally speaking, two residue positions are considered coupled if, in the sequence record, some of their amino acid type combinations are significantly more common than others. While the proposed approaches have proven useful in finding and describing coupling, a significant missing component is a formal probabilistic model that explicates and compactly represents the coupling, integrates information about sequence,structure, and function, and supports inferential procedures for analysis, diagnosis, and prediction.We present an approach to learning and using probabilistic graphical models of residue coupling. These models capture significant conservation and coupling constraints observable ina multiply-aligned set of sequences. Our approach can place a structural prior on considered couplings, so that all identified relationships have direct mechanistic explanations. It can also incorporate information about functional classes, and thereby learn a differential graphical model that distinguishes constraints common to all classes from those unique to individual classes. Such differential models separately account for class-specific conservation and family-wide coupling, two different sources of sequence covariation. They are then able to perform interpretable functional classification of new sequences, explaining classification decisions in terms of the underlying conservation and coupling constraints. We apply our approach in studies of both G protein-coupled receptors and PDZ domains, identifying and analyzing family-wide and class-specific constraints, and performing functional classification. The results demonstrate that graphical models of residue coupling provide a powerful tool for uncovering, representing, and utilizing significant sequence structure-function relationships in protein families.
Collapse
Affiliation(s)
- John Thomas
- Department of Computer Science, Dartmouth College, Sudikoff Laboratory, Hanover, NH 03755, USA.
| | | | | |
Collapse
|
9
|
Avramova LV, Desai J, Weaver S, Friedman AM, Bailey-Kellogg C. Robotic hierarchical mixing for the production of combinatorial libraries of proteins and small molecules. ACTA ACUST UNITED AC 2007; 10:63-8. [PMID: 18072752 DOI: 10.1021/cc700106e] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We present a method to automatically plan a robotic process to mix individual combinations of reactants in individual reaction vessels (vials or wells in a multiwell plate), mixing any number of reactants in any desired stoichiometry, and ordering the mixing steps according to an arbitrarily complex treelike assembly protocol. This process enables the combinatorial generation of complete or partial product libraries in individual reaction vessels from intermediates formed in the presence of different sets of reactants. It can produce either libraries of chimeric genes constructed by ligation of fragments from different parent genes or libraries of chemical compounds constructed by convergent synthesis. Given concentrations of the input reactants and desired amounts or volumes of the products, our algorithm, RoboMix, computes the required reactant volumes and the resulting product concentrations, along with volumes and concentrations for all intermediate combinations. It outputs a sequence of robotic liquid transfer steps that ensures that each combination is correctly mixed even when individualized stoichiometries are employed and with any fractional yield for a product. It can also account for waste in robotic liquid handling and residual volume needed to ensure accurate aspiration. We demonstrate the effectiveness of the method in a test mixing dyes with different UV-vis absorption spectra, verifying the desired combinations spectroscopically.
Collapse
Affiliation(s)
- Larisa V Avramova
- Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907, USA
| | | | | | | | | |
Collapse
|
10
|
Chaparro-Riggers JF, Polizzi KM, Bommarius AS. Better library design: data-driven protein engineering. Biotechnol J 2007; 2:180-91. [PMID: 17183506 DOI: 10.1002/biot.200600170] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Data-driven protein engineering is increasingly used as an alternative to rational design and combinatorial engineering because it uses available knowledge to limit library size, while still allowing for the identification of unpredictable substitutions that lead to large effects. Recent advances in computational modeling and bioinformatics, as well as an increasing databank of experiments on functional variants, have led to new strategies to choose particular amino acid residues to vary in order to increase the chances of obtaining a variant protein with the desired property. Strategies for limiting diversity at each position, design of small sub-libraries, and the performance of scouting experiments, have also been developed or even automated, further reducing the library size.
Collapse
Affiliation(s)
- Javier F Chaparro-Riggers
- School of Chemical and Biomolecular Engineering, Parker H. Petit Institute of Bioengineering and Bioscience, Atlanta, GA, USA
| | | | | |
Collapse
|
11
|
Ye X, Friedman AM, Bailey-Kellogg C. Hypergraph model of multi-residue interactions in proteins: sequentially-constrained partitioning algorithms for optimization of site-directed protein recombination. J Comput Biol 2007; 14:777-90. [PMID: 17691894 DOI: 10.1089/cmb.2007.r016] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Relationships among amino acids determine stability and function and are also constrained by evolutionary history. We develop a probabilistic hypergraph model of residue relationships that generalizes traditional pairwise contact potentials to account for the statistics of multi-residue interactions. Using this model, we detected non-random associations in protein families and in the protein database. We also use this model in optimizing site-directed recombination experiments to preserve significant interactions and thereby increase the frequency of generating useful recombinants. We formulate the optimization as a sequentially-constrained hypergraph partitioning problem; the quality of recombinant libraries with respect to a set of breakpoints is characterized by the total perturbation to edge weights. We prove this problem to be NP-hard in general, but develop exact and heuristic polynomial-time algorithms for a number of important cases. Application to the beta-lactamase family demonstrates the utility of our algorithms in planning site-directed recombination.
Collapse
Affiliation(s)
- Xiaoduan Ye
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | | | | |
Collapse
|