51
|
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566287. [PMID: 37987009 PMCID: PMC10659313 DOI: 10.1101/2023.11.08.566287] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.
Collapse
Affiliation(s)
- Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
52
|
Skwara A, Gowda K, Yousef M, Diaz-Colunga J, Raman AS, Sanchez A, Tikhonov M, Kuehn S. Statistically learning the functional landscape of microbial communities. Nat Ecol Evol 2023; 7:1823-1833. [PMID: 37783827 PMCID: PMC11088814 DOI: 10.1038/s41559-023-02197-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 08/11/2023] [Indexed: 10/04/2023]
Abstract
Microbial consortia exhibit complex functional properties in contexts ranging from soils to bioreactors to human hosts. Understanding how community composition determines function is a major goal of microbial ecology. Here we address this challenge using the concept of community-function landscapes-analogues to fitness landscapes-that capture how changes in community composition alter collective function. Using datasets that represent a broad set of community functions, from production/degradation of specific compounds to biomass generation, we show that statistically inferred landscapes quantitatively predict community functions from knowledge of species presence or absence. Crucially, community-function landscapes allow prediction without explicit knowledge of abundance dynamics or interactions between species and can be accurately trained using measurements from a small subset of all possible community compositions. The success of our approach arises from the fact that empirical community-function landscapes appear to be not rugged, meaning that they largely lack high-order epistatic contributions that would be difficult to fit with limited data. Finally, we show that this observation holds across a wide class of ecological models, suggesting community-function landscapes can be efficiently inferred across a broad range of ecological regimes. Our results open the door to the rational design of consortia without detailed knowledge of abundance dynamics or interactions.
Collapse
Affiliation(s)
- Abigail Skwara
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Karna Gowda
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL, USA
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
| | - Mahmoud Yousef
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL, USA
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
| | - Juan Diaz-Colunga
- Department of Microbial Biotechnology, National Center for Biotechnology (CNB-CSIC), Madrid, Spain
| | - Arjun S Raman
- Department of Pathology, University of Chicago, Chicago, IL, USA
- Duchossois Family Institute, University of Chicago, Chicago, IL, USA
| | - Alvaro Sanchez
- Department of Microbial Biotechnology, National Center for Biotechnology (CNB-CSIC), Madrid, Spain
| | - Mikhail Tikhonov
- Department of Physics, Washington University in St. Louis, St. Louis, MO, USA.
| | - Seppe Kuehn
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL, USA.
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
53
|
Affiliation(s)
- Daniel R Amor
- Laboratoire de Physique, Ecole normale supérieure, Université PSL, CNRS, Paris, France.
- IAME, Université de Paris Cité, Université Sorbonne Paris Nord, INSERM, Paris, France.
| |
Collapse
|
54
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
55
|
Yang J, Ducharme J, Johnston KE, Li FZ, Yue Y, Arnold FH. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering. ACS Synth Biol 2023; 12:2444-2454. [PMID: 37524064 DOI: 10.1021/acssynbio.3c00301] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method that directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with the potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy to use, generalizable, and scalable. With accompanying software (https://github.com/jsunn-y/DeCOIL), DeCOIL can be readily implemented to generate desired informed libraries.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Julie Ducharme
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, California 91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
56
|
Chen L, Zhang Z, Li Z, Li R, Huo R, Chen L, Wang D, Luo X, Chen K, Liao C, Zheng M. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst 2023; 14:706-721.e5. [PMID: 37591206 DOI: 10.1016/j.cels.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/30/2023] [Accepted: 07/18/2023] [Indexed: 08/19/2023]
Abstract
One of the key points of machine learning-assisted directed evolution (MLDE) is the accurate learning of the fitness landscape, a conceptual mapping from sequence variants to the desired function. Here, we describe a multi-protein training scheme that leverages the existing deep mutational scanning data from diverse proteins to aid in understanding the fitness landscape of a new protein. Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects. Moreover, our study identified previously overlooked strong baselines, and their unexpectedly good performance brings our attention to the pitfalls of MLDE. Overall, these results may improve our understanding of the association between different protein fitness profiles and shed light on developing better machine learning-assisted approaches to the directed evolution of proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Lin Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zehong Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhenghao Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; Shanghai Institute for Advanced Immunochemical Studies, School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Rui Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Ruifeng Huo
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Lifan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Kaixian Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Cangsong Liao
- University of Chinese Academy of Sciences, Beijing 100049, China; Chemical Biology Research Center, Shanghai Institute of Materia Medica, Chinese Academy of Science, Shanghai 201203, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China; School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China.
| |
Collapse
|
57
|
Parkinson J, Wang W. Linear-Scaling Kernels for Protein Sequences and Small Molecules Outperform Deep Learning While Providing Uncertainty Quantitation and Improved Interpretability. J Chem Inf Model 2023; 63:4589-4601. [PMID: 37498239 DOI: 10.1021/acs.jcim.3c00601] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/28/2023]
Abstract
Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g., amino acid sequences) and graphs (e.g., small molecules). In this study, we introduce a group of random feature-approximated kernels for sequences and graphs that exhibit linear scaling with both the size of the training set and the size of the sequences or graphs. We incorporate these new kernels into our new Python library for GP regression, xGPR, and develop an efficient and scalable algorithm for fitting GPs equipped with these kernels to large datasets. We compare the performance of xGPR on 17 different benchmarks with both standard and state-of-the-art deep learning models and find that GP regression achieves highly competitive accuracy for these tasks while providing with well-calibrated uncertainty quantitation and improved interpretability. Finally, in a simple experiment, we illustrate how xGPR may be used as part of an active learning strategy to engineer a protein with a desired property in an automated way without human intervention.
Collapse
|
58
|
Mauri E, Cocco S, Monasson R. Transition paths in Potts-like energy landscapes: General properties and application to protein sequence models. Phys Rev E 2023; 108:024141. [PMID: 37723761 DOI: 10.1103/physreve.108.024141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 08/07/2023] [Indexed: 09/20/2023]
Abstract
We study transition paths in energy landscapes over multicategorical Potts configurations using the mean-field approach introduced by Mauri et al. [Phys. Rev. Lett. 130, 158402 (2023)0031-900710.1103/PhysRevLett.130.158402]. Paths interpolate between two fixed configurations or are anchored at one extremity only. We characterize the properties of "good" transition paths realizing a trade-off between exploring low-energy regions in the landscape and being not too long, such as their entropy or the probability of escape from a region of the landscape. We unveil the existence of a phase transition separating a regime in which paths are stretched in between their anchors from another regime where paths can explore the energy landscape more globally to minimize the energy. This phase transition is first illustrated and studied in detail on a mathematically tractable Hopfield-Potts toy model, then studied in energy landscapes inferred from protein sequence data.
Collapse
Affiliation(s)
- Eugenio Mauri
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| | - Simona Cocco
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| | - Rémi Monasson
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| |
Collapse
|
59
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. ARXIV 2023:arXiv:2307.14587v1. [PMID: 37547662 PMCID: PMC10402185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI, USA
| |
Collapse
|
60
|
Wagner A. Evolvability-enhancing mutations in the fitness landscapes of an RNA and a protein. Nat Commun 2023; 14:3624. [PMID: 37336901 PMCID: PMC10279741 DOI: 10.1038/s41467-023-39321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 06/05/2023] [Indexed: 06/21/2023] Open
Abstract
Can evolvability-the ability to produce adaptive heritable variation-itself evolve through adaptive Darwinian evolution? If so, then Darwinian evolution may help create the conditions that enable Darwinian evolution. Here I propose a framework that is suitable to address this question with available experimental data on adaptive landscapes. I introduce the notion of an evolvability-enhancing mutation, which increases the likelihood that subsequent mutations in an evolving organism, protein, or RNA molecule are adaptive. I search for such mutations in the experimentally characterized and combinatorially complete fitness landscapes of a protein and an RNA molecule. I find that such evolvability-enhancing mutations indeed exist. They constitute a small fraction of all mutations, which shift the distribution of fitness effects of subsequent mutations towards less deleterious mutations, and increase the incidence of beneficial mutations. Evolving populations which experience such mutations can evolve significantly higher fitness. The study of evolvability-enhancing mutations opens many avenues of investigation into the evolution of evolvability.
Collapse
Affiliation(s)
- Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, Switzerland.
- The Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
61
|
Johnston KE, Fannjiang C, Wittmann BJ, Hie BL, Yang KK, Wu Z. Machine Learning for Protein Engineering. ARXIV 2023:arXiv:2305.16634v1. [PMID: 37292483 PMCID: PMC10246115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.
Collapse
Affiliation(s)
| | | | - Bruce J Wittmann
- work done while at California Institute of Technology, now at Microsoft
| | | | | | | |
Collapse
|
62
|
Cano AV, Gitschlag BL, Rozhoňová H, Stoltzfus A, McCandlish DM, Payne JL. Mutation bias and the predictability of evolution. Philos Trans R Soc Lond B Biol Sci 2023; 378:20220055. [PMID: 37004719 PMCID: PMC10067271 DOI: 10.1098/rstb.2022.0055] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 02/16/2023] [Indexed: 04/04/2023] Open
Abstract
Predicting evolutionary outcomes is an important research goal in a diversity of contexts. The focus of evolutionary forecasting is usually on adaptive processes, and efforts to improve prediction typically focus on selection. However, adaptive processes often rely on new mutations, which can be strongly influenced by predictable biases in mutation. Here, we provide an overview of existing theory and evidence for such mutation-biased adaptation and consider the implications of these results for the problem of prediction, in regard to topics such as the evolution of infectious diseases, resistance to biochemical agents, as well as cancer and other kinds of somatic evolution. We argue that empirical knowledge of mutational biases is likely to improve in the near future, and that this knowledge is readily applicable to the challenges of short-term prediction. This article is part of the theme issue 'Interdisciplinary approaches to predicting evolutionary biology'.
Collapse
Affiliation(s)
- Alejandro V. Cano
- Institute of Integrative Biology, ETH Zurich, 8092 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Bryan L. Gitschlag
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Hana Rozhoňová
- Institute of Integrative Biology, ETH Zurich, 8092 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Arlin Stoltzfus
- Office of Data and Informatics, Material Measurement Laboratory, National Institute of Standards and Technology, Rockville, MD 20899, USA
- Institute for Bioscience and Biotechnology Research, Rockville, MD 20850, USA
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Joshua L. Payne
- Institute of Integrative Biology, ETH Zurich, 8092 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
63
|
Gantz M, Neun S, Medcalf EJ, van Vliet LD, Hollfelder F. Ultrahigh-Throughput Enzyme Engineering and Discovery in In Vitro Compartments. Chem Rev 2023; 123:5571-5611. [PMID: 37126602 PMCID: PMC10176489 DOI: 10.1021/acs.chemrev.2c00910] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Indexed: 05/03/2023]
Abstract
Novel and improved biocatalysts are increasingly sourced from libraries via experimental screening. The success of such campaigns is crucially dependent on the number of candidates tested. Water-in-oil emulsion droplets can replace the classical test tube, to provide in vitro compartments as an alternative screening format, containing genotype and phenotype and enabling a readout of function. The scale-down to micrometer droplet diameters and picoliter volumes brings about a >107-fold volume reduction compared to 96-well-plate screening. Droplets made in automated microfluidic devices can be integrated into modular workflows to set up multistep screening protocols involving various detection modes to sort >107 variants a day with kHz frequencies. The repertoire of assays available for droplet screening covers all seven enzyme commission (EC) number classes, setting the stage for widespread use of droplet microfluidics in everyday biochemical experiments. We review the practicalities of adapting droplet screening for enzyme discovery and for detailed kinetic characterization. These new ways of working will not just accelerate discovery experiments currently limited by screening capacity but profoundly change the paradigms we can probe. By interfacing the results of ultrahigh-throughput droplet screening with next-generation sequencing and deep learning, strategies for directed evolution can be implemented, examined, and evaluated.
Collapse
Affiliation(s)
| | | | | | | | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K.
| |
Collapse
|
64
|
Rabitz H, Russell B, Ho TS. The Surprising Ease of Finding Optimal Solutions for Controlling Nonlinear Phenomena in Quantum and Classical Complex Systems. J Phys Chem A 2023; 127:4224-4236. [PMID: 37142303 DOI: 10.1021/acs.jpca.3c01896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This Perspective addresses the often observed surprising ease of achieving optimal control of nonlinear phenomena in quantum and classical complex systems. The circumstances involved are wide-ranging, with scenarios including manipulation of atomic scale processes, maximization of chemical and material properties or synthesis yields, Nature's optimization of species' populations by natural selection, and directed evolution. Natural evolution will mainly be discussed in terms of laboratory experiments with microorganisms, and the field is also distinct from the other domains where a scientist specifies the goal(s) and oversees the control process. We use the word "control" in reference to all of the available variables, regardless of the circumstance. The empirical observations on the ease of achieving at least good, if not excellent, control in diverse domains of science raise the question of why this occurs despite the generally inherent complexity of the systems in each scenario. The key to addressing the question lies in examining the associated control landscape, which is defined as the optimization objective as a function of the control variables that can be as diverse as the phenomena under consideration. Controls may range from laser pulses, chemical reagents, chemical processing conditions, out to nucleic acids in the genome and more. This Perspective presents a conjecture, based on present findings, that the systematics of readily finding good outcomes from controlled phenomena may be unified through consideration of control landscapes with the same common set of three underlying assumptions─the existence of an optimal solution, the ability for local movement on the landscape, and the availability of sufficient control resources─whose validity needs assessment in each scenario. In practice, many cases permit using myopic gradient-like algorithms while other circumstances utilize algorithms having some elements of stochasticity or introduced noise, depending on whether the landscape is locally smooth or rough. The overarching observation is that only relatively short searches are required despite the common high dimensionality of the available controls in typical scenarios.
Collapse
Affiliation(s)
- Herschel Rabitz
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Benjamin Russell
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Tak-San Ho
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| |
Collapse
|
65
|
Crona K, Krug J, Srivastava M. Geometry of fitness landscapes: peaks, shapes and universal positive epistasis. J Math Biol 2023; 86:62. [PMID: 36976406 DOI: 10.1007/s00285-023-01889-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 02/03/2023] [Accepted: 02/15/2023] [Indexed: 03/29/2023]
Abstract
Darwinian evolution is driven by random mutations, genetic recombination (gene shuffling) and selection that favors genotypes with high fitness. For systems where each genotype can be represented as a bitstring of length L, an overview of possible evolutionary trajectories is provided by the L-cube graph with nodes labeled by genotypes and edges directed toward the genotype with higher fitness. Peaks (sinks in the graphs) are important since a population can get stranded at a suboptimal peak. The fitness landscape is defined by the fitness values of all genotypes in the system. Some notion of curvature is necessary for a more complete analysis of the landscapes, including the effect of recombination. The shape approach uses triangulations (shapes) induced by fitness landscapes. The main topic for this work is the interplay between peak patterns and shapes. Because of constraints on the shapes for [Formula: see text] imposed by peaks, there are in total 25 possible combinations of peak patterns and shapes. Similar constraints exist for higher L. Specifically, we show that the constraints induced by the staircase triangulation can be formulated as a condition of universal positive epistasis, an order relation on the fitness effects of arbitrary sets of mutations that respects the inclusion relation between the corresponding genetic backgrounds. We apply the concept to a large protein fitness landscape for an immunoglobulin-binding protein expressed in Streptococcal bacteria.
Collapse
Affiliation(s)
- Kristina Crona
- Department of Mathematics and Statistics, American University, Washington, DC, USA.
| | - Joachim Krug
- Institute for Biological Physics, University of Cologne, Cologne, Germany
| | - Malvika Srivastava
- Department of Environmental Systems Science, ETH Zürich, Zurich, Switzerland
| |
Collapse
|
66
|
Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput Biol 2023; 19:e1010956. [PMID: 36857380 PMCID: PMC10010530 DOI: 10.1371/journal.pcbi.1010956] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/13/2023] [Accepted: 02/16/2023] [Indexed: 03/02/2023] Open
Abstract
Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.
Collapse
|
67
|
Novelty Search Promotes Antigenic Diversity in Microbial Pathogens. Pathogens 2023; 12:pathogens12030388. [PMID: 36986310 PMCID: PMC10053453 DOI: 10.3390/pathogens12030388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 02/12/2023] [Accepted: 02/21/2023] [Indexed: 03/05/2023] Open
Abstract
Driven by host–pathogen coevolution, cell surface antigens are often the fastest evolving parts of a microbial pathogen. The persistent evolutionary impetus for novel antigen variants suggests the utility of novelty-seeking algorithms in predicting antigen diversification in microbial pathogens. In contrast to traditional genetic algorithms maximizing variant fitness, novelty-seeking algorithms optimize variant novelty. Here, we designed and implemented three evolutionary algorithms (fitness-seeking, novelty-seeking, and hybrid) and evaluated their performances in 10 simulated and 2 empirically derived antigen fitness landscapes. The hybrid walks combining fitness- and novelty-seeking strategies overcame the limitations of each algorithm alone, and consistently reached global fitness peaks. Thus, hybrid walks provide a model for microbial pathogens escaping host immunity without compromising variant fitness. Biological processes facilitating novelty-seeking evolution in natural pathogen populations include hypermutability, recombination, wide dispersal, and immune-compromised hosts. The high efficiency of the hybrid algorithm improves the evolutionary predictability of novel antigen variants. We propose the design of escape-proof vaccines based on high-fitness variants covering a majority of the basins of attraction on the fitness landscape representing all potential variants of a microbial antigen.
Collapse
|
68
|
Yang KK, Zanichelli N, Yeh H. Masked inverse folding with sequence transfer for protein representation learning. Protein Eng Des Sel 2023; 36:gzad015. [PMID: 37883472 DOI: 10.1093/protein/gzad015] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023] Open
Abstract
Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.
Collapse
Affiliation(s)
- Kevin K Yang
- Microsoft Research, 1 Memorial Drive, Cambridge, MA, USA
| | | | - Hugh Yeh
- Pritzker School of Medicine, University of Chicago, 924 E 57th Street, Chicago, IL, USA
| |
Collapse
|
69
|
Hu R, Fu L, Chen Y, Chen J, Qiao Y, Si T. Protein engineering via Bayesian optimization-guided evolutionary algorithm and robotic experiments. Brief Bioinform 2023; 24:6958505. [PMID: 36562723 DOI: 10.1093/bib/bbac570] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 11/14/2022] [Accepted: 11/22/2022] [Indexed: 12/24/2022] Open
Abstract
Directed protein evolution applies repeated rounds of genetic mutagenesis and phenotypic screening and is often limited by experimental throughput. Through in silico prioritization of mutant sequences, machine learning has been applied to reduce wet lab burden to a level practical for human researchers. On the other hand, robotics permits large batches and rapid iterations for protein engineering cycles, but such capacities have not been well exploited in existing machine learning-assisted directed evolution approaches. Here, we report a scalable and batched method, Bayesian Optimization-guided EVOlutionary (BO-EVO) algorithm, to guide multiple rounds of robotic experiments to explore protein fitness landscapes of combinatorial mutagenesis libraries. We first examined various design specifications based on an empirical landscape of protein G domain B1. Then, BO-EVO was successfully generalized to another empirical landscape of an Escherichia coli kinase PhoQ, as well as simulated NK landscapes with up to moderate epistasis. This approach was then applied to guide robotic library creation and screening to engineer enzyme specificity of RhlA, a key biosynthetic enzyme for rhamnolipid biosurfactants. A 4.8-fold improvement in producing a target rhamnolipid congener was achieved after examining less than 1% of all possible mutants after four iterations. Overall, BO-EVO proves to be an efficient and general approach to guide combinatorial protein engineering without prior knowledge.
Collapse
Affiliation(s)
- Ruyun Hu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Lihao Fu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,CAS Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen 518055, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yongcan Chen
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,CAS Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen 518055, China
| | - Junyu Chen
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Yu Qiao
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Tong Si
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,CAS Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen 518055, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
70
|
Schmiegelt B, Krug J. Accessibility percolation on Cartesian power graphs. J Math Biol 2023; 86:46. [PMID: 36790641 PMCID: PMC9931871 DOI: 10.1007/s00285-023-01882-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 01/12/2023] [Accepted: 01/31/2023] [Indexed: 02/16/2023]
Abstract
A fitness landscape is a mapping from a space of discrete genotypes to the real numbers. A path in a fitness landscape is a sequence of genotypes connected by single mutational steps. Such a path is said to be accessible if the fitness values of the genotypes encountered along the path increase monotonically. We study accessible paths on random fitness landscapes of the House-of-Cards type, on which fitness values are independent, identically and continuously distributed random variables. The genotype space is taken to be a Cartesian power graph [Formula: see text], where [Formula: see text] is the number of genetic loci and the allele graph [Formula: see text] encodes the possible allelic states and mutational transitions on one locus. The probability of existence of accessible paths between two genotypes at a distance linear in [Formula: see text] displays a transition from 0 to a positive value at a threshold [Formula: see text] for the fitness difference between the initial and final genotype. We derive a lower bound on [Formula: see text] for general [Formula: see text] and show that this bound is tight for a large class of allele graphs. Our results generalize previous results for accessibility percolation on the biallelic hypercube, and compare favorably to published numerical results for multiallelic Hamming graphs.
Collapse
Affiliation(s)
| | - Joachim Krug
- Institute for Biological Physics, University of Cologne, Köln, Germany
| |
Collapse
|
71
|
Minot M, Reddy ST. Nucleotide augmentation for machine learning-guided protein engineering. BIOINFORMATICS ADVANCES 2022; 3:vbac094. [PMID: 36698759 PMCID: PMC9843584 DOI: 10.1093/bioadv/vbac094] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 10/24/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022]
Abstract
Summary Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. Availability and implementation The code used in this study is publicly available at https://github.com/minotm/NTA. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Mason Minot
- Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| | - Sai T Reddy
- Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| |
Collapse
|
72
|
Azbukina N, Zharikova A, Ramensky V. Intragenic compensation through the lens of deep mutational scanning. Biophys Rev 2022; 14:1161-1182. [PMID: 36345285 PMCID: PMC9636336 DOI: 10.1007/s12551-022-01005-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 09/26/2022] [Indexed: 12/20/2022] Open
Abstract
A significant fraction of mutations in proteins are deleterious and result in adverse consequences for protein function, stability, or interaction with other molecules. Intragenic compensation is a specific case of positive epistasis when a neutral missense mutation cancels effect of a deleterious mutation in the same protein. Permissive compensatory mutations facilitate protein evolution, since without them all sequences would be extremely conserved. Understanding compensatory mechanisms is an important scientific challenge at the intersection of protein biophysics and evolution. In human genetics, intragenic compensatory interactions are important since they may result in variable penetrance of pathogenic mutations or fixation of pathogenic human alleles in orthologous proteins from related species. The latter phenomenon complicates computational and clinical inference of an allele's pathogenicity. Deep mutational scanning is a relatively new technique that enables experimental studies of functional effects of thousands of mutations in proteins. We review the important aspects of the field and discuss existing limitations of current datasets. We reviewed ten published DMS datasets with quantified functional effects of single and double mutations and described rates and patterns of intragenic compensation in eight of them. Supplementary Information The online version contains supplementary material available at 10.1007/s12551-022-01005-w.
Collapse
Affiliation(s)
- Nadezhda Azbukina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 1-73, Leninskie Gory, 119991 Moscow, Russia
| | - Anastasia Zharikova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 1-73, Leninskie Gory, 119991 Moscow, Russia
- National Medical Research Center for Therapy and Preventive Medicine, Petroverigsky per., 10, Bld.3, 101000 Moscow, Russia
| | - Vasily Ramensky
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 1-73, Leninskie Gory, 119991 Moscow, Russia
- National Medical Research Center for Therapy and Preventive Medicine, Petroverigsky per., 10, Bld.3, 101000 Moscow, Russia
| |
Collapse
|
73
|
Abstract
One core goal of genetics is to systematically understand the mapping between the DNA sequence of an organism (genotype) and its measurable characteristics (phenotype). Understanding this mapping is often challenging because of interactions between mutations, where the result of combining several different mutations can be very different than the sum of their individual effects. Here we provide a statistical framework for modeling complex genetic interactions of this type. The key idea is to ask how fast the effects of mutations change when introducing the same mutation in increasingly distant genetic backgrounds. We then propose a model for phenotypic prediction that takes into account this tendency for the effects of mutations to be more similar in nearby genetic backgrounds. Contemporary high-throughput mutagenesis experiments are providing an increasingly detailed view of the complex patterns of genetic interaction that occur between multiple mutations within a single protein or regulatory element. By simultaneously measuring the effects of thousands of combinations of mutations, these experiments have revealed that the genotype–phenotype relationship typically reflects not only genetic interactions between pairs of sites but also higher-order interactions among larger numbers of sites. However, modeling and understanding these higher-order interactions remains challenging. Here we present a method for reconstructing sequence-to-function mappings from partially observed data that can accommodate all orders of genetic interaction. The main idea is to make predictions for unobserved genotypes that match the type and extent of epistasis found in the observed data. This information on the type and extent of epistasis can be extracted by considering how phenotypic correlations change as a function of mutational distance, which is equivalent to estimating the fraction of phenotypic variance due to each order of genetic interaction (additive, pairwise, three-way, etc.). Using these estimated variance components, we then define an empirical Bayes prior that in expectation matches the observed pattern of epistasis and reconstruct the genotype–phenotype mapping by conducting Gaussian process regression under this prior. To demonstrate the power of this approach, we present an application to the antibody-binding domain GB1 and also provide a detailed exploration of a dataset consisting of high-throughput measurements for the splicing efficiency of human pre-mRNA 5′ splice sites, for which we also validate our model predictions via additional low-throughput experiments.
Collapse
|
74
|
Castro E, Godavarthi A, Rubinfien J, Givechian K, Bhaskar D, Krishnaswamy S. Transformer-based protein generation with regularized latent space optimization. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00532-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
75
|
Qiu Y, Wei GW. CLADE 2.0: Evolution-Driven Cluster Learning-Assisted Directed Evolution. J Chem Inf Model 2022; 62:4629-4641. [PMID: 36154171 DOI: 10.1021/acs.jcim.2c01046] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Directed evolution, a revolutionary biotechnology in protein engineering, optimizes protein fitness by searching an astronomical mutational space via expensive experiments. The cluster learning-assisted directed evolution (CLADE) efficiently explores the mutational space via a combination of unsupervised hierarchical clustering and supervised learning. However, the initial-stage sampling in CLADE treats all clusters equally despite many clusters containing a large portion of non-functional mutations. Recent statistical and deep learning tools enable evolutionary density modeling to access protein fitness in an unsupervised manner. In this work, we construct an ensemble of multiple evolutionary scores to guide the initial sampling in CLADE. The resulting evolutionary score-enhanced CLADE, called CLADE 2.0, efficiently selects a training set within a small informative space using the evolution-driven clustering sampling. CLADE 2.0 is validated by using two benchmark libraries both having 160,000 sequences from four-site mutational combinations. Extensive computational experiments and comparisons with existing cutting-edge methods indicate that CLADE 2.0 is a new state-of-art tool for machine learning-assisted directed evolution.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States.,Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
76
|
Srivastava M, Payne JL. On the incongruence of genotype-phenotype and fitness landscapes. PLoS Comput Biol 2022; 18:e1010524. [PMID: 36121840 PMCID: PMC9521842 DOI: 10.1371/journal.pcbi.1010524] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 09/29/2022] [Accepted: 08/30/2022] [Indexed: 11/22/2022] Open
Abstract
The mapping from genotype to phenotype to fitness typically involves multiple nonlinearities that can transform the effects of mutations. For example, mutations may contribute additively to a phenotype, but their effects on fitness may combine non-additively because selection favors a low or intermediate value of that phenotype. This can cause incongruence between the topographical properties of a fitness landscape and its underlying genotype-phenotype landscape. Yet, genotype-phenotype landscapes are often used as a proxy for fitness landscapes to study the dynamics and predictability of evolution. Here, we use theoretical models and empirical data on transcription factor-DNA interactions to systematically study the incongruence of genotype-phenotype and fitness landscapes when selection favors a low or intermediate phenotypic value. Using the theoretical models, we prove a number of fundamental results. For example, selection for low or intermediate phenotypic values does not change simple sign epistasis into reciprocal sign epistasis, implying that genotype-phenotype landscapes with only simple sign epistasis motifs will always give rise to single-peaked fitness landscapes under such selection. More broadly, we show that such selection tends to create fitness landscapes that are more rugged than the underlying genotype-phenotype landscape, but this increased ruggedness typically does not frustrate adaptive evolution because the local adaptive peaks in the fitness landscape tend to be nearly as tall as the global peak. Many of these results carry forward to the empirical genotype-phenotype landscapes, which may help to explain why low- and intermediate-affinity transcription factor-DNA interactions are so prevalent in eukaryotic gene regulation.
Collapse
Affiliation(s)
- Malvika Srivastava
- Institute of Integrative Biology, ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Joshua L. Payne
- Institute of Integrative Biology, ETH Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
77
|
Zhang L, King E, Black WB, Heckmann CM, Wolder A, Cui Y, Nicklen F, Siegel JB, Luo R, Paul CE, Li H. Directed evolution of phosphite dehydrogenase to cycle noncanonical redox cofactors via universal growth selection platform. Nat Commun 2022; 13:5021. [PMID: 36028482 PMCID: PMC9418148 DOI: 10.1038/s41467-022-32727-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 08/13/2022] [Indexed: 11/09/2022] Open
Abstract
Noncanonical redox cofactors are attractive low-cost alternatives to nicotinamide adenine dinucleotide (phosphate) (NAD(P)+) in biotransformation. However, engineering enzymes to utilize them is challenging. Here, we present a high-throughput directed evolution platform which couples cell growth to the in vivo cycling of a noncanonical cofactor, nicotinamide mononucleotide (NMN+). We achieve this by engineering the life-essential glutathione reductase in Escherichia coli to exclusively rely on the reduced NMN+ (NMNH). Using this system, we develop a phosphite dehydrogenase (PTDH) to cycle NMN+ with ~147-fold improved catalytic efficiency, which translates to an industrially viable total turnover number of ~45,000 in cell-free biotransformation without requiring high cofactor concentrations. Moreover, the PTDH variants also exhibit improved activity with another structurally deviant noncanonical cofactor, 1-benzylnicotinamide (BNA+), showcasing their broad applications. Structural modeling prediction reveals a general design principle where the mutations and the smaller, noncanonical cofactors together mimic the steric interactions of the larger, natural cofactors NAD(P)+.
Collapse
Affiliation(s)
- Linyue Zhang
- Department of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, CA, 92697, USA
| | - Edward King
- Department of Molecular Biology and Biochemistry, University of California Irvine, Irvine, CA, 92697, USA
| | - William B Black
- Department of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, CA, 92697, USA
| | - Christian M Heckmann
- Biocatalysis, Department of Biotechnology, Delft University of Technology, 2629 HZ, Delft, Netherlands
| | - Allison Wolder
- Biocatalysis, Department of Biotechnology, Delft University of Technology, 2629 HZ, Delft, Netherlands
| | - Youtian Cui
- Department of Chemistry, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA
| | - Francis Nicklen
- Department of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, CA, 92697, USA
| | - Justin B Siegel
- Department of Chemistry, University of California, Davis, One Shields Avenue, Davis, CA, 95616, USA
- Department of Biochemistry and Molecular Medicine, University of California, Davis, 2700 Stockton Boulevard, Suite 2102, Sacramento, CA, 95817, USA
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA, 95616, USA
| | - Ray Luo
- Department of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, CA, 92697, USA
- Department of Molecular Biology and Biochemistry, University of California Irvine, Irvine, CA, 92697, USA
- Department of Biomedical Engineering, University of California Irvine, Irvine, CA, 92697, USA
- Department Materials Science and Engineering, University of California Irvine, Irvine, CA, 92697, USA
| | - Caroline E Paul
- Biocatalysis, Department of Biotechnology, Delft University of Technology, 2629 HZ, Delft, Netherlands
| | - Han Li
- Department of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, CA, 92697, USA.
- Department of Biomedical Engineering, University of California Irvine, Irvine, CA, 92697, USA.
| |
Collapse
|
78
|
Rotrattanadumrong R, Yokobayashi Y. Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning. Nat Commun 2022; 13:4847. [PMID: 35977956 PMCID: PMC9385714 DOI: 10.1038/s41467-022-32538-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 08/03/2022] [Indexed: 11/18/2022] Open
Abstract
A neutral network connects all genotypes with equivalent phenotypes in a fitness landscape and plays an important role in the mutational robustness and evolvability of biomolecules. In contrast to earlier theoretical works, evidence of large neutral networks has been lacking in recent experimental studies of fitness landscapes. This suggests that evolution could be constrained globally. Here, we demonstrate that a deep learning-guided evolutionary algorithm can efficiently identify neutral genotypes within the sequence space of an RNA ligase ribozyme. Furthermore, we measure the activities of all 216 variants connecting two active ribozymes that differ by 16 mutations and analyze mutational interactions (epistasis) up to the 16th order. We discover an extensive network of neutral paths linking the two genotypes and reveal that these paths might be predicted using only information from lower-order interactions. Our experimental evaluation of over 120,000 ribozyme sequences provides important empirical evidence that neutral networks can increase the accessibility and predictability of the fitness landscape. Neutral networks, which are sets of genotypes connected via single mutations that share the same phenotype, are important for evolvability. Here, the authors provide experimental evidence of a neutral network in an RNA enzyme using a high-throughput assay and deep learning.
Collapse
Affiliation(s)
- Rachapun Rotrattanadumrong
- Nucleic Acid Chemistry and Engineering Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa, 9040495, Japan
| | - Yohei Yokobayashi
- Nucleic Acid Chemistry and Engineering Unit, Okinawa Institute of Science and Technology Graduate University, Onna, Okinawa, 9040495, Japan.
| |
Collapse
|
79
|
Yang T, Ye Z, Lynch MD. "Multiagent" Screening Improves Directed Enzyme Evolution by Identifying Epistatic Mutations. ACS Synth Biol 2022; 11:1971-1983. [PMID: 35507897 DOI: 10.1021/acssynbio.2c00136] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Enzyme evolution has enabled numerous advances in biotechnology and synthetic biology, yet still requires many iterative rounds of screening to identify optimal mutant sequences. This is due to the sparsity of the fitness landscape, which is caused by epistatic mutations that only offer improvements when combined with other mutations. We report an approach that incorporates diverse substrate analogues in the screening process, where multiple substrates act like multiple agents navigating the fitness landscape, identifying epistatic mutant residues without a need for testing the entire combinatorial search space. We initially validate this approach by engineering a malonyl-CoA synthetase and identify numerous epistatic mutations improving activity for several diverse substrates. The majority of these mutations would have been missed upon screening for a single substrate alone. We expect that this approach can accelerate a wide array of enzyme engineering programs.
Collapse
Affiliation(s)
- Tian Yang
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27701, United States
| | - Zhixia Ye
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27701, United States
| | - Michael D. Lynch
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27701, United States
| |
Collapse
|
80
|
Park Y, Metzger BPH, Thornton JW. Epistatic drift causes gradual decay of predictability in protein evolution. Science 2022; 376:823-830. [PMID: 35587978 DOI: 10.1126/science.abn6895] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Epistatic interactions can make the outcomes of evolution unpredictable, but no comprehensive data are available on the extent and temporal dynamics of changes in the effects of mutations as protein sequences evolve. Here, we use phylogenetic deep mutational scanning to measure the functional effect of every possible amino acid mutation in a series of ancestral and extant steroid receptor DNA binding domains. Across 700 million years of evolution, epistatic interactions caused the effects of most mutations to become decorrelated from their initial effects and their windows of evolutionary accessibility to open and close transiently. Most effects changed gradually and without bias at rates that were largely constant across time, indicating a neutral process caused by many weak epistatic interactions. Our findings show that protein sequences drift inexorably into contingency and unpredictability, but that the process is statistically predictable, given sufficient phylogenetic and experimental data.
Collapse
Affiliation(s)
- Yeonwoo Park
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Brian P H Metzger
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
| | - Joseph W Thornton
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA.,Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA.,Department of Human Genetics, University of Chicago, Chicago, IL, USA
| |
Collapse
|
81
|
Bakerlee CW, Nguyen Ba AN, Shulgina Y, Rojas Echenique JI, Desai MM. Idiosyncratic epistasis leads to global fitness-correlated trends. Science 2022; 376:630-635. [PMID: 35511982 PMCID: PMC10124986 DOI: 10.1126/science.abm4774] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Epistasis can markedly affect evolutionary trajectories. In recent decades, protein-level fitness landscapes have revealed extensive idiosyncratic epistasis among specific mutations. By contrast, other work has found ubiquitous and apparently nonspecific patterns of global diminishing-returns and increasing-costs epistasis among mutations across the genome. Here, we used a hierarchical CRISPR gene drive system to construct all combinations of 10 missense mutations from across the genome in budding yeast and measured their fitness in six environments. We show that the resulting fitness landscapes exhibit global fitness-correlated trends but that these trends emerge from specific idiosyncratic interactions. We thus provide experimental validation of recent theoretical work arguing that fitness-correlated trends can emerge as the generic consequence of idiosyncratic epistasis.
Collapse
Affiliation(s)
- Christopher W Bakerlee
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.,Quantitative Biology Initiative, Harvard University, Cambridge, MA, USA.,Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Alex N Nguyen Ba
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.,Quantitative Biology Initiative, Harvard University, Cambridge, MA, USA.,Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada.,Department of Biology, University of Toronto Mississauga, Mississauga, Ontario, Canada
| | - Yekaterina Shulgina
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Jose I Rojas Echenique
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.,Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Michael M Desai
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.,Quantitative Biology Initiative, Harvard University, Cambridge, MA, USA.,NSF-Simons Center for Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, MA, USA.,Department of Physics, Harvard University, Cambridge, MA, USA
| |
Collapse
|
82
|
Yang CH, Scarpino SV. A Family of Fitness Landscapes Modeled through Gene Regulatory Networks. ENTROPY (BASEL, SWITZERLAND) 2022; 24:622. [PMID: 35626507 PMCID: PMC9141513 DOI: 10.3390/e24050622] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 04/11/2022] [Accepted: 04/26/2022] [Indexed: 02/01/2023]
Abstract
Fitness landscapes are a powerful metaphor for understanding the evolution of biological systems. These landscapes describe how genotypes are connected to each other through mutation and related through fitness. Empirical studies of fitness landscapes have increasingly revealed conserved topographical features across diverse taxa, e.g., the accessibility of genotypes and "ruggedness". As a result, theoretical studies are needed to investigate how evolution proceeds on fitness landscapes with such conserved features. Here, we develop and study a model of evolution on fitness landscapes using the lens of Gene Regulatory Networks (GRNs), where the regulatory products are computed from multiple genes and collectively treated as phenotypes. With the assumption that regulation is a binary process, we prove the existence of empirically observed, topographical features such as accessibility and connectivity. We further show that these results hold across arbitrary fitness functions and that a trade-off between accessibility and ruggedness need not exist. Then, using graph theory and a coarse-graining approach, we deduce a mesoscopic structure underlying GRN fitness landscapes where the information necessary to predict a population's evolutionary trajectory is retained with minimal complexity. Using this coarse-graining, we develop a bottom-up algorithm to construct such mesoscopic backbones, which does not require computing the genotype network and is therefore far more efficient than brute-force approaches. Altogether, this work provides mathematical results of high-dimensional fitness landscapes and a path toward connecting theory to empirical studies.
Collapse
Affiliation(s)
- Chia-Hung Yang
- Network Science Institute, Northeastern University, Boston, MA 02115, USA
| | - Samuel V. Scarpino
- Network Science Institute, Northeastern University, Boston, MA 02115, USA
- Physics Department, Northeastern University, Boston, MA 02115, USA
- Roux Institute, Northeastern University, Boston, MA 02115, USA
- Institute for Experiential AI, Northeastern University, Boston, MA 02115, USA
- Santa Fe Institute, Santa Fe, NM 87501, USA
- Vermont Complex Systems Center, University of Vermont, Burlington, VT 05405, USA
| |
Collapse
|
83
|
Hayes RL, Vilseck JZ, Brooks CL. Addressing Intersite Coupling Unlocks Large Combinatorial Chemical Spaces for Alchemical Free Energy Methods. J Chem Theory Comput 2022; 18:2114-2123. [PMID: 35255214 PMCID: PMC9700482 DOI: 10.1021/acs.jctc.1c00948] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Alchemical free energy methods are playing a growing role in molecular design, both for computer-aided drug design of small molecules and for computational protein design. Multisite λ dynamics (MSλD) is a uniquely scalable alchemical free energy method that enables more efficient exploration of combinatorial alchemical spaces encountered in molecular design, but simulations have typically been limited to a few hundred ligands or sequences. Here, we focus on coupling between sites to enable scaling to larger alchemical spaces. We first discuss updates to the biasing potentials that facilitate MSλD sampling to include coupling terms and show that this can provide more thorough sampling of alchemical states. We then harness coupling between sites by developing a new free energy estimator based on the Potts models underlying direct coupling analysis, a method for predicting contacts from sequence coevolution, and find it yields more accurate free energies than previous estimators. The sampling requirements of the Potts model estimator scale with the square of the number of sites, a substantial improvement over the exponential scaling of the standard estimator. This opens up exploration of much larger alchemical spaces with MSλD for molecular design.
Collapse
Affiliation(s)
- Ryan L Hayes
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Biophysics Program, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Jonah Z Vilseck
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana 46202, United States
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Charles L Brooks
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Biophysics Program, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
84
|
Wittmann BJ, Johnston KE, Almhjell PJ, Arnold FH. evSeq: Cost-Effective Amplicon Sequencing of Every Variant in a Protein Library. ACS Synth Biol 2022; 11:1313-1324. [PMID: 35172576 DOI: 10.1021/acssynbio.1c00592] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Widespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing are thus unjustified. It also results from the fact that, even though many lower-cost sequencing strategies have been developed, they often require at least some access to and experience with sequencing or computational resources, both of which can be barriers to access. Here, we present every variant sequencing (evSeq), a method and collection of tools/standardized components for sequencing a variable region within every variant gene produced during a protein engineering campaign at a cost of cents per variant. evSeq was designed to democratize low-cost sequencing for protein engineers and, indeed, anyone interested in engineering biological systems. Execution of its wet-lab component is simple, requires no sequencing experience to perform, relies only on resources and services typically available to biology labs, and slots neatly into existing protein engineering workflows. Analysis of evSeq data is likewise made simple by its accompanying software (found at github.com/fhalab/evSeq, documentation at fhalab.github.io/evSeq), which can be run on a personal laptop and was designed to be accessible to users with no computational experience. Low-cost and easy-to-use, evSeq makes the collection of extensive protein variant sequence-fitness data practical.
Collapse
Affiliation(s)
- Bruce J. Wittmann
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Kadina E. Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Patrick J. Almhjell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
- Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| |
Collapse
|
85
|
Scheele RA, Lindenburg LH, Petek M, Schober M, Dalby KN, Hollfelder F. Droplet-based screening of phosphate transfer catalysis reveals how epistasis shapes MAP kinase interactions with substrates. Nat Commun 2022; 13:844. [PMID: 35149678 PMCID: PMC8837617 DOI: 10.1038/s41467-022-28396-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 01/10/2022] [Indexed: 11/20/2022] Open
Abstract
The combination of ultrahigh-throughput screening and sequencing informs on function and intragenic epistasis within combinatorial protein mutant libraries. Establishing a droplet-based, in vitro compartmentalised approach for robust expression and screening of protein kinase cascades (>107 variants/day) allowed us to dissect the intrinsic molecular features of the MKK-ERK signalling pathway, without interference from endogenous cellular components. In a six-residue combinatorial library of the MKK1 docking domain, we identified 29,563 sequence permutations that allow MKK1 to efficiently phosphorylate and activate its downstream target kinase ERK2. A flexibly placed hydrophobic sequence motif emerges which is defined by higher order epistatic interactions between six residues, suggesting synergy that enables high connectivity in the sequence landscape. Through positive epistasis, MKK1 maintains function during mutagenesis, establishing the importance of co-dependent residues in mammalian protein kinase-substrate interactions, and creating a scenario for the evolution of diverse human signalling networks. Here, the authors use a droplet-based screen for phosphate transfer catalysis, testing variants of the human protein kinase MKK1 for its ability to activate its downstream target ERK2. Data reveal a flexible motif in the MKK1 docking domain that promotes efficient activation of ERK2, and suggest epistasis between the residues within that sequence.
Collapse
Affiliation(s)
- Remkes A Scheele
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
| | | | - Maya Petek
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK.,Faculty of Medicine, University of Maribor, SI-2000, Maribor, Slovenia
| | - Markus Schober
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK
| | - Kevin N Dalby
- Division of Chemical Biology and Medicinal Chemistry, The University of Texas at Austin, Austin, TX, 78712, USA
| | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, Cambridge, CB2 1GA, UK.
| |
Collapse
|
86
|
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022; 40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 82] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]
Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Collapse
Affiliation(s)
- Chloe Hsu
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, USA
| | - Clara Fannjiang
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. .,Center for Computational Biology, University of California, Berkeley, USA.
| |
Collapse
|
87
|
Peri G, Gibard C, Shults NH, Crossin K, Hayden EJ. Dynamic RNA fitness landscapes of a group I ribozyme during changes to the experimental environment. Mol Biol Evol 2022; 39:6502289. [PMID: 35020916 PMCID: PMC8890501 DOI: 10.1093/molbev/msab373] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Fitness landscapes of protein and RNA molecules can be studied experimentally using high-throughput techniques to measure the functional effects of numerous combinations of mutations. The rugged topography of these molecular fitness landscapes is important for understanding and predicting natural and experimental evolution. Mutational effects are also dependent upon environmental conditions, but the effects of environmental changes on fitness landscapes remains poorly understood. Here, we investigate the changes to the fitness landscape of a catalytic RNA molecule while changing a single environmental variable that is critical for RNA structure and function. Using high-throughput sequencing of in vitro selections, we mapped a fitness landscape of the Azoarcus group I ribozyme under eight different concentrations of magnesium ions (1–48 mM MgCl2). The data revealed the magnesium dependence of 16,384 mutational neighbors, and from this, we investigated the magnesium induced changes to the topography of the fitness landscape. The results showed that increasing magnesium concentration improved the relative fitness of sequences at higher mutational distances while also reducing the ruggedness of the mutational trajectories on the landscape. As a result, as magnesium concentration was increased, simulated populations evolved toward higher fitness faster. Curve-fitting of the magnesium dependence of individual ribozymes demonstrated that deep sequencing of in vitro reactions can be used to evaluate the structural stability of thousands of sequences in parallel. Overall, the results highlight how environmental changes that stabilize structures can also alter the ruggedness of fitness landscapes and alter evolutionary processes.
Collapse
Affiliation(s)
- Gianluca Peri
- Biomolecular Sciences Graduate Programs, Boise State University, Boise, ID, USA
| | - Clémentine Gibard
- Department of Biological Science, Boise State University, Boise, ID, USA
| | - Nicholas H Shults
- Department of Biological Science, Boise State University, Boise, ID, USA
| | - Kent Crossin
- Department of Biological Science, Boise State University, Boise, ID, USA
| | - Eric J Hayden
- Biomolecular Sciences Graduate Programs, Boise State University, Boise, ID, USA.,Department of Biological Science, Boise State University, Boise, ID, USA
| |
Collapse
|
88
|
On the sparsity of fitness functions and implications for learning. Proc Natl Acad Sci U S A 2022; 119:2109649118. [PMID: 34937698 PMCID: PMC8740588 DOI: 10.1073/pnas.2109649118] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/11/2021] [Indexed: 01/05/2023] Open
Abstract
The properties of proteins and other biological molecules are encoded in large part in the sequence of amino acids or nucleotides that defines them. Increasingly, researchers estimate functions that map sequences to a particular property using machine learning and related statistical approaches. However, an important question remains unanswered: How many experimental measurements are needed in order to accurately learn these “fitness” functions? We leverage perspectives from the fields of biophysics, evolutionary biology, and signal processing to develop a theoretical framework that enables us to make progress on answering this question. We demonstrate that this framework can be used to make useful calculations on real-world data and suggest how these calculations may be used to guide experiments. Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.
Collapse
|
89
|
Adaptive machine learning for protein engineering. Curr Opin Struct Biol 2021; 72:145-152. [PMID: 34896756 DOI: 10.1016/j.sbi.2021.11.002] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 09/14/2021] [Accepted: 11/08/2021] [Indexed: 11/22/2022]
Abstract
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Collapse
|
90
|
Schweizer G, Wagner A. Both Binding Strength and Evolutionary Accessibility Affect the Population Frequency of Transcription Factor Binding Sequences in Arabidopsis thaliana. Genome Biol Evol 2021; 13:6459646. [PMID: 34894231 PMCID: PMC8712246 DOI: 10.1093/gbe/evab273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/06/2021] [Indexed: 11/22/2022] Open
Abstract
Mutations in DNA sequences that bind transcription factors and thus modulate gene expression are a source of adaptive variation in gene expression. To understand how transcription factor binding sequences evolve in natural populations of the thale cress Arabidopsis thaliana, we integrated genomic polymorphism data for loci bound by transcription factors with in vitro data on binding affinity for these transcription factors. Specifically, we studied 19 different transcription factors, and the allele frequencies of 8,333 genomic loci bound in vivo by these transcription factors in 1,135 A. thaliana accessions. We find that transcription factor binding sequences show very low genetic diversity, suggesting that they are subject to purifying selection. High frequency alleles of such binding sequences tend to bind transcription factors strongly. Conversely, alleles that are absent from the population tend to bind them weakly. In addition, alleles with high frequencies also tend to be the endpoints of many accessible evolutionary paths leading to these alleles. We show that both high affinity and high evolutionary accessibility contribute to high allele frequency for at least some transcription factors. Although binding sequences with stronger affinity are more frequent, we did not find them to be associated with higher gene expression levels. Epistatic interactions among individual mutations that alter binding affinity are pervasive and can help explain variation in accessibility among binding sequences. In summary, combining in vitro binding affinity data with in vivo binding sequence data can help understand the forces that affect the evolution of transcription factor binding sequences in natural populations.
Collapse
Affiliation(s)
- Gabriel Schweizer
- Department of Evolutionary Biology and Environmental Studies, University of Zürich, Switzerland.,Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, Switzerland
| | - Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zürich, Switzerland.,Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode, Lausanne, Switzerland.,Santa Fe Institute, Santa Fe, New Mexico, USA.,Stellenbosch Institute for Advanced Study (STIAS), Wallenberg Research Centre at Stellenbosch University, South Africa
| |
Collapse
|
91
|
Abstract
Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by expensive and time-consuming screening or selection of large mutational sequence space. Machine learning-assisted directed evolution (MLDE), which screens sequence properties in silico, can accelerate the optimization and reduce the experimental burden. This work introduces a MLDE framework, cluster learning-assisted directed evolution (CLADE), that combines hierarchical unsupervised clustering sampling and supervised learning to guide protein engineering. The clustering sampling selectively picks and screens variants in targeted subspaces, which guides the subsequent generation of diverse training sets. In the last stage, accurate predictions via supervised learning models improve final outcomes. By sequentially screening 480 sequences out of 160,000 in a four-site combinatorial library with five equal experimental batches, CLADE achieves the global maximal fitness hit rate up to 91.0% and 34.0% for GB1 and PhoQ datasets, respectively, improved from 18.6% and 7.2% obtained by random-sampling-based MLDE.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Jian Hu
- Department of Chemistry, Michigan State University, MI, 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
- Corresponding author:
| |
Collapse
|
92
|
Chu HY, Wong ASL. Facilitating Machine Learning-Guided Protein Engineering with Smart Library Design and Massively Parallel Assays. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:2100038. [PMID: 36619853 PMCID: PMC9744531 DOI: 10.1002/ggn2.202100038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/08/2021] [Indexed: 01/11/2023]
Abstract
Protein design plays an important role in recent medical advances from antibody therapy to vaccine design. Typically, exhaustive mutational screens or directed evolution experiments are used for the identification of the best design or for improvements to the wild-type variant. Even with a high-throughput screening on pooled libraries and Next-Generation Sequencing to boost the scale of read-outs, surveying all the variants with combinatorial mutations for their empirical fitness scores is still of magnitudes beyond the capacity of existing experimental settings. To tackle this challenge, in-silico approaches using machine learning to predict the fitness of novel variants based on a subset of empirical measurements are now employed. These machine learning models turn out to be useful in many cases, with the premise that the experimentally determined fitness scores and the amino-acid descriptors of the models are informative. The machine learning models can guide the search for the highest fitness variants, resolve complex epistatic relationships, and highlight bio-physical rules for protein folding. Using machine learning-guided approaches, researchers can build more focused libraries, thus relieving themselves from labor-intensive screens and fast-tracking the optimization process. Here, we describe the current advances in massive-scale variant screens, and how machine learning and mutagenesis strategies can be integrated to accelerate protein engineering. More specifically, we examine strategies to make screens more economical, informative, and effective in discovery of useful variants.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
| | - Alan S. L. Wong
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
- Electrical and Electronic EngineeringThe University of Hong KongPokfulamHong Kong852China
| |
Collapse
|
93
|
Saito Y, Oikawa M, Sato T, Nakazawa H, Ito T, Kameda T, Tsuda K, Umetsu M. Machine-Learning-Guided Library Design Cycle for Directed Evolution of Enzymes: The Effects of Training Data Composition on Sequence Space Exploration. ACS Catal 2021. [DOI: 10.1021/acscatal.1c03753] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Yutaka Saito
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
- Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo 103-0027, Japan
| | - Misaki Oikawa
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, 6-6-11 Aoba, Aramaki, Aoba-ku, Sendai 980-8579, Japan
| | - Takumi Sato
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, 6-6-11 Aoba, Aramaki, Aoba-ku, Sendai 980-8579, Japan
| | - Hikaru Nakazawa
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, 6-6-11 Aoba, Aramaki, Aoba-ku, Sendai 980-8579, Japan
| | - Tomoyuki Ito
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, 6-6-11 Aoba, Aramaki, Aoba-ku, Sendai 980-8579, Japan
| | - Tomoshi Kameda
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo 103-0027, Japan
| | - Koji Tsuda
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan
- Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo 103-0027, Japan
- Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki 305-0047, Japan
| | - Mitsuo Umetsu
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, 6-6-11 Aoba, Aramaki, Aoba-ku, Sendai 980-8579, Japan
- Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo 103-0027, Japan
| |
Collapse
|
94
|
Wittmann BJ, Yue Y, Arnold FH. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst 2021; 12:1026-1045.e7. [PMID: 34416172 DOI: 10.1016/j.cels.2021.07.008] [Citation(s) in RCA: 93] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 05/06/2021] [Accepted: 07/26/2021] [Indexed: 11/17/2022]
Abstract
Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified-the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative "holes" (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Bruce J Wittmann
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA
| | - Yisong Yue
- Department of Computing and Mathematical Sciences, California Institute of Technology, MC 305-16, 1200 E. California Blvd., Pasadena, CA 91125, USA
| | - Frances H Arnold
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA; Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA.
| |
Collapse
|
95
|
Thomas N, Colwell LJ. Minding the gaps: The importance of navigating holes in protein fitness landscapes. Cell Syst 2021; 12:1019-1020. [PMID: 34793698 DOI: 10.1016/j.cels.2021.10.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Machine-learning-guided protein design is rapidly emerging as a strategy to find high-fitness multi-mutant variants. In this issue of Cell Systems, Wittman et al. analyze the impact of design decisions for machine-learning-assisted directed evolution (MLDE) on its ability to navigate a fitness landscape and reliably find global optima.
Collapse
Affiliation(s)
- Neil Thomas
- Computer Science Division, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Lucy J Colwell
- Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK; Google Research, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA.
| |
Collapse
|
96
|
Sesta L, Uguzzoni G, Fernandez-de-Cossio-Diaz J, Pagnani A. AMaLa: Analysis of Directed Evolution Experiments via Annealed Mutational Approximated Landscape. Int J Mol Sci 2021; 22:10908. [PMID: 34681569 PMCID: PMC8535593 DOI: 10.3390/ijms222010908] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2021] [Revised: 09/24/2021] [Accepted: 09/27/2021] [Indexed: 01/12/2023] Open
Abstract
We present Annealed Mutational approximated Landscape (AMaLa), a new method to infer fitness landscapes from Directed Evolution experiments sequencing data. Such experiments typically start from a single wild-type sequence, which undergoes Darwinian in vitro evolution via multiple rounds of mutation and selection for a target phenotype. In the last years, Directed Evolution is emerging as a powerful instrument to probe fitness landscapes under controlled experimental conditions and as a relevant testing ground to develop accurate statistical models and inference algorithms (thanks to high-throughput screening and sequencing). Fitness landscape modeling either uses the enrichment of variants abundances as input, thus requiring the observation of the same variants at different rounds or assuming the last sequenced round as being sampled from an equilibrium distribution. AMaLa aims at effectively leveraging the information encoded in the whole time evolution. To do so, while assuming statistical sampling independence between sequenced rounds, the possible trajectories in sequence space are gauged with a time-dependent statistical weight consisting of two contributions: (i) an energy term accounting for the selection process and (ii) a generalized Jukes-Cantor model for the purely mutational step. This simple scheme enables accurately describing the Directed Evolution dynamics and inferring a fitness landscape that correctly reproduces the measures of the phenotype under selection (e.g., antibiotic drug resistance), notably outperforming widely used inference strategies. In addition, we assess the reliability of AMaLa by showing how the inferred statistical model could be used to predict relevant structural properties of the wild-type sequence.
Collapse
Affiliation(s)
- Luca Sesta
- Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy; (L.S.); (G.U.); (A.P.)
| | - Guido Uguzzoni
- Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy; (L.S.); (G.U.); (A.P.)
| | - Jorge Fernandez-de-Cossio-Diaz
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 & PSL Research, Sorbonne Université, 24 rue Lhomond, 75005 Paris, France
- Center of Molecular Immunology, Systems Biology Department, Playa, Havana CP 11600, Cuba
| | - Andrea Pagnani
- Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy; (L.S.); (G.U.); (A.P.)
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo, Italy
- INFN, Sezione di Torino, I-10125 Torino, Italy
| |
Collapse
|
97
|
Epistasis shapes the fitness landscape of an allosteric specificity switch. Nat Commun 2021; 12:5562. [PMID: 34548494 PMCID: PMC8455584 DOI: 10.1038/s41467-021-25826-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 09/03/2021] [Indexed: 11/08/2022] Open
Abstract
Epistasis is a major determinant in the emergence of novel protein function. In allosteric proteins, direct interactions between inducer-binding mutations propagate through the allosteric network, manifesting as epistasis at the level of biological function. Elucidating this relationship between local interactions and their global effects is essential to understanding evolution of allosteric proteins. We integrate computational design, structural and biophysical analysis to characterize the emergence of novel inducer specificity in an allosteric transcription factor. Adaptive landscapes of different inducers of the designed mutant show that a few strong epistatic interactions constrain the number of viable sequence pathways, revealing ridges in the fitness landscape leading to new specificity. The structure of the designed mutant shows that a striking change in inducer orientation still retains allosteric function. Comparing biophysical and functional properties suggests a nonlinear relationship between inducer binding affinity and allostery. Our results highlight the functional and evolutionary complexity of allosteric proteins. Epistasis plays an important role in the evolution of novel protein functions because it determines the mutational path a protein takes. Here, the authors combine functional, structural and biophysical analyses to characterize epistasis in a computationally redesigned ligand-inducible allosteric transcription factor and found that epistasis creates distinct biophysical and biological functional landscapes.
Collapse
|
98
|
Evolution-aided engineering of plant specialized metabolism. ABIOTECH 2021; 2:240-263. [PMID: 36303885 PMCID: PMC9590541 DOI: 10.1007/s42994-021-00052-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/04/2021] [Indexed: 02/07/2023]
Abstract
The evolution of new traits in living organisms occurs via the processes of mutation, recombination, genetic drift, and selection. These processes that have resulted in the immense biological diversity on our planet are also being employed in metabolic engineering to optimize enzymes and pathways, create new-to-nature reactions, and synthesize complex natural products in heterologous systems. In this review, we discuss two evolution-aided strategies for metabolic engineering-directed evolution, which improves upon existing genetic templates using the evolutionary process, and combinatorial pathway reconstruction, which brings together genes evolved in different organisms into a single heterologous host. We discuss the general principles of these strategies, describe the technologies involved and the molecular traits they influence, provide examples of their use, and discuss the roadblocks that need to be addressed for their wider adoption. A better understanding of these strategies can provide an impetus to research on gene function discovery and biochemical evolution, which is foundational for improved metabolic engineering. These evolution-aided approaches thus have a substantial potential for improving our understanding of plant metabolism in general, for enhancing the production of plant metabolites, and in sustainable agriculture.
Collapse
|
99
|
Aghazadeh A, Nisonoff H, Ocal O, Brookes DH, Huang Y, Koyluoglu OO, Listgarten J, Ramchandran K. Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat Commun 2021; 12:5225. [PMID: 34471113 PMCID: PMC8410946 DOI: 10.1038/s41467-021-25371-3] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 07/27/2021] [Indexed: 11/18/2022] Open
Abstract
Despite recent advances in high-throughput combinatorial mutagenesis assays, the number of labeled sequences available to predict molecular functions has remained small for the vastness of the sequence space combined with the ruggedness of many fitness functions. While deep neural networks (DNNs) can capture high-order epistatic interactions among the mutational sites, they tend to overfit to the small number of labeled sequences available for training. Here, we developed Epistatic Net (EN), a method for spectral regularization of DNNs that exploits evidence that epistatic interactions in many fitness functions are sparse. We built a scalable extension of EN, usable for larger sequences, which enables spectral regularization using fast sparse recovery algorithms informed by coding theory. Results on several biological landscapes show that EN consistently improves the prediction accuracy of DNNs and enables them to outperform competing models which assume other priors. EN estimates the higher-order epistatic interactions of DNNs trained on massive sequence spaces-a computational problem that otherwise takes years to solve.
Collapse
Affiliation(s)
- Amirali Aghazadeh
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | | | - Orhan Ocal
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - David H Brookes
- Biophysics Graduate Group, University of California, Berkeley, CA, USA
| | - Yijie Huang
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - O Ozan Koyluoglu
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA
- Center for Computational Biology, Berkeley, CA, USA
| | - Kannan Ramchandran
- Department of Electrical Engineering and Computer Sciences, Berkeley, CA, USA.
| |
Collapse
|
100
|
Morrison AJ, Wonderlick DR, Harms MJ. Ensemble epistasis: thermodynamic origins of nonadditivity between mutations. Genetics 2021; 219:iyab105. [PMID: 34849909 PMCID: PMC8633102 DOI: 10.1093/genetics/iyab105] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Accepted: 06/19/2021] [Indexed: 01/02/2023] Open
Abstract
Epistasis-when mutations combine nonadditively-is a profoundly important aspect of biology. It is often difficult to understand its mechanistic origins. Here, we show that epistasis can arise from the thermodynamic ensemble, or the set of interchanging conformations a protein adopts. Ensemble epistasis occurs because mutations can have different effects on different conformations of the same protein, leading to nonadditive effects on its average, observable properties. Using a simple analytical model, we found that ensemble epistasis arises when two conditions are met: (1) a protein populates at least three conformations and (2) mutations have differential effects on at least two conformations. To explore the relative magnitude of ensemble epistasis, we performed a virtual deep-mutational scan of the allosteric Ca2+ signaling protein S100A4. We found that 47% of mutation pairs exhibited ensemble epistasis with a magnitude on the order of thermal fluctuations. We observed many forms of epistasis: magnitude, sign, and reciprocal sign epistasis. The same mutation pair could even exhibit different forms of epistasis under different environmental conditions. The ubiquity of thermodynamic ensembles in biology and the pervasiveness of ensemble epistasis in our dataset suggests that it may be a common mechanism of epistasis in proteins and other macromolecules.
Collapse
Affiliation(s)
- Anneliese J Morrison
- Institute of Molecular Biology, University of Oregon, Eugene, OR 97403, USA
- Department of Chemistry and Biochemistry, University of Oregon, Eugene OR 97403, USA
| | - Daria R Wonderlick
- Institute of Molecular Biology, University of Oregon, Eugene, OR 97403, USA
- Department of Chemistry and Biochemistry, University of Oregon, Eugene OR 97403, USA
| | - Michael J Harms
- Institute of Molecular Biology, University of Oregon, Eugene, OR 97403, USA
- Department of Chemistry and Biochemistry, University of Oregon, Eugene OR 97403, USA
| |
Collapse
|