101
|
Di Gioacchino A, Procyk J, Molari M, Schreck JS, Zhou Y, Liu Y, Monasson R, Cocco S, Šulc P. Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection. PLoS Comput Biol 2022; 18:e1010561. [PMID: 36174101 PMCID: PMC9553063 DOI: 10.1371/journal.pcbi.1010561] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 10/11/2022] [Accepted: 09/12/2022] [Indexed: 12/03/2022] Open
Abstract
Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM’s performance with different supervised learning approaches that include random forests and several deep neural network architectures. We show that two-layer neural networks, Restricted Boltzmann Machines (RBM), can be successfully trained on sequence ensemble datasets from selection-amplification experiments. We train the RBM using datasets from aptamer selection experiments on thrombin protein, and show that the model can successfully generalize to the test set to predict binders and non-binders. The log-likelihood assigned to a sequence by the RBM is correlated with the sequence fitness as quantified by the amplification between different rounds of selection. We further show that that the model is interpretable and by inspecting the weights of the model, we can identify structural motifs that are characteristic of the good binders. We explore the usage of the RBMs to identify which of the possible protein exosites the aptamers bind to. We show that the RBM can also be used for unsupervised clustering. Finally, we use RBMs to generate novel aptamers, and we experimentally verify predicted binding and non-binding sequences generated from the RBM.
Collapse
Affiliation(s)
- Andrea Di Gioacchino
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
| | - Jonah Procyk
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Marco Molari
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- Biozentrum, University of Basel, Basel, Switzerland
- Swiss Institute of Bioinformatics, Basel, Switzerland
| | - John S. Schreck
- National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, Colorado, United States of America
| | - Yu Zhou
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Yan Liu
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Rémi Monasson
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- * E-mail: (RM); (SC); (PŠ)
| | - Simona Cocco
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- * E-mail: (RM); (SC); (PŠ)
| | - Petr Šulc
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
- * E-mail: (RM); (SC); (PŠ)
| |
Collapse
|
102
|
Colberg M, Schofield J. Configurational entropy, transition rates, and optimal interactions for rapid folding in coarse-grained model proteins. J Chem Phys 2022; 157:125101. [PMID: 36182418 DOI: 10.1063/5.0098612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Under certain conditions, the dynamics of coarse-grained models of solvated proteins can be described using a Markov state model, which tracks the evolution of populations of configurations. The transition rates among states that appear in the Markov model can be determined by computing the relative entropy of states and their mean first passage times. In this paper, we present an adaptive method to evaluate the configurational entropy and the mean first passage times for linear chain models with discontinuous potentials. The approach is based on event-driven dynamical sampling in a massively parallel architecture. Using the fact that the transition rate matrix can be calculated for any choice of interaction energies at any temperature, it is demonstrated how each state's energy can be chosen such that the average time to transition between any two states is minimized. The methods are used to analyze the optimization of the folding process of two protein systems: the crambin protein and a model with frustration and misfolding. It is shown that the folding pathways for both systems are comprised of two regimes: first, the rapid establishment of local bonds, followed by the subsequent formation of more distant contacts. The state energies that lead to the most rapid folding encourage multiple pathways, and they either penalize folding pathways through kinetic traps by raising the energies of trapping states or establish an escape route from the trapping states by lowering free energy barriers to other states that rapidly reach the native state.
Collapse
Affiliation(s)
- Margarita Colberg
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| | - Jeremy Schofield
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| |
Collapse
|
103
|
Kim D, Noh MH, Park M, Kim I, Ahn H, Ye DY, Jung GY, Kim S. Enzyme activity engineering based on sequence co-evolution analysis. Metab Eng 2022; 74:49-60. [PMID: 36113751 DOI: 10.1016/j.ymben.2022.09.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 08/31/2022] [Accepted: 09/05/2022] [Indexed: 11/17/2022]
Abstract
The utility of engineering enzyme activity is expanding with the development of biotechnology. Conventional methods have limited applicability as they require high-throughput screening or three-dimensional structures to direct target residues of activity control. An alternative method uses sequence evolution of natural selection. A repertoire of mutations was selected for fine-tuning enzyme activities to adapt to varying environments during the evolution. Here, we devised a strategy called sequence co-evolutionary analysis to control the efficiency of enzyme reactions (SCANEER), which scans the evolution of protein sequences and direct mutation strategy to improve enzyme activity. We hypothesized that amino acid pairs for various enzyme activity were encoded in the evolutionary history of protein sequences, whereas loss-of-function mutations were avoided since those are depleted during the evolution. SCANEER successfully predicted the enzyme activities of beta-lactamase and aminoglycoside 3'-phosphotransferase. SCANEER was further experimentally validated to control the activities of three different enzymes of great interest in chemical production: cis-aconitate decarboxylase, α-ketoglutaric semialdehyde dehydrogenase, and inositol oxygenase. Activity-enhancing mutations that improve substrate-binding affinity or turnover rate were found at sites distal from known active sites or ligand-binding pockets. We provide SCANEER to control desired enzyme activity through a user-friendly webserver.
Collapse
Affiliation(s)
- Donghyo Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, South Korea
| | - Myung Hyun Noh
- Department of Chemical Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Minhyuk Park
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, South Korea
| | - Inhae Kim
- ImmunoBiome Inc., Pohang, South Korea
| | - Hyunsoo Ahn
- Graduate School of Artificial Intelligence, Pohang University of Science and Technology, Pohang, South Korea
| | - Dae-Yeol Ye
- Department of Chemical Engineering, Pohang University of Science and Technology, Pohang, South Korea
| | - Gyoo Yeol Jung
- Department of Chemical Engineering, Pohang University of Science and Technology, Pohang, South Korea; Institute of Convergence Research and Education in Advanced Technology, Yonsei University, Seoul, South Korea; School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology, Pohang, South Korea.
| | - Sanguk Kim
- Department of Life Sciences, Pohang University of Science and Technology, Pohang, South Korea; Graduate School of Artificial Intelligence, Pohang University of Science and Technology, Pohang, South Korea; Institute of Convergence Research and Education in Advanced Technology, Yonsei University, Seoul, South Korea; School of Interdisciplinary Bioscience and Bioengineering, Pohang University of Science and Technology, Pohang, South Korea.
| |
Collapse
|
104
|
Lynn CW, Holmes CM, Bialek W, Schwab DJ. Decomposing the Local Arrow of Time in Interacting Systems. PHYSICAL REVIEW LETTERS 2022; 129:118101. [PMID: 36154397 PMCID: PMC9751844 DOI: 10.1103/physrevlett.129.118101] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 06/03/2022] [Accepted: 06/24/2022] [Indexed: 05/30/2023]
Abstract
We show that the evidence for a local arrow of time, which is equivalent to the entropy production in thermodynamic systems, can be decomposed. In a system with many degrees of freedom, there is a term that arises from the irreversible dynamics of the individual variables, and then a series of non-negative terms contributed by correlations among pairs, triplets, and higher-order combinations of variables. We illustrate this decomposition on simple models of noisy logical computations, and then apply it to the analysis of patterns of neural activity in the retina as it responds to complex dynamic visual scenes. We find that neural activity breaks detailed balance even when the visual inputs do not, and that this irreversibility arises primarily from interactions between pairs of neurons.
Collapse
Affiliation(s)
- Christopher W Lynn
- Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, New York, New York 10016, USA
- Joseph Henry Laboratories of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - Caroline M Holmes
- Joseph Henry Laboratories of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - William Bialek
- Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, New York, New York 10016, USA
- Joseph Henry Laboratories of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - David J Schwab
- Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, New York, New York 10016, USA
| |
Collapse
|
105
|
Ruff KM, Choi YH, Cox D, Ormsby AR, Myung Y, Ascher DB, Radford SE, Pappu RV, Hatters DM. Sequence grammar underlying the unfolding and phase separation of globular proteins. Mol Cell 2022; 82:3193-3208.e8. [PMID: 35853451 PMCID: PMC10846692 DOI: 10.1016/j.molcel.2022.06.024] [Citation(s) in RCA: 59] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 05/05/2022] [Accepted: 06/15/2022] [Indexed: 12/23/2022]
Abstract
Aberrant phase separation of globular proteins is associated with many diseases. Here, we use a model protein system to understand how the unfolded states of globular proteins drive phase separation and the formation of unfolded protein deposits (UPODs). We find that for UPODs to form, the concentrations of unfolded molecules must be above a threshold value. Additionally, unfolded molecules must possess appropriate sequence grammars to drive phase separation. While UPODs recruit molecular chaperones, their compositional profiles are also influenced by synergistic physicochemical interactions governed by the sequence grammars of unfolded proteins and cellular proteins. Overall, the driving forces for phase separation and the compositional profiles of UPODs are governed by the sequence grammars of unfolded proteins. Our studies highlight the need for uncovering the sequence grammars of unfolded proteins that drive UPOD formation and cause gain-of-function interactions whereby proteins are aberrantly recruited into UPODs.
Collapse
Affiliation(s)
- Kiersten M Ruff
- Department of Biomedical Engineering, Center for Science & Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Yoon Hee Choi
- Department of Biochemistry and Pharmacology and Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Dezerae Cox
- Department of Biochemistry and Pharmacology and Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Angelique R Ormsby
- Department of Biochemistry and Pharmacology and Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Yoochan Myung
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; Structural Biology and Bioinformatics, Department of Biochemistry and Pharmacology, The University of Melbourne, Melbourne, VIC 3010, Australia; Systems and Computational Biology, Bio21 Institute, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia; Structural Biology and Bioinformatics, Department of Biochemistry and Pharmacology, The University of Melbourne, Melbourne, VIC 3010, Australia; Systems and Computational Biology, Bio21 Institute, The University of Melbourne, Melbourne, VIC 3010, Australia
| | - Sheena E Radford
- Astbury Centre for Structural and Molecular Biology, School of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, UK
| | - Rohit V Pappu
- Department of Biomedical Engineering, Center for Science & Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO 63130, USA.
| | - Danny M Hatters
- Department of Biochemistry and Pharmacology and Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC 3010, Australia.
| |
Collapse
|
106
|
Abstract
Repeat proteins are made with tandem copies of similar amino acid stretches that fold into elongated architectures. These proteins constitute excellent model systems to investigate how evolution relates to structure, folding, and function. Here, we propose a scheme to map evolutionary information at the sequence level to a coarse-grained model for repeat-protein folding and use it to investigate the folding of thousands of repeat proteins. We model the energetics by a combination of an inverse Potts-model scheme with an explicit mechanistic model of duplications and deletions of repeats to calculate the evolutionary parameters of the system at the single-residue level. These parameters are used to inform an Ising-like model that allows for the generation of folding curves, apparent domain emergence, and occupation of intermediate states that are highly compatible with experimental data in specific case studies. We analyzed the folding of thousands of natural Ankyrin repeat proteins and found that a multiplicity of folding mechanisms are possible. Fully cooperative all-or-none transitions are obtained for arrays with enough sequence-similar elements and strong interactions between them, while noncooperative element-by-element intermittent folding arose if the elements are dissimilar and the interactions between them are energetically weak. Additionally, we characterized nucleation-propagation and multidomain folding mechanisms. We show that the global stability and cooperativity of the repeating arrays can be predicted from simple sequence scores.
Collapse
|
107
|
Vigué L, Croce G, Petitjean M, Ruppé E, Tenaillon O, Weigt M. Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes. Nat Commun 2022; 13:4030. [PMID: 35821377 PMCID: PMC9276797 DOI: 10.1038/s41467-022-31643-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 06/27/2022] [Indexed: 12/05/2022] Open
Abstract
Characterizing the effect of mutations is key to understand the evolution of protein sequences and to separate neutral amino-acid changes from deleterious ones. Epistatic interactions between residues can lead to a context dependence of mutation effects. Context dependence constrains the amino-acid changes that can contribute to polymorphism in the short term, and the ones that can accumulate between species in the long term. We use computational approaches to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes from the analysis of distant homologues. By comparing a context-aware Direct-Coupling Analysis modelling to a non-epistatic approach, we show that the genetic context strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. The study of more distant species suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.
Collapse
Affiliation(s)
- Lucile Vigué
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
| | - Giancarlo Croce
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics-SIB, Lausanne, Switzerland
| | - Marie Petitjean
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
| | - Etienne Ruppé
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
- Laboratoire de Bactériologie, Hôpital Bichat, APHP, Paris, France
| | - Olivier Tenaillon
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France.
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology-LCQB, Paris, France.
| |
Collapse
|
108
|
Feinauer C, Meynard-Piganeau B, Lucibello C. Interpretable pairwise distillations for generative protein sequence models. PLoS Comput Biol 2022; 18:e1010219. [PMID: 35737722 PMCID: PMC9258900 DOI: 10.1371/journal.pcbi.1010219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 07/06/2022] [Accepted: 05/17/2022] [Indexed: 11/25/2022] Open
Abstract
Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze two different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction. In addition, we show that even simpler, factorized models often come close in performance to the original models. Complex neural networks trained on large biological datasets have recently shown powerful capabilites in tasks like the prediction of protein structure, assessing the effect of mutations on the fitness of proteins and even designing completely novel proteins with desired characteristics. The enthralling prospect of leveraging these advances in fields like medicine and synthetic biology has created a large amount of interest in academic research and industry. The connected question of what biological insights these methods actually gain during training has, however, received less attention. In this work, we systematically investigate in how far neural networks capture information that could not be captured by simpler models. To this end, we develop a method to train simpler models to imitate more complex models, and compare their performance to the original neural network models. Surprisingly, we find that the simpler models thus trained often perform on par with the neural networks, while having a considerably easier structure. This highlights the importance of finding ways to interpret the predictions of neural networks in these fields, which could inform the creation of better models, improve methods for their assessment and ultimately also increase our understanding of the underlying biology.
Collapse
Affiliation(s)
- Christoph Feinauer
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
- * E-mail:
| | - Barthelemy Meynard-Piganeau
- Laboratory of Computational and Quantitative Biology (LCQB) UMR 7238 CNRS, Sorbonne Université, Paris, France
- Department of Applied Science and Technologies (DISAT), Politecnico di Torino, Turin, Italy
| | - Carlo Lucibello
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
| |
Collapse
|
109
|
Saona R, Kondrashov FA, Khudiakova KA. Relation Between the Number of Peaks and the Number of Reciprocal Sign Epistatic Interactions. Bull Math Biol 2022; 84:74. [PMID: 35713756 PMCID: PMC9205815 DOI: 10.1007/s11538-022-01029-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Accepted: 05/15/2022] [Indexed: 01/25/2023]
Abstract
Empirical essays of fitness landscapes suggest that they may be rugged, that is having multiple fitness peaks. Such fitness landscapes, those that have multiple peaks, necessarily have special local structures, called reciprocal sign epistasis (Poelwijk et al. in J Theor Biol 272:141-144, 2011). Here, we investigate the quantitative relationship between the number of fitness peaks and the number of reciprocal sign epistatic interactions. Previously, it has been shown (Poelwijk et al. in J Theor Biol 272:141-144, 2011) that pairwise reciprocal sign epistasis is a necessary but not sufficient condition for the existence of multiple peaks. Applying discrete Morse theory, which to our knowledge has never been used in this context, we extend this result by giving the minimal number of reciprocal sign epistatic interactions required to create a given number of peaks.
Collapse
Affiliation(s)
- Raimundo Saona
- Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Lower Austria Austria
| | - Fyodor A. Kondrashov
- Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Lower Austria Austria
| | - Ksenia A. Khudiakova
- Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Lower Austria Austria
| |
Collapse
|
110
|
Abstract
The ability to design efficient enzymes from scratch would have a profound effect on chemistry, biotechnology and medicine. Rapid progress in protein engineering over the past decade makes us optimistic that this ambition is within reach. The development of artificial enzymes containing metal cofactors and noncanonical organocatalytic groups shows how protein structure can be optimized to harness the reactivity of nonproteinogenic elements. In parallel, computational methods have been used to design protein catalysts for diverse reactions on the basis of fundamental principles of transition state stabilization. Although the activities of designed catalysts have been quite low, extensive laboratory evolution has been used to generate efficient enzymes. Structural analysis of these systems has revealed the high degree of precision that will be needed to design catalysts with greater activity. To this end, emerging protein design methods, including deep learning, hold particular promise for improving model accuracy. Here we take stock of key developments in the field and highlight new opportunities for innovation that should allow us to transition beyond the current state of the art and enable the robust design of biocatalysts to address societal needs.
Collapse
|
111
|
Patel R, Carnevale V, Kumar S. Epistasis Creates Invariant Sites and Modulates the Rate of Molecular Evolution. Mol Biol Evol 2022; 39:msac106. [PMID: 35575390 PMCID: PMC9156017 DOI: 10.1093/molbev/msac106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Invariant sites are a common feature of amino acid sequence evolution. The presence of invariant sites is frequently attributed to the need to preserve function through site-specific conservation of amino acid residues. Amino acid substitution models without a provision for invariant sites often fit the data significantly worse than those that allow for an excess of invariant sites beyond those predicted by models that only incorporate rate variation among sites (e.g., a Gamma distribution). An alternative is epistasis between sites to preserve residue interactions that can create invariant sites. Through computer-simulated sequence evolution, we evaluated the relative effects of site-specific preferences and site-site couplings in the generation of invariant sites and the modulation of the rate of molecular evolution. In an analysis of ten major families of protein domains with diverse sequence and functional properties, we find that the negative selection imposed by epistasis creates many more invariant sites than site-specific residue preferences alone. Further, epistasis plays an increasingly larger role in creating invariant sites over longer evolutionary periods. Epistasis also dictates rates of domain evolution over time by exerting significant additional purifying selection to preserve site couplings. These patterns illuminate the mechanistic role of epistasis in the processes underlying observed site invariance and evolutionary rates.
Collapse
Affiliation(s)
- Ravi Patel
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA
- Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Vincenzo Carnevale
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA
- Department of Biology, Temple University, Philadelphia, PA 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA 19122, USA
- Department of Biology, Temple University, Philadelphia, PA 19122, USA
- Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
112
|
Ding D, Green AG, Wang B, Lite TLV, Weinstein EN, Marks DS, Laub MT. Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nat Ecol Evol 2022; 6:590-603. [PMID: 35361892 PMCID: PMC9090974 DOI: 10.1038/s41559-022-01688-0] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Accepted: 01/31/2022] [Indexed: 01/08/2023]
Abstract
Proteins often accumulate neutral mutations that do not affect current functions but can profoundly influence future mutational possibilities and functions. Understanding such hidden potential has major implications for protein design and evolutionary forecasting but has been limited by a lack of systematic efforts to identify potentiating mutations. Here, through the comprehensive analysis of a bacterial toxin-antitoxin system, we identified all possible single substitutions in the toxin that enable it to tolerate otherwise interface-disrupting mutations in its antitoxin. Strikingly, the majority of enabling mutations in the toxin do not contact and promote tolerance non-specifically to many different antitoxin mutations, despite covariation in homologues occurring primarily between specific pairs of contacting residues across the interface. In addition, the enabling mutations we identified expand future mutational paths that both maintain old toxin-antitoxin interactions and form new ones. These non-specific mutations are missed by widely used covariation and machine learning methods. Identifying such enabling mutations will be critical for ensuring continued binding of therapeutically relevant proteins, such as antibodies, aimed at evolving targets.
Collapse
Affiliation(s)
- David Ding
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Anna G Green
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Boyuan Wang
- Department of Pharmacology, UT Southwestern Medical Center, Dallas, TX, USA
| | - Thuy-Lan Vo Lite
- Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA, USA
| | | | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Michael T Laub
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
113
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
114
|
Chi H, Zhou Q, Tutol JN, Phelps SM, Lee J, Kapadia P, Morcos F, Dodani SC. Coupling a Live Cell Directed Evolution Assay with Coevolutionary Landscapes to Engineer an Improved Fluorescent Rhodopsin Chloride Sensor. ACS Synth Biol 2022; 11:1627-1638. [PMID: 35389621 PMCID: PMC9184236 DOI: 10.1021/acssynbio.2c00033] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Our understanding of chloride in biology has been accelerated through the application of fluorescent protein-based sensors in living cells. These sensors can be generated and diversified to have a range of properties using laboratory-guided evolution. Recently, we established that the fluorescent proton-pumping rhodopsin wtGR from Gloeobacter violaceus can be converted into a fluorescent sensor for chloride. To unlock this non-natural function, a single point mutation at the Schiff counterion position (D121V) was introduced into wtGR fused to cyan fluorescent protein (CFP) resulting in GR1-CFP. Here, we have integrated coevolutionary analysis with directed evolution to understand how the rhodopsin sequence space can be explored and engineered to improve this starting point. We first show how evolutionary couplings are predictive of functional sites in the rhodopsin family and how a fitness metric based on a sequence can be used to quantify the known proton-pumping activities of GR-CFP variants. Then, we couple this ability to predict potential functional outcomes with a screening and selection assay in live Escherichia coli to reduce the mutational search space of five residues along the proton-pumping pathway in GR1-CFP. This iterative selection process results in GR2-CFP with four additional mutations: E132K, A84K, T125C, and V245I. Finally, bulk and single fluorescence measurements in live E. coli reveal that GR2-CFP is a reversible, ratiometric fluorescent sensor for extracellular chloride with an improved dynamic range. We anticipate that our framework will be applicable to other systems, providing a more efficient methodology to engineer fluorescent protein-based sensors with desired properties.
Collapse
|
115
|
Singer JM, Novotney S, Strickland D, Haddox HK, Leiby N, Rocklin GJ, Chow CM, Roy A, Bera AK, Motta FC, Cao L, Strauch EM, Chidyausiku TM, Ford A, Ho E, Zaitzeff A, Mackenzie CO, Eramian H, DiMaio F, Grigoryan G, Vaughn M, Stewart LJ, Baker D, Klavins E. Large-scale design and refinement of stable proteins using sequence-only models. PLoS One 2022; 17:e0265020. [PMID: 35286324 PMCID: PMC8920274 DOI: 10.1371/journal.pone.0265020] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 02/18/2022] [Indexed: 12/25/2022] Open
Abstract
Engineered proteins generally must possess a stable structure in order to achieve their designed function. Stable designs, however, are astronomically rare within the space of all possible amino acid sequences. As a consequence, many designs must be tested computationally and experimentally in order to find stable ones, which is expensive in terms of time and resources. Here we use a high-throughput, low-fidelity assay to experimentally evaluate the stability of approximately 200,000 novel proteins. These include a wide range of sequence perturbations, providing a baseline for future work in the field. We build a neural network model that predicts protein stability given only sequences of amino acids, and compare its performance to the assayed values. We also report another network model that is able to generate the amino acid sequences of novel stable proteins given requested secondary sequences. Finally, we show that the predictive model-despite weaknesses including a noisy data set-can be used to substantially increase the stability of both expert-designed and model-generated proteins.
Collapse
Affiliation(s)
| | - Scott Novotney
- Two Six Technologies, Arlington, Virginia, United States of America
| | - Devin Strickland
- Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, United States of America
| | - Hugh K. Haddox
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Nicholas Leiby
- Two Six Technologies, Arlington, Virginia, United States of America
| | - Gabriel J. Rocklin
- Department of Pharmacology and Center for Synthetic Biology, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Cameron M. Chow
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Anindya Roy
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Asim K. Bera
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Francis C. Motta
- Department of Mathematical Sciences, Florida Atlantic University, Boca Raton, Florida, United States of America
| | - Longxing Cao
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Eva-Maria Strauch
- Department of Pharmaceutical and Biomedical Sciences, University of Georgia, Athens, Georgia, United States of America
| | - Tamuka M. Chidyausiku
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Alex Ford
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Ethan Ho
- Texas Advanced Computing Center, Austin, Texas, United States of America
| | | | - Craig O. Mackenzie
- Quantitative Biomedical Sciences Graduate Program, Dartmouth College, Hanover, New Hampshire, United States of America
| | - Hamed Eramian
- Netrias, Cambridge, Massachusetts, United States of America
| | - Frank DiMaio
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Gevorg Grigoryan
- Departments of Computer Science and Biological Sciences, Dartmouth College, Hanover, New Hampshire, United States of America
| | - Matthew Vaughn
- Texas Advanced Computing Center, Austin, Texas, United States of America
| | - Lance J. Stewart
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - David Baker
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, Washington, United States of America
| | - Eric Klavins
- Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
116
|
Computational enzyme redesign: large jumps in function. TRENDS IN CHEMISTRY 2022. [DOI: 10.1016/j.trechm.2022.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
117
|
Gopalakrishnappa C, Gowda K, Prabhakara KH, Kuehn S. An ensemble approach to the structure-function problem in microbial communities. iScience 2022; 25:103761. [PMID: 35141504 PMCID: PMC8810406 DOI: 10.1016/j.isci.2022.103761] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
The metabolic activity of microbial communities plays a primary role in the flow of essential nutrients throughout the biosphere. Molecular genetics has revealed the metabolic pathways that model organisms utilize to generate energy and biomass, but we understand little about how the metabolism of diverse, natural communities emerges from the collective action of its constituents. We propose that quantifying and mapping metabolic fluxes to sequencing measurements of genomic, taxonomic, or transcriptional variation across an ensemble of diverse communities, either in the laboratory or in the wild, can reveal low-dimensional descriptions of community structure that can explain or predict their emergent metabolic activity. We survey the types of communities for which this approach might be best suited, review the analytical techniques available for quantifying metabolite fluxes in communities, and discuss what types of data analysis approaches might be lucrative for learning the structure-function mapping in communities from these data.
Collapse
Affiliation(s)
| | - Karna Gowda
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL 60637, USA
| | - Kaumudi H. Prabhakara
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL 60637, USA
| | - Seppe Kuehn
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637, USA
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
118
|
Enhancing computational enzyme design by a maximum entropy strategy. Proc Natl Acad Sci U S A 2022; 119:2122355119. [PMID: 35135886 PMCID: PMC8851541 DOI: 10.1073/pnas.2122355119] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/03/2022] [Indexed: 01/16/2023] Open
Abstract
Although computational enzyme design is of great importance, the advances utilizing physics-based approaches have been slow, and further progress is urgently needed. One promising direction is using machine learning, but such strategies have not been established as effective tools for predicting the catalytic power of enzymes. Here, we show that the statistical energy inferred from homologous sequences with the maximum entropy (MaxEnt) principle significantly correlates with enzyme catalysis and stability at the active site region and the more distant region, respectively. This finding decodes enzyme architecture and offers a connection between enzyme evolution and the physical chemistry of enzyme catalysis, and it deepens our understanding of the stability-activity trade-off hypothesis for enzymes. Overall, the strong correlations found here provide a powerful way of guiding enzyme design.
Collapse
|
119
|
Tubiana J, Xiang Y, Fan L, Wolfson HJ, Chen K, Schneidman-Duhovny D, Shi Y. Reduced antigenicity of Omicron lowers host serologic response. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.02.15.480546. [PMID: 35194608 PMCID: PMC8863144 DOI: 10.1101/2022.02.15.480546] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
SARS-CoV-2 Omicron variant of concern (VOC) contains fifteen mutations on the receptor binding domain (RBD), evading most neutralizing antibodies from vaccinated sera. Emerging evidence suggests that Omicron breakthrough cases are associated with substantially lower antibody titers than other VOC cases. However, the mechanism remains unclear. Here, using a novel geometric deep-learning model, we discovered that the antigenic profile of Omicron RBD is distinct from the prior VOCs, featuring reduced antigenicity in its remodeled receptor binding sites (RBS). To substantiate our deep-learning prediction, we immunized mice with different recombinant RBD variants and found that the Omicron's extensive mutations can lead to a drastically attenuated serologic response with limited neutralizing activity in vivo , while the T cell response remains potent. Analyses of serum cross-reactivity and competitive ELISA with epitope-specific nanobodies revealed that the antibody response to Omicron was reduced across RBD epitopes, including both the variable RBS and epitopes without any known VOC mutations. Moreover, computational modeling confirmed that the RBS is highly versatile with a capacity to further decrease antigenicity while retaining efficient receptor binding. Longitudinal analysis showed that this evolutionary trend of decrease in antigenicity was also found in hCoV229E, a common cold coronavirus that has been circulating in humans for decades. Thus, our study provided unprecedented insights into the reduced antibody titers associated with Omicron infection, revealed a possible trajectory of future viral evolution and may inform the vaccine development against future outbreaks.
Collapse
Affiliation(s)
- Jérôme Tubiana
- Blavatnik School of Computer Science, Tel Aviv University, Israel
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel
| | - Yufei Xiang
- Department of Cell Biology, University of Pittsburgh, USA
- Current address: Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, USA
| | - Li Fan
- Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, USA
| | - Haim J. Wolfson
- Blavatnik School of Computer Science, Tel Aviv University, Israel
| | - Kong Chen
- Division of Pulmonary, Allergy, and Critical Care Medicine, Department of Medicine, University of Pittsburgh, USA
| | | | - Yi Shi
- Department of Cell Biology, University of Pittsburgh, USA
- Current address: Department of Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, USA
| |
Collapse
|
120
|
Engineering synthetic auxotrophs for growth-coupled directed protein evolution. Trends Biotechnol 2022; 40:773-776. [DOI: 10.1016/j.tibtech.2022.01.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 01/19/2022] [Accepted: 01/19/2022] [Indexed: 11/19/2022]
|
121
|
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022; 40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 82] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]
Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Collapse
Affiliation(s)
- Chloe Hsu
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, USA
| | - Clara Fannjiang
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. .,Center for Computational Biology, University of California, Berkeley, USA.
| |
Collapse
|
122
|
Gonzalez Somermeyer L, Fleiss A, Mishin AS, Bozhanova NG, Igolkina AA, Meiler J, Alaball Pujol ME, Putintseva EV, Sarkisyan KS, Kondrashov FA. Heterogeneity of the GFP fitness landscape and data-driven protein design. eLife 2022; 11:75842. [PMID: 35510622 PMCID: PMC9119679 DOI: 10.7554/elife.75842] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 03/25/2022] [Indexed: 11/24/2022] Open
Abstract
Studies of protein fitness landscapes reveal biophysical constraints guiding protein evolution and empower prediction of functional proteins. However, generalisation of these findings is limited due to scarceness of systematic data on fitness landscapes of proteins with a defined evolutionary relationship. We characterized the fitness peaks of four orthologous fluorescent proteins with a broad range of sequence divergence. While two of the four studied fitness peaks were sharp, the other two were considerably flatter, being almost entirely free of epistatic interactions. Mutationally robust proteins, characterized by a flat fitness peak, were not optimal templates for machine-learning-driven protein design - instead, predictions were more accurate for fragile proteins with epistatic landscapes. Our work paves insights for practical application of fitness landscape heterogeneity in protein engineering.
Collapse
Affiliation(s)
| | - Aubin Fleiss
- Synthetic Biology Group, MRC London Institute of Medical SciencesLondonUnited Kingdom,Institute of Clinical Sciences, Faculty of Medicine and Imperial College Centre for Synthetic Biology, Imperial College LondonLondonUnited Kingdom
| | - Alexander S Mishin
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of SciencesMoscowRussian Federation
| | - Nina G Bozhanova
- Department of Chemistry, Center for Structural Biology, Vanderbilt UniversityNashvilleUnited States
| | - Anna A Igolkina
- Gregor Mendel Institute, Austrian Academy of Sciences, Vienna BioCenterViennaAustria
| | - Jens Meiler
- Department of Chemistry, Center for Structural Biology, Vanderbilt UniversityNashvilleUnited States,Institute for Drug Discovery, Medical School, Leipzig UniversityLeipzigGermany
| | - Maria-Elisenda Alaball Pujol
- Synthetic Biology Group, MRC London Institute of Medical SciencesLondonUnited Kingdom,Institute of Clinical Sciences, Faculty of Medicine and Imperial College Centre for Synthetic Biology, Imperial College LondonLondonUnited Kingdom
| | | | - Karen S Sarkisyan
- Synthetic Biology Group, MRC London Institute of Medical SciencesLondonUnited Kingdom,Institute of Clinical Sciences, Faculty of Medicine and Imperial College Centre for Synthetic Biology, Imperial College LondonLondonUnited Kingdom,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of SciencesMoscowRussian Federation
| | - Fyodor A Kondrashov
- Institute of Science and Technology AustriaKlosterneuburgAustria,Evolutionary and Synthetic Biology Unit, Okinawa Institute of Science and Technology Graduate UniversityOkinawaJapan
| |
Collapse
|
123
|
Adaptive machine learning for protein engineering. Curr Opin Struct Biol 2021; 72:145-152. [PMID: 34896756 DOI: 10.1016/j.sbi.2021.11.002] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 09/14/2021] [Accepted: 11/08/2021] [Indexed: 11/22/2022]
Abstract
Machine-learning models that learn from data to predict how protein sequence encodes function are emerging as a useful protein engineering tool. However, when using these models to suggest new protein designs, one must deal with the vast combinatorial complexity of protein sequences. Here, we review how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
Collapse
|
124
|
Mokhtari DA, Appel MJ, Fordyce PM, Herschlag D. High throughput and quantitative enzymology in the genomic era. Curr Opin Struct Biol 2021; 71:259-273. [PMID: 34592682 PMCID: PMC8648990 DOI: 10.1016/j.sbi.2021.07.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 07/23/2021] [Indexed: 12/28/2022]
Abstract
Accurate predictions from models based on physical principles are the ultimate metric of our biophysical understanding. Although there has been stunning progress toward structure prediction, quantitative prediction of enzyme function has remained challenging. Realizing this goal will require large numbers of quantitative measurements of rate and binding constants and the use of these ground-truth data sets to guide the development and testing of these quantitative models. Ground truth data more closely linked to the underlying physical forces are also desired. Here, we describe technological advances that enable both types of ground truth measurements. These advances allow classic models to be tested, provide novel mechanistic insights, and place us on the path toward a predictive understanding of enzyme structure and function.
Collapse
Affiliation(s)
- D A Mokhtari
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA
| | - M J Appel
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA
| | - P M Fordyce
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA; ChEM-H Institute, Stanford University, Stanford, CA, 94305, USA; Department of Genetics, Stanford University, Stanford, CA, 94305, USA; Chan Zuckerberg Biohub San Francisco, CA, 94110, USA.
| | - D Herschlag
- Department of Biochemistry, Stanford University, Stanford, CA, 94305, USA; Department of Chemical Engineering, Stanford University, Stanford, CA, 94305, USA; ChEM-H Institute, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
125
|
Malbranke C, Bikard D, Cocco S, Monasson R. Improving sequence-based modeling of protein families using secondary-structure quality assessment. Bioinformatics 2021; 37:4083-4090. [PMID: 34117879 PMCID: PMC9502231 DOI: 10.1093/bioinformatics/btab442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Revised: 06/03/2021] [Accepted: 06/16/2021] [Indexed: 12/03/2022] Open
Abstract
MOTIVATION Modeling of protein family sequence distribution from homologous sequence data recently received considerable attention, in particular for structure and function predictions, as well as for protein design. In particular, direct coupling analysis, a method to infer effective pairwise interactions between residues, was shown to capture important structural constraints and to successfully generate functional protein sequences. Building on this and other graphical models, we introduce a new framework to assess the quality of the secondary structures of the generated sequences with respect to reference structures for the family. RESULTS We introduce two scoring functions characterizing the likeliness of the secondary structure of a protein sequence to match a reference structure, called Dot Product and Pattern Matching. We test these scores on published experimental protein mutagenesis and design dataset, and show improvement in the detection of nonfunctional sequences. We also show that use of these scores help rejecting nonfunctional sequences generated by graphical models (Restricted Boltzmann Machines) learned from homologous sequence alignments. AVAILABILITY AND IMPLEMENTATION Data and code available at https://github.com/CyrilMa/ssqa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
- Synthetic Biology, Microbiology Department, Institut Pasteur, Paris, France
| | - David Bikard
- Synthetic Biology, Microbiology Department, Institut Pasteur, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| |
Collapse
|
126
|
Bisardi M, Rodriguez-Rivas J, Zamponi F, Weigt M. Modeling sequence-space exploration and emergence of epistatic signals in protein evolution. Mol Biol Evol 2021; 39:6424001. [PMID: 34751386 PMCID: PMC8789065 DOI: 10.1093/molbev/msab321] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
During their evolution, proteins explore sequence space via an interplay between random mutations and phenotypic selection. Here, we build upon recent progress in reconstructing data-driven fitness landscapes for families of homologous proteins, to propose stochastic models of experimental protein evolution. These models predict quantitatively important features of experimentally evolved sequence libraries, like fitness distributions and position-specific mutational spectra. They also allow us to efficiently simulate sequence libraries for a vast array of combinations of experimental parameters like sequence divergence, selection strength, and library size. We showcase the potential of the approach in reanalyzing two recent experiments to determine protein structure from signals of epistasis emerging in experimental sequence libraries. To be detectable, these signals require sufficiently large and sufficiently diverged libraries. Our modeling framework offers a quantitative explanation for different outcomes of recently published experiments. Furthermore, we can forecast the outcome of time- and resource-intensive evolution experiments, opening thereby a way to computationally optimize experimental protocols.
Collapse
Affiliation(s)
- M Bisardi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, F-75005, France.,Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, Paris, F-75005, France
| | - J Rodriguez-Rivas
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, Paris, F-75005, France
| | - F Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, F-75005, France
| | - M Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, Paris, F-75005, France
| |
Collapse
|
127
|
McGee F, Hauri S, Novinger Q, Vucetic S, Levy RM, Carnevale V, Haldane A. The generative capacity of probabilistic protein sequence models. Nat Commun 2021; 12:6302. [PMID: 34728624 PMCID: PMC8563988 DOI: 10.1038/s41467-021-26529-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 09/23/2021] [Indexed: 01/10/2023] Open
Abstract
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
Collapse
Affiliation(s)
- Francisco McGee
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
| | - Sandro Hauri
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Quentin Novinger
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Slobodan Vucetic
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Ronald M Levy
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
- Department of Physics, Temple University, Philadelphia, 19122, USA
- Department of Chemistry, Temple University, Philadelphia, 19122, USA
| | - Vincenzo Carnevale
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA.
- Department of Biology, Temple University, Philadelphia, 19122, USA.
| | - Allan Haldane
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA.
- Department of Chemistry, Temple University, Philadelphia, 19122, USA.
| |
Collapse
|
128
|
Modulating Glycoside Hydrolase Activity between Hydrolysis and Transfer Reactions Using an Evolutionary Approach. Molecules 2021; 26:molecules26216586. [PMID: 34770995 PMCID: PMC8587830 DOI: 10.3390/molecules26216586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 10/27/2021] [Accepted: 10/28/2021] [Indexed: 01/02/2023] Open
Abstract
The proteins within the CAZy glycoside hydrolase family GH13 catalyze the hydrolysis of polysaccharides such as glycogen and starch. Many of these enzymes also perform transglycosylation in various degrees, ranging from secondary to predominant reactions. Identifying structural determinants associated with GH13 family reaction specificity is key to modifying and designing enzymes with increased specificity towards individual reactions for further applications in industrial, chemical, or biomedical fields. This work proposes a computational approach for decoding the determinant structural composition defining the reaction specificity. This method is based on the conservation of coevolving residues in spatial contacts associated with reaction specificity. To evaluate the algorithm, mutants of α-amylase (TmAmyA) and glucanotransferase (TmGTase) from Thermotoga maritima were constructed to modify the reaction specificity. The K98P/D99A/H222Q variant from TmAmyA doubled the transglycosydation/hydrolysis (T/H) ratio while the M279N variant from TmGTase increased the hydrolysis/transglycosidation ratio five-fold. Molecular dynamic simulations of the variants indicated changes in flexibility that can account for the modified T/H ratio. An essential contribution of the presented computational approach is its capacity to identify residues outside of the active center that affect the reaction specificity.
Collapse
|
129
|
adabmDCA: adaptive Boltzmann machine learning for biological sequences. BMC Bioinformatics 2021; 22:528. [PMID: 34715775 PMCID: PMC8555268 DOI: 10.1186/s12859-021-04441-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Accepted: 10/12/2021] [Indexed: 11/30/2022] Open
Abstract
Background Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. Results Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA. As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. Conclusions The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.
Collapse
|
130
|
Defresne M, Barbe S, Schiex T. Protein Design with Deep Learning. Int J Mol Sci 2021; 22:11741. [PMID: 34769173 PMCID: PMC8584038 DOI: 10.3390/ijms222111741] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/23/2021] [Accepted: 10/26/2021] [Indexed: 12/21/2022] Open
Abstract
Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Collapse
Affiliation(s)
- Marianne Defresne
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| |
Collapse
|
131
|
Trinquier J, Uguzzoni G, Pagnani A, Zamponi F, Weigt M. Efficient generative modeling of protein sequences using simple autoregressive models. Nat Commun 2021; 12:5800. [PMID: 34608136 PMCID: PMC8490405 DOI: 10.1038/s41467-021-25756-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 08/23/2021] [Indexed: 02/08/2023] Open
Abstract
Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.
Collapse
Affiliation(s)
- Jeanne Trinquier
- grid.503253.20000 0004 0520 7190Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France ,grid.462608.e0000 0004 0384 7821Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Guido Uguzzoni
- grid.4800.c0000 0004 1937 0343Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy ,grid.428948.b0000 0004 1784 6598Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy
| | - Andrea Pagnani
- grid.4800.c0000 0004 1937 0343Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy ,grid.428948.b0000 0004 1784 6598Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy ,grid.470222.10000 0004 7471 9712INFN Sezione di Torino, Via P. Giuria 1, I-10125 Torino, Italy
| | - Francesco Zamponi
- grid.462608.e0000 0004 0384 7821Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Martin Weigt
- grid.503253.20000 0004 0520 7190Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France
| |
Collapse
|
132
|
Barrat-Charlaix P, Muntoni AP, Shimagaki K, Weigt M, Zamponi F. Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families. Phys Rev E 2021; 104:024407. [PMID: 34525554 DOI: 10.1103/physreve.104.024407] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/19/2021] [Indexed: 11/07/2022]
Abstract
Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
Collapse
Affiliation(s)
- Pierre Barrat-Charlaix
- Biozentrum, Universität Basel, Switzerland, Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Anna Paola Muntoni
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy.,Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy.,Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.,Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Kai Shimagaki
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Francesco Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
133
|
Bitard-Feildel T. Navigating the amino acid sequence space between functional proteins using a deep learning framework. PeerJ Comput Sci 2021; 7:e684. [PMID: 34616884 PMCID: PMC8459775 DOI: 10.7717/peerj-cs.684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/30/2021] [Indexed: 06/13/2023]
Abstract
MOTIVATION Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. RESULTS This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.
Collapse
Affiliation(s)
- Tristan Bitard-Feildel
- IBPS, CNRS, Laboratoire de Biologie Computationnelle et Quantitative, Sorbonne Université, Paris, France
- Institut des Sciences du Calcul et de des Données (ISCD), Sorbonne Université, Paris, France
| |
Collapse
|
134
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
135
|
Chang HJ, Zúñiga A, Conejero I, Voyvodic PL, Gracy J, Fajardo-Ruiz E, Cohen-Gonsaud M, Cambray G, Pageaux GP, Meszaros M, Meunier L, Bonnet J. Programmable receptors enable bacterial biosensors to detect pathological biomarkers in clinical samples. Nat Commun 2021; 12:5216. [PMID: 34471137 PMCID: PMC8410942 DOI: 10.1038/s41467-021-25538-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/12/2021] [Indexed: 12/17/2022] Open
Abstract
Bacterial biosensors, or bactosensors, are promising agents for medical and environmental diagnostics. However, the lack of scalable frameworks to systematically program ligand detection limits their applications. Here we show how novel, clinically relevant sensing modalities can be introduced into bactosensors in a modular fashion. To do so, we have leveraged a synthetic receptor platform, termed EMeRALD (Engineered Modularized Receptors Activated via Ligand-induced Dimerization) which supports the modular assembly of sensing modules onto a high-performance, generic signaling scaffold controlling gene expression in E. coli. We apply EMeRALD to detect bile salts, a biomarker of liver dysfunction, by repurposing sensing modules from enteropathogenic Vibrio species. We improve the sensitivity and lower the limit-of-detection of the sensing module by directed evolution. We then engineer a colorimetric bactosensor detecting pathological bile salt levels in serum from patients having undergone liver transplant, providing an output detectable by the naked-eye. The EMeRALD technology enables functional exploration of natural sensing modules and rapid engineering of synthetic receptors for diagnostics, environmental monitoring, and control of therapeutic microbes.
Collapse
Affiliation(s)
- Hung-Ju Chang
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Ana Zúñiga
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Ismael Conejero
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
- Neuropsychiatry: Epidemiological and Clinical Research, Inserm Unit 1061, Montpellier, France
- Department of Psychiatry, CHU Nimes, University of Montpellier, Montpellier, France
| | - Peter L Voyvodic
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Jerome Gracy
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Elena Fajardo-Ruiz
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Martin Cohen-Gonsaud
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Guillaume Cambray
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France
| | - Georges-Philippe Pageaux
- Department of Hepatogastroenterology, Hepatology and Liver Transplantation Unit, Saint Eloi Hospital, University of Montpellier, Montpellier, France
| | - Magdalena Meszaros
- Department of Hepatogastroenterology, Hepatology and Liver Transplantation Unit, Saint Eloi Hospital, University of Montpellier, Montpellier, France
| | - Lucy Meunier
- Department of Hepatogastroenterology, Hepatology and Liver Transplantation Unit, Saint Eloi Hospital, University of Montpellier, Montpellier, France
| | - Jerome Bonnet
- Centre de Biologie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France.
| |
Collapse
|
136
|
Cheung NJ, John Peter AT, Kornmann B. Leri: A web-server for identifying protein functional networks from evolutionary couplings. Comput Struct Biotechnol J 2021; 19:3556-3563. [PMID: 34257835 PMCID: PMC8239741 DOI: 10.1016/j.csbj.2021.06.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 05/30/2021] [Accepted: 06/02/2021] [Indexed: 12/12/2022] Open
Abstract
Identify the evolutionary signatures (termed “residue communities”) from protein sequences. The identified residue communities specify the signatures of protein evolution and function sites. Guide the engineering of functional proteins with altered (bio) chemical activities.
Information on the co-evolution of amino acid pairs in a protein can be used for endeavors such as protein engineering, mutation design, and structure prediction. Here we report a method that captures significant determinants of proteins using estimated co-evolution information to identify networks of residues, termed ”residue communities”, relevant to protein function. On the benchmark dataset (67 proteins with both catalytic and allosteric residues), the Pearson’s correlation between the identified residues in the communities at functional sites is 0.53, and it is higher than 0.8 by taking account of conserved residues derived from the method. On the endoplasmic reticulum-mitochondria encounter structure complex, the results indicate three distinguishable residue communities that are relevant to functional roles in the protein family, suggesting that the residue communities could be general evolutionary signatures in proteins. Based on the method, we provide a webserver for the scientific community to explore the signatures in protein families, which establishes a powerful tool to analyze residue-level profiling for the discovery of functional sites and biological pathway identification. This web-server is freely available for non-commercial users at https://kornmann.bioch.ox.ac.uk/leri/services/ecs.html, neither login nor e-mail required.
Collapse
Affiliation(s)
- Ngaam J Cheung
- Department of Biochemistry, University of Oxford, Oxford OX1 3QU, UK.,Leri Ltd, Oxford, UK
| | | | - Benoit Kornmann
- Department of Biochemistry, University of Oxford, Oxford OX1 3QU, UK
| |
Collapse
|
137
|
Miton CM, Buda K, Tokuriki N. Epistasis and intramolecular networks in protein evolution. Curr Opin Struct Biol 2021; 69:160-168. [PMID: 34077895 DOI: 10.1016/j.sbi.2021.04.007] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 04/01/2021] [Accepted: 04/21/2021] [Indexed: 12/01/2022]
Abstract
Proteins are molecular machines composed of complex, highly connected amino acid networks. Their functional optimization requires the reorganization of these intramolecular networks by evolution. In this review, we discuss the mechanisms by which epistasis, that is, the dependence of the effect of a mutation on the genetic background, rewires intramolecular interactions to alter protein function. Deciphering the biophysical basis of epistasis is crucial to our understanding of evolutionary dynamics and the elucidation of sequence-structure-function relationships. We featured recent studies that provide insights into the molecular mechanisms giving rise to epistasis, particularly at the structural level. These studies illustrate the convoluted and fascinating nature of the intramolecular networks co-opted by epistasis during the evolution of protein function.
Collapse
Affiliation(s)
- Charlotte M Miton
- Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, BC, Canada
| | - Karol Buda
- Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, BC, Canada
| | - Nobuhiko Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, V6T 1Z4, BC, Canada.
| |
Collapse
|
138
|
On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins. PLoS Comput Biol 2021; 17:e1008957. [PMID: 34029316 PMCID: PMC8177639 DOI: 10.1371/journal.pcbi.1008957] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Revised: 06/04/2021] [Accepted: 04/09/2021] [Indexed: 12/04/2022] Open
Abstract
Coevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. A comparison between the results of Direct Coupling Analysis applied to real and to resampled data shows that the largest coevolutionary couplings, i.e. those used for contact prediction, are only weakly influenced by phylogeny. However, the phylogeny-induced spurious couplings in the resampled data are compatible in size with the first false-positive contact predictions from real data. Dissecting functional from phylogeny-induced couplings might therefore extend accurate contact predictions to the range of intermediate-size couplings. Many homologous protein families contain thousands of highly diverged amino-acid sequences, which fold into close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.
Collapse
|
139
|
Bouchiba Y, Cortés J, Schiex T, Barbe S. Molecular flexibility in computational protein design: an algorithmic perspective. Protein Eng Des Sel 2021; 34:6271252. [PMID: 33959778 DOI: 10.1093/protein/gzab011] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/12/2021] [Accepted: 03/29/2021] [Indexed: 12/19/2022] Open
Abstract
Computational protein design (CPD) is a powerful technique for engineering new proteins, with both great fundamental implications and diverse practical interests. However, the approximations usually made for computational efficiency, using a single fixed backbone and a discrete set of side chain rotamers, tend to produce rigid and hyper-stable folds that may lack functionality. These approximations contrast with the demonstrated importance of molecular flexibility and motions in a wide range of protein functions. The integration of backbone flexibility and multiple conformational states in CPD, in order to relieve the inaccuracies resulting from these simplifications and to improve design reliability, are attracting increased attention. However, the greatly increased search space that needs to be explored in these extensions defines extremely challenging computational problems. In this review, we outline the principles of CPD and discuss recent effort in algorithmic developments for incorporating molecular flexibility in the design process.
Collapse
Affiliation(s)
- Younes Bouchiba
- Toulouse Biotechnology Institute, TBI, CNRS, INRAE, INSA, ANITI, Toulouse 31400, France.,Laboratoire d'Analyse et d'Architecture des Systèmes, LAAS CNRS, Université de Toulouse, CNRS, Toulouse 31400, France
| | - Juan Cortés
- Laboratoire d'Analyse et d'Architecture des Systèmes, LAAS CNRS, Université de Toulouse, CNRS, Toulouse 31400, France
| | - Thomas Schiex
- Université de Toulouse, ANITI, INRAE, UR MIAT, F-31320, Castanet-Tolosan, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, TBI, CNRS, INRAE, INSA, ANITI, Toulouse 31400, France
| |
Collapse
|
140
|
Chen Z, Elowitz MB. Programmable protein circuit design. Cell 2021; 184:2284-2301. [PMID: 33848464 PMCID: PMC8087657 DOI: 10.1016/j.cell.2021.03.007] [Citation(s) in RCA: 64] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2020] [Revised: 02/22/2021] [Accepted: 03/02/2021] [Indexed: 12/11/2022]
Abstract
A fundamental challenge in synthetic biology is to create molecular circuits that can program complex cellular functions. Because proteins can bind, cleave, and chemically modify one another and interface directly and rapidly with endogenous pathways, they could extend the capabilities of synthetic circuits beyond what is possible with gene regulation alone. However, the very diversity that makes proteins so powerful also complicates efforts to harness them as well-controlled synthetic circuit components. Recent work has begun to address this challenge, focusing on principles such as orthogonality and composability that permit construction of diverse circuit-level functions from a limited set of engineered protein components. These approaches are now enabling the engineering of circuits that can sense, transmit, and process information; dynamically control cellular behaviors; and enable new therapeutic strategies, establishing a powerful paradigm for programming biology.
Collapse
Affiliation(s)
- Zibo Chen
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA
| | - Michael B Elowitz
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, USA; Howard Hughes Medical Institute, California Institute of Technology, Pasadena, CA 91125, USA.
| |
Collapse
|
141
|
Frappier V, Keating AE. Data-driven computational protein design. Curr Opin Struct Biol 2021; 69:63-69. [PMID: 33910104 DOI: 10.1016/j.sbi.2021.03.009] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Revised: 03/18/2021] [Accepted: 03/19/2021] [Indexed: 01/28/2023]
Abstract
Computational protein design can generate proteins not found in nature that adopt desired structures and perform novel functions. Although proteins could, in theory, be designed with ab initio methods, practical success has come from using large amounts of data that describe the sequences, structures, and functions of existing proteins and their variants. We present recent creative uses of multiple-sequence alignments, protein structures, and high-throughput functional assays in computational protein design. Approaches range from enhancing structure-based design with experimental data to building regression models to training deep neural nets that generate novel sequences. Looking ahead, deep learning will be increasingly important for maximizing the value of data for protein design.
Collapse
Affiliation(s)
- Vincent Frappier
- Generate Biomedicines, 26 Landsdowne Street, Cambridge, MA, 02139, USA
| | - Amy E Keating
- MIT Departments of Biology and Biological Engineering, 77 Massachusetts Ave., Cambridge, MA, 02139, USA.
| |
Collapse
|
142
|
Biswas S, Khimulya G, Alley EC, Esvelt KM, Church GM. Low-N protein engineering with data-efficient deep learning. Nat Methods 2021; 18:389-396. [PMID: 33828272 DOI: 10.1038/s41592-021-01100-y] [Citation(s) in RCA: 191] [Impact Index Per Article: 47.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Accepted: 02/22/2021] [Indexed: 11/09/2022]
Abstract
Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two dissimilar proteins, GFP from Aequorea victoria (avGFP) and E. coli strain TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous high-throughput efforts. By distilling information from natural protein sequence landscapes, our model learns a latent representation of 'unnaturalness', which helps to guide search away from nonfunctional sequence neighborhoods. Subsequent low-N supervision then identifies improvements to the activity of interest. In sum, our approach enables efficient use of resource-intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field and clinic.
Collapse
Affiliation(s)
- Surojit Biswas
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA.,Nabla Bio, Inc., Boston, MA, USA
| | | | - Ethan C Alley
- MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kevin M Esvelt
- MIT Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - George M Church
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA. .,Department of Genetics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
143
|
Ferguson AL, Ranganathan R. 100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. ACS Macro Lett 2021; 10:327-340. [PMID: 35549066 DOI: 10.1021/acsmacrolett.0c00885] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The design of synthetic proteins with the desired function is a long-standing goal in biomolecular science, with broad applications in biochemical engineering, agriculture, medicine, and public health. Rational de novo design and experimental directed evolution have achieved remarkable successes but are challenged by the requirement to find functional "needles" in the vast "haystack" of protein sequence space. Data-driven models for fitness landscapes provide a predictive map between protein sequence and function and can prospectively identify functional candidates for experimental testing to greatly improve the efficiency of this search. This Viewpoint reviews the applications of machine learning and, in particular, deep learning as part of data-driven protein engineering platforms. We highlight recent successes, review promising computational methodologies, and provide an outlook on future challenges and opportunities. The article is written for a broad audience comprising both polymer and protein scientists and computer and data scientists interested in an up-to-date review of recent innovations and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Andrew L. Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
- Center for Physics of Evolving Systems, University of Chicago, Chicago, Illinois 60637, United States
- Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
144
|
Feehan R, Montezano D, Slusky JSG. Machine learning for enzyme engineering, selection and design. Protein Eng Des Sel 2021; 34:gzab019. [PMID: 34296736 PMCID: PMC8299298 DOI: 10.1093/protein/gzab019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 06/18/2021] [Accepted: 06/23/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is a useful computational tool for large and complex tasks such as those in the field of enzyme engineering, selection and design. In this review, we examine enzyme-related applications of machine learning. We start by comparing tools that can identify the function of an enzyme and the site responsible for that function. Then we detail methods for optimizing important experimental properties, such as the enzyme environment and enzyme reactants. We describe recent advances in enzyme systems design and enzyme design itself. Throughout we compare and contrast the data and algorithms used for these tasks to illustrate how the algorithms and data can be best used by future designers.
Collapse
Affiliation(s)
- Ryan Feehan
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Daniel Montezano
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
| | - Joanna S G Slusky
- Center for Computational Biology, The University of Kansas, 2030 Becker Dr., Lawrence, KS 66047-1620, USA
- Department of Molecular Biosciences, The University of Kansas, 1200 Sunnyside Ave. Lawrence, KS 66045-7600, USA
| |
Collapse
|
145
|
Crippa M, Andreghetti D, Capelli R, Tiana G. Evolution of frustrated and stabilising contacts in reconstructed ancient proteins. EUROPEAN BIOPHYSICS JOURNAL 2021; 50:699-712. [PMID: 33569610 PMCID: PMC8260555 DOI: 10.1007/s00249-021-01500-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 12/14/2020] [Accepted: 01/13/2021] [Indexed: 11/30/2022]
Abstract
Energetic properties of a protein are a major determinant of its evolutionary fitness. Using a reconstruction algorithm, dating the reconstructed proteins and calculating the interaction network between their amino acids through a coevolutionary approach, we studied how the interactions that stabilise 890 proteins, belonging to five families, evolved for billions of years. In particular, we focused our attention on the network of most strongly attractive contacts and on that of poorly optimised, frustrated contacts. Our results support the idea that the cluster of most attractive interactions extends its size along evolutionary time, but from the data, we cannot conclude that protein stability or that the degree of frustration tends always to decrease.
Collapse
Affiliation(s)
- Martina Crippa
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133, Milan, Italy
- Department of Applied Science and Technology, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129, Turin, Italy
| | - Damiano Andreghetti
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133, Milan, Italy
| | - Riccardo Capelli
- Department of Applied Science and Technology, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129, Turin, Italy
| | - Guido Tiana
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133, Milan, Italy.
| |
Collapse
|
146
|
Hawkins-Hooker A, Depardieu F, Baur S, Couairon G, Chen A, Bikard D. Generating functional protein variants with variational autoencoders. PLoS Comput Biol 2021; 17:e1008736. [PMID: 33635868 PMCID: PMC7946179 DOI: 10.1371/journal.pcbi.1008736] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 03/10/2021] [Accepted: 01/25/2021] [Indexed: 11/20/2022] Open
Abstract
The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.
Collapse
Affiliation(s)
- Alex Hawkins-Hooker
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| | - Florence Depardieu
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| | - Sebastien Baur
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| | - Guillaume Couairon
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| | - Arthur Chen
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| | - David Bikard
- Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France
| |
Collapse
|
147
|
Fahrig-Kamarauskait J, Würth-Roderer K, Thorbjørnsrud HV, Mailand S, Krengel U, Kast P. Evolving the naturally compromised chorismate mutase from Mycobacterium tuberculosis to top performance. J Biol Chem 2020; 295:17514-17534. [PMID: 33453995 PMCID: PMC7762937 DOI: 10.1074/jbc.ra120.014924] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Revised: 10/08/2020] [Indexed: 11/06/2022] Open
Abstract
Chorismate mutase (CM), an essential enzyme at the branch-point of the shikimate pathway, is required for the biosynthesis of phenylalanine and tyrosine in bacteria, archaea, plants, and fungi. MtCM, the CM from Mycobacterium tuberculosis, has less than 1% of the catalytic efficiency of a typical natural CM and requires complex formation with 3-deoxy-d-arabino-heptulosonate 7-phosphate synthase for high activity. To explore the full potential of MtCM for catalyzing its native reaction, we applied diverse iterative cycles of mutagenesis and selection, thereby raising kcat/Km 270-fold to 5 × 105m−1s−1, which is even higher than for the complex. Moreover, the evolutionarily optimized autonomous MtCM, which had 11 of its 90 amino acids exchanged, was stabilized compared with its progenitor, as indicated by a 9 °C increase in melting temperature. The 1.5 Å crystal structure of the top-evolved MtCM variant reveals the molecular underpinnings of this activity boost. Some acquired residues (e.g. Pro52 and Asp55) are conserved in naturally efficient CMs, but most of them lie beyond the active site. Our evolutionary trajectories reached a plateau at the level of the best natural enzymes, suggesting that we have exhausted the potential of MtCM. Taken together, these findings show that the scaffold of MtCM, which naturally evolved for mediocrity to enable inter-enzyme allosteric regulation of the shikimate pathway, is inherently capable of high activity.
Collapse
Affiliation(s)
| | | | | | - Susanne Mailand
- Laboratory of Organic Chemistry, ETH Zurich, Zurich, Switzerland
| | - Ute Krengel
- Department of Chemistry, University of Oslo, Oslo, Norway.
| | - Peter Kast
- Laboratory of Organic Chemistry, ETH Zurich, Zurich, Switzerland.
| |
Collapse
|
148
|
Bravi B, Tubiana J, Cocco S, Monasson R, Mora T, Walczak AM. RBM-MHC: A Semi-Supervised Machine-Learning Method for Sample-Specific Prediction of Antigen Presentation by HLA-I Alleles. Cell Syst 2020; 12:195-202.e9. [PMID: 33338400 PMCID: PMC7895905 DOI: 10.1016/j.cels.2020.11.005] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 09/18/2020] [Accepted: 11/17/2020] [Indexed: 12/22/2022]
Abstract
The recent increase of immunopeptidomics data, obtained by mass spectrometry or binding assays, opens up possibilities for investigating endogenous antigen presentation by the highly polymorphic human leukocyte antigen class I (HLA-I) protein. State-of-the-art methods predict with high accuracy presentation by HLA alleles that are well represented in databases at the time of release but have a poorer performance for rarer and less characterized alleles. Here, we introduce a method based on Restricted Boltzmann Machines (RBMs) for prediction of antigens presented on the Major Histocompatibility Complex (MHC) encoded by HLA genes-RBM-MHC. RBM-MHC can be trained on custom and newly available samples with no or a small amount of HLA annotations. RBM-MHC ensures improved predictions for rare alleles and matches state-of-the-art performance for well-characterized alleles while being less data demanding. RBM-MHC is shown to be a flexible and easily interpretable method that can be used as a predictor of cancer neoantigens and viral epitopes, as a tool for feature discovery, and to reconstruct peptide motifs presented on specific HLA molecules.
Collapse
Affiliation(s)
- Barbara Bravi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
| | - Jérôme Tubiana
- Blavatnik School of Computer Science, Tel Aviv University, 6139601 Tel Aviv, Israel
| | - Simona Cocco
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
| | - Rémi Monasson
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
| | - Thierry Mora
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
| | - Aleksandra M Walczak
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France.
| |
Collapse
|
149
|
Muntoni AP, Pagnani A, Weigt M, Zamponi F. Aligning biological sequences by exploiting residue conservation and coevolution. Phys Rev E 2020; 102:062409. [PMID: 33465950 DOI: 10.1103/physreve.102.062409] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 11/12/2020] [Indexed: 11/07/2022]
Abstract
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e., arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position specificities like conservation in sequences but assume an independent evolution of different positions. Over recent years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles, and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.
Collapse
Affiliation(s)
- Anna Paola Muntoni
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France
| | - Andrea Pagnani
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129 Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy
- INFN, Sezione di Torino, Via Giuria 1, I-10125 Torino, Italy
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France
| | - Francesco Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
150
|
Mignon D, Druart K, Michael E, Opuu V, Polydorides S, Villa F, Gaillard T, Panel N, Archontis G, Simonson T. Physics-Based Computational Protein Design: An Update. J Phys Chem A 2020; 124:10637-10648. [DOI: 10.1021/acs.jpca.0c07605] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- David Mignon
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Karen Druart
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Eleni Michael
- Department of Physics, University of Cyprus, PO20537, CY1678 Nicosia, Cyprus
| | - Vaitea Opuu
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Savvas Polydorides
- Department of Physics, University of Cyprus, PO20537, CY1678 Nicosia, Cyprus
| | - Francesco Villa
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Thomas Gaillard
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Nicolas Panel
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| | - Georgios Archontis
- Department of Physics, University of Cyprus, PO20537, CY1678 Nicosia, Cyprus
| | - Thomas Simonson
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128 Palaiseau, France
| |
Collapse
|