1
|
Nowak JS, Kruuse N, Rasmussen HØ, Tian P, Astono J, Schultz‐Nielsen S, Thøgersen MS, Stougaard P, Pedersen JS, Otzen DE. Quaternary stabilization of a GH2 β-galactosidase from the psychrophile A. ikkensis, a flexible and unstable dimeric enzyme. Protein Sci 2025; 34:e70141. [PMID: 40277444 PMCID: PMC12023411 DOI: 10.1002/pro.70141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2024] [Revised: 04/09/2025] [Accepted: 04/11/2025] [Indexed: 04/26/2025]
Abstract
Studies of cold-active enzymes may elucidate the basis for low-temperature activity and contribute to their wider application in energy-efficient processes. Here we investigate the cold-active GH2 β-galactosidase from the psychrophilic bacterium Alkalilactibacillus ikkensis (AiLac). AiLac has a specific activity twice as high as its closest structural homolog (the mesophilic Escherichia coli GH2 β-galactosidase) toward the lactose analog ONPG at room temperature and neutral pH, and shows biphasic behavior in Michaelis-Menten plots. AiLac is activated by Mg2+ and Na+ and is most effective at pH 7.0 and 30°C. However, early unfolding events are observed already at room temperature. Stability studies using intrinsic fluorescence, circular dichroism, and small-angle x-ray scattering (SAXS), combined with activity assays, showed AiLac to be highly sensitive to heat and urea and to be stabilized, but also inhibited, by loss of structural flexibility induced by the osmolyte trehalose. AlphaFold structure prediction combined with SAXS and flow-induced dispersion analysis support a reversible monomer-dimer model, suggesting structural adaptation to cold temperatures on a quaternary level. The low amount of dimeric buried surface area, high flexibility, and remarkably low chemical and thermal stability present an extreme example of cold adaptation promoted by high levels of solvent interactions. To investigate the relationship between evolution and oligomerization, we trained a generative deep learning model to successfully engineer functional variants that form stabilized dimers and tetramers by introducing high evolutionary fitness mutations at the interface, demonstrating an efficient way to explore the local sequence fitness landscape to modulate the equilibrium of oligomerization.
Collapse
Affiliation(s)
- Jan S. Nowak
- Interdisciplinary Nanoscience Center (iNANO)Aarhus UniversityAarhusDenmark
| | - Nikoline Kruuse
- Interdisciplinary Nanoscience Center (iNANO)Aarhus UniversityAarhusDenmark
| | | | | | - Julie Astono
- Interdisciplinary Nanoscience Center (iNANO)Aarhus UniversityAarhusDenmark
| | | | - Mariane S. Thøgersen
- Department of Environmental ScienceAarhus UniversityRoskildeDenmark
- Present address:
Zealand Academy of Technologies and BusinessRoskildeDenmark
| | - Peter Stougaard
- Department of Environmental ScienceAarhus UniversityRoskildeDenmark
| | - Jan Skov Pedersen
- Interdisciplinary Nanoscience Center (iNANO)Aarhus UniversityAarhusDenmark
- Department of ChemistryAarhus UniversityAarhusDenmark
| | - Daniel E. Otzen
- Interdisciplinary Nanoscience Center (iNANO)Aarhus UniversityAarhusDenmark
- Department of Molecular Biology and GeneticsAarhus UniversityAarhusDenmark
| |
Collapse
|
2
|
De Leonardis M, Pagnani A, Barrat-Charlaix P. Reconstruction of Ancestral Protein Sequences Using Autoregressive Generative Models. Mol Biol Evol 2025; 42:msaf070. [PMID: 40139916 PMCID: PMC12006719 DOI: 10.1093/molbev/msaf070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/21/2025] [Accepted: 02/14/2025] [Indexed: 03/29/2025] Open
Abstract
Ancestral sequence reconstruction (ASR) is an important tool to understand how protein structure and function changed over the course of evolution. It essentially relies on models of sequence evolution that can quantitatively describe changes in a sequence over time. Such models usually consider that sequence positions evolve independently from each other and neglect epistasis: the context-dependence of the effect of mutations. On the other hand, the last years have seen major developments in the field of generative protein models, which learn constraints associated with structure and function from large ensembles of evolutionarily related proteins. Here, we show that it is possible to extend a specific type of generative model to describe the evolution of sequences in time while taking epistasis into account. We apply the developed technique to the problem of ASR: given a protein family and its evolutionary tree, we try to infer the sequences of extinct ancestors. Using both simulations and data coming from experimental evolution we show that our method outperforms state-of-the-art ones. Moreover, it allows for sampling a greater diversity of potential ancestors, allowing for a less biased characterization of ancestral sequences.
Collapse
Affiliation(s)
- Matteo De Leonardis
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, Candiolo 10060, Italy
- INFN, Sezione di Torino, Via Pietro Giuria 1, Torino 10125, Italy
| | | |
Collapse
|
3
|
Xu M, Dantu SC, Garnett JA, Bonomo RA, Pandini A, Haider S. Functionally important residues from graph analysis of coevolved dynamic couplings. eLife 2025; 14:RP105005. [PMID: 40153310 PMCID: PMC11952748 DOI: 10.7554/elife.105005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2025] Open
Abstract
The relationship between protein dynamics and function is essential for understanding biological processes and developing effective therapeutics. Functional sites within proteins are critical for activities such as substrate binding, catalysis, and structural changes. Existing computational methods for the predictions of functional residues are trained on sequence, structural, and experimental data, but they do not explicitly model the influence of evolution on protein dynamics. This overlooked contribution is essential as it is known that evolution can fine-tune protein dynamics through compensatory mutations either to improve the proteins' performance or diversify its function while maintaining the same structural scaffold. To model this critical contribution, we introduce DyNoPy, a computational method that combines residue coevolution analysis with molecular dynamics simulations, revealing hidden correlations between functional sites. DyNoPy constructs a graph model of residue-residue interactions, identifies communities of key residue groups, and annotates critical sites based on their roles. By leveraging the concept of coevolved dynamical couplings-residue pairs with critical dynamical interactions that have been preserved during evolution-DyNoPy offers a powerful method for predicting and analysing protein evolution and dynamics. We demonstrate the effectiveness of DyNoPy on SHV-1 and PDC-3, chromosomally encoded β-lactamases linked to antibiotic resistance, highlighting its potential to inform drug design and address pressing healthcare challenges.
Collapse
Affiliation(s)
- Manming Xu
- UCL School of PharmacyLondonUnited Kingdom
| | | | - James A Garnett
- Centre for Host-Microbiome Interactions, Faculty of Dentistry, Oral & Craniofacial Sciences, King’s College LondonLondonUnited Kingdom
| | - Robert A Bonomo
- Research Service, Louis Stokes Cleveland Department of Veterans Affairs Medical CenterClevelandUnited States
- Department of Molecular Biology and Microbiology, Case Western Reserve University School of MedicineClevelandUnited States
- Department of Medicine, Case Western Reserve University School of MedicineClevelandUnited States
- Departments of Pharmacology, Biochemistry, and Proteomics and Bioinformatics Case Western Reserve University School of MedicineClevelandUnited States
- CWRU-Cleveland VAMC Center for Antimicrobial Resistance and Epidemiology (Case VA CARES)ClevelandUnited States
| | - Alessandro Pandini
- Department of Computer Science, Brunel University LondonUxbridgeUnited Kingdom
| | - Shozeb Haider
- UCL School of PharmacyLondonUnited Kingdom
- University of Tabuk (PFSCBR)TabukSaudi Arabia
- UCL Center for Advanced Research Computing, University College LondonLondonUnited Kingdom
| |
Collapse
|
4
|
Shukla D, Martin J, Morcos F, Potoyan DA. Thermal Adaptation of Cytosolic Malate Dehydrogenase Revealed by Deep Learning and Coevolutionary Analysis. J Chem Theory Comput 2025; 21:3277-3287. [PMID: 40079215 PMCID: PMC11948321 DOI: 10.1021/acs.jctc.4c01774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
Protein evolution has shaped enzymes that maintain stability and function across diverse thermal environments. While sequence variation, thermal stability and conformational dynamics are known to influence an enzyme's thermal adaptation, how these factors collectively govern stability and function across diverse temperatures remains unresolved. Cytosolic malate dehydrogenase (cMDH), a citric acid cycle enzyme, is an ideal model for studying these mechanisms due to its temperature-sensitive flexibility and broad presence in species from diverse thermal environments. In this study, we employ techniques inspired by deep learning and statistical mechanics to uncover how sequence variation and conformational dynamics shape patterns of cMDH's thermal adaptation. By integrating coevolutionary models with variational autoencoders (VAE), we generate a latent generative landscape (LGL) of the cMDH sequence space, enabling us to explore mutational pathways and predict fitness using direct coupling analysis (DCA). Structure predictions via AlphaFold and molecular dynamics simulations further illuminate how variations in hydrophobic interactions and conformational flexibility contribute to the thermal stability of warm- and cold-adapted cMDH orthologs. Notably, we identify the ratio of hydrophobic contacts between two regions as a predictive order parameter for thermal stability features, providing a quantitative metric for understanding cMDH dynamics across temperatures. The integrative computational framework employed in this study provides mechanistic insights into protein adaptation at both sequence and structural levels, offering unique perspectives on the evolution of thermal stability and creating avenues for the rational design of proteins with optimized thermal properties.
Collapse
Affiliation(s)
- Divyanshu Shukla
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| | - Jonathan Martin
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
| | - Faruck Morcos
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
- Departments
of Bioengineering and Physics, UT Dallas, Richardson, TX 75080, United States
- Center
for
Systems Biology, UT Dallas, Richardson, TX 75080, United States
| | - Davit A. Potoyan
- Department
of Chemistry, Iowa State University, Ames, Iowa 50011, United States
- Department
of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, United States
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| |
Collapse
|
5
|
Thomas N, Belanger D, Xu C, Lee H, Hirano K, Iwai K, Polic V, Nyberg KD, Hoff KG, Frenz L, Emrich CA, Kim JW, Chavarha M, Ramanan A, Agresti JJ, Colwell LJ. Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Syst 2025; 16:101236. [PMID: 40081373 DOI: 10.1016/j.cels.2025.101236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 09/17/2024] [Accepted: 02/19/2025] [Indexed: 03/16/2025]
Abstract
Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization is often hindered by a rugged fitness landscape and costly experiments. In this work, we present TeleProt, a machine learning (ML) framework that blends evolutionary and experimental data to design diverse protein libraries, and employ it to improve the catalytic activity of a nuclease enzyme that degrades biofilms that accumulate on chronic wounds. After multiple rounds of high-throughput experiments, TeleProt found a significantly better top-performing enzyme than directed evolution (DE), had a better hit rate at finding diverse, high-activity variants, and was even able to design a high-performance initial library using no prior experimental data. We have released a dataset of 55,000 nuclease variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date, to drive further progress in ML-guided design. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Neil Thomas
- X, the Moonshot Factory, Mountain View, CA 94043, USA.
| | | | | | | | | | | | | | | | | | | | | | - Jun W Kim
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Abi Ramanan
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Lucy J Colwell
- Google DeepMind, Cambridge, MA 02142, USA; Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK.
| |
Collapse
|
6
|
Martí-Gómez C, Zhou J, Chen WC, Kinney JB, McCandlish DM. Inference and visualization of complex genotype-phenotype maps with gpmap-tools. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642267. [PMID: 40161830 PMCID: PMC11952336 DOI: 10.1101/2025.03.09.642267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Multiplex assays of variant effect (MAVEs) allow the functional characterization of an unprecedented number of sequence variants in both gene regulatory regions and protein coding sequences. This has enabled the study of nearly complete combinatorial libraries of mutational variants and revealed the widespread influence of higher-order genetic interactions that arise when multiple mutations are combined. However, the lack of appropriate tools for exploratory analysis of this high-dimensional data limits our overall understanding of the main qualitative properties of complex genotype-phenotype maps. To fill this gap, we have developed gpmap-tools (https://github.com/cmarti/gpmap-tools), a python library that integrates Gaussian process models for inference, phenotypic imputation, and error estimation from incomplete and noisy MAVE data and collections of natural sequences, together with methods for summarizing patterns of higher-order epistasis and non-linear dimensionality reduction techniques that allow visualization of genotype-phenotype maps containing up to millions of genotypes. Here, we used gpmap-tools to study the genotype-phenotype map of the Shine-Dalgarno sequence, a motif that modulates binding of the 16S rRNA to the 5' untranslated region (UTR) of mRNAs through base pair complementarity during translation initiation in prokaryotes. We inferred full combinatorial landscapes containing 262,144 different sequences from the sequences of 5,311 5'UTRs in the E. coli genome and from experimental MAVE data. Visualizations of the inferred landscapes were largely consistent with each other, and unveiled a simple molecular mechanism underlying the highly epistatic genotype-phenotype map of the Shine-Dalgarno sequence.
Collapse
Affiliation(s)
- Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, Republic of China
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
7
|
Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol 2025; 43:396-405. [PMID: 38653796 PMCID: PMC11919684 DOI: 10.1038/s41587-024-02214-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/20/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.
Collapse
Affiliation(s)
| | - Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Sandra Viknander
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Clara Goldin
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | | | - Aleksej Zelezniak
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden.
- Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania.
- Randall Centre for Cell & Molecular Biophysics, King's College London, Guy's Campus, London, UK.
| | | |
Collapse
|
8
|
Nelson MG, Talavera D. Identification of coevolving positions by ancestral reconstruction. Commun Biol 2025; 8:329. [PMID: 40021815 PMCID: PMC11871020 DOI: 10.1038/s42003-025-07676-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 02/05/2025] [Indexed: 03/03/2025] Open
Abstract
Coevolution within proteins occurs when changes in one position affect the selective pressure in another position to preserve the protein structure or function. The identification of coevolving positions within proteins remains contentious, with most methods disregarding the phylogenetic information. Here, we present a time-efficient approach for detecting coevolving pairs, which is almost perfect in terms of precision and specificity. It is based on maximum parsimony-based ancestral reconstruction followed by the identification of pairs with a depletion on separate changes when compared to their number of concurrent changes. Our analysis of a previously characterised biological dataset shows that the coevolving pairs that we identified tend to be close in the protein sequence and structure, slightly less solvent exposed and have a higher mutation rate. We also show how the ancestral reconstruction can be used to detect favourable and unfavourable amino acid combinations. Altogether, we demonstrate how this approach is essential for identifying pairs of positions with weak covariation patterns.
Collapse
Affiliation(s)
- Michael G Nelson
- Division of Cardiovascular Sciences, School of Medical Sciences, The University of Manchester, Oxford Road, Manchester, UK
| | - David Talavera
- Division of Cardiovascular Sciences, School of Medical Sciences, The University of Manchester, Oxford Road, Manchester, UK.
| |
Collapse
|
9
|
Prywes N, Phillips NR, Oltrogge LM, Lindner S, Taylor-Kearney LJ, Tsai YCC, de Pins B, Cowan AE, Chang HA, Wang RZ, Hall LN, Bellieny-Rabelo D, Nisonoff HM, Weissman RF, Flamholz AI, Ding D, Bhatt AY, Mueller-Cajar O, Shih PM, Milo R, Savage DF. A map of the rubisco biochemical landscape. Nature 2025; 638:823-828. [PMID: 39843747 PMCID: PMC11839469 DOI: 10.1038/s41586-024-08455-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 11/26/2024] [Indexed: 01/24/2025]
Abstract
Rubisco is the primary CO2-fixing enzyme of the biosphere1, yet it has slow kinetics2. The roles of evolution and chemical mechanism in constraining its biochemical function remain debated3,4. Engineering efforts aimed at adjusting the biochemical parameters of rubisco have largely failed5, although recent results indicate that the functional potential of rubisco has a wider scope than previously known6. Here we developed a massively parallel assay, using an engineered Escherichia coli7 in which enzyme activity is coupled to growth, to systematically map the sequence-function landscape of rubisco. Composite assay of more than 99% of single-amino acid mutants versus CO2 concentration enabled inference of enzyme velocity and apparent CO2 affinity parameters for thousands of substitutions. This approach identified many highly conserved positions that tolerate mutation and rare mutations that improve CO2 affinity. These data indicate that non-trivial biochemical changes are readily accessible and that the functional distance between rubiscos from diverse organisms can be traversed, laying the groundwork for further enzyme engineering efforts.
Collapse
Affiliation(s)
- Noam Prywes
- Innovative Genomics Institute, University of California Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA
| | - Naiya R Phillips
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
| | - Luke M Oltrogge
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
| | | | - Leah J Taylor-Kearney
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | - Yi-Chin Candace Tsai
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Benoit de Pins
- Department of Biology, University of Naples Federico II, Naples, Italy
| | - Aidan E Cowan
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
- Joint BioEnergy Institute, Lawrence Berkeley National Laboratory, Emeryville, CA, USA
| | - Hana A Chang
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | - Renée Z Wang
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
| | - Laina N Hall
- Biophysics, University of California Berkeley, Berkeley, CA, USA
| | - Daniel Bellieny-Rabelo
- Innovative Genomics Institute, University of California Berkeley, Berkeley, CA, USA
- California Institute for Quantitative Biosciences (QB3), University of California Berkeley, Berkeley, CA, USA
| | - Hunter M Nisonoff
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Rachel F Weissman
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
| | - Avi I Flamholz
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - David Ding
- Innovative Genomics Institute, University of California Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA
| | - Abhishek Y Bhatt
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA
- School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Oliver Mueller-Cajar
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Patrick M Shih
- Innovative Genomics Institute, University of California Berkeley, Berkeley, CA, USA
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Feedstocks Division, Joint BioEnergy Institute, Emeryville, CA, USA
| | - Ron Milo
- Department of Plant and Environmental Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - David F Savage
- Innovative Genomics Institute, University of California Berkeley, Berkeley, CA, USA.
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA.
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, CA, USA.
| |
Collapse
|
10
|
Gelfand N, Orel V, Cui W, Damborský J, Li C, Prokop Z, Xie WJ, Warshel A. Biochemical and Computational Characterization of Haloalkane Dehalogenase Variants Designed by Generative AI: Accelerating the S N2 Step. J Am Chem Soc 2025; 147:2747-2755. [PMID: 39792627 DOI: 10.1021/jacs.4c15551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
Generative artificial intelligence (AI) models trained on natural protein sequences have been used to design functional enzymes. However, their ability to predict individual reaction steps in enzyme catalysis remains unclear, limiting the potential use of sequence information for enzyme engineering. In this study, we demonstrated that sequence information can predict the rate of the SN2 step of a haloalkane dehalogenase using a generative maximum-entropy (MaxEnt) model. We then designed lower-order protein variants of haloalkane dehalogenase using the model. Kinetic measurements confirmed the successful design of protein variants that enhance catalytic activity, above that of the wild type, in the overall reaction and in particular in the SN2 step. On the simulation side, we provided molecular insights into these designs for the SN2 step using the empirical valence bond (EVB) and metadynamics simulations. The EVB calculations showed activation barriers consistent with experimental reaction rates, while examining the effect of amino acid replacements on the electrostatic effect on the activation barrier and the consequence of water penetration, as well as the extent of ground state destabilization/stabilization. Metadynamics simulations emphasize the importance of the substrate positioning in enzyme catalysis. Overall, our AI-guided approach successfully enabled the design of a variant with a faster rate for the SN2 step than the wild-type enzyme, despite haloalkane dehalogenase being extensively optimized through natural evolution.
Collapse
Affiliation(s)
- Natalia Gelfand
- Department of Chemistry, University of Southern California, Los Angeles, California 90089, United States
| | - Vojtech Orel
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, Brno 625 00, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, Brno 656 91, Czech Republic
| | - Wenqiang Cui
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32610, United States
| | - Jiří Damborský
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, Brno 625 00, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, Brno 656 91, Czech Republic
| | - Chenglong Li
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32610, United States
| | - Zbyněk Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, Brno 625 00, Czech Republic
- International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, Brno 656 91, Czech Republic
| | - Wen Jun Xie
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32610, United States
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, California 90089, United States
| |
Collapse
|
11
|
Schmelkin L, Carnevale V, Haldane A, Townsend JP, Chung S, Levy RM, Kumar S. Entrenchment and contingency in neutral protein evolution with epistasis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632266. [PMID: 39868204 PMCID: PMC11761135 DOI: 10.1101/2025.01.09.632266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Protein sequence evolution in the presence of epistasis makes many previously acceptable amino acid residues at a site unfavorable over time. This phenomenon of entrenchment has also been observed with neutral substitutions using Potts Hamiltonian models. Here, we show that simulations using these models often evolve non-neutral proteins. We introduce a Neutral-with-Epistasis (N×E) model that incorporates purifying selection to conserve fitness, a requirement of neutral evolution. N×E protein evolution revealed a surprising lack of entrenchment, with site-specific amino-acid preferences remaining remarkably conserved, in biologically realistic time frames despite extensive residue coupling. Moreover, we found that the overdispersion of the molecular clock is caused by rate variation across sites introduced by epistasis in individual lineages, rather than by historical contingency. Therefore, substitutional entrenchment and rate contingency may indicate that adaptive and other non-neutral evolutionary processes were at play during protein evolution.
Collapse
Affiliation(s)
- Lisa Schmelkin
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| | - Vincenzo Carnevale
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
| | - Allan Haldane
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
- Department of Chemistry, Temple University; Philadelphia, Pennsylvania 19122, USA
- Center for Biophysics and Computational Biology, Temple University; Philadelphia, Pennsylvania 19122, USA
| | | | - Sarah Chung
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| | - Ronald M. Levy
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
- Department of Chemistry, Temple University; Philadelphia, Pennsylvania 19122, USA
- Center for Biophysics and Computational Biology, Temple University; Philadelphia, Pennsylvania 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| |
Collapse
|
12
|
Sternke M, Tripp KW, Barrick D. Protein stability is determined by single-site bias rather than pairwise covariance. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632118. [PMID: 39868188 PMCID: PMC11760396 DOI: 10.1101/2025.01.09.632118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
The biases revealed in protein sequence alignments have been shown to provide information related to protein structure, stability, and function. For example, sequence biases at individual positions can be used to design consensus proteins that are often more stable than naturally occurring counterparts. Likewise, correlations between pairs of residue can be used to predict protein structures. Recent work using Potts models show that together, single-site biases and pair correlations lead to improved predictions of protein fitness, activity, and stability. Here we use a Potts model to design groups of protein sequences with different amounts of single-site biases and pair correlations, and determine the thermodynamic stabilities of a representative set of sequences from each group. Surprisingly, sequences excluding pair correlations maximize stability, whereas sequences that maximize pair correlations are less stable, suggesting that pair correlations contribute to another aspect of protein fitness. Consistent with this interpretation, we find that for adenylate kinase, enzyme activity is greatly increased by maximizing pair correlations. The finding that elimination of covariant residue pairs increases protein stability suggests a route to enhance stability of designed proteins; indeed, this strategy produces hyperstable homeodomain and adenylate kinase proteins that retain significant activity.
Collapse
Affiliation(s)
- Matt Sternke
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
- Current address: Protein Design and Informatics, GSK, 1250 South Collegeville Rd, Collegeville, PA 19426 USA
| | - Katherine W. Tripp
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
| | - Doug Barrick
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
| |
Collapse
|
13
|
Lee B, White KI, Socolich M, Klureza MA, Henning R, Srajer V, Ranganathan R, Hekstra DR. Direct visualization of electric-field-stimulated ion conduction in a potassium channel. Cell 2025; 188:77-88.e15. [PMID: 39793560 PMCID: PMC11924917 DOI: 10.1016/j.cell.2024.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Revised: 02/22/2024] [Accepted: 12/08/2024] [Indexed: 01/13/2025]
Abstract
Understanding protein function would be facilitated by direct, real-time observation of chemical kinetics in the atomic structure. The selectivity filter (SF) of the K+ channel provides an ideal model, catalyzing the dehydration and transport of K+ ions across the cell membrane through a narrow pore. We used a "pump-probe" method called electric-field-stimulated time-resolved X-ray crystallography (EFX) to initiate and observe K+ conduction in the NaK2K channel in both directions on the timescale of the transport process. We observe both known and potentially new features in the high-energy conformations visited along the conduction pathway, including the associated dynamics of protein residues that control selectivity and conduction rate. A single time series of one channel in action shows the orderly appearance of features observed in diverse homologs with diverse methods, arguing for deep conservation of the dynamics underlying the reaction coordinate in this protein family.
Collapse
Affiliation(s)
- BoRam Lee
- Center for Physics of Evolving Systems, Biochemistry & Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA; Modeling and Informatics, Discovery Chemistry, Merck & Co., Inc., South San Francisco, CA, USA
| | - K Ian White
- Department of Molecular and Cellular Physiology and HHMI, Stanford University, Stanford, CA, USA
| | - Michael Socolich
- Center for Physics of Evolving Systems, Biochemistry & Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA
| | - Margaret A Klureza
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Robert Henning
- Center for Advanced Radiation Sources, University of Chicago, Chicago, IL, USA
| | - Vukica Srajer
- Center for Advanced Radiation Sources, University of Chicago, Chicago, IL, USA
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Biochemistry & Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA; Center for Advanced Radiation Sources, University of Chicago, Chicago, IL, USA.
| | - Doeke R Hekstra
- Department of Molecular and Cell Biology and School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
14
|
Fernandez-de-Cossio-Diaz J. Generative Modeling of RNA Sequence Families with Restricted Boltzmann Machines. Methods Mol Biol 2025; 2847:163-175. [PMID: 39312143 DOI: 10.1007/978-1-0716-4079-1_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
In this chapter, we discuss the potential application of Restricted Boltzmann machines (RBM) to model sequence families of structured RNA molecules. RBMs are a simple two-layer machine learning model able to capture intricate sequence dependencies induced by secondary and tertiary structure, as well as mechanisms of structural flexibility, resulting in a model that can be successfully used for the design of allosteric RNA such as riboswitches. They have recently been experimentally validated as generative models for the SAM-I riboswitch aptamer domain sequence family. We introduce RBM mathematically and practically, providing self-contained code examples to download the necessary training sequence data, train the RBM, and sample novel sequences. We present in detail the implementation of algorithms necessary to use RBMs, focusing on applications in biological sequence modeling.
Collapse
|
15
|
da Rocha W, Liberti L, Mucherino A, Malliavin TE. Influence of Stereochemistry in a Local Approach for Calculating Protein Conformations. J Chem Inf Model 2024; 64:8999-9008. [PMID: 39560315 DOI: 10.1021/acs.jcim.4c01232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2024]
Abstract
Protein structure prediction is generally based on the use of local conformational information coupled with long-range distance restraints. Such restraints can be derived from the knowledge of a template structure or the analysis of protein sequence alignment in the framework of models arising from the physics of disordered systems. The accuracy of approaches based on sequence alignment, however, is limited in the case where the number of aligned sequences is small. Here, we derive protein conformations using only local conformations knowledge by means of the interval Branch-and-Prune algorithm. The computation efficiency is directly related to the knowledge of stereochemistry (bond angle and ω values) along the protein sequence and, in particular, to the variations of the torsion angle ω. The impact of stereochemistry variations is particularly strong in the case of protein topologies defined from numerous long-range restraints, as in the case of protein of β secondary structures. The systematic enumeration of the conformations improves the efficiency of the calculations. The analysis of DNA codons permits to connect the variations of torsion angle ω to the positions of rare DNA codons.
Collapse
Affiliation(s)
- Wagner da Rocha
- LIX CNRS, École Polytechnique, Institut Polytechnique de Paris, Palaiseau 91128, France
| | - Leo Liberti
- LIX CNRS, École Polytechnique, Institut Polytechnique de Paris, Palaiseau 91128, France
| | | | - Thérèse E Malliavin
- LPCT, UMR 7019 Université de Lorraine CNRS, Vandoeuvre-lès-Nancy 54500, France
| |
Collapse
|
16
|
Liao M, Feng S, Liu X, Xu G, Li S, Bai Y, Luo H, Yao B, Wang H, Tu T. Novel Insights into Enzymatic Thermostability: The "Short Board" Theory and Zero-Shot Hamiltonian Model. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2402441. [PMID: 39308285 PMCID: PMC11615740 DOI: 10.1002/advs.202402441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 08/23/2024] [Indexed: 12/06/2024]
Abstract
Understanding the mechanism underlying thermostabilization in naturally stable enzymes and enhancing the thermostability of unstable enzymes are crucial aspects in enzyme engineering. Despite the development of various engineering methods, there remains substantial scope for improvement. In this study, a novel concept termed as the "short board" theory is proposed, which conceptualizes proteins as barrels with each component representing a jagged board. Notably, optimizing modifications to the shortest board yields optimal enhancements in terms of thermostability performance. To validate this theory, α-amylase, an industrial bulk enzyme with multiple domains, is employed as a model enzyme. The existence of "short boards" and their impact on thermostability modification are demonstrated at the domain, residue, and atomic levels through experimental confirmation using domain substitution. Furthermore, a novel thermostable design and prediction model called Zero-Shot Hamiltonian (ZSH) is established and evaluated on α-amylase. This coevolutionary approach based on thermostability and deep learning exhibits remarkable success exclusively when applied to enzymes with fixed short boards. The integration of the "short board" theory with the ZSH model presents an innovative tool for enhancing enzymatic thermostability.
Collapse
Affiliation(s)
- Min Liao
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | | | - Xiaoqing Liu
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | - Guoshun Xu
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | - Sicong Li
- Hangzhou Levinthal Biotech Ltd.Zhejiang311200China
| | - Yingguo Bai
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | - Huiying Luo
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | - Bin Yao
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| | - Haobo Wang
- Hangzhou Levinthal Biotech Ltd.Zhejiang311200China
| | - Tao Tu
- State Key Laboratory of Animal Nutrition and Feeding, Institute of Animal SciencesChinese Academy of Agricultural SciencesBeijing100193China
| |
Collapse
|
17
|
Praljak N, Yeh H, Moore M, Socolich M, Ranganathan R, Ferguson AL. Natural Language Prompts Guide the Design of Novel Functional Protein Sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.622734. [PMID: 39605414 PMCID: PMC11601239 DOI: 10.1101/2024.11.11.622734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The advent of natural language interaction with machines has ushered in new innovations in text-guided generation of images, audio, video, and more. In this arena, we introduce Bio logical M ulti- M odal M odel ( BioM3 ), as a novel framework for designing functional proteins via natural language prompts. This framework integrates natural language with protein design through a three-stage process: aligning protein and text representations in a joint embedding space learned using contrastive learning, refinement of the text embeddings, and conditional generation of protein sequences via a discrete autoregressive diffusion model. BioM3 synthe-sizes protein sequences with detailed descriptions of the protein structure, lineage, and function from text annotations to enable the conditional generation of novel sequences with desired attributes through natural language prompts. We present in silico validation of the model predictions for subcellular localization prediction, reaction classification, remote homology detection, scaffold in-painting, and structural plausibility, and in vivo and in vitro experimental tests of natural language prompt-designed synthetic analogs of Src-homology 3 (SH3) domain proteins that mediate signaling in the Sho1 osmotic stress response pathway in baker's yeast. BioM3 possesses state-of-the-art performance in zero-shot prediction and homology detection tasks, and generates proteins with native-like tertiary folds and wild-type levels of experimentally assayed function.
Collapse
|
18
|
Faure AJ, Martí-Aranda A, Hidalgo-Carcedo C, Beltran A, Schmiedel JM, Lehner B. The genetic architecture of protein stability. Nature 2024; 634:995-1003. [PMID: 39322666 PMCID: PMC11499273 DOI: 10.1038/s41586-024-07966-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 08/20/2024] [Indexed: 09/27/2024]
Abstract
There are more ways to synthesize a 100-amino acid (aa) protein (20100) than there are atoms in the universe. Only a very small fraction of such a vast sequence space can ever be experimentally or computationally surveyed. Deep neural networks are increasingly being used to navigate high-dimensional sequence spaces1. However, these models are extremely complicated. Here, by experimentally sampling from sequence spaces larger than 1010, we show that the genetic architecture of at least some proteins is remarkably simple, allowing accurate genetic prediction in high-dimensional sequence spaces with fully interpretable energy models. These models capture the nonlinear relationships between free energies and phenotypes but otherwise consist of additive free energy changes with a small contribution from pairwise energetic couplings. These energetic couplings are sparse and associated with structural contacts and backbone proximity. Our results indicate that protein genetics is actually both rather simple and intelligible.
Collapse
Affiliation(s)
- Andre J Faure
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- ALLOX, Barcelona, Spain.
| | - Aina Martí-Aranda
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Cristina Hidalgo-Carcedo
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Antoni Beltran
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jörn M Schmiedel
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- factorize.bio, Berlin, Germany
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.
- Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
19
|
Di Bari L, Bisardi M, Cotogno S, Weigt M, Zamponi F. Emergent time scales of epistasis in protein evolution. Proc Natl Acad Sci U S A 2024; 121:e2406807121. [PMID: 39325427 PMCID: PMC11459137 DOI: 10.1073/pnas.2406807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 08/17/2024] [Indexed: 09/27/2024] Open
Abstract
We introduce a data-driven epistatic model of protein evolution, capable of generating evolutionary trajectories spanning very different time scales reaching from individual mutations to diverged homologs. Our in silico evolution encompasses random nucleotide mutations, insertions and deletions, and models selection using a fitness landscape, which is inferred via a generative probabilistic model for protein families. We show that the proposed framework accurately reproduces the sequence statistics of both short-time (experimental) and long-time (natural) protein evolution, suggesting applicability also to relatively data-poor intermediate evolutionary time scales, which are currently inaccessible to evolution experiments. Our model uncovers a highly collective nature of epistasis, gradually changing the fitness effect of mutations in a diverging sequence context, rather than acting via strong interactions between individual mutations. This collective nature triggers the emergence of a long evolutionary time scale, separating fast mutational processes inside a given sequence context, from the slow evolution of the context itself. The model quantitatively reproduces epistatic phenomena such as contingency and entrenchment, as well as the loss of predictability in protein evolution observed in deep mutational scanning experiments of distant homologs. It thereby deepens our understanding of the interplay between mutation and selection in shaping protein diversity and functions, allows one to statistically forecast evolution, and challenges the prevailing independent-site models of protein evolution, which are unable to capture the fundamental importance of epistasis.
Collapse
Affiliation(s)
- Leonardo Di Bari
- Dipartimento Scienza Applicata e Tecnologia, Politecnico di Torino, I-10129Torino, Italy
| | - Matteo Bisardi
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Sabrina Cotogno
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, 00185Rome, Italy
| |
Collapse
|
20
|
Chen JZ, Bisardi M, Lee D, Cotogno S, Zamponi F, Weigt M, Tokuriki N. Understanding epistatic networks in the B1 β-lactamases through coevolutionary statistical modeling and deep mutational scanning. Nat Commun 2024; 15:8441. [PMID: 39349467 PMCID: PMC11442494 DOI: 10.1038/s41467-024-52614-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 09/16/2024] [Indexed: 10/02/2024] Open
Abstract
Throughout evolution, protein families undergo substantial sequence divergence while preserving structure and function. Although most mutations are deleterious, evolution can explore sequence space via epistatic networks of intramolecular interactions that alleviate the harmful mutations. However, comprehensive analysis of such epistatic networks across protein families remains limited. Thus, we conduct a family wide analysis of the B1 metallo-β-lactamases, combining experiments (deep mutational scanning, DMS) on two distant homologs (NDM-1 and VIM-2) and computational analyses (in silico DMS based on Direct Coupling Analysis, DCA) of 100 homologs. The methods jointly reveal and quantify prevalent epistasis, as ~1/3rd of equivalent mutations are epistatic in DMS. From DCA, half of the positions have a >6.5 fold difference in effective number of tolerated mutations across the entire family. Notably, both methods locate residues with the strongest epistasis in regions of intermediate residue burial, suggesting a balance of residue packing and mutational freedom in forming epistatic networks. We identify entrenched WT residues between NDM-1 and VIM-2 in DMS, which display statistically distinct behaviors in DCA from non-entrenched residues. Entrenched residues are not easily compensated by changes in single nearby interactions, reinforcing existing findings where a complex epistatic network compounds smaller effects from many interacting residues.
Collapse
Affiliation(s)
- J Z Chen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - M Bisardi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - D Lee
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - S Cotogno
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - F Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Dipartimento di Fisica, Sapienza Università di Roma, I-00185, Rome, Italy
| | - M Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - N Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
21
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
22
|
Illig AM, Siedhoff NE, Davari MD, Schwaneberg U. Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort. J Chem Inf Model 2024; 64:6350-6360. [PMID: 39088689 DOI: 10.1021/acs.jcim.4c00704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2024]
Abstract
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
Collapse
Affiliation(s)
| | - Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| |
Collapse
|
23
|
Lian X, Praljak N, Subramanian SK, Wasinger S, Ranganathan R, Ferguson AL. Deep-learning-based design of synthetic orthologs of SH3 signaling domains. Cell Syst 2024; 15:725-737.e7. [PMID: 39106868 PMCID: PMC11879475 DOI: 10.1016/j.cels.2024.07.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 11/12/2023] [Accepted: 07/22/2024] [Indexed: 08/09/2024]
Abstract
Evolution-based deep generative models represent an exciting direction in understanding and designing proteins. An open question is whether such models can learn specialized functional constraints that control fitness in specific biological contexts. Here, we examine the ability of generative models to produce synthetic versions of Src-homology 3 (SH3) domains that mediate signaling in the Sho1 osmotic stress response pathway of yeast. We show that a variational autoencoder (VAE) model produces artificial sequences that experimentally recapitulate the function of natural SH3 domains. More generally, the model organizes all fungal SH3 domains such that locality in the model latent space (but not simply locality in sequence space) enriches the design of synthetic orthologs and exposes non-obvious amino acid constraints distributed near and far from the SH3 ligand-binding site. The ability of generative models to design ortholog-like functions in vivo opens new avenues for engineering protein function in specific cellular contexts and environments.
Collapse
Affiliation(s)
- Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, IL 60637, USA
| | - Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL 60637, USA
| | - Subu K Subramanian
- Department of Molecular and Cell Biology, California Institute for Quantitative Biosciences (QB3), and Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sarah Wasinger
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Rama Ranganathan
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA; Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA.
| | - Andrew L Ferguson
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
24
|
Lipsh-Sokolik R, Fleishman SJ. Addressing epistasis in the design of protein function. Proc Natl Acad Sci U S A 2024; 121:e2314999121. [PMID: 39133844 PMCID: PMC11348311 DOI: 10.1073/pnas.2314999121] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024] Open
Abstract
Mutations in protein active sites can dramatically improve function. The active site, however, is densely packed and extremely sensitive to mutations. Therefore, some mutations may only be tolerated in combination with others in a phenomenon known as epistasis. Epistasis reduces the likelihood of obtaining improved functional variants and dramatically slows natural and lab evolutionary processes. Research has shed light on the molecular origins of epistasis and its role in shaping evolutionary trajectories and outcomes. In addition, sequence- and AI-based strategies that infer epistatic relationships from mutational patterns in natural or experimental evolution data have been used to design functional protein variants. In recent years, combinations of such approaches and atomistic design calculations have successfully predicted highly functional combinatorial mutations in active sites. These were used to design thousands of functional active-site variants, demonstrating that, while our understanding of epistasis remains incomplete, some of the determinants that are critical for accurate design are now sufficiently understood. We conclude that the space of active-site variants that has been explored by evolution may be expanded dramatically to enhance natural activities or discover new ones. Furthermore, design opens the way to systematically exploring sequence and structure space and mutational impacts on function, deepening our understanding and control over protein activity.
Collapse
Affiliation(s)
- Rosalie Lipsh-Sokolik
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| |
Collapse
|
25
|
Kinshuk S, Li L, Meckes B, Chan CTY. Sequence-Based Protein Design: A Review of Using Statistical Models to Characterize Coevolutionary Traits for Developing Hybrid Proteins as Genetic Sensors. Int J Mol Sci 2024; 25:8320. [PMID: 39125888 PMCID: PMC11312098 DOI: 10.3390/ijms25158320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 07/23/2024] [Accepted: 07/26/2024] [Indexed: 08/12/2024] Open
Abstract
Statistical analyses of homologous protein sequences can identify amino acid residue positions that co-evolve to generate family members with different properties. Based on the hypothesis that the coevolution of residue positions is necessary for maintaining protein structure, coevolutionary traits revealed by statistical models provide insight into residue-residue interactions that are important for understanding protein mechanisms at the molecular level. With the rapid expansion of genome sequencing databases that facilitate statistical analyses, this sequence-based approach has been used to study a broad range of protein families. An emerging application of this approach is to design hybrid transcriptional regulators as modular genetic sensors for novel wiring between input signals and genetic elements to control outputs. Among many allosterically regulated regulator families, the members contain structurally conserved and functionally independent protein domains, including a DNA-binding module (DBM) for interacting with a specific genetic element and a ligand-binding module (LBM) for sensing an input signal. By hybridizing a DBM and an LBM from two different family members, a hybrid regulator can be created with a new combination of signal-detection and DNA-recognition properties not present in natural systems. In this review, we present recent advances in the development of hybrid regulators and their applications in cellular engineering, especially focusing on the use of statistical analyses for characterizing DBM-LBM interactions and hybrid regulator design. Based on these studies, we then discuss the current limitations and potential directions for enhancing the impact of this sequence-based design approach.
Collapse
Affiliation(s)
- Sahaj Kinshuk
- Department of Biomedical Engineering, College of Engineering, University of North Texas, 3940 N Elm Street, Denton, TX 76207, USA; (S.K.); (L.L.); (B.M.)
| | - Lin Li
- Department of Biomedical Engineering, College of Engineering, University of North Texas, 3940 N Elm Street, Denton, TX 76207, USA; (S.K.); (L.L.); (B.M.)
| | - Brian Meckes
- Department of Biomedical Engineering, College of Engineering, University of North Texas, 3940 N Elm Street, Denton, TX 76207, USA; (S.K.); (L.L.); (B.M.)
- BioDiscovery Institute, University of North Texas, 1155 Union Circle #305220, Denton, TX 76203, USA
| | - Clement T. Y. Chan
- Department of Biomedical Engineering, College of Engineering, University of North Texas, 3940 N Elm Street, Denton, TX 76207, USA; (S.K.); (L.L.); (B.M.)
- BioDiscovery Institute, University of North Texas, 1155 Union Circle #305220, Denton, TX 76203, USA
| |
Collapse
|
26
|
Yan H, Tan X, Zou S, Sun Y, Ke A, Tang W. Assessing and engineering the IscB-ωRNA system for programmed genome editing. Nat Chem Biol 2024:10.1038/s41589-024-01669-3. [PMID: 38977787 DOI: 10.1038/s41589-024-01669-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 06/07/2024] [Indexed: 07/10/2024]
Abstract
OMEGA RNA (ωRNA)-guided endonuclease IscB, the evolutionary ancestor of Cas9, is an attractive system for in vivo genome editing because of its compact size and mechanistic resemblance to Cas9. However, wild-type IscB-ωRNA systems show limited activity in human cells. Here we report enhanced OgeuIscB, which, with eight amino acid substitutions, displayed a fourfold increase in in vitro DNA-binding affinity and a 30.4-fold improvement in insertion-deletion (indel) formation efficiency in human cells. Paired with structure-guided ωRNA engineering, the enhanced OgeuIscB-ωRNA systems efficiently edited the human genome across 26 target sites, attaining up to 87.3% indel and 62.2% base-editing frequencies. Both wild-type and engineered OgeuIscB-ωRNA showed moderate fidelity in editing the human genome, with off-target profiles revealing key determinants of target selection including an NARR target-adjacent motif (TAM) and the TAM-proximal 14 nucleotides in the R-loop. Collectively, our engineered OgeuIscB-ωRNA systems are programmable, potent and sufficiently specific for human genome editing.
Collapse
Affiliation(s)
- Hao Yan
- Department of Chemistry, The University of Chicago, Chicago, IL, USA
- Institute for Biophysical Dynamics, The University of Chicago, Chicago, IL, USA
| | - Xiaoqing Tan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Siyuan Zou
- Department of Chemistry, The University of Chicago, Chicago, IL, USA
- Institute for Biophysical Dynamics, The University of Chicago, Chicago, IL, USA
| | - Yihong Sun
- Department of Chemistry, The University of Chicago, Chicago, IL, USA
- Institute for Biophysical Dynamics, The University of Chicago, Chicago, IL, USA
| | - Ailong Ke
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.
| | - Weixin Tang
- Department of Chemistry, The University of Chicago, Chicago, IL, USA.
- Institute for Biophysical Dynamics, The University of Chicago, Chicago, IL, USA.
| |
Collapse
|
27
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 PMCID: PMC11228481 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of BioSciences, Rice University, Houston, TX77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
28
|
Guan A, He Z, Wang X, Jia ZJ, Qin J. Engineering the next-generation synthetic cell factory driven by protein engineering. Biotechnol Adv 2024; 73:108366. [PMID: 38663492 DOI: 10.1016/j.biotechadv.2024.108366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/21/2024] [Accepted: 04/22/2024] [Indexed: 05/09/2024]
Abstract
Synthetic cell factory offers substantial advantages in economically efficient production of biofuels, chemicals, and pharmaceutical compounds. However, to create a high-performance synthetic cell factory, precise regulation of cellular material and energy flux is essential. In this context, protein components including enzymes, transcription factor-based biosensors and transporters play pivotal roles. Protein engineering aims to create novel protein variants with desired properties by modifying or designing protein sequences. This review focuses on summarizing the latest advancements of protein engineering in optimizing various aspects of synthetic cell factory, including: enhancing enzyme activity to eliminate production bottlenecks, altering enzyme selectivity to steer metabolic pathways towards desired products, modifying enzyme promiscuity to explore innovative routes, and improving the efficiency of transporters. Furthermore, the utilization of protein engineering to modify protein-based biosensors accelerates evolutionary process and optimizes the regulation of metabolic pathways. The remaining challenges and future opportunities in this field are also discussed.
Collapse
Affiliation(s)
- Ailin Guan
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Zixi He
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China
| | - Xin Wang
- West China School of Pharmacy, Sichuan University, Chengdu 610041, China
| | - Zhi-Jun Jia
- West China School of Pharmacy, Sichuan University, Chengdu 610041, China
| | - Jiufu Qin
- College of Biomass Science and Engineering, Sichuan University, Chengdu 610065, China.
| |
Collapse
|
29
|
Zhou L, Tao C, Shen X, Sun X, Wang J, Yuan Q. Unlocking the potential of enzyme engineering via rational computational design strategies. Biotechnol Adv 2024; 73:108376. [PMID: 38740355 DOI: 10.1016/j.biotechadv.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 04/27/2024] [Accepted: 05/08/2024] [Indexed: 05/16/2024]
Abstract
Enzymes play a pivotal role in various industries by enabling efficient, eco-friendly, and sustainable chemical processes. However, the low turnover rates and poor substrate selectivity of enzymes limit their large-scale applications. Rational computational enzyme design, facilitated by computational algorithms, offers a more targeted and less labor-intensive approach. There has been notable advancement in employing rational computational protein engineering strategies to overcome these issues, it has not been comprehensively reviewed so far. This article reviews recent developments in rational computational enzyme design, categorizing them into three types: structure-based, sequence-based, and data-driven machine learning computational design. Case studies are presented to demonstrate successful enhancements in catalytic activity, stability, and substrate selectivity. Lastly, the article provides a thorough analysis of these approaches, highlights existing challenges and potential solutions, and offers insights into future development directions.
Collapse
Affiliation(s)
- Lei Zhou
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Chunmeng Tao
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Xiaolin Shen
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Xinxiao Sun
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China
| | - Jia Wang
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China.
| | - Qipeng Yuan
- State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, Beijing 100029, China.
| |
Collapse
|
30
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
31
|
Fram B, Su Y, Truebridge I, Riesselman AJ, Ingraham JB, Passera A, Napier E, Thadani NN, Lim S, Roberts K, Kaur G, Stiffler MA, Marks DS, Bahl CD, Khan AR, Sander C, Gauthier NP. Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Nat Commun 2024; 15:5141. [PMID: 38902262 PMCID: PMC11190266 DOI: 10.1038/s41467-024-49119-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 05/24/2024] [Indexed: 06/22/2024] Open
Abstract
A major challenge in protein design is to augment existing functional proteins with multiple property enhancements. Altering several properties likely necessitates numerous primary sequence changes, and novel methods are needed to accurately predict combinations of mutations that maintain or enhance function. Models of sequence co-variation (e.g., EVcouplings), which leverage extensive information about various protein properties and activities from homologous protein sequences, have proven effective for many applications including structure determination and mutation effect prediction. We apply EVcouplings to computationally design variants of the model protein TEM-1 β-lactamase. Nearly all the 14 experimentally characterized designs were functional, including one with 84 mutations from the nearest natural homolog. The designs also had large increases in thermostability, increased activity on multiple substrates, and nearly identical structure to the wild type enzyme. This study highlights the efficacy of evolutionary models in guiding large sequence alterations to generate functional diversity for protein design applications.
Collapse
Affiliation(s)
- Benjamin Fram
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Yang Su
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Ian Truebridge
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Adam J Riesselman
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Program in Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - John B Ingraham
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Alessandro Passera
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Campus-Vienna-Biocenter 1, 1030, Vienna, Austria
| | - Eve Napier
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
| | - Nicole N Thadani
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Apriori Bio, Cambridge, MA, USA
| | - Samuel Lim
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Kristen Roberts
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Gurleen Kaur
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Michael A Stiffler
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Dyno Therapeutics, 343 Arsenal Street, Watertown, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christopher D Bahl
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Amir R Khan
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
- Division of Newborn Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicholas P Gauthier
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
32
|
Calvanese F, Lambert CN, Nghe P, Zamponi F, Weigt M. Towards parsimonious generative modeling of RNA families. Nucleic Acids Res 2024; 52:5465-5477. [PMID: 38661206 PMCID: PMC11162787 DOI: 10.1093/nar/gkae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 03/05/2024] [Accepted: 04/05/2024] [Indexed: 04/26/2024] Open
Abstract
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
Collapse
Affiliation(s)
- Francesco Calvanese
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Camille N Lambert
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Philippe Nghe
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
| |
Collapse
|
33
|
Han Y, Zhang H, Zeng Z, Liu Z, Lu D, Liu Z. Descriptor-augmented machine learning for enzyme-chemical interaction predictions. Synth Syst Biotechnol 2024; 9:259-268. [PMID: 38450325 PMCID: PMC10915406 DOI: 10.1016/j.synbio.2024.02.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 02/21/2024] [Accepted: 02/22/2024] [Indexed: 03/08/2024] Open
Abstract
Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.
Collapse
Affiliation(s)
- Yilei Han
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| | - Haoye Zhang
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Zheni Zeng
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Zhiyuan Liu
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Diannan Lu
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| | - Zheng Liu
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| |
Collapse
|
34
|
Wang D, Frechette LB, Best RB. On the role of native contact cooperativity in protein folding. Proc Natl Acad Sci U S A 2024; 121:e2319249121. [PMID: 38776371 PMCID: PMC11145220 DOI: 10.1073/pnas.2319249121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/11/2024] [Indexed: 05/25/2024] Open
Abstract
The consistency of energy landscape theory predictions with available experimental data, as well as direct evidence from molecular simulations, have shown that protein folding mechanisms are largely determined by the contacts present in the native structure. As expected, native contacts are generally energetically favorable. However, there are usually at least as many energetically favorable nonnative pairs owing to the greater number of possible nonnative interactions. This apparent frustration must therefore be reduced by the greater cooperativity of native interactions. In this work, we analyze the statistics of contacts in the unbiased all-atom folding trajectories obtained by Shaw and coworkers, focusing on the unfolded state. By computing mutual cooperativities between contacts formed in the unfolded state, we show that native contacts form the most cooperative pairs, while cooperativities among nonnative or between native and nonnative contacts are typically much less favorable or even anticooperative. Furthermore, we show that the largest network of cooperative interactions observed in the unfolded state consists mainly of native contacts, suggesting that this set of mutually reinforcing interactions has evolved to stabilize the native state.
Collapse
Affiliation(s)
- David Wang
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
- Department of Biology, Johns Hopkins University, Baltimore, MD21218
| | - Layne B. Frechette
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
- Martin A. Fisher School of Physics, Brandeis University, Waltham, MA02453
| | - Robert B. Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
| |
Collapse
|
35
|
Syrén PO. Ancestral terpene cyclases: From fundamental science to applications in biosynthesis. Methods Enzymol 2024; 699:311-341. [PMID: 38942509 DOI: 10.1016/bs.mie.2024.04.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/30/2024]
Abstract
Terpenes constitute one of the largest family of natural products with potent applications as renewable platform chemicals and medicines. The low activity, selectivity and stability displayed by terpene biosynthetic machineries can constitute an obstacle towards achieving expedient biosynthesis of terpenoids in processes that adhere to the 12 principles of green chemistry. Accordingly, engineering of terpene synthase enzymes is a prerequisite for industrial biotechnology applications, but obstructed by their complex catalysis that depend on reactive carbocationic intermediates that are prone to undergo bifurcation mechanisms. Rational redesign of terpene synthases can be tedious and requires high-resolution structural information, which is not always available. Furthermore, it has proven difficult to link sequence space of terpene synthase enzymes to specific product profiles. Herein, the author shows how ancestral sequence reconstruction (ASR) can favorably be used as a protein engineering tool in the redesign of terpene synthases without the need of a structure, and without excessive screening. A detailed workflow of ASR is presented along with associated limitations, with a focus on applying this methodology on terpene synthases. From selected examples of both class I and II enzymes, the author advocates that ancestral terpene cyclases constitute valuable assets to shed light on terpene-synthase catalysis and in enabling accelerated biosynthesis.
Collapse
Affiliation(s)
- Per-Olof Syrén
- School of Chemistry, Biotechnology and Health, Science for Life Laboratory, KTH Royal Institute of Technology, Solna, Sweden; School of Engineering Sciences in Chemistry, Biotechnology and Health, Department of Fibre and Polymer Technology, KTH Royal Institute of Technology, Stockholm, Sweden.
| |
Collapse
|
36
|
Jansen S, Mayer C. A Robust Growth-Based Selection Platform to Evolve an Enzyme via Dependency on Noncanonical Tyrosine Analogues. JACS AU 2024; 4:1583-1590. [PMID: 38665651 PMCID: PMC11040555 DOI: 10.1021/jacsau.4c00070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 02/21/2024] [Accepted: 02/22/2024] [Indexed: 04/28/2024]
Abstract
Growth-based selections evaluate the fitness of individual organisms at a population level. In enzyme engineering, such growth selections allow for the rapid and straightforward identification of highly efficient biocatalysts from extensive libraries. However, selection-based improvement of (synthetically useful) biocatalysts is challenging, as they require highly dependable strategies that artificially link their activities to host survival. Here, we showcase a robust and scalable growth-based selection platform centered around the complementation of noncanonical amino acid-dependent bacteria. Specifically, we demonstrate how serial passaging of populations featuring millions of carbamoylase variants autonomously selects biocatalysts with up to 90,000-fold higher initial rates. Notably, selection of replicate populations enriched diverse biocatalysts, which feature distinct amino acid motifs that drastically boost carbamoylase activity. As beneficial substitutions also originated from unintended copying errors during library preparation or cell division, we anticipate that our growth-based selection platform will be applicable to the continuous, autonomous evolution of diverse biocatalysts in the future.
Collapse
Affiliation(s)
- Suzanne
C. Jansen
- Stratingh Institute for Chemistry, University of Groningen, Nijenborgh 4, 9747
AG Groningen, The
Netherlands
| | - Clemens Mayer
- Stratingh Institute for Chemistry, University of Groningen, Nijenborgh 4, 9747
AG Groningen, The
Netherlands
| |
Collapse
|
37
|
Nguyen TN, Ingle C, Thompson S, Reynolds KA. The genetic landscape of a metabolic interaction. Nat Commun 2024; 15:3351. [PMID: 38637543 PMCID: PMC11026382 DOI: 10.1038/s41467-024-47671-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 04/09/2024] [Indexed: 04/20/2024] Open
Abstract
While much prior work has explored the constraints on protein sequence and evolution induced by physical protein-protein interactions, the sequence-level constraints emerging from non-binding functional interactions in metabolism remain unclear. To quantify how variation in the activity of one enzyme constrains the biochemical parameters and sequence of another, we focus on dihydrofolate reductase (DHFR) and thymidylate synthase (TYMS), a pair of enzymes catalyzing consecutive reactions in folate metabolism. We use deep mutational scanning to quantify the growth rate effect of 2696 DHFR single mutations in 3 TYMS backgrounds under conditions selected to emphasize biochemical epistasis. Our data are well-described by a relatively simple enzyme velocity to growth rate model that quantifies how metabolic context tunes enzyme mutational tolerance. Together our results reveal the structural distribution of epistasis in a metabolic enzyme and establish a foundation for the design of multi-enzyme systems.
Collapse
Affiliation(s)
- Thuy N Nguyen
- The Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- The Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- The Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Form Bio, Dallas, TX, 75226, USA
| | - Christine Ingle
- The Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- The Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- The Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Samuel Thompson
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA, 94158, USA
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA
| | - Kimberly A Reynolds
- The Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- The Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- The Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
38
|
Prywes N, Philips NR, Oltrogge LM, Lindner S, Candace Tsai YC, de Pins B, Cowan AE, Taylor-Kearney LJ, Chang HA, Hall LN, Bellieny-Rabelo D, Nisonoff HM, Weissman RF, Flamholz AI, Ding D, Bhatt AY, Shih PM, Mueller-Cajar O, Milo R, Savage DF. A map of the rubisco biochemical landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.27.559826. [PMID: 38645011 PMCID: PMC11030240 DOI: 10.1101/2023.09.27.559826] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Rubisco is the primary CO2 fixing enzyme of the biosphere yet has slow kinetics. The roles of evolution and chemical mechanism in constraining the sequence landscape of rubisco remain debated. In order to map sequence to function, we developed a massively parallel assay for rubisco using an engineered E. coli where enzyme function is coupled to growth. By assaying >99% of single amino acid mutants across CO2 concentrations, we inferred enzyme velocity and CO2 affinity for thousands of substitutions. We identified many highly conserved positions that tolerate mutation and rare mutations that improve CO2 affinity. These data suggest that non-trivial kinetic improvements are readily accessible and provide a comprehensive sequence-to-function mapping for enzyme engineering efforts.
Collapse
Affiliation(s)
- Noam Prywes
- Innovative Genomics Institute, University of California; Berkeley, California 94720, USA
- Howard Hughes Medical Institute, University of California; Berkeley, California 94720, USA
| | - Naiya R Philips
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
| | - Luke M Oltrogge
- Howard Hughes Medical Institute, University of California; Berkeley, California 94720, USA
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
| | | | - Yi-Chin Candace Tsai
- School of Biological Sciences, Nanyang Technological University; Singapore 637551, Singapore
| | - Benoit de Pins
- Department of Plant and Environmental Sciences, Weizmann Institute of Science; Rehovot 76100, Israel
| | - Aidan E Cowan
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
- Joint BioEnergy Institute, Lawrence Berkeley National Laboratory; Emeryville, CA 94608, USA
| | - Leah J Taylor-Kearney
- Department of Plant and Microbial Biology, University of California, Berkeley; Berkeley, CA 94720, USA
| | - Hana A Chang
- Department of Plant and Microbial Biology, University of California, Berkeley; Berkeley, CA 94720, USA
| | - Laina N Hall
- Biophysics, University of California, Berkeley; Berkeley, CA 94720, USA
| | - Daniel Bellieny-Rabelo
- Innovative Genomics Institute, University of California; Berkeley, California 94720, USA
- California Institute for Quantitative Biosciences (QB3), University of California; Berkeley, CA 94720, USA
| | - Hunter M Nisonoff
- Center for Computational Biology, University of California, Berkeley; Berkeley, CA, USA
| | - Rachel F Weissman
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
| | - Avi I Flamholz
- Division of Biology and Biological Engineering, California Institute of Technology; Pasadena, CA 91125
| | - David Ding
- Innovative Genomics Institute, University of California; Berkeley, California 94720, USA
- Howard Hughes Medical Institute, University of California; Berkeley, California 94720, USA
| | - Abhishek Y Bhatt
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
- School of Medicine, University of California, San Diego; La Jolla, CA 92092, USA
| | - Patrick M Shih
- Innovative Genomics Institute, University of California; Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley; Berkeley, CA 94720, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory; Berkeley, CA 94720, USA
- Feedstocks Division, Joint BioEnergy Institute; Emeryville, CA 94608, USA
| | - Oliver Mueller-Cajar
- School of Biological Sciences, Nanyang Technological University; Singapore 637551, Singapore
| | - Ron Milo
- Department of Plant and Environmental Sciences, Weizmann Institute of Science; Rehovot 76100, Israel
| | - David F Savage
- Innovative Genomics Institute, University of California; Berkeley, California 94720, USA
- Howard Hughes Medical Institute, University of California; Berkeley, California 94720, USA
- Department of Molecular and Cell Biology, University of California; Berkeley, California 94720, USA
| |
Collapse
|
39
|
Bibik P, Alibai S, Pandini A, Dantu SC. PyCoM: a python library for large-scale analysis of residue-residue coevolution data. Bioinformatics 2024; 40:btae166. [PMID: 38532297 PMCID: PMC11009027 DOI: 10.1093/bioinformatics/btae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/02/2024] [Accepted: 03/25/2024] [Indexed: 03/28/2024] Open
Abstract
MOTIVATION Computational methods to detect correlated amino acid positions in proteins have become a valuable tool to predict intra- and inter-residue protein contacts, protein structures, and effects of mutation on protein stability and function. While there are many tools and webservers to compute coevolution scoring matrices, there is no central repository of alignments and coevolution matrices for large-scale studies and pattern detection leveraging on biological and structural annotations already available in UniProt. RESULTS We present a Python library, PyCoM, which enables users to query and analyze coevolution matrices and sequence alignments of 457 622 proteins, selected from UniProtKB/Swiss-Prot database (length ≤ 500 residues), from a precompiled coevolution matrix database (PyCoMdb). PyCoM facilitates the development of statistical analyses of residue coevolution patterns using filters on biological and structural annotations from UniProtKB/Swiss-Prot, with simple access to PyCoMdb for both novice and advanced users, supporting Jupyter Notebooks, Python scripts, and a web API access. The resource is open source and will help in generating data-driven computational models and methods to study and understand protein structures, stability, function, and design. AVAILABILITY AND IMPLEMENTATION PyCoM code is freely available from https://github.com/scdantu/pycom and PyCoMdb and the Jupyter Notebook tutorials are freely available from https://pycom.brunel.ac.uk.
Collapse
Affiliation(s)
- Philipp Bibik
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sabriyeh Alibai
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Alessandro Pandini
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sarath Chandra Dantu
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| |
Collapse
|
40
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
41
|
Nam K, Shao Y, Major DT, Wolf-Watz M. Perspectives on Computational Enzyme Modeling: From Mechanisms to Design and Drug Development. ACS OMEGA 2024; 9:7393-7412. [PMID: 38405524 PMCID: PMC10883025 DOI: 10.1021/acsomega.3c09084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 01/15/2024] [Accepted: 01/19/2024] [Indexed: 02/27/2024]
Abstract
Understanding enzyme mechanisms is essential for unraveling the complex molecular machinery of life. In this review, we survey the field of computational enzymology, highlighting key principles governing enzyme mechanisms and discussing ongoing challenges and promising advances. Over the years, computer simulations have become indispensable in the study of enzyme mechanisms, with the integration of experimental and computational exploration now established as a holistic approach to gain deep insights into enzymatic catalysis. Numerous studies have demonstrated the power of computer simulations in characterizing reaction pathways, transition states, substrate selectivity, product distribution, and dynamic conformational changes for various enzymes. Nevertheless, significant challenges remain in investigating the mechanisms of complex multistep reactions, large-scale conformational changes, and allosteric regulation. Beyond mechanistic studies, computational enzyme modeling has emerged as an essential tool for computer-aided enzyme design and the rational discovery of covalent drugs for targeted therapies. Overall, enzyme design/engineering and covalent drug development can greatly benefit from our understanding of the detailed mechanisms of enzymes, such as protein dynamics, entropy contributions, and allostery, as revealed by computational studies. Such a convergence of different research approaches is expected to continue, creating synergies in enzyme research. This review, by outlining the ever-expanding field of enzyme research, aims to provide guidance for future research directions and facilitate new developments in this important and evolving field.
Collapse
Affiliation(s)
- Kwangho Nam
- Department
of Chemistry and Biochemistry, University
of Texas at Arlington, Arlington, Texas 76019, United States
| | - Yihan Shao
- Department
of Chemistry and Biochemistry, University
of Oklahoma, Norman, Oklahoma 73019-5251, United States
| | - Dan T. Major
- Department
of Chemistry and Institute for Nanotechnology & Advanced Materials, Bar-Ilan University, Ramat-Gan 52900, Israel
| | | |
Collapse
|
42
|
Dramé-Maigné A, Espada R, McCallum G, Sieskind R, Gines G, Rondelez Y. In Vitro Enzyme Self-Selection Using Molecular Programs. ACS Synth Biol 2024; 13:474-484. [PMID: 38206581 DOI: 10.1021/acssynbio.3c00385] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
Directed evolution provides a powerful route for in vitro enzyme engineering. State-of-the-art techniques functionally screen up to millions of enzyme variants using high throughput microfluidic sorters, whose operation remains technically challenging. Alternatively, in vitro self-selection methods, analogous to in vivo complementation strategies, open the way to even higher throughputs, but have been demonstrated only for a few specific activities. Here, we leverage synthetic molecular networks to generalize in vitro compartmentalized self-selection processes. We introduce a programmable circuit architecture that can link an arbitrary target enzymatic activity to the replication of its encoding gene. Microencapsulation of a bacterial expression library with this autonomous selection circuit results in the single-step and screening-free enrichment of genetic sequences coding for programmed enzymatic phenotypes. We demonstrate the potential of this approach for the nicking enzyme Nt.BstNBI (NBI). We applied autonomous selection conditions to enrich for thermostability or catalytic efficiency, manipulating up to 107 microcompartments and 5 × 105 variants at once. Full gene reads of the libraries using nanopore sequencing revealed detailed mutational activity landscapes, suggesting a key role of electrostatic interactions with DNA in the enzyme's turnover. The most beneficial mutations, identified after a single round of self-selection, provided variants with, respectively, 20 times and 3 °C increased activity and thermostability. Based on a modular molecular programming architecture, this approach does not require complex instrumentation and can be repurposed for other enzymes, including those that are not related to DNA chemistry.
Collapse
Affiliation(s)
- Adèle Dramé-Maigné
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Rocío Espada
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Giselle McCallum
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Rémi Sieskind
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Guillaume Gines
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Yannick Rondelez
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| |
Collapse
|
43
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
44
|
Notin P, Rollins N, Gal Y, Sander C, Marks D. Machine learning for functional protein design. Nat Biotechnol 2024; 42:216-228. [PMID: 38361074 DOI: 10.1038/s41587-024-02127-0] [Citation(s) in RCA: 50] [Impact Index Per Article: 50.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 01/05/2024] [Indexed: 02/17/2024]
Abstract
Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.
Collapse
Affiliation(s)
- Pascal Notin
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Computer Science, University of Oxford, Oxford, UK.
| | | | - Yarin Gal
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Debora Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| |
Collapse
|
45
|
Pucci F, Zerihun MB, Rooman M, Schug A. pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics 2024; 40:btae074. [PMID: 38335928 PMCID: PMC10881095 DOI: 10.1093/bioinformatics/btae074] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 01/25/2024] [Accepted: 02/06/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION The accurate prediction of how mutations change biophysical properties of proteins or RNA is a major goal in computational biology with tremendous impacts on protein design and genetic variant interpretation. Evolutionary approaches such as coevolution can help solving this issue. RESULTS We present pycofitness, a standalone Python-based software package for the in silico mutagenesis of protein and RNA sequences. It is based on coevolution and, more specifically, on a popular inverse statistical approach, namely direct coupling analysis by pseudo-likelihood maximization. Its efficient implementation and user-friendly command line interface make it an easy-to-use tool even for researchers with no bioinformatics background. To illustrate its strengths, we present three applications in which pycofitness efficiently predicts the deleteriousness of genetic variants and the effect of mutations on protein fitness and thermodynamic stability. AVAILABILITY AND IMPLEMENTATION https://github.com/KIT-MBS/pycofitness.
Collapse
Affiliation(s)
- Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Mehari B Zerihun
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Alexander Schug
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
- Department of Biology, University of Duisburg-Essen, D-45141 Essen, Germany
| |
Collapse
|
46
|
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
47
|
Stock M, Gorochowski TE. Open-endedness in synthetic biology: A route to continual innovation for biological design. SCIENCE ADVANCES 2024; 10:eadi3621. [PMID: 38241375 PMCID: PMC11809665 DOI: 10.1126/sciadv.adi3621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 12/20/2023] [Indexed: 01/21/2024]
Abstract
Design in synthetic biology is typically goal oriented, aiming to repurpose or optimize existing biological functions, augmenting biology with new-to-nature capabilities, or creating life-like systems from scratch. While the field has seen many advances, bottlenecks in the complexity of the systems built are emerging and designs that function in the lab often fail when used in real-world contexts. Here, we propose an open-ended approach to biological design, with the novelty of designed biology being at least as important as how well it fulfils its goal. Rather than solely focusing on optimization toward a single best design, designing with novelty in mind may allow us to move beyond the diminishing returns we see in performance for most engineered biology. Research from the artificial life community has demonstrated that embracing novelty can automatically generate innovative and unexpected solutions to challenging problems beyond local optima. Synthetic biology offers the ideal playground to explore more creative approaches to biological design.
Collapse
Affiliation(s)
- Michiel Stock
- KERMIT & Biobix, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Thomas E. Gorochowski
- School of Biological Sciences, University of Bristol, Life Sciences Building, 24 Tyndall Avenue, Bristol BS8 1TQ, UK
- BrisEngBio, School of Chemistry, University of Bristol, Cantock’s Close, Bristol BS8 1TS, UK
| |
Collapse
|
48
|
Xu B, Chen Y, Xue W. Computational Protein Design - Where it goes? Curr Med Chem 2024; 31:2841-2854. [PMID: 37272467 DOI: 10.2174/0929867330666230602143700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 02/18/2023] [Accepted: 03/15/2023] [Indexed: 06/06/2023]
Abstract
Proteins have been playing a critical role in the regulation of diverse biological processes related to human life. With the increasing demand, functional proteins are sparse in this immense sequence space. Therefore, protein design has become an important task in various fields, including medicine, food, energy, materials, etc. Directed evolution has recently led to significant achievements. Molecular modification of proteins through directed evolution technology has significantly advanced the fields of enzyme engineering, metabolic engineering, medicine, and beyond. However, it is impossible to identify desirable sequences from a large number of synthetic sequences alone. As a result, computational methods, including data-driven machine learning and physics-based molecular modeling, have been introduced to protein engineering to produce more functional proteins. This review focuses on recent advances in computational protein design, highlighting the applicability of different approaches as well as their limitations.
Collapse
Affiliation(s)
- Binbin Xu
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yingjun Chen
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Weiwei Xue
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| |
Collapse
|
49
|
Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023; 12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
Collapse
Affiliation(s)
- Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States
| | - Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
50
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|