1
|
Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, Kaczmarski JA, Nichols J, Tokuriki N, Jackson CJ. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry 2025; 64:1673-1684. [PMID: 40132127 DOI: 10.1021/acs.biochem.4c00673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2025]
Abstract
Proteins evolve through complex sequence spaces, with fitness landscapes serving as a conceptual framework that links sequence to function. Fitness landscapes can be smooth, where multiple similarly accessible evolutionary paths are available, or rugged, where the presence of multiple local fitness optima complicate evolution and prediction. Indeed, many proteins, especially those with complex functions or under multiple selection pressures, exist on rugged fitness landscapes. Here we discuss the theoretical framework that underpins our understanding of fitness landscapes, alongside recent work that has advanced our understanding─particularly the biophysical basis for smoothness versus ruggedness. Finally, we address the rapid advances that have been made in computational and experimental exploration and exploitation of fitness landscapes, and how these can identify efficient routes to protein optimization.
Collapse
Affiliation(s)
- Mahakaran Sandhu
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - John Z Chen
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - Dana S Matthews
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Sacha B Pulsford
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Barnabas Gall
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Joe A Kaczmarski
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - James Nichols
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
| | - Nobuhiko Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| |
Collapse
|
2
|
Li Y, Zhang J. On the Probability of Reaching High Peaks in Fitness Landscapes by Adaptive Walks. Mol Biol Evol 2025; 42:msaf066. [PMID: 40120127 PMCID: PMC11975533 DOI: 10.1093/molbev/msaf066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 02/25/2025] [Accepted: 03/10/2025] [Indexed: 03/25/2025] Open
Abstract
Adaptive evolution can be described by an uphill walk in a fitness landscape. However, climbing the global peak in a multipeak landscape is improbable because of the high chance of being trapped at a local peak. Nonetheless, over three-quarters of simulated adaptive walks in the fitness landscape of the Escherichia coli dihydrofolate reductase (DHFR) gene were reported to end at the highest 14% of peaks, suggesting that biological systems may be substantially more evolvable than commonly thought. To investigate the cause and generality of this observation, we estimate in empirical and theoretical fitness landscapes the probability of reaching high peaks by adaptive walks (PHP), where high peaks refer to the highest 1, 5, 14, or 25% of all peaks. We find that (i) PHP varies substantially among landscapes, (ii) PHP in empirical landscapes is generally comparable to or smaller than that in same-size Rough Mount Fuji landscapes of similar ruggedness, and (iii) lowering landscape ruggedness boosts PHP. As observed in DHFR, we find in every examined landscape a positive correlation between the fitness of a peak and its basin size, which is the number of genotypes that can reach the peak through adaptive walks. Yet, this correlation does not guarantee a large PHP because of the influences of other factors. We conclude that evolvability depends on the specific fitness landscape and that the large PHP in the DHFR landscape is not a general property of empirical or theoretical fitness landscapes.
Collapse
Affiliation(s)
- Yang Li
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jianzhi Zhang
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Luo J, Ding K, Luo Y. Pareto-optimal sampling for multi-objective protein sequence design. iScience 2025; 28:112119. [PMID: 40160427 PMCID: PMC11952807 DOI: 10.1016/j.isci.2025.112119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Revised: 01/03/2025] [Accepted: 02/24/2025] [Indexed: 04/02/2025] Open
Abstract
Supervised machine learning (ML) has significantly advanced sequence-based protein property prediction. However, its inverse application, designing protein sequences with desired properties, remains under-explored. The challenges in sequence design stem from the vast search space and the rugged protein fitness landscape. In this work, we present MosPro, an efficient ML algorithm for property-guided protein sequence design. We frame sequence design as a discrete sampling problem. Utilizing a pre-trained differentiable ML model that predicts properties of sequences, MosPro shapes a distribution that assigns high probability mass to regions for high-property sequences. To generate designs, MosPro efficiently samples sequences from this constructed distribution. We further develop a Pareto optimization algorithm to propose sequences that are simultaneously optimized for multiple properties. Evaluations on experimental fitness landscapes demonstrated that MosPro generates sequences that optimally trade off multiple desiderata. Our results suggested an unparalleled potential of generative ML for efficient and controllable design for functional proteins.
Collapse
Affiliation(s)
- Jiaqi Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30308, USA
| | - Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30308, USA
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30308, USA
| |
Collapse
|
4
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. PLoS Comput Biol 2025; 21:e1012818. [PMID: 40111986 PMCID: PMC11957564 DOI: 10.1371/journal.pcbi.1012818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 01/22/2025] [Indexed: 03/22/2025] Open
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- Department of Biology, University of Florida, Gainesville, Florida, United States of America
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
5
|
Thomas N, Belanger D, Xu C, Lee H, Hirano K, Iwai K, Polic V, Nyberg KD, Hoff KG, Frenz L, Emrich CA, Kim JW, Chavarha M, Ramanan A, Agresti JJ, Colwell LJ. Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Syst 2025; 16:101236. [PMID: 40081373 DOI: 10.1016/j.cels.2025.101236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 09/17/2024] [Accepted: 02/19/2025] [Indexed: 03/16/2025]
Abstract
Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization is often hindered by a rugged fitness landscape and costly experiments. In this work, we present TeleProt, a machine learning (ML) framework that blends evolutionary and experimental data to design diverse protein libraries, and employ it to improve the catalytic activity of a nuclease enzyme that degrades biofilms that accumulate on chronic wounds. After multiple rounds of high-throughput experiments, TeleProt found a significantly better top-performing enzyme than directed evolution (DE), had a better hit rate at finding diverse, high-activity variants, and was even able to design a high-performance initial library using no prior experimental data. We have released a dataset of 55,000 nuclease variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date, to drive further progress in ML-guided design. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Neil Thomas
- X, the Moonshot Factory, Mountain View, CA 94043, USA.
| | | | | | | | | | | | | | | | | | | | | | - Jun W Kim
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Abi Ramanan
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Lucy J Colwell
- Google DeepMind, Cambridge, MA 02142, USA; Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK.
| |
Collapse
|
6
|
Ouyang WO, Lv H, Liu W, Lei R, Mou Z, Pholcharee T, Talmage L, Tong M, Wang Y, Dailey KE, Gopal AB, Choi D, Ardagh MR, Rodriguez LA, Dai X, Wu NC. High-throughput synthesis and specificity characterization of natively paired antibodies using oPool + display. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.08.30.610421. [PMID: 39257766 PMCID: PMC11383711 DOI: 10.1101/2024.08.30.610421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Antibody discovery is crucial for developing therapeutics and vaccines as well as understanding adaptive immunity. However, the lack of approaches to synthesize antibodies with defined sequences in a high-throughput manner represents a major bottleneck in antibody discovery. Here, we presented oPool+ display, a high-throughput cell-free platform that combined oligo pool synthesis and mRNA display to rapidly construct and characterize many natively paired antibodies in parallel. As a proof-of-concept, we applied oPool+ display to probe the binding specificity of >300 uncommon influenza hemagglutinin (HA) antibodies against 9 HA variants through 16 different screens. Over 5,000 binding tests were performed in 3-5 days with further scaling potential. Follow-up structural analysis of two HA stem antibodies revealed the previously unknown versatility of IGHD3-3 gene segment in recognizing the HA stem. Overall, this study established an experimental platform that not only accelerate antibody characterization, but also enable unbiased discovery of antibody molecular signatures.
Collapse
Affiliation(s)
- Wenhao O Ouyang
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Huibin Lv
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Wenkan Liu
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Ruipeng Lei
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Zongjun Mou
- Department of Physiology and Biophysics, Case Western Reserve University, School of Medicine, Cleveland, OH 44106, USA
| | - Tossapol Pholcharee
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Logan Talmage
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Meixuan Tong
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Yiquan Wang
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Katrine E Dailey
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Akshita B Gopal
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Danbi Choi
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Madison R Ardagh
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Lucia A Rodriguez
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Xinghong Dai
- Department of Physiology and Biophysics, Case Western Reserve University, School of Medicine, Cleveland, OH 44106, USA
| | - Nicholas C Wu
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
7
|
Chitra U, Arnold B, Raphael BJ. Resolving discrepancies between chimeric and multiplicative measures of higher-order epistasis. Nat Commun 2025; 16:1711. [PMID: 39962081 PMCID: PMC11833126 DOI: 10.1038/s41467-025-56986-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 02/06/2025] [Indexed: 02/20/2025] Open
Abstract
Epistasis - the interaction between alleles at different genetic loci - plays a fundamental role in biology. However, several recent approaches quantify epistasis using a chimeric formula that measures deviations from a multiplicative fitness model on an additive scale, thus mixing two scales. Here, we show that for pairwise interactions, the chimeric formula yields a different magnitude but the same sign of epistasis compared to the multiplicative formula that measures both fitness and deviations on a multiplicative scale. However, for higher-order interactions, we show that the chimeric formula can have both different magnitude and sign compared to the multiplicative formula. We resolve these inconsistencies by deriving mathematical relationships between the different epistasis formulae and different parametrizations of the multivariate Bernoulli distribution. We argue that the chimeric formula does not appropriately model interactions between the Bernoulli random variables. In simulations, we show that the chimeric formula is less accurate than the classical multiplicative/additive epistasis formulae and may falsely detect higher-order epistasis. Analyzing multi-gene knockouts in yeast, multi-way drug interactions in E. coli, and deep mutational scanning of several proteins, we find that approximately 10% to 60% of inferred higher-order interactions change sign using the multiplicative/additive formula compared to the chimeric formula.
Collapse
Affiliation(s)
- Uthsav Chitra
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Brian Arnold
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA.
| |
Collapse
|
8
|
Ertelt M, Moretti R, Meiler J, Schoeder CT. Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants. SCIENCE ADVANCES 2025; 11:eadr7338. [PMID: 39937901 PMCID: PMC11817935 DOI: 10.1126/sciadv.adr7338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Accepted: 01/10/2025] [Indexed: 02/14/2025]
Abstract
Machine learning (ML) is changing the world of computational protein design, with data-driven methods surpassing biophysical-based methods in experimental success. However, they are most often reported as case studies, lack integration and standardization, and are therefore hard to objectively compare. In this study, we established a streamlined and diverse toolbox for methods that predict amino acid probabilities inside the Rosetta software framework that allows for the side-by-side comparison of these models. Subsequently, existing protein fitness landscapes were used to benchmark novel ML methods in realistic protein design settings. We focused on the traditional problems of protein design: sampling and scoring. A major finding of our study is that ML approaches are better at purging the sampling space from deleterious mutations. Nevertheless, scoring resulting mutations without model fine-tuning showed no clear improvement over scoring with Rosetta. We conclude that ML now complements, rather than replaces, biophysical methods in protein design.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| | - Rocco Moretti
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Jens Meiler
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Clara T. Schoeder
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| |
Collapse
|
9
|
Zhang Q, Chen W, Qin M, Wang Y, Pu Z, Ding K, Liu Y, Zhang Q, Li D, Li X, Zhao Y, Yao J, Huang L, Wu J, Yang L, Chen H, Yu H. Integrating protein language models and automatic biofoundry for enhanced protein evolution. Nat Commun 2025; 16:1553. [PMID: 39934638 PMCID: PMC11814318 DOI: 10.1038/s41467-025-56751-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Accepted: 01/24/2025] [Indexed: 02/13/2025] Open
Abstract
Traditional protein engineering methods, such as directed evolution, while effective, are often slow and labor-intensive. Advances in machine learning and automated biofoundry present new opportunities for optimizing these processes. This study devises a protein language model-enabled automatic evolution platform, a closed-loop system for automated protein engineering within the Design-Build-Test-Learn cycle. The protein language model ESM-2 makes zero-shot prediction of 96 variants to initiate the cycle. The biofoundry constructs and evaluates these variants, and feeds the results back to a multi-layer perceptron to train a fitness predictor, which then makes prediction of second round of 96 variants with improved fitness. With the tRNA synthetase as a model enzyme, four-rounds of evolution carried out within 10 days lead to mutants with enzyme activity improved by up to 2.4-fold. Our system significantly enhances the speed and accuracy of protein evolution, driving faster advancements in protein engineering for industrial applications.
Collapse
Affiliation(s)
- Qiang Zhang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-UIUC Institute, International Campus, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Wanyi Chen
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Ming Qin
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- School of Software Technology, Zhejiang University, Hangzhou, 315103, China
| | - Yuhao Wang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Polytechnic Institute, Zhejiang University, Hangzhou, 310015, China
| | - Zhongji Pu
- Xianghu Laboratory, Hangzhou, 311231, China
| | - Keyan Ding
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Yuyue Liu
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Qunfeng Zhang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Dongfang Li
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Xinjia Li
- Xianghu Laboratory, Hangzhou, 311231, China
| | - Yu Zhao
- AI Lab, Tencent, Shenzhen, Guangdong, 518000, China
| | - Jianhua Yao
- AI Lab, Tencent, Shenzhen, Guangdong, 518000, China
| | - Lei Huang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Jianping Wu
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
- Zhejiang Key Laboratory of Intelligent Manufacturing for Functional Chemicals, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311215, China
| | - Lirong Yang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
- Zhejiang Key Laboratory of Intelligent Manufacturing for Functional Chemicals, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311215, China
| | - Huajun Chen
- Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China.
- College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, 310027, China.
| | - Haoran Yu
- Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China.
| |
Collapse
|
10
|
Long Y, Mora A, Li FZ, Gürsoy E, Johnston KE, Arnold FH. LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning. ACS Synth Biol 2025; 14:230-238. [PMID: 39719062 DOI: 10.1021/acssynbio.4c00625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2024]
Abstract
Sequence-function data provides valuable information about the protein functional landscape but is rarely obtained during directed evolution campaigns. Here, we present Long-read every variant Sequencing (LevSeq), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes. LevSeq integrates into existing protein engineering workflows and comes with open-source software for data analysis and visualization. The pipeline facilitates data-driven protein engineering by consolidating sequence-function data to inform directed evolution and provide the requisite data for machine learning-guided protein engineering (MLPE). LevSeq enables quality control of mutagenesis libraries prior to screening, which reduces time and resource costs. Simulation studies demonstrate LevSeq's ability to accurately detect variants under various experimental conditions. Finally, we show LevSeq's utility in engineering protoglobins for new-to-nature chemistry. Widespread adoption of LevSeq and sharing of the data will enhance our understanding of protein sequence-function landscapes and empower data-driven directed evolution.
Collapse
Affiliation(s)
- Yueming Long
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Ariane Mora
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| | - Emre Gürsoy
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Kadina E Johnston
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| |
Collapse
|
11
|
Yang J, Lal RG, Bowden JC, Astudillo R, Hameedi MA, Kaur S, Hill M, Yue Y, Arnold FH. Active learning-assisted directed evolution. Nat Commun 2025; 16:714. [PMID: 39821082 PMCID: PMC11739421 DOI: 10.1038/s41467-025-55987-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 01/02/2025] [Indexed: 01/19/2025] Open
Abstract
Directed evolution (DE) is a powerful tool to optimize protein fitness for a specific application. However, DE can be inefficient when mutations exhibit non-additive, or epistatic, behavior. Here, we present Active Learning-assisted Directed Evolution (ALDE), an iterative machine learning-assisted DE workflow that leverages uncertainty quantification to explore the search space of proteins more efficiently than current DE methods. We apply ALDE to an engineering landscape that is challenging for DE: optimization of five epistatic residues in the active site of an enzyme. In three rounds of wet-lab experimentation, we improve the yield of a desired product of a non-native cyclopropanation reaction from 12% to 93%. We also perform computational simulations on existing protein sequence-fitness datasets to support our argument that ALDE can be more effective than DE. Overall, ALDE is a practical and broadly applicable strategy to unlock improved protein engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ravi G Lal
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - James C Bowden
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
- Computer Science, University of California-Berkeley, Berkeley, CA, USA
| | - Raul Astudillo
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Mikhail A Hameedi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Matthew Hill
- Elegen Corp, 1300 Industrial Road #16, San Carlos, CA, USA
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA.
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
12
|
Schulz S, Tan TJC, Wu NC, Wang S. Epistatic hotspots organize antibody fitness landscape and boost evolvability. Proc Natl Acad Sci U S A 2025; 122:e2413884122. [PMID: 39773024 PMCID: PMC11745389 DOI: 10.1073/pnas.2413884122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Accepted: 12/06/2024] [Indexed: 01/11/2025] Open
Abstract
The course of evolution is strongly shaped by interaction between mutations. Such epistasis can yield rugged sequence-function maps and constrain the availability of adaptive paths. While theoretical intuition is often built on global statistics of large, homogeneous model landscapes, mutagenesis measurements necessarily probe a limited neighborhood of a reference genotype. It is unclear to what extent local topography of a real epistatic landscape represents its global shape. Here, we demonstrate that epistatic landscapes can be heterogeneously rugged and this heterogeneity may render biomolecules more evolvable. By characterizing a multipeaked fitness landscape of a SARS-CoV-2 antibody mutant library, we show that heterogeneous ruggedness arises from sparse epistatic hotspots, whose mutation impacts the fitness effect of numerous sequence sites. Surprisingly, mutating an epistatic hotspot may enhance, rather than reduce, the accessibility of the fittest genotype, while increasing the overall ruggedness. Further, migratory constraints in real space alleviate mutational constraints in sequence space, which not only diversify direct paths taken but may also turn a road-blocking fitness peak into a stepping stone leading toward the global optimum. Our results suggest that a hierarchy of epistatic hotspots may organize the fitness landscape in such a way that path-orienting ruggedness confers global smoothness.
Collapse
Affiliation(s)
- Steven Schulz
- Department of Physics and Astronomy, University of California, Los Angeles, CA90095
| | - Timothy J. C. Tan
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL61801
| | - Nicholas C. Wu
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL61801
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL61801
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL61801
| | - Shenshen Wang
- Department of Physics and Astronomy, University of California, Los Angeles, CA90095
| |
Collapse
|
13
|
Yang W, Ji J, Fang G. A metric and its derived protein network for evaluation of ortholog database inconsistency. BMC Bioinformatics 2025; 26:6. [PMID: 39773281 PMCID: PMC11707888 DOI: 10.1186/s12859-024-06023-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 12/24/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors. RESULTS We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets. CONCLUSIONS We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.
Collapse
Affiliation(s)
- Weijie Yang
- NYU-Shanghai, Shanghai, 200120, China
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China
| | - Jingsi Ji
- NYU-Shanghai, Shanghai, 200120, China
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China
| | - Gang Fang
- NYU-Shanghai, Shanghai, 200120, China.
- Department of Biology, New York University, New York, NY, 10003, USA.
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
14
|
James J, Towers S, Foerster J, Steel H. Optimisation strategies for directed evolution without sequencing. PLoS Comput Biol 2024; 20:e1012695. [PMID: 39700257 PMCID: PMC11698521 DOI: 10.1371/journal.pcbi.1012695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 01/03/2025] [Accepted: 12/04/2024] [Indexed: 12/21/2024] Open
Abstract
Directed evolution can enable engineering of biological systems with minimal knowledge of their underlying sequence-to-function relationships. A typical directed evolution process consists of iterative rounds of mutagenesis and selection that are designed to steer changes in a biological system (e.g. a protein) towards some functional goal. Much work has been done, particularly leveraging advancements in machine learning, to optimise the process of directed evolution. Many of these methods, however, require DNA sequencing and synthesis, making them resource-intensive and incompatible with developments in targeted in vivo mutagenesis. Operating within the experimental constraints of established sorting-based directed evolution techniques (e.g. Fluorescence-Activated Cell Sorting, FACS), we explore approaches for optimisation of directed evolution that could in future be implemented without sequencing information. We then expand our methods to the context of emerging experimental techniques in directed evolution, which allow for single-cell selection based on fitness objectives defined from any combination of measurable traits. Finally, we explore these alternative strategies on the GB1 and TrpB empirical landscapes, demonstrating that they could lead to up to 19-fold and 7-fold increases respectively in the probability of attaining the global fitness peak.
Collapse
Affiliation(s)
- Jessica James
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Sebastian Towers
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Jakob Foerster
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Harrison Steel
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
15
|
van der Flier F, Estell D, Pricelius S, Dankmeyer L, van Stigt Thans S, Mulder H, Otsuka R, Goedegebuur F, Lammerts L, Staphorst D, van Dijk AD, de Ridder D, Redestig H. Enzyme structure correlates with variant effect predictability. Comput Struct Biotechnol J 2024; 23:3489-3497. [PMID: 39435338 PMCID: PMC11491678 DOI: 10.1016/j.csbj.2024.09.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/03/2024] [Accepted: 09/12/2024] [Indexed: 10/23/2024] Open
Abstract
Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
Collapse
Affiliation(s)
- Floris van der Flier
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dave Estell
- Health & Biosciences, International Flavors and Fragrances, Palo Alto, 94304 CA, USA
| | - Sina Pricelius
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Lydia Dankmeyer
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Sander van Stigt Thans
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Harm Mulder
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Rei Otsuka
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Frits Goedegebuur
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Laurens Lammerts
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Diego Staphorst
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Aalt D.J. van Dijk
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dick de Ridder
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Henning Redestig
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| |
Collapse
|
16
|
Crandall JG, Zhou X, Rokas A, Hittinger CT. Specialization Restricts the Evolutionary Paths Available to Yeast Sugar Transporters. Mol Biol Evol 2024; 41:msae228. [PMID: 39492761 PMCID: PMC11571961 DOI: 10.1093/molbev/msae228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 10/22/2024] [Accepted: 10/25/2024] [Indexed: 11/05/2024] Open
Abstract
Functional innovation at the protein level is a key source of evolutionary novelties. The constraints on functional innovations are likely to be highly specific in different proteins, which are shaped by their unique histories and the extent of global epistasis that arises from their structures and biochemistries. These contextual nuances in the sequence-function relationship have implications both for a basic understanding of the evolutionary process and for engineering proteins with desirable properties. Here, we have investigated the molecular basis of novel function in a model member of an ancient, conserved, and biotechnologically relevant protein family. These Major Facilitator Superfamily sugar porters are a functionally diverse group of proteins that are thought to be highly plastic and evolvable. By dissecting a recent evolutionary innovation in an α-glucoside transporter from the yeast Saccharomyces eubayanus, we show that the ability to transport a novel substrate requires high-order interactions between many protein regions and numerous specific residues proximal to the transport channel. To reconcile the functional diversity of this family with the constrained evolution of this model protein, we generated new, state-of-the-art genome annotations for 332 Saccharomycotina yeast species spanning ∼400 My of evolution. By integrating phylogenetic and phenotypic analyses across these species, we show that the model yeast α-glucoside transporters likely evolved from a multifunctional ancestor and became subfunctionalized. The accumulation of additive and epistatic substitutions likely entrenched this subfunction, which made the simultaneous acquisition of multiple interacting substitutions the only reasonably accessible path to novelty.
Collapse
Affiliation(s)
- Johnathan G Crandall
- Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI 53726, USA
| | - Xiaofan Zhou
- Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Center, South China Agricultural University, Guangzhou 510642, China
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Antonis Rokas
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Chris Todd Hittinger
- Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI 53726, USA
| |
Collapse
|
17
|
Lisanza SL, Gershon JM, Tipps SWK, Sims JN, Arnoldt L, Hendel SJ, Simma MK, Liu G, Yase M, Wu H, Tharp CD, Li X, Kang A, Brackenbrough E, Bera AK, Gerben S, Wittmann BJ, McShan AC, Baker D. Multistate and functional protein design using RoseTTAFold sequence space diffusion. Nat Biotechnol 2024:10.1038/s41587-024-02395-w. [PMID: 39322764 DOI: 10.1038/s41587-024-02395-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 08/21/2024] [Indexed: 09/27/2024]
Abstract
Protein denoising diffusion probabilistic models are used for the de novo generation of protein backbones but are limited in their ability to guide generation of proteins with sequence-specific attributes and functional properties. To overcome this limitation, we developed ProteinGenerator (PG), a sequence space diffusion model based on RoseTTAFold that simultaneously generates protein sequences and structures. Beginning from a noised sequence representation, PG generates sequence and structure pairs by iterative denoising, guided by desired sequence and structural protein attributes. We designed thermostable proteins with varying amino acid compositions and internal sequence repeats and cage bioactive peptides, such as melittin. By averaging sequence logits between diffusion trajectories with distinct structural constraints, we designed multistate parent-child protein triples in which the same sequence folds to different supersecondary structures when intact in the parent versus split into two child domains. PG design trajectories can be guided by experimental sequence-activity data, providing a general approach for integrated computational and experimental optimization of protein function.
Collapse
Affiliation(s)
- Sidney Lyayuga Lisanza
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, WA, USA
| | - Jacob Merle Gershon
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Department of Molecular Engineering, University of Washington, Seattle, WA, USA
| | - Samuel W K Tipps
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Jeremiah Nelson Sims
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Molecular & Cellular Biology, Medical Scientist Training Program, University of Washington, Seattle, WA, USA
| | - Lucas Arnoldt
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Faculty of Engineering Sciences, Heidelberg University, Heidelberg, Germany
| | - Samuel J Hendel
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Miriam K Simma
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA
| | - Ge Liu
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Muna Yase
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Department of Molecular Engineering, University of Washington, Seattle, WA, USA
| | - Hongwei Wu
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA
| | - Claire D Tharp
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA
| | - Xinting Li
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Alex Kang
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | | | - Asim K Bera
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Stacey Gerben
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Bruce J Wittmann
- Office of the Chief Scientific Officer, Microsoft, Redmond, WA, USA
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA.
- Institute for Protein Design, University of Washington, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
18
|
Gallup O, Sechkar K, Towers S, Steel H. Computational Synthetic Biology Enabled through JAX: A Showcase. ACS Synth Biol 2024; 13:3046-3050. [PMID: 39230510 PMCID: PMC11421211 DOI: 10.1021/acssynbio.4c00307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Mathematical modeling is indispensable in synthetic biology but remains underutilized. Tackling problems, from optimizing gene networks to simulating intracellular dynamics, can be facilitated by the ever-growing body of modeling approaches, be they mechanistic, stochastic, data-driven, or AI-enabled. Thanks to progress in the AI community, robust frameworks have emerged to enable researchers to access complex computational hardware and compilation. Previously, these frameworks focused solely on deep learning, but they have been developed to the point where running different forms of computation is relatively simple, as made possible, notably, by the JAX library. Running simulations at scale on GPUs speeds up research, which compounds enable larger-scale experiments and greater usability of code. As JAX remains underexplored in computational biology, we demonstrate its utility in three example projects ranging from synthetic biology to directed evolution, each with an accompanying demonstrative Jupyter notebook. We hope that these tutorials serve to democratize the flexible scaling, faster run-times, easy GPU portability, and mathematical enhancements (such as automatic differentiation) that JAX brings, all with only minor restructuring of code.
Collapse
Affiliation(s)
- Olivia Gallup
- University of Oxford, Department of Engineering Science, OX1 3PJ Oxford, U.K
| | - Kirill Sechkar
- University of Oxford, Department of Engineering Science, OX1 3PJ Oxford, U.K
| | - Sebastian Towers
- University of Oxford, Department of Engineering Science, OX1 3PJ Oxford, U.K
| | - Harrison Steel
- University of Oxford, Department of Engineering Science, OX1 3PJ Oxford, U.K
| |
Collapse
|
19
|
Park Y, Metzger BPH, Thornton JW. The simplicity of protein sequence-function relationships. Nat Commun 2024; 15:7953. [PMID: 39261454 PMCID: PMC11390738 DOI: 10.1038/s41467-024-51895-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 08/20/2024] [Indexed: 09/13/2024] Open
Abstract
How complex are the rules by which a protein's sequence determines its function? High-order epistatic interactions among residues are thought to be pervasive, suggesting an idiosyncratic and unpredictable sequence-function relationship. But many prior studies may have overestimated epistasis, because they analyzed sequence-function relationships relative to a single reference sequence-which causes measurement noise and local idiosyncrasies to snowball into high-order epistasis-or they did not fully account for global nonlinearities. Here we present a reference-free method that jointly infers specific epistatic interactions and global nonlinearity using a bird's-eye view of sequence space. This technique yields the simplest explanation of sequence-function relationships and is more robust than existing methods to measurement noise, missing data, and model misspecification. We reanalyze 20 experimental datasets and find that context-independent amino acid effects and pairwise interactions, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of phenotypic variance and over 92% in every case. Only a tiny fraction of genotypes are strongly affected by higher-order epistasis. Sequence-function relationships are also sparse: a miniscule fraction of amino acids and interactions account for 90% of phenotypic variance. Sequence-function causality across these datasets is therefore simple, opening the way for tractable approaches to characterize proteins' genetic architecture.
Collapse
Affiliation(s)
- Yeonwoo Park
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
- Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea
| | - Brian P H Metzger
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Joseph W Thornton
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA.
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
20
|
Schmirler R, Heinzinger M, Rost B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat Commun 2024; 15:7407. [PMID: 39198457 PMCID: PMC11358375 DOI: 10.1038/s41467-024-51844-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 08/15/2024] [Indexed: 09/01/2024] Open
Abstract
Prediction methods inputting embeddings from protein language models have reached or even surpassed state-of-the-art performance on many protein prediction tasks. In natural language processing fine-tuning large language models has become the de facto standard. In contrast, most protein language model-based protein predictions do not back-propagate to the language model. Here, we compare the fine-tuning of three state-of-the-art models (ESM2, ProtT5, Ankh) on eight different tasks. Two results stand out. Firstly, task-specific supervised fine-tuning almost always improves downstream predictions. Secondly, parameter-efficient fine-tuning can reach similar improvements consuming substantially fewer resources at up to 4.5-fold acceleration of training over fine-tuning full models. Our results suggest to always try fine-tuning, in particular for problems with small datasets, such as for fitness landscape predictions of a single protein. For ease of adaptability, we provide easy-to-use notebooks to fine-tune all models used during this work for per-protein (pooling) and per-residue prediction tasks.
Collapse
Affiliation(s)
- Robert Schmirler
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany.
- AbbVie Deutschland GmbH & Co. KG, Innovation Center, BTS IR LU, Ludwigshafen, Germany.
| | - Michael Heinzinger
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany
| | - Burkhard Rost
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany
| |
Collapse
|
21
|
Son A, Park J, Kim W, Lee W, Yoon Y, Ji J, Kim H. Integrating Computational Design and Experimental Approaches for Next-Generation Biologics. Biomolecules 2024; 14:1073. [PMID: 39334841 PMCID: PMC11430650 DOI: 10.3390/biom14091073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/13/2024] [Accepted: 08/26/2024] [Indexed: 09/30/2024] Open
Abstract
Therapeutic protein engineering has revolutionized medicine by enabling the development of highly specific and potent treatments for a wide range of diseases. This review examines recent advances in computational and experimental approaches for engineering improved protein therapeutics. Key areas of focus include antibody engineering, enzyme replacement therapies, and cytokine-based drugs. Computational methods like structure-based design, machine learning integration, and protein language models have dramatically enhanced our ability to predict protein properties and guide engineering efforts. Experimental techniques such as directed evolution and rational design approaches continue to evolve, with high-throughput methods accelerating the discovery process. Applications of these methods have led to breakthroughs in affinity maturation, bispecific antibodies, enzyme stability enhancement, and the development of conditionally active cytokines. Emerging approaches like intracellular protein delivery, stimulus-responsive proteins, and de novo designed therapeutic proteins offer exciting new possibilities. However, challenges remain in predicting in vivo behavior, scalable manufacturing, immunogenicity mitigation, and targeted delivery. Addressing these challenges will require continued integration of computational and experimental methods, as well as a deeper understanding of protein behavior in complex physiological environments. As the field advances, we can anticipate increasingly sophisticated and effective protein therapeutics for treating human diseases.
Collapse
Affiliation(s)
- Ahrum Son
- Department of Molecular Medicine, Scripps Research, La Jolla, CA 92037, USA;
| | - Jongham Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (W.L.); (Y.Y.)
| | - Woojin Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (W.L.); (Y.Y.)
| | - Wonseok Lee
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (W.L.); (Y.Y.)
| | - Yoonki Yoon
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (W.L.); (Y.Y.)
| | - Jaeho Ji
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea;
| | - Hyunsoo Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (W.L.); (Y.Y.)
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea;
- Protein AI Design Institute, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- SCICS (Sciences for Panomics), 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
| |
Collapse
|
22
|
Guclu TF, Atilgan AR, Atilgan C. Deciphering GB1's Single Mutational Landscape: Insights from MuMi Analysis. J Phys Chem B 2024; 128:7987-7996. [PMID: 39115184 PMCID: PMC11671028 DOI: 10.1021/acs.jpcb.4c04916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 08/02/2024] [Accepted: 08/02/2024] [Indexed: 08/23/2024]
Abstract
Mutational changes that affect the binding of the C2 fragment of Streptococcal protein G (GB1) to the Fc domain of human IgG (IgG-Fc) have been extensively studied using deep mutational scanning (DMS), and the binding affinity of all single mutations has been measured experimentally in the literature. To investigate the underlying molecular basis, we perform in silico mutational scanning for all possible single mutations, along with 2 μs-long molecular dynamics (WT-MD) of the wild-type (WT) GB1 in both unbound and IgG-Fc bound forms. We compute the hydrogen bonds between GB1 and IgG-Fc in WT-MD to identify the dominant hydrogen bonds for binding, which we then assess in conformations produced by Mutation and Minimization (MuMi) to explain the fitness landscape of GB1 and IgG-Fc binding. Furthermore, we analyze MuMi and WT-MD to investigate the dynamics of binding, focusing on the relative solvent accessibility of residues and the probability of residues being located at the binding interface. With these analyses, we explain the interactions between GB1 and IgG-Fc and display the structural features of binding. In sum, our findings highlight the potential of MuMi as a reliable and computationally efficient tool for predicting protein fitness landscapes, offering significant advantages over traditional methods. The methodologies and results presented in this study pave the way for improved predictive accuracy in protein stability and interaction studies, which are crucial for advancements in drug design and synthetic biology.
Collapse
Affiliation(s)
- Tandac F. Guclu
- Faculty of Natural Sciences
and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| | - Ali Rana Atilgan
- Faculty of Natural Sciences
and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| | - Canan Atilgan
- Faculty of Natural Sciences
and Engineering, Sabanci University, Tuzla, Istanbul 34956, Turkey
| |
Collapse
|
23
|
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, Arnold FH. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proc Natl Acad Sci U S A 2024; 121:e2400439121. [PMID: 39074291 PMCID: PMC11317637 DOI: 10.1073/pnas.2400439121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 06/17/2024] [Indexed: 07/31/2024] Open
Abstract
Protein engineering often targets binding pockets or active sites which are enriched in epistasis-nonadditive interactions between amino acid substitutions-and where the combined effects of multiple single substitutions are difficult to predict. Few existing sequence-fitness datasets capture epistasis at large scale, especially for enzyme catalysis, limiting the development and assessment of model-guided enzyme engineering approaches. We present here a combinatorially complete, 160,000-variant fitness landscape across four residues in the active site of an enzyme. Assaying the native reaction of a thermostable β-subunit of tryptophan synthase (TrpB) in a nonnative environment yielded a landscape characterized by significant epistasis and many local optima. These effects prevent simulated directed evolution approaches from efficiently reaching the global optimum. There is nonetheless wide variability in the effectiveness of different directed evolution approaches, which together provide experimental benchmarks for computational and machine learning workflows. The most-fit TrpB variants contain a substitution that is nearly absent in natural TrpB sequences-a result that conservation-based predictions would not capture. Thus, although fitness prediction using evolutionary data can enrich in more-active variants, these approaches struggle to identify and differentiate among the most-active variants, even for this near-native function. Overall, this work presents a large-scale testing ground for model-guided enzyme engineering and suggests that efficient navigation of epistatic fitness landscapes can be improved by advances in both machine learning and physical modeling.
Collapse
Affiliation(s)
- Kadina E. Johnston
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Patrick J. Almhjell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Ella J. Watkins-Dulaney
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Grace Liu
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Nicholas J. Porter
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Frances H. Arnold
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| |
Collapse
|
24
|
Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, Hong L. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminform 2024; 16:92. [PMID: 39095917 PMCID: PMC11297785 DOI: 10.1186/s13321-024-00884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/13/2024] [Indexed: 08/04/2024] Open
Abstract
Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.
Collapse
Affiliation(s)
- Yang Tan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Mingchen Li
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Ziyi Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Pan Tan
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
| | - Huiqun Yu
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Liang Hong
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China.
| |
Collapse
|
25
|
Freschlin CR, Fahlberg SA, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. Nat Commun 2024; 15:6405. [PMID: 39080282 PMCID: PMC11289474 DOI: 10.1038/s41467-024-50712-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 07/13/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
26
|
Ding K, Chin M, Zhao Y, Huang W, Mai BK, Wang H, Liu P, Yang Y, Luo Y. Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nat Commun 2024; 15:6392. [PMID: 39080249 PMCID: PMC11289365 DOI: 10.1038/s41467-024-50698-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 07/19/2024] [Indexed: 08/02/2024] Open
Abstract
The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochrome c to achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY's potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Michael Chin
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Yunlong Zhao
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Wei Huang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Binh Khanh Mai
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Huanan Wang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Peng Liu
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA.
| | - Yang Yang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA.
- Biomolecular Science and Engineering (BMSE) Program, University of California, Santa Barbara, CA, 93106, USA.
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
27
|
Hayes RL, Cervantes LF, Abad Santos JC, Samadi A, Vilseck JZ, Brooks CL. How to Sample Dozens of Substitutions per Site with λ Dynamics. J Chem Theory Comput 2024; 20:6098-6110. [PMID: 38976796 PMCID: PMC11270746 DOI: 10.1021/acs.jctc.4c00514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/18/2024] [Accepted: 06/18/2024] [Indexed: 07/10/2024]
Abstract
Alchemical free energy methods are useful in computer-aided drug design and computational protein design because they provide rigorous statistical mechanics-based estimates of free energy differences from molecular dynamics simulations. λ dynamics is a free energy method with the ability to characterize combinatorial chemical spaces spanning thousands of related systems within a single simulation, which gives it a distinct advantage over other alchemical free energy methods that are mostly limited to pairwise comparisons. Recently developed methods have improved the scalability of λ dynamics to perturbations at many sites; however, the size of chemical space that can be explored at each individual site has previously been limited to fewer than ten substituents. As the number of substituents increases, the volume of alchemical space corresponding to nonphysical alchemical intermediates grows exponentially relative to the size corresponding to the physical states of interest. Beyond nine substituents, λ dynamics simulations become lost in an alchemical morass of intermediate states. In this work, we introduce new biasing potentials that circumvent excessive sampling of intermediate states by favoring sampling of physical end points relative to alchemical intermediates. Additionally, we present a more scalable adaptive landscape flattening algorithm for these larger alchemical spaces. Finally, we show that this potential enables more efficient sampling in both protein and drug design test systems with up to 24 substituents per site, enabling, for the first time, simultaneous simulation of all 20 amino acids.
Collapse
Affiliation(s)
- Ryan L. Hayes
- Department
of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, California 92697, United States
- Department
of Pharmaceutical Sciences, University of
California Irvine, Irvine, California 92697, United States
| | - Luis F. Cervantes
- Department
of Medicinal Chemistry, College of Pharmacy, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Justin Cruz Abad Santos
- Department
of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, California 92697, United States
| | - Amirmasoud Samadi
- Department
of Chemical and Biomolecular Engineering, University of California Irvine, Irvine, California 92697, United States
| | - Jonah Z. Vilseck
- Department
of Biochemistry and Molecular Biology, Indiana
University School of Medicine, Indianapolis, Indiana 46202, United States
- Center
for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Charles L. Brooks
- Department
of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
- Biophysics
Program, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
28
|
Crandall JG, Zhou X, Rokas A, Hittinger CT. Specialization restricts the evolutionary paths available to yeast sugar transporters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.22.604696. [PMID: 39091816 PMCID: PMC11291069 DOI: 10.1101/2024.07.22.604696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Functional innovation at the protein level is a key source of evolutionary novelties. The constraints on functional innovations are likely to be highly specific in different proteins, which are shaped by their unique histories and the extent of global epistasis that arises from their structures and biochemistries. These contextual nuances in the sequence-function relationship have implications both for a basic understanding of the evolutionary process and for engineering proteins with desirable properties. Here, we have investigated the molecular basis of novel function in a model member of an ancient, conserved, and biotechnologically relevant protein family. These Major Facilitator Superfamily sugar porters are a functionally diverse group of proteins that are thought to be highly plastic and evolvable. By dissecting a recent evolutionary innovation in an α-glucoside transporter from the yeast Saccharomyces eubayanus, we show that the ability to transport a novel substrate requires high-order interactions between many protein regions and numerous specific residues proximal to the transport channel. To reconcile the functional diversity of this family with the constrained evolution of this model protein, we generated new, state-of-the-art genome annotations for 332 Saccharomycotina yeast species spanning approximately 400 million years of evolution. By integrating phylogenetic and phenotypic analyses across these species, we show that the model yeast α-glucoside transporters likely evolved from a multifunctional ancestor and became subfunctionalized. The accumulation of additive and epistatic substitutions likely entrenched this subfunction, which made the simultaneous acquisition of multiple interacting substitutions the only reasonably accessible path to novelty.
Collapse
Affiliation(s)
- Johnathan G. Crandall
- Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI 53726, USA
| | - Xiaofan Zhou
- Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Center, South China Agricultural University, Guangzhou 510642, China
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Antonis Rokas
- Department of Biological Sciences and Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Chris Todd Hittinger
- Laboratory of Genetics, J. F. Crow Institute for the Study of Evolution, Center for Genomic Science Innovation, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, University of Wisconsin-Madison, Madison, WI 53726, USA
| |
Collapse
|
29
|
Chitra U, Arnold BJ, Raphael BJ. Quantifying higher-order epistasis: beware the chimera. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.17.603976. [PMID: 39071303 PMCID: PMC11275791 DOI: 10.1101/2024.07.17.603976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Epistasis, or interactions in which alleles at one locus modify the fitness effects of alleles at other loci, plays a fundamental role in genetics, protein evolution, and many other areas of biology. Epistasis is typically quantified by computing the deviation from the expected fitness under an additive or multiplicative model using one of several formulae. However, these formulae are not all equivalent. Importantly, one widely used formula - which we call the chimeric formula - measures deviations from a multiplicative fitness model on an additive scale, thus mixing two measurement scales. We show that for pairwise interactions, the chimeric formula yields a different magnitude, but the same sign (synergistic vs. antagonistic) of epistasis compared to the multiplicative formula that measures both fitness and deviations on a multiplicative scale. However, for higher-order interactions, we show that the chimeric formula can have both different magnitude and sign compared to the multiplicative formula - thus confusing negative epistatic interactions with positive interactions, and vice versa. We resolve these inconsistencies by deriving fundamental connections between the different epistasis formulae and the parameters of the multivariate Bernoulli distribution . Our results demonstrate that the additive and multiplicative epistasis formulae are more mathematically sound than the chimeric formula. Moreover, we demonstrate that the mathematical issues with the chimeric epistasis formula lead to markedly different biological interpretations of real data. Analyzing multi-gene knockout data in yeast, multi-way drug interactions in E. coli , and deep mutational scanning (DMS) of several proteins, we find that 10 - 60% of higher-order interactions have a change in sign with the multiplicative or additive epistasis formula. These sign changes result in qualitatively different findings on functional divergence in the yeast genome, synergistic vs. antagonistic drug interactions, and and epistasis between protein mutations. In particular, in the yeast data, the more appropriate multiplicative formula identifies nearly 500 additional negative three-way interactions, thus extending the trigenic interaction network by 25%.
Collapse
|
30
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
31
|
Chen SK, Liu J, Van Nynatten A, Tudor-Price BM, Chang BSW. Sampling Strategies for Experimentally Mapping Molecular Fitness Landscapes Using High-Throughput Methods. J Mol Evol 2024:10.1007/s00239-024-10179-8. [PMID: 38886207 DOI: 10.1007/s00239-024-10179-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 05/20/2024] [Indexed: 06/20/2024]
Abstract
Empirical studies of genotype-phenotype-fitness maps of proteins are fundamental to understanding the evolutionary process, in elucidating the space of possible genotypes accessible through mutations in a landscape of phenotypes and fitness effects. Yet, comprehensively mapping molecular fitness landscapes remains challenging since all possible combinations of amino acid substitutions for even a few protein sites are encoded by an enormous genotype space. High-throughput mapping of genotype space can be achieved using large-scale screening experiments known as multiplexed assays of variant effect (MAVEs). However, to accommodate such multi-mutational studies, the size of MAVEs has grown to the point where a priori determination of sampling requirements is needed. To address this problem, we propose calculations and simulation methods to approximate minimum sampling requirements for multi-mutational MAVEs, which we combine with a new library construction protocol to experimentally validate our approximation approaches. Analysis of our simulated data reveals how sampling trajectories differ between simulations of nucleotide versus amino acid variants and among mutagenesis schemes. For this, we show quantitatively that marginal gains in sampling efficiency demand increasingly greater sampling effort when sampling for nucleotide sequences over their encoded amino acid equivalents. We present a new library construction protocol that efficiently maximizes sequence variation, and demonstrate using ultradeep sequencing that the library encodes virtually all possible combinations of mutations within the experimental design. Insights learned from our analyses together with the methodological advances reported herein are immediately applicable toward pooled experimental screens of arbitrary design, enabling further assay upscaling and expanded testing of genotype space.
Collapse
Affiliation(s)
- Steven K Chen
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Jing Liu
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Alexander Van Nynatten
- Department of Biological Science, University of Toronto Scarborough, Toronto, ON, Canada
| | | | - Belinda S W Chang
- Department of Cell & Systems Biology, University of Toronto, Toronto, ON, Canada.
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, ON, Canada.
- Centre for the Analysis of Genome Evolution & Function, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
32
|
Metzger BPH, Park Y, Starr TN, Thornton JW. Epistasis facilitates functional evolution in an ancient transcription factor. eLife 2024; 12:RP88737. [PMID: 38767330 PMCID: PMC11105156 DOI: 10.7554/elife.88737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
A protein's genetic architecture - the set of causal rules by which its sequence produces its functions - also determines its possible evolutionary trajectories. Prior research has proposed that the genetic architecture of proteins is very complex, with pervasive epistatic interactions that constrain evolution and make function difficult to predict from sequence. Most of this work has analyzed only the direct paths between two proteins of interest - excluding the vast majority of possible genotypes and evolutionary trajectories - and has considered only a single protein function, leaving unaddressed the genetic architecture of functional specificity and its impact on the evolution of new functions. Here, we develop a new method based on ordinal logistic regression to directly characterize the global genetic determinants of multiple protein functions from 20-state combinatorial deep mutational scanning (DMS) experiments. We use it to dissect the genetic architecture and evolution of a transcription factor's specificity for DNA, using data from a combinatorial DMS of an ancient steroid hormone receptor's capacity to activate transcription from two biologically relevant DNA elements. We show that the genetic architecture of DNA recognition consists of a dense set of main and pairwise effects that involve virtually every possible amino acid state in the protein-DNA interface, but higher-order epistasis plays only a tiny role. Pairwise interactions enlarge the set of functional sequences and are the primary determinants of specificity for different DNA elements. They also massively expand the number of opportunities for single-residue mutations to switch specificity from one DNA target to another. By bringing variants with different functions close together in sequence space, pairwise epistasis therefore facilitates rather than constrains the evolution of new functions.
Collapse
Affiliation(s)
- Brian PH Metzger
- Department of Ecology and Evolution, University of ChicagoChicagoUnited States
| | - Yeonwoo Park
- Program in Genetics, Genomics, and Systems Biology, University of ChicagoChicagoUnited States
| | - Tyler N Starr
- Department of Biochemistry and Molecular Biophysics, University of ChicagoChicagoUnited States
| | - Joseph W Thornton
- Department of Ecology and Evolution, University of ChicagoChicagoUnited States
- Department of Human Genetics, University of ChicagoChicagoUnited States
| |
Collapse
|
33
|
Wagner A. Genotype sampling for deep-learning assisted experimental mapping of a combinatorially complete fitness landscape. Bioinformatics 2024; 40:btae317. [PMID: 38745436 PMCID: PMC11132821 DOI: 10.1093/bioinformatics/btae317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 05/16/2024] Open
Abstract
MOTIVATION Experimental characterization of fitness landscapes, which map genotypes onto fitness, is important for both evolutionary biology and protein engineering. It faces a fundamental obstacle in the astronomical number of genotypes whose fitness needs to be measured for any one protein. Deep learning may help to predict the fitness of many genotypes from a smaller neural network training sample of genotypes with experimentally measured fitness. Here I use a recently published experimentally mapped fitness landscape of more than 260 000 protein genotypes to ask how such sampling is best performed. RESULTS I show that multilayer perceptrons, recurrent neural networks, convolutional networks, and transformers, can explain more than 90% of fitness variance in the data. In addition, 90% of this performance is reached with a training sample comprising merely ≈103 sequences. Generalization to unseen test data is best when training data is sampled randomly and uniformly, or sampled to minimize the number of synonymous sequences. In contrast, sampling to maximize sequence diversity or codon usage bias reduces performance substantially. These observations hold for more than one network architecture. Simple sampling strategies may perform best when training deep learning neural networks to map fitness landscapes from experimental data. AVAILABILITY AND IMPLEMENTATION The fitness landscape data analyzed here is publicly available as described previously (Papkou et al. 2023). All code used to analyze this landscape is publicly available at https://github.com/andreas-wagner-uzh/fitness_landscape_sampling.
Collapse
Affiliation(s)
- Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, 8057 Zurich, Switzerland
- Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode,1015 Lausanne, Switzerland
- The Santa Fe Institute, Santa Fe, 87501 NM, United States
| |
Collapse
|
34
|
Rozhoňová H, Martí-Gómez C, McCandlish DM, Payne JL. Robust genetic codes enhance protein evolvability. PLoS Biol 2024; 22:e3002594. [PMID: 38754362 PMCID: PMC11098591 DOI: 10.1371/journal.pbio.3002594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 03/19/2024] [Indexed: 05/18/2024] Open
Abstract
The standard genetic code defines the rules of translation for nearly every life form on Earth. It also determines the amino acid changes accessible via single-nucleotide mutations, thus influencing protein evolvability-the ability of mutation to bring forth adaptive variation in protein function. One of the most striking features of the standard genetic code is its robustness to mutation, yet it remains an open question whether such robustness facilitates or frustrates protein evolvability. To answer this question, we use data from massively parallel sequence-to-function assays to construct and analyze 6 empirical adaptive landscapes under hundreds of thousands of rewired genetic codes, including those of codon compression schemes relevant to protein engineering and synthetic biology. We find that robust genetic codes tend to enhance protein evolvability by rendering smooth adaptive landscapes with few peaks, which are readily accessible from throughout sequence space. However, the standard genetic code is rarely exceptional in this regard, because many alternative codes render smoother landscapes than the standard code. By constructing low-dimensional visualizations of these landscapes, which each comprise more than 16 million mRNA sequences, we show that such alternative codes radically alter the topological features of the network of high-fitness genotypes. Whereas the genetic codes that optimize evolvability depend to some extent on the detailed relationship between amino acid sequence and protein function, we also uncover general design principles for engineering nonstandard genetic codes for enhanced and diminished evolvability, which may facilitate directed protein evolution experiments and the bio-containment of synthetic organisms, respectively.
Collapse
Affiliation(s)
- Hana Rozhoňová
- Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Joshua L. Payne
- Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
35
|
Meger AT, Spence MA, Sandhu M, Matthews D, Chen J, Jackson CJ, Raman S. Rugged fitness landscapes minimize promiscuity in the evolution of transcriptional repressors. Cell Syst 2024; 15:374-387.e6. [PMID: 38537640 PMCID: PMC11299162 DOI: 10.1016/j.cels.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 09/08/2023] [Accepted: 03/05/2024] [Indexed: 04/20/2024]
Abstract
How a protein's function influences the shape of its fitness landscape, smooth or rugged, is a fundamental question in evolutionary biochemistry. Smooth landscapes arise when incremental mutational steps lead to a progressive change in function, as commonly seen in enzymes and binding proteins. On the other hand, rugged landscapes are poorly understood because of the inherent unpredictability of how sequence changes affect function. Here, we experimentally characterize the entire sequence phylogeny, comprising 1,158 extant and ancestral sequences, of the DNA-binding domain (DBD) of the LacI/GalR transcriptional repressor family. Our analysis revealed an extremely rugged landscape with rapid switching of specificity, even between adjacent nodes. Further, the ruggedness arises due to the necessity of the repressor to simultaneously evolve specificity for asymmetric operators and disfavors potentially adverse regulatory crosstalk. Our study provides fundamental insight into evolutionary, molecular, and biophysical rules of genetic regulation through the lens of fitness landscapes.
Collapse
Affiliation(s)
- Anthony T Meger
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
| | - Mahakaran Sandhu
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
| | - Dana Matthews
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Jackie Chen
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia; ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia; ARC Centre of Excellence for Innovations in Synthetic Biology, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia.
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Bacteriology, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
36
|
Yang KK, Fusi N, Lu AX. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst 2024; 15:286-294.e2. [PMID: 38428432 DOI: 10.1016/j.cels.2024.01.008] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 11/08/2023] [Accepted: 01/24/2024] [Indexed: 03/03/2024]
Abstract
Pretrained protein sequence language models have been shown to improve the performance of many prediction tasks and are now routinely integrated into bioinformatics tools. However, these models largely rely on the transformer architecture, which scales quadratically with sequence length in both run-time and memory. Therefore, state-of-the-art models have limitations on sequence length. To address this limitation, we investigated whether convolutional neural network (CNN) architectures, which scale linearly with sequence length, could be as effective as transformers in protein language models. With masked language model pretraining, CNNs are competitive with, and occasionally superior to, transformers across downstream applications while maintaining strong performance on sequences longer than those allowed in the current state-of-the-art transformer models. Our work suggests that computational efficiency can be improved without sacrificing performance, simply by using a CNN architecture instead of a transformer, and emphasizes the importance of disentangling pretraining task and model architecture. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Kevin K Yang
- Microsoft Research New England, Cambridge, MA 02139, USA.
| | - Nicolo Fusi
- Microsoft Research New England, Cambridge, MA 02139, USA
| | - Alex X Lu
- Microsoft Research New England, Cambridge, MA 02139, USA
| |
Collapse
|
37
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
38
|
Park Y, Metzger BP, Thornton JW. The simplicity of protein sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.02.556057. [PMID: 37732229 PMCID: PMC10508729 DOI: 10.1101/2023.09.02.556057] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
How complicated is the genetic architecture of proteins - the set of causal effects by which sequence determines function? High-order epistatic interactions among residues are thought to be pervasive, making a protein's function difficult to predict or understand from its sequence. Most studies, however, used methods that overestimate epistasis, because they analyze genetic architecture relative to a designated reference sequence - causing measurement noise and small local idiosyncrasies to propagate into pervasive high-order interactions - or have not effectively accounted for global nonlinearity in the sequence-function relationship. Here we present a new reference-free method that jointly estimates global nonlinearity and specific epistatic interactions across a protein's entire genotype-phenotype map. This method yields a maximally efficient explanation of a protein's genetic architecture and is more robust than existing methods to measurement noise, partial sampling, and model misspecification. We reanalyze 20 combinatorial mutagenesis experiments from a diverse set of proteins and find that additive and pairwise effects, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of total variance in measured phenotypes (and >92% in every case). Only a tiny fraction of genotypes are strongly affected by third- or higher-order epistasis. Genetic architecture is also sparse: the number of terms required to explain the vast majority of variance is smaller than the number of genotypes by many orders of magnitude. The sequence-function relationship in most proteins is therefore far simpler than previously thought, opening the way for new and tractable approaches to characterize it.
Collapse
Affiliation(s)
- Yeonwoo Park
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL 60637
- Current affiliation: Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea 08826
| | - Brian P.H. Metzger
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637
- Current affiliation: Department of Biological Sciences, Purdue University, West Lafayette, IN 47907
| | - Joseph W. Thornton
- Department of Ecology and Evolution, University of Chicago, Chicago, IL 60637
- Department of Human Genetics, University of Chicago, Chicago, IL 60637
| |
Collapse
|
39
|
Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G. Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 2024; 20:e1011812. [PMID: 38377054 PMCID: PMC10878520 DOI: 10.1371/journal.pcbi.1011812] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/08/2024] [Indexed: 02/22/2024] Open
Abstract
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
Collapse
Affiliation(s)
- Luca Sesta
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
| | - Andrea Pagnani
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
- Italian Institute for Genomic Medicine, Torino, Italy
- INFN, Sezione di Torino, Torino, Italy
| | | | | |
Collapse
|
40
|
Ieremie I, Ewing RM, Niranjan M. Protein language models meet reduced amino acid alphabets. Bioinformatics 2024; 40:btae061. [PMID: 38310333 PMCID: PMC10872054 DOI: 10.1093/bioinformatics/btae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 12/14/2023] [Accepted: 01/30/2024] [Indexed: 02/05/2024] Open
Abstract
MOTIVATION Protein language models (PLMs), which borrowed ideas for modelling and inference from natural language processing, have demonstrated the ability to extract meaningful representations in an unsupervised way. This led to significant performance improvement in several downstream tasks. Clustering amino acids based on their physical-chemical properties to achieve reduced alphabets has been of interest in past research, but their application to PLMs or folding models is unexplored. RESULTS Here, we investigate the efficacy of PLMs trained on reduced amino acid alphabets in capturing evolutionary information, and we explore how the loss of protein sequence information impacts learned representations and downstream task performance. Our empirical work shows that PLMs trained on the full alphabet and a large number of sequences capture fine details that are lost in alphabet reduction methods. We further show the ability of a structure prediction model(ESMFold) to fold CASP14 protein sequences translated using a reduced alphabet. For 10 proteins out of the 50 targets, reduced alphabets improve structural predictions with LDDT-Cα differences of up to 19%. AVAILABILITY AND IMPLEMENTATION Trained models and code are available at github.com/Ieremie/reduced-alph-PLM.
Collapse
Affiliation(s)
- Ioan Ieremie
- Vision, Learning & Control Group, University of Southampton, Southampton SO17 1BJ, United Kingdom
| | - Rob M Ewing
- Biological Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom
| | - Mahesan Niranjan
- Vision, Learning & Control Group, University of Southampton, Southampton SO17 1BJ, United Kingdom
| |
Collapse
|
41
|
Dupic T, Phillips AM, Desai MM. Protein sequence landscapes are not so simple: on reference-free versus reference-based inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.29.577800. [PMID: 38352387 PMCID: PMC10862727 DOI: 10.1101/2024.01.29.577800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
In a recent preprint, Park, Metzger, and Thornton reanalyze 20 empirical protein sequence-function landscapes using a "reference-free analysis" (RFA) method they recently developed. They argue that these empirical landscapes are simpler and less epistatic than earlier work suggested, and attribute the difference to limitations of the methods used in the original analyses of these landscapes, which they claim are more sensitive to measurement noise, missing data, and other artifacts. Here, we show that these claims are incorrect. Instead, we find that the RFA method introduced by Park et al. is exactly equivalent to the reference-based least-squares methods used in the original analysis of many of these empirical landscapes (and also equivalent to a Hadamard-based approach they implement). Because the reanalyzed and original landscapes are in fact identical, the different conclusions drawn by Park et al. instead reflect different interpretations of the parameters describing the inferred landscapes; we argue that these do not support the conclusion that epistasis plays only a small role in protein sequence-function landscapes.
Collapse
Affiliation(s)
- Thomas Dupic
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA
| | - Angela M Phillips
- Department of Microbiology and Immunology, University of California San Francisco, San Francisco CA
| | - Michael M Desai
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA
| |
Collapse
|
42
|
Hayes RL, Nixon CF, Marqusee S, Brooks CL. Selection pressures on evolution of ribonuclease H explored with rigorous free-energy-based design. Proc Natl Acad Sci U S A 2024; 121:e2312029121. [PMID: 38194446 PMCID: PMC10801872 DOI: 10.1073/pnas.2312029121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/22/2023] [Indexed: 01/11/2024] Open
Abstract
Understanding natural protein evolution and designing novel proteins are motivating interest in development of high-throughput methods to explore large sequence spaces. In this work, we demonstrate the application of multisite λ dynamics (MSλD), a rigorous free energy simulation method, and chemical denaturation experiments to quantify evolutionary selection pressure from sequence-stability relationships and to address questions of design. This study examines a mesophilic phylogenetic clade of ribonuclease H (RNase H), furthering its extensive characterization in earlier studies, focusing on E. coli RNase H (ecRNH) and a more stable consensus sequence (AncCcons) differing at 15 positions. The stabilities of 32,768 chimeras between these two sequences were computed using the MSλD framework. The most stable and least stable chimeras were predicted and tested along with several other sequences, revealing a designed chimera with approximately the same stability increase as AncCcons, but requiring only half the mutations. Comparing the computed stabilities with experiment for 12 sequences reveals a Pearson correlation of 0.86 and root mean squared error of 1.18 kcal/mol, an unprecedented level of accuracy well beyond less rigorous computational design methods. We then quantified selection pressure using a simple evolutionary model in which sequences are selected according to the Boltzmann factor of their stability. Selection temperatures from 110 to 168 K are estimated in three ways by comparing experimental and computational results to evolutionary models. These estimates indicate selection pressure is high, which has implications for evolutionary dynamics and for the accuracy required for design, and suggests accurate high-throughput computational methods like MSλD may enable more effective protein design.
Collapse
Affiliation(s)
- Ryan L. Hayes
- Department of Chemical and Biomolecular Engineering, University of California, Irvine, CA92697
- Department of Chemistry, University of Michigan, Ann Arbor, MI48109
| | - Charlotte F. Nixon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA94720
| | - Susan Marqusee
- Department of Molecular and Cell Biology, University of California, Berkeley, CA94720
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA94720
- Department of Chemistry, University of California, Berkeley, CA94720
| | - Charles L. Brooks
- Department of Chemistry, University of Michigan, Ann Arbor, MI48109
- Biophysics Program, University of Michigan, Ann Arbor, MI48109
| |
Collapse
|
43
|
Eble H, Joswig M, Lamberti L, Ludington WB. Master regulators of biological systems in higher dimensions. Proc Natl Acad Sci U S A 2023; 120:e2300634120. [PMID: 38096409 PMCID: PMC10743376 DOI: 10.1073/pnas.2300634120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 10/23/2023] [Indexed: 12/18/2023] Open
Abstract
A longstanding goal of biology is to identify the key genes and species that critically impact evolution, ecology, and health. Network analysis has revealed keystone species that regulate ecosystems and master regulators that regulate cellular genetic networks. Yet these studies have focused on pairwise biological interactions, which can be affected by the context of genetic background and other species present, generating higher-order interactions. The important regulators of higher-order interactions are unstudied. To address this, we applied a high-dimensional geometry approach that quantifies epistasis in a fitness landscape to ask how individual genes and species influence the interactions in the rest of the biological network. We then generated and also reanalyzed 5-dimensional datasets (two genetic, two microbiome). We identified key genes (e.g., the rbs locus and pykF) and species (e.g., Lactobacilli) that control the interactions of many other genes and species. These higher-order master regulators can induce or suppress evolutionary and ecological diversification by controlling the topography of the fitness landscape. Thus, we provide a method and mathematical justification for exploration of biological networks in higher dimensions.
Collapse
Affiliation(s)
- Holger Eble
- Chair of Discrete Mathematics/Geometry, Technical University Berlin, Berlin10623, Germany
| | - Michael Joswig
- Chair of Discrete Mathematics/Geometry, Technical University Berlin, Berlin10623, Germany
- Max Planck Institute for Mathematics in the Sciences, Leipzig04103, Germany
| | - Lisa Lamberti
- Department of Biosystems Science and Engineering, Federal Institute of Technology (ETH Zürich), Basel4058, Switzerland
- Swiss Institute of Bioinformatics, Basel4058, Switzerland
| | - William B. Ludington
- Department of Biosphere Sciences and Engineering, Carnegie Institution for Science, Baltimore, MD21218
- Department of Biology, Johns Hopkins University, Baltimore, MD21218
| |
Collapse
|
44
|
Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023; 12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
Collapse
Affiliation(s)
- Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States
| | - Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
45
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|
46
|
Notin P, Marks DS, Weitzman R, Gal Y. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.06.570473. [PMID: 38106034 PMCID: PMC10723423 DOI: 10.1101/2023.12.06.570473] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Collapse
Affiliation(s)
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| |
Collapse
|
47
|
Li Y, Arcos S, Sabsay KR, te Velthuis AJW, Lauring AS. Deep mutational scanning reveals the functional constraints and evolutionary potential of the influenza A virus PB1 protein. J Virol 2023; 97:e0132923. [PMID: 37882522 PMCID: PMC10688322 DOI: 10.1128/jvi.01329-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 10/08/2023] [Indexed: 10/27/2023] Open
Abstract
IMPORTANCE The influenza virus polymerase is important for adaptation to new hosts and, as a determinant of mutation rate, for the process of adaptation itself. We performed a deep mutational scan of the polymerase basic 1 (PB1) protein to gain insights into the structural and functional constraints on the influenza RNA-dependent RNA polymerase. We find that PB1 is highly constrained at specific sites that are only moderately predicted by the global structure or larger domain. We identified a number of beneficial mutations, many of which have been shown to be functionally important or observed in influenza virus' natural evolution. Overall, our atlas of PB1 mutations and their fitness impacts serves as an important resource for future studies of influenza replication and evolution.
Collapse
Affiliation(s)
- Yuan Li
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
| | - Sarah Arcos
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
| | - Kimberly R. Sabsay
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
- Lewis-Sigler Institute, Princeton University, Princeton, New Jersey, USA
| | | | - Adam S. Lauring
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, USA
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
48
|
Yang L, Liang X, Zhang N, Lu L. STAR: A Web Server for Assisting Directed Protein Evolution with Machine Learning. ACS OMEGA 2023; 8:44751-44756. [PMID: 38046324 PMCID: PMC10688154 DOI: 10.1021/acsomega.3c04832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/10/2023] [Accepted: 10/12/2023] [Indexed: 12/05/2023]
Abstract
Protein engineering has made significant contributions to industries such as agriculture, food, and pharmaceuticals. In recent years, directed evolution combined with artificial intelligence has emerged as a cutting-edge R&D approach. However, the application of machine learning techniques can be challenging for those without relevant experience and coding skills. To address this issue, we have developed a web-based protein sequence recommendation system: STAR (Sequence recommendaTion via ARtificial intelligence). Our system utilizes Bayesian optimization as its backbone and includes a filtering step using a regression model to enhance the success rate of recommended sequences. Additionally, we have incorporated an in silico-directed evolution approach to expand the exploration of the protein space. The Web site can be accessed at https://www.FindProteinStar.com/.
Collapse
Affiliation(s)
- Likun Yang
- Asymchem Life Science (Tianjin) Co.,
Ltd, Tianjin 300457, P. R. China
| | - Xiaoli Liang
- Asymchem Life Science (Tianjin) Co.,
Ltd, Tianjin 300457, P. R. China
| | - Na Zhang
- Asymchem Life Science (Tianjin) Co.,
Ltd, Tianjin 300457, P. R. China
| | - Lu Lu
- Asymchem Life Science (Tianjin) Co.,
Ltd, Tianjin 300457, P. R. China
| |
Collapse
|
49
|
Papkou A, Garcia-Pastor L, Escudero JA, Wagner A. A rugged yet easily navigable fitness landscape. Science 2023; 382:eadh3860. [PMID: 37995212 DOI: 10.1126/science.adh3860] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 09/29/2023] [Indexed: 11/25/2023]
Abstract
Fitness landscape theory predicts that rugged landscapes with multiple peaks impair Darwinian evolution, but experimental evidence is limited. In this study, we used genome editing to map the fitness of >260,000 genotypes of the key metabolic enzyme dihydrofolate reductase in the presence of the antibiotic trimethoprim, which targets this enzyme. The resulting landscape is highly rugged and harbors 514 fitness peaks. However, its highest peaks are accessible to evolving populations via abundant fitness-increasing paths. Different peaks share large basins of attraction that render the outcome of adaptive evolution highly contingent on chance events. Our work shows that ruggedness need not be an obstacle to Darwinian evolution but can reduce its predictability. If true in general, the complexity of optimization problems on realistic landscapes may require reappraisal.
Collapse
Affiliation(s)
- Andrei Papkou
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | - Lucia Garcia-Pastor
- Departamento de Sanidad Animal and VISAVET Health Surveillance Centre, Universidad Complutense de Madrid, Madrid, Spain
| | - José Antonio Escudero
- Departamento de Sanidad Animal and VISAVET Health Surveillance Centre, Universidad Complutense de Madrid, Madrid, Spain
| | - Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- The Santa Fe Institute, Santa Fe, NM, USA
| |
Collapse
|
50
|
Nisonoff H, Wang Y, Listgarten J. Coherent Blending of Biophysics-Based Knowledge with Bayesian Neural Networks for Robust Protein Property Prediction. ACS Synth Biol 2023; 12:3242-3251. [PMID: 37888887 DOI: 10.1021/acssynbio.3c00217] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.
Collapse
Affiliation(s)
- Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
| | - Yixin Wang
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109-1107, United States
| | - Jennifer Listgarten
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, California 94720-1776, United States
| |
Collapse
|