1
|
Paz M, Moratorio G. Deep mutational scanning and CRISPR-engineered viruses: tools for evolutionary and functional genomics studies. mSphere 2025; 10:e0050824. [PMID: 40272173 DOI: 10.1128/msphere.00508-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2025] Open
Abstract
Recent advancements in synthetic biology and sequencing technologies have revolutionized the ability to manipulate viral genomes with unparalleled precision. This review focuses on two powerful methodologies: deep mutational scanning and CRISPR-based genome editing, that enable comprehensive mutagenesis and detailed functional characterization of viral proteins. These approaches have significantly deepened our understanding of the molecular determinants driving viral evolution and adaptation. Furthermore, we discuss how these advances provide transformative insights for future vaccine development and therapeutic strategies.
Collapse
Affiliation(s)
- Mercedes Paz
- Laboratory of Experimental Virus Evolution, Institut Pasteur de Montevideo, Montevideo, Uruguay
- Molecular Virology Laboratory, Faculty of Sciences, University of the Republic, Montevideo, Uruguay
| | - Gonzalo Moratorio
- Laboratory of Experimental Virus Evolution, Institut Pasteur de Montevideo, Montevideo, Uruguay
- Molecular Virology Laboratory, Faculty of Sciences, University of the Republic, Montevideo, Uruguay
- Center for Innovation in Epidemiological Surveillance, Institut Pasteur de Montevideo, Montevideo, Uruguay
| |
Collapse
|
2
|
Miller ST, Macdonald CB, Raman S. Understanding, inhibiting, and engineering membrane transporters with high-throughput mutational screens. Cell Chem Biol 2025; 32:529-541. [PMID: 40168989 DOI: 10.1016/j.chembiol.2025.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 01/20/2025] [Accepted: 03/10/2025] [Indexed: 04/03/2025]
Abstract
Promiscuous membrane transporters play vital roles across domains of life, mediating the uptake and efflux of structurally and chemically diverse substrates. Although many transporter structures have been solved, the fundamental rules of polyspecific transport remain inscrutable. In recent years, high-throughput genetic screens have solidified as powerful tools for comprehensive, unbiased measurements of variant function and hypothesis generation, but have had infrequent application and limited impact in the transporter field. In this primer, we describe the principles of high-throughput screening methods available for studying polyspecific transporters and comment on the necessity and potential of high-throughput methods for deciphering these transporters in particular. We present several screening approaches which could provide a fundamental understanding of the molecular basis of function and promiscuity in transporters. We further posit how this knowledge can be leveraged to design inhibitors that combat multidrug resistance and engineer transporters as needed tools for synthetic biology and biotechnology applications.
Collapse
Affiliation(s)
- Silas T Miller
- Cellular and Molecular Biology Graduate Program, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Christian B Macdonald
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Srivatsan Raman
- DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Bacteriology, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
3
|
Martí-Gómez C, Zhou J, Chen WC, Kinney JB, McCandlish DM. Inference and visualization of complex genotype-phenotype maps with gpmap-tools. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642267. [PMID: 40161830 PMCID: PMC11952336 DOI: 10.1101/2025.03.09.642267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Multiplex assays of variant effect (MAVEs) allow the functional characterization of an unprecedented number of sequence variants in both gene regulatory regions and protein coding sequences. This has enabled the study of nearly complete combinatorial libraries of mutational variants and revealed the widespread influence of higher-order genetic interactions that arise when multiple mutations are combined. However, the lack of appropriate tools for exploratory analysis of this high-dimensional data limits our overall understanding of the main qualitative properties of complex genotype-phenotype maps. To fill this gap, we have developed gpmap-tools (https://github.com/cmarti/gpmap-tools), a python library that integrates Gaussian process models for inference, phenotypic imputation, and error estimation from incomplete and noisy MAVE data and collections of natural sequences, together with methods for summarizing patterns of higher-order epistasis and non-linear dimensionality reduction techniques that allow visualization of genotype-phenotype maps containing up to millions of genotypes. Here, we used gpmap-tools to study the genotype-phenotype map of the Shine-Dalgarno sequence, a motif that modulates binding of the 16S rRNA to the 5' untranslated region (UTR) of mRNAs through base pair complementarity during translation initiation in prokaryotes. We inferred full combinatorial landscapes containing 262,144 different sequences from the sequences of 5,311 5'UTRs in the E. coli genome and from experimental MAVE data. Visualizations of the inferred landscapes were largely consistent with each other, and unveiled a simple molecular mechanism underlying the highly epistatic genotype-phenotype map of the Shine-Dalgarno sequence.
Collapse
Affiliation(s)
- Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, Republic of China
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
4
|
Cui Q. Machine learning in molecular biophysics: Protein allostery, multi-level free energy simulations, and lipid phase transitions. BIOPHYSICS REVIEWS 2025; 6:011305. [PMID: 39957913 PMCID: PMC11825181 DOI: 10.1063/5.0248589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Accepted: 01/14/2025] [Indexed: 02/18/2025]
Abstract
Machine learning (ML) techniques have been making major impacts on all areas of science and engineering, including biophysics. In this review, we discuss several applications of ML to biophysical problems based on our recent research. The topics include the use of ML techniques to identify hotspot residues in allosteric proteins using deep mutational scanning data and to analyze how mutations of these hotspots perturb co-operativity in the framework of a statistical thermodynamic model, to improve the accuracy of free energy simulations by integrating data from different levels of potential energy functions, and to determine the phase transition temperature of lipid membranes. Through these examples, we illustrate the unique value of ML in extracting patterns or parameters from complex data sets, as well as the remaining limitations. By implementing the ML approaches in the context of physically motivated models or computational frameworks, we are able to gain a deeper mechanistic understanding or better convergence in numerical simulations. We conclude by briefly discussing how the introduced models can be further expanded to tackle more complex problems.
Collapse
Affiliation(s)
- Qiang Cui
- Author to whom correspondence should be addressed:
| |
Collapse
|
5
|
Ozkan S, Padilla N, de la Cruz X. QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning. Hum Genet 2025; 144:191-208. [PMID: 39048855 PMCID: PMC11976337 DOI: 10.1007/s00439-024-02692-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
Collapse
Affiliation(s)
- Selen Ozkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
6
|
Boorla VS, Maranas CD. CatPred: a comprehensive framework for deep learning in vitro enzyme kinetic parameters. Nat Commun 2025; 16:2072. [PMID: 40021618 PMCID: PMC11871309 DOI: 10.1038/s41467-025-57215-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 02/14/2025] [Indexed: 03/03/2025] Open
Abstract
Estimation of enzymatic activities still heavily relies on experimental assays, which can be cost and time-intensive. We present CatPred, a deep learning framework for predicting in vitro enzyme kinetic parameters, including turnover numbers (kcat), Michaelis constants (Km), and inhibition constants (Ki). CatPred addresses key challenges such as the lack of standardized datasets, performance evaluation on enzyme sequences that are dissimilar to those used during training, and model uncertainty quantification. We explore diverse learning architectures and feature representations, including pretrained protein language models and three-dimensional structural features, to enable robust predictions. CatPred provides accurate predictions with query-specific uncertainty estimates, with lower predicted variances correlating with higher accuracy. Pretrained protein language model features particularly enhance performance on out-of-distribution samples. CatPred also introduces benchmark datasets with extensive coverage (~23 k, 41 k, and 12 k data points for kcat, Km, and Ki respectively). Our framework performs competitively with existing methods while offering reliable uncertainty quantification.
Collapse
Grants
- This material is based upon work supported by the Center for Bioenergy Innovation (CBI), U.S. Department of Energy, Office of Science, Biological and Environmental Research Program under Award Number ERKP886. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the U.S. Department of Energy. This work was also supported by the U.S. National Science Foundation funded Molecule Maker Lab Institute (MMLI), award number 2019897 supported by National AI Research Institutes Program of the Directorate for Computer and Information Science and Engineering (CISE), in collaboration with the Division of Chemistry (CHE) and the Division of Chemical, Bioengineering, and Environmental Transport Systems (CBET) awarded to CDM. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Collapse
Affiliation(s)
- Veda Sheersh Boorla
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA
- The Center for Bioenergy Innovation, Oak Ridge, TN, 37830, USA
| | - Costas D Maranas
- Department of Chemical Engineering, The Pennsylvania State University, University Park, PA, 16802, USA.
- The Center for Bioenergy Innovation, Oak Ridge, TN, 37830, USA.
| |
Collapse
|
7
|
Cui Q. Identification and understanding of allostery hotspots in proteins: Integration of deep mutational scanning and multi-faceted computational analyses. J Mol Biol 2025:168998. [PMID: 39952349 DOI: 10.1016/j.jmb.2025.168998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2024] [Revised: 01/19/2025] [Accepted: 02/08/2025] [Indexed: 02/17/2025]
Abstract
Motivated by recent deep mutational scanning (DMS) experiments, we have carried out a diverse set of computations to better understand the distribution and contributions of allostery hotspot residues in a transcription factor, TetR. These include extensive atomistic simulations and free energy computations for different functional states of TetR, machine learning analysis of the DMS data and a statistical thermodynamic model for the experimental induction data for the WT protein and a handful of hotspot mutants. Collectively, these computations provided insights into the structural and energetic basis of allostery in TetR, and the distinct contributions of allostery hotspots. The results highlight that the allostery function (i.e., the induction activity) of TetR can be modulated by perturbing both inter-domain coupling and intra-domain properties, such as the population of the binding-competent conformation of each domain. This mechanistic degeneracy qualitatively explains the broad distribution of allostery hotspots across the protein structure observed in the DMS experiments, and also informs the design of strategies aimed at identifying allostery hotspots. The mechanistic framework and the multi-faceted computational approaches are expected to be applicable to the analysis of other allostery systems, especially those sharing the similar two-domain structural topology, and to the design of allostery modulators.
Collapse
Affiliation(s)
- Qiang Cui
- Departments of Chemistry, Physics and Biomedical Engineering, Boston University, 590 Commonwealth Avenue, Boston 02215, MA, USA
| |
Collapse
|
8
|
Jiang K, Yan Z, Di Bernardo M, Sgrizzi SR, Villiger L, Kayabolen A, Kim BJ, Carscadden JK, Hiraizumi M, Nishimasu H, Gootenberg JS, Abudayyeh OO. Rapid in silico directed evolution by a protein language model with EVOLVEpro. Science 2025; 387:eadr6006. [PMID: 39571002 DOI: 10.1126/science.adr6006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 11/12/2024] [Indexed: 01/25/2025]
Abstract
Directed protein evolution is central to biomedical applications but faces challenges such as experimental complexity, inefficient multiproperty optimization, and local maxima traps. Although in silico methods that use protein language models (PLMs) can provide modeled fitness landscape guidance, they struggle to generalize across diverse protein families and map to protein activity. We present EVOLVEpro, a few-shot active learning framework that combines PLMs and regression models to rapidly improve protein activity. EVOLVEpro surpasses current methods, yielding up to 100-fold improvements in desired properties. We demonstrate its effectiveness across six proteins in RNA production, genome editing, and antibody binding applications. These results highlight the advantages of few-shot active learning with minimal experimental data over zero-shot predictions. EVOLVEpro opens new possibilities for artificial intelligence-guided protein engineering in biology and medicine.
Collapse
Affiliation(s)
- Kaiyi Jiang
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
- Department of Bioengineering Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Zhaoqing Yan
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Matteo Di Bernardo
- Whitehead Institute Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Samantha R Sgrizzi
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Lukas Villiger
- Department of Dermatology and Allergology Kantonspital St. Gallen, St. Gallen, Switzerland
| | - Alisan Kayabolen
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - B J Kim
- Koch Institute for Integrative Cancer Research at MIT Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Josephine K Carscadden
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo, Japan
- Inamori Research Institute for Science, 620 Suiginya-cho, Shimogyo-ku, Kyoto, Japan
| | - Jonathan S Gootenberg
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Omar O Abudayyeh
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| |
Collapse
|
9
|
Gelman S, Johnson B, Freschlin C, Sharma A, D'Costa S, Peters J, Gitter A, Romero PA. Biophysics-based protein language models for protein engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.03.15.585128. [PMID: 38559182 PMCID: PMC10980077 DOI: 10.1101/2024.03.15.585128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Collapse
Affiliation(s)
- Sam Gelman
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Bryce Johnson
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | | | - Arnav Sharma
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Sameer D'Costa
- Department of Biochemistry, University of Wisconsin-Madison
| | - John Peters
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison
- Department of Biomedical Engineering, Duke University
| |
Collapse
|
10
|
Barozi V, Chakraborty S, Govender S, Morgan E, Ramahala R, Graham SC, Bishop NT, Tastan Bishop Ö. Revealing SARS-CoV-2 M pro mutation cold and hot spots: Dynamic residue network analysis meets machine learning. Comput Struct Biotechnol J 2024; 23:3800-3816. [PMID: 39525081 PMCID: PMC11550722 DOI: 10.1016/j.csbj.2024.10.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 10/19/2024] [Accepted: 10/19/2024] [Indexed: 11/16/2024] Open
Abstract
Deciphering the effect of evolutionary mutations of viruses and predicting future mutations is crucial for designing long-lasting and effective drugs. While understanding the impact of current mutations on protein drug targets is feasible, predicting future mutations due to natural evolution of viruses and environmental pressures remains challenging. Here, we leveraged existing mutation data during the evolution of the SARS-CoV-2 protein drug target main protease (Mpro) to test the predictive power of dynamic residue network (DRN) analysis in identifying mutation cold and hot spots. We conducted molecular dynamics simulations on the Mpro of SARS-CoV-2 (Wuhan strain) and calculated eight DRN metrics (averaged BC, CC, DC, EC, ECC, KC, L, PR), each of which identifies a unique network feature within the protein. The sets of residues with the highest and lowest values for each metric, comprising potential cold and hot spots, were compared to published biochemical analyses and per residue mutation frequencies observed across five SARS-CoV-2 lineages, encompassing a total of 191,878 sequences. Individual DRN metrics displayed only modest power to predict the mutation frequency of individual residues. However, integrating the eight DRN metrics with additional structural and sequence-derived metrics allowed us to develop machine learning models which significantly improved the prediction of residue mutation frequency. While further refinements should enhance accuracy, we demonstrated a robust method to understand pathogen evolution. This approach can also guide the development of long-lasting drugs by targeting functional residues located in and near active site, and allosteric sites, that are less prone to mutations.
Collapse
Affiliation(s)
- Victor Barozi
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Shrestha Chakraborty
- Division of Virology, Department of Pathology, University of Cambridge, Cambridge CB2 1QP, UK
| | - Shaylyn Govender
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Emily Morgan
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Rabelani Ramahala
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Stephen C. Graham
- Division of Virology, Department of Pathology, University of Cambridge, Cambridge CB2 1QP, UK
| | - Nigel T. Bishop
- Department of Pure and Applied Mathematics, Rhodes University, Makhanda 6139, South Africa
- National Institute for Theoretical and Computational Sciences (NITheCS), South Africa
| | - Özlem Tastan Bishop
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
- National Institute for Theoretical and Computational Sciences (NITheCS), South Africa
| |
Collapse
|
11
|
Wang H, Ren Z, Sun J, Chen Y, Bo X, Xue J, Gao J, Ni M. DeepPFP: a multi-task-aware architecture for protein function prediction. Brief Bioinform 2024; 26:bbae579. [PMID: 39905954 PMCID: PMC11794456 DOI: 10.1093/bib/bbae579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 09/14/2024] [Accepted: 01/31/2025] [Indexed: 02/06/2025] Open
Abstract
Deriving protein function from protein sequences poses a significant challenge due to the intricate relationship between sequence and function. Deep learning has made remarkable strides in predicting sequence-function relationships. However, models tailored for specific tasks or protein types encounter difficulties when using transfer learning across domains. This is attributed to the fact that protein function relies heavily on structural characteristics rather than mere sequence information. Consequently, there is a pressing need for a model capable of capturing shared features among diverse sequence-function mapping tasks to address the generalization issue. In this study, we explore the potential of Model-Agnostic Meta-Learning combined with a protein language model called Evolutionary Scale Modeling to tackle this challenge. Our approach involves training the architecture on five out-domain deep mutational scanning (DMS) datasets and evaluating its performance across four key dimensions. Our findings demonstrate that the proposed architecture exhibits satisfactory performance in terms of generalization and employs an effective few-shot learning strategy. To explain further, Compared to the best results, the Pearson's correlation coefficient (PCC) in the final stage increased by ~0.31%. Furthermore, we leverage the trained architecture to predict binding affinity scores of the DMS dataset of SARS-CoV-2 using transfer learning. Notably, training on a subset of the Ube4b dataset with 500 samples resulted in a notable improvement of 0.11 in the PCC. These results underscore the potential of our conceptual architecture as a promising methodology for multi-task protein function prediction.
Collapse
Affiliation(s)
- Han Wang
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Zilin Ren
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jinghong Sun
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Yongbing Chen
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Xiaochen Bo
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - JiGuo Xue
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Ming Ni
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| |
Collapse
|
12
|
Nishikawa KK, Chen J, Acheson JF, Harbaugh SV, Huss P, Frenkel M, Novy N, Sieren HR, Lodewyk EC, Lee DH, Chávez JL, Fox BG, Raman S. Highly multiplexed design of an allosteric transcription factor to sense new ligands. Nat Commun 2024; 15:10001. [PMID: 39562775 PMCID: PMC11577015 DOI: 10.1038/s41467-024-54260-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Accepted: 11/05/2024] [Indexed: 11/21/2024] Open
Abstract
Allosteric transcription factors (aTF) regulate gene expression through conformational changes induced by small molecule binding. Although widely used as biosensors, aTFs have proven challenging to design for detecting new molecules because mutation of ligand-binding residues often disrupts allostery. Here, we develop Sensor-seq, a high-throughput platform to design and identify aTF biosensors that bind to non-native ligands. We screen a library of 17,737 variants of the aTF TtgR, a regulator of a multidrug exporter, against six non-native ligands of diverse chemical structures - four derivatives of the cancer therapeutic tamoxifen, the antimalarial drug quinine, and the opiate analog naltrexone - as well as two native flavonoid ligands, naringenin and phloretin. Sensor-seq identifies biosensors for each of these ligands with high dynamic range and diverse specificity profiles. The structure of a naltrexone-bound design shows shape-complementary methionine-aromatic interactions driving ligand specificity. To demonstrate practical utility, we develop cell-free detection systems for naltrexone and quinine. Sensor-seq enables rapid and scalable design of new biosensors, overcoming constraints of natural biosensors.
Collapse
Affiliation(s)
- Kyle K Nishikawa
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Jackie Chen
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Justin F Acheson
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Svetlana V Harbaugh
- 711th Human Performance Wing, Air Force Research Laboratory, Wright Patterson Air Force Base, OH, USA
| | - Phil Huss
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Max Frenkel
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Nathan Novy
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Hailey R Sieren
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Ella C Lodewyk
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Daniel H Lee
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Jorge L Chávez
- 711th Human Performance Wing, Air Force Research Laboratory, Wright Patterson Air Force Base, OH, USA
| | - Brian G Fox
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
13
|
Sokirniy I, Inam H, Tomaszkiewicz M, Reynolds J, McCandlish D, Pritchard J. A side-by-side comparison of variant function measurements using deep mutational scanning and base editing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.30.601444. [PMID: 39005366 PMCID: PMC11244880 DOI: 10.1101/2024.06.30.601444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Variant annotation is a crucial objective in mammalian functional genomics. Deep Mutational Scanning (DMS) is a well-established method for annotating human gene variants, but CRISPR base editing (BE) is emerging as an alternative. However, questions remain about how well high-throughput base editing measurements can annotate variant function and the extent of downstream experimental validation required. This study presents the first direct comparison of DMS and BE in the same lab and cell line. Results indicate that focusing on the most likely edits and highest efficiency sgRNAs enhances the agreement between a "gold standard" DMS dataset and a BE screen. A simple filter for sgRNAs making single edits in their window could sufficiently annotate a large proportion of variants directly from sgRNA sequencing of large pools. When multi-edit guides are unavoidable, directly measuring the variants created in the pool, rather than sgRNA abundance, can recover high-quality variant annotation measurements in multiplexed pools. Taken together, our data show a surprising degree of correlation between base editor data and gold standard deep mutational scanning.
Collapse
Affiliation(s)
- Ivan Sokirniy
- Huck Institute for the Life Sciences, University Park, PA 16802
| | - Haider Inam
- Huck Institute for the Life Sciences, University Park, PA 16802
- Department of Biomedical Engineering, University Park, PA 16802
| | - Marta Tomaszkiewicz
- Huck Institute for the Life Sciences, University Park, PA 16802
- Department of Biomedical Engineering, University Park, PA 16802
| | - Joshua Reynolds
- Huck Institute for the Life Sciences, University Park, PA 16802
- Department of Biomedical Engineering, University Park, PA 16802
| | - David McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724
| | - Justin Pritchard
- Huck Institute for the Life Sciences, University Park, PA 16802
- Department of Biomedical Engineering, University Park, PA 16802
| |
Collapse
|
14
|
Xie X, Gui L, Qiao B, Wang G, Huang S, Zhao Y, Sun S. Deep learning in template-free de novo biosynthetic pathway design of natural products. Brief Bioinform 2024; 25:bbae495. [PMID: 39373052 PMCID: PMC11456888 DOI: 10.1093/bib/bbae495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/12/2024] [Accepted: 09/20/2024] [Indexed: 10/08/2024] Open
Abstract
Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.
Collapse
Affiliation(s)
- Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Lin Gui
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital, Harbin Medical University, No. 246 Xuefu Road, Nangang District,Harbin 150081, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| |
Collapse
|
15
|
Meiri R, Aharoni Lotati SL, Orenstein Y, Papo N. Deep neural networks for predicting the affinity landscape of protein-protein interactions. iScience 2024; 27:110772. [PMID: 39310756 PMCID: PMC11416218 DOI: 10.1016/j.isci.2024.110772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 06/27/2024] [Accepted: 08/15/2024] [Indexed: 09/25/2024] Open
Abstract
Studies determining protein-protein interactions (PPIs) by deep mutational scanning have focused mainly on a narrow range of affinities within complexes and thus include only partial coverage of the mutation space of given proteins. By inserting an affinity-reducing N-terminal alanine in the N-terminal domain of the tissue inhibitor of metalloproteinases-2 (N-TIMP2), we overcame the limitation of its narrow affinity range for matrix metalloproteinase 9 (MMP9CAT). We trained deep neural networks (DNNs) to quantitatively predict the binding affinity of unobserved wild-type variants and variants carrying an N-terminal alanine. Good correlation was obtained between predicted and observed log2 enrichment ratio (ER) values, which also correlated with the affinity of N-TIMP2 variants to MMP9CAT. Our ability to predict affinities of unobserved N-TIMP2 variants was confirmed on an independent dataset of experimentally validated N-TIMP2 proteins. This ability is of significant importance in the field of PPI prediction and for developing therapies targeting these interactions.
Collapse
Affiliation(s)
- Reut Meiri
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Shay-Lee Aharoni Lotati
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel
| | - Niv Papo
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
16
|
Hilvert D. Spiers Memorial Lecture: Engineering biocatalysts. Faraday Discuss 2024; 252:9-28. [PMID: 39046423 PMCID: PMC11389855 DOI: 10.1039/d4fd00139g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 06/26/2024] [Indexed: 07/25/2024]
Abstract
Enzymes are being engineered to catalyze chemical reactions for many practical applications in chemistry and biotechnology. The approaches used are surveyed in this short review, emphasizing methods for accessing reactivities not expressed by native protein scaffolds. The successful generation of completely de novo enzymes that rival the rates and selectivities of their natural counterparts highlights the potential role that designer enzymes may play in the coming years in research, industry, and medicine. Some challenges that need to be addressed to realize this ambitious dream are considered together with possible solutions.
Collapse
Affiliation(s)
- Donald Hilvert
- Laboratory of Organic Chemistry, ETH Zürich, 8093 Zürich, Switzerland.
| |
Collapse
|
17
|
Cheng P, Mao C, Tang J, Yang S, Cheng Y, Wang W, Gu Q, Han W, Chen H, Li S, Chen Y, Zhou J, Li W, Pan A, Zhao S, Huang X, Zhu S, Zhang J, Shu W, Wang S. Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering. Cell Res 2024; 34:630-647. [PMID: 38969803 PMCID: PMC11369238 DOI: 10.1038/s41422-024-00989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 06/03/2024] [Indexed: 07/07/2024] Open
Abstract
Mutations in amino acid sequences can provoke changes in protein function. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Protein Mutational Effect Predictor (ProMEP), a general and multiple sequence alignment-free method that enables zero-shot prediction of mutation effects. A multimodal deep representation learning model embedded in ProMEP was developed to comprehensively learn both sequence and structure contexts from ~160 million proteins. ProMEP achieves state-of-the-art performance in mutational effect prediction and accomplishes a tremendous improvement in speed, enabling efficient and intelligent protein engineering. Specifically, ProMEP accurately forecasts mutational consequences on the gene-editing enzymes TnpB and TadA, and successfully guides the development of high-performance gene-editing tools with their engineered variants. The gene-editing efficiency of a 5-site mutant of TnpB reaches up to 74.04% (vs 24.66% for the wild type); and the base editing tool developed on the basis of a TadA 15-site mutant (in addition to the A106V/D108N double mutation that renders deoxyadenosine deaminase activity to TadA) exhibits an A-to-G conversion frequency of up to 77.27% (vs 69.80% for ABE8e, a previous TadA-based adenine base editor) with significantly reduced bystander and off-target effects compared to ABE8e. ProMEP not only showcases superior performance in predicting mutational effects on proteins but also demonstrates a great capability to guide protein engineering. Therefore, ProMEP enables efficient exploration of the gigantic protein space and facilitates practical design of proteins, thereby advancing studies in biomedicine and synthetic biology.
Collapse
Affiliation(s)
- Peng Cheng
- Bioinformatics Center of AMMS, Beijing, China
| | - Cong Mao
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Jin Tang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Sen Yang
- Bioinformatics Center of AMMS, Beijing, China
| | - Yu Cheng
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wuke Wang
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Qiuxi Gu
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Wei Han
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Hao Chen
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Sihan Li
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China
| | | | | | - Wuju Li
- Bioinformatics Center of AMMS, Beijing, China
| | - Aimin Pan
- Zhejiang Lab, Hangzhou, Zhejiang, China
| | - Suwen Zhao
- iHuman Institute, ShanghaiTech University, Shanghai, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xingxu Huang
- Zhejiang Lab, Hangzhou, Zhejiang, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai, China
| | | | - Jun Zhang
- State Key Laboratory of Reproductive Medicine and Offspring Health, Women's Hospital of Nanjing Medical University, Nanjing Maternity and Child Health Care Hospital, Nanjing Medical University, Nanjing, Jiangsu, China.
| | - Wenjie Shu
- Bioinformatics Center of AMMS, Beijing, China.
| | | |
Collapse
|
18
|
Li SS, Liu ZM, Li J, Ma YB, Dong ZY, Hou JW, Shen FJ, Wang WB, Li QM, Su JG. Prediction of mutation-induced protein stability changes based on the geometric representations learned by a self-supervised method. BMC Bioinformatics 2024; 25:282. [PMID: 39198740 PMCID: PMC11360314 DOI: 10.1186/s12859-024-05876-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 07/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Thermostability is a fundamental property of proteins to maintain their biological functions. Predicting protein stability changes upon mutation is important for our understanding protein structure-function relationship, and is also of great interest in protein engineering and pharmaceutical design. RESULTS Here we present mutDDG-SSM, a deep learning-based framework that uses the geometric representations encoded in protein structure to predict the mutation-induced protein stability changes. mutDDG-SSM consists of two parts: a graph attention network-based protein structural feature extractor that is trained with a self-supervised learning scheme using large-scale high-resolution protein structures, and an eXtreme Gradient Boosting model-based stability change predictor with an advantage of alleviating overfitting problem. The performance of mutDDG-SSM was tested on several widely-used independent datasets. Then, myoglobin and p53 were used as case studies to illustrate the effectiveness of the model in predicting protein stability changes upon mutations. Our results show that mutDDG-SSM achieved high performance in estimating the effects of mutations on protein stability. In addition, mutDDG-SSM exhibited good unbiasedness, where the prediction accuracy on the inverse mutations is as well as that on the direct mutations. CONCLUSION Meaningful features can be extracted from our pre-trained model to build downstream tasks and our model may serve as a valuable tool for protein engineering and drug design.
Collapse
Affiliation(s)
- Shan Shan Li
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China
- National Engineering Center for New Vaccine Research, Beijing, China
| | - Zhao Ming Liu
- National Engineering Center for New Vaccine Research, Beijing, China
- The Sixth Laboratory, National Vaccine and Serum Institute (NVSI), Beijing, China
| | - Jiao Li
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China
- National Engineering Center for New Vaccine Research, Beijing, China
| | - Yi Bo Ma
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China
- National Engineering Center for New Vaccine Research, Beijing, China
| | - Ze Yuan Dong
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China
- National Engineering Center for New Vaccine Research, Beijing, China
| | - Jun Wei Hou
- National Engineering Center for New Vaccine Research, Beijing, China
- The Sixth Laboratory, National Vaccine and Serum Institute (NVSI), Beijing, China
| | - Fu Jie Shen
- National Engineering Center for New Vaccine Research, Beijing, China
- The Sixth Laboratory, National Vaccine and Serum Institute (NVSI), Beijing, China
| | - Wei Bu Wang
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China
- National Engineering Center for New Vaccine Research, Beijing, China
| | - Qi Ming Li
- National Engineering Center for New Vaccine Research, Beijing, China.
- The Sixth Laboratory, National Vaccine and Serum Institute (NVSI), Beijing, China.
| | - Ji Guo Su
- High Performance Computing Center, National Vaccine and Serum Institute (NVSI), Beijing, China.
- National Engineering Center for New Vaccine Research, Beijing, China.
| |
Collapse
|
19
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
20
|
Freschlin CR, Fahlberg SA, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. Nat Commun 2024; 15:6405. [PMID: 39080282 PMCID: PMC11289474 DOI: 10.1038/s41467-024-50712-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 07/13/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
21
|
Ding K, Chin M, Zhao Y, Huang W, Mai BK, Wang H, Liu P, Yang Y, Luo Y. Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nat Commun 2024; 15:6392. [PMID: 39080249 PMCID: PMC11289365 DOI: 10.1038/s41467-024-50698-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 07/19/2024] [Indexed: 08/02/2024] Open
Abstract
The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochrome c to achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY's potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Michael Chin
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Yunlong Zhao
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Wei Huang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Binh Khanh Mai
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Huanan Wang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Peng Liu
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA.
| | - Yang Yang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA.
- Biomolecular Science and Engineering (BMSE) Program, University of California, Santa Barbara, CA, 93106, USA.
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
22
|
Vornholt T, Mutný M, Schmidt GW, Schellhaas C, Tachibana R, Panke S, Ward TR, Krause A, Jeschek M. Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning. ACS CENTRAL SCIENCE 2024; 10:1357-1370. [PMID: 39071060 PMCID: PMC11273458 DOI: 10.1021/acscentsci.4c00258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/22/2024] [Accepted: 05/02/2024] [Indexed: 07/30/2024]
Abstract
Tailored enzymes are crucial for the transition to a sustainable bioeconomy. However, enzyme engineering is laborious and failure-prone due to its reliance on serendipity. The efficiency and success rates of engineering campaigns may be improved by applying machine learning to map the sequence-activity landscape based on small experimental data sets. Yet, it often proves challenging to reliably model large sequence spaces while keeping the experimental effort tractable. To address this challenge, we present an integrated pipeline combining large-scale screening with active machine learning, which we applied to engineer an artificial metalloenzyme (ArM) catalyzing a new-to-nature hydroamination reaction. Combining lab automation and next-generation sequencing, we acquired sequence-activity data for several thousand ArM variants. We then used Gaussian process regression to model the activity landscape and guide further screening rounds. Critical characteristics of our pipeline include the cost-effective generation of information-rich data sets, the integration of an explorative round to improve the model's performance, and the inclusion of experimental noise. Our approach led to an order-of-magnitude boost in the hit rate while making efficient use of experimental resources. Search strategies like this should find broad utility in enzyme engineering and accelerate the development of novel biocatalysts.
Collapse
Affiliation(s)
- Tobias Vornholt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Mojmír Mutný
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Gregor W. Schmidt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Christian Schellhaas
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Ryo Tachibana
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Sven Panke
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Thomas R. Ward
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Andreas Krause
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Markus Jeschek
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- Institute
of Microbiology, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
| |
Collapse
|
23
|
Zhou Z, Zhang L, Yu Y, Wu B, Li M, Hong L, Tan P. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun 2024; 15:5566. [PMID: 38956442 PMCID: PMC11219809 DOI: 10.1038/s41467-024-49798-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/11/2024] [Indexed: 07/04/2024] Open
Abstract
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering.
Collapse
Affiliation(s)
- Ziyi Zhou
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Liang Zhang
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuanxi Yu
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Banghao Wu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Mingchen Li
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Liang Hong
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
- Zhang Jiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai, 201203, China.
| | - Pan Tan
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
| |
Collapse
|
24
|
Nestl BM, Nebel BA, Resch V, Schürmann M, Tischler D. The Development and Opportunities of Predictive Biotechnology. Chembiochem 2024; 25:e202300863. [PMID: 38713151 DOI: 10.1002/cbic.202300863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 04/05/2024] [Indexed: 05/08/2024]
Abstract
Recent advances in bioeconomy allow a holistic view of existing and new process chains and enable novel production routines continuously advanced by academia and industry. All this progress benefits from a growing number of prediction tools that have found their way into the field. For example, automated genome annotations, tools for building model structures of proteins, and structural protein prediction methods such as AlphaFold2TM or RoseTTAFold have gained popularity in recent years. Recently, it has become apparent that more and more AI-based tools are being developed and used for biocatalysis and biotechnology. This is an excellent opportunity for academia and industry to accelerate advancements in the field further. Biotechnology, as a rapidly growing interdisciplinary field, stands to benefit greatly from these developments.
Collapse
Affiliation(s)
- Bettina M Nestl
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Bernd A Nebel
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Verena Resch
- Innophore GmbH, Am Eisernen Tor 3, 8010, Graz, Austria
| | - Martin Schürmann
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- InnoSyn B. V., Urmonderbaan 22, 6167 RD, Geleen, The Netherlands
- SynSilico B. V., Urmonderbaan 22, 6167 RD, Geleen, The Netherlands
| | - Dirk Tischler
- Joint working group on biotransformations of the Association for General and Applied Microbiology VAAM, the Society for Chemical Engineering, Biotechnology DECHEMA, Theodor-Heuss-Allee 25, 60486, Frankfurt, Germany
- Microbial Biotechnology, Ruhr University Bochum, Universitätsstrasse 150, 44780, Bochum, Germany
| |
Collapse
|
25
|
Wirnsberger G, Pritišanac I, Oberdorfer G, Gruber K. Flattening the curve-How to get better results with small deep-mutational-scanning datasets. Proteins 2024; 92:886-902. [PMID: 38501649 DOI: 10.1002/prot.26686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 02/24/2024] [Accepted: 03/07/2024] [Indexed: 03/20/2024]
Abstract
Proteins are used in various biotechnological applications, often requiring the optimization of protein properties by introducing specific amino-acid exchanges. Deep mutational scanning (DMS) is an effective high-throughput method for evaluating the effects of these exchanges on protein function. DMS data can then inform the training of a neural network to predict the impact of mutations. Most approaches use some representation of the protein sequence for training and prediction. As proteins are characterized by complex structures and intricate residue interaction networks, directly providing structural information as input reduces the need to learn these features from the data. We introduce a method for encoding protein structures as stacked 2D contact maps, which capture residue interactions, their evolutionary conservation, and mutation-induced interaction changes. Furthermore, we explored techniques to augment neural network training performance on smaller DMS datasets. To validate our approach, we trained three neural network architectures originally used for image analysis on three DMS datasets, and we compared their performances with networks trained solely on protein sequences. The results confirm the effectiveness of the protein structure encoding in machine learning efforts on DMS data. Using structural representations as direct input to the networks, along with data augmentation and pretraining, significantly reduced demands on training data size and improved prediction performance, especially on smaller datasets, while performance on large datasets was on par with state-of-the-art sequence convolutional neural networks. The methods presented here have the potential to provide the same workflow as DMS without the experimental and financial burden of testing thousands of mutants. Additionally, we present an open-source, user-friendly software tool to make these data analysis techniques accessible, particularly to biotechnology and protein engineering researchers who wish to apply them to their mutagenesis data.
Collapse
Affiliation(s)
| | - Iva Pritišanac
- Institute of Molecular Biology and Biochemistry, Medical University of Graz, Graz, Austria
- BioTechMed-Graz, Graz, Austria
| | - Gustav Oberdorfer
- BioTechMed-Graz, Graz, Austria
- Institute of Biochemistry, Graz University of Technology, Graz, Austria
| | - Karl Gruber
- Institute of Molecular Biosciences, University of Graz, Graz, Austria
- BioTechMed-Graz, Graz, Austria
- Field of Excellence BioHealth, University of Graz, Graz, Austria
| |
Collapse
|
26
|
Nishikawa KK, Chen J, Acheson JF, Harbaugh SV, Huss P, Frenkel M, Novy N, Sieren HR, Lodewyk EC, Lee DH, Chávez JL, Fox BG, Raman S. Highly multiplexed design of an allosteric transcription factor to sense novel ligands. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.07.583947. [PMID: 38496486 PMCID: PMC10942455 DOI: 10.1101/2024.03.07.583947] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Allosteric transcription factors (aTF), widely used as biosensors, have proven challenging to design for detecting novel molecules because mutation of ligand-binding residues often disrupts allostery. We developed Sensor-seq, a high-throughput platform to design and identify aTF biosensors that bind to non-native ligands. We screened a library of 17,737 variants of the aTF TtgR, a regulator of a multidrug exporter, against six non-native ligands of diverse chemical structures - four derivatives of the cancer therapeutic tamoxifen, the antimalarial drug quinine, and the opiate analog naltrexone - as well as two native flavonoid ligands, naringenin and phloretin. Sensor-seq identified novel biosensors for each of these ligands with high dynamic range and diverse specificity profiles. The structure of a naltrexone-bound design showed shape-complementary methionine-aromatic interactions driving ligand specificity. To demonstrate practical utility, we developed cell-free detection systems for naltrexone and quinine. Sensor-seq enables rapid, scalable design of new biosensors, overcoming constraints of natural biosensors.
Collapse
Affiliation(s)
- Kyle K Nishikawa
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Jackie Chen
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Justin F Acheson
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Svetlana V Harbaugh
- 711th Human Performance Wing, Air Force Research Laboratory Wright Patterson Air Force Base, OH, USA
| | - Phil Huss
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Max Frenkel
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Nathan Novy
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Hailey R Sieren
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Ella C Lodewyk
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Daniel H Lee
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Jorge L Chávez
- 711th Human Performance Wing, Air Force Research Laboratory Wright Patterson Air Force Base, OH, USA
| | - Brian G Fox
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin, USA
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
27
|
Goshisht MK. Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS OMEGA 2024; 9:9921-9945. [PMID: 38463314 PMCID: PMC10918679 DOI: 10.1021/acsomega.3c05913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 01/19/2024] [Accepted: 01/30/2024] [Indexed: 03/12/2024]
Abstract
Machine learning (ML), particularly deep learning (DL), has made rapid and substantial progress in synthetic biology in recent years. Biotechnological applications of biosystems, including pathways, enzymes, and whole cells, are being probed frequently with time. The intricacy and interconnectedness of biosystems make it challenging to design them with the desired properties. ML and DL have a synergy with synthetic biology. Synthetic biology can be employed to produce large data sets for training models (for instance, by utilizing DNA synthesis), and ML/DL models can be employed to inform design (for example, by generating new parts or advising unrivaled experiments to perform). This potential has recently been brought to light by research at the intersection of engineering biology and ML/DL through achievements like the design of novel biological components, best experimental design, automated analysis of microscopy data, protein structure prediction, and biomolecular implementations of ANNs (Artificial Neural Networks). I have divided this review into three sections. In the first section, I describe predictive potential and basics of ML along with myriad applications in synthetic biology, especially in engineering cells, activity of proteins, and metabolic pathways. In the second section, I describe fundamental DL architectures and their applications in synthetic biology. Finally, I describe different challenges causing hurdles in the progress of ML/DL and synthetic biology along with their solutions.
Collapse
Affiliation(s)
- Manoj Kumar Goshisht
- Department of Chemistry, Natural and
Applied Sciences, University of Wisconsin—Green
Bay, Green
Bay, Wisconsin 54311-7001, United States
| |
Collapse
|
28
|
Martinusen SG, Denard CA. Leveraging yeast sequestration to study and engineer posttranslational modification enzymes. Biotechnol Bioeng 2024; 121:903-914. [PMID: 38079116 PMCID: PMC11229454 DOI: 10.1002/bit.28621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 11/04/2023] [Accepted: 11/27/2023] [Indexed: 02/20/2024]
Abstract
Enzymes that catalyze posttranslational modifications (PTMs) of peptides and proteins (PTM-enzymes)-proteases, protein ligases, oxidoreductases, kinases, and other transferases-are foundational to our understanding of health and disease and empower applications in chemical biology, synthetic biology, and biomedicine. To fully harness the potential of PTM-enzymes, there is a critical need to decipher their enzymatic and biological mechanisms, develop molecules that can probe and modulate them, and endow them with improved and novel functions. These objectives are contingent upon implementation of high-throughput functional screens and selections that interrogate large sequence libraries to isolate desired PTM-enzyme properties. This review discusses the principles of Saccharomyces cerevisiae organelle sequestration to study and engineer PTM-enzymes. These include outer membrane sequestration, specifically methods that modify yeast surface display, and cytoplasmic sequestration based on enzyme-mediated transcription activation. Furthermore, we present a detailed discussion of yeast endoplasmic reticulum sequestration for the first time. Where appropriate, we highlight the major features and limitations of different systems, specifically how they can measure and control enzyme catalytic efficiencies. Taken together, yeast-based high-throughput sequestration approaches significantly lower the barrier to understanding how PTM-enzymes function and how to reprogram them.
Collapse
Affiliation(s)
- Samantha G Martinusen
- Department of Chemical Engineering, University of Florida, Gainesville, Florida, USA
| | - Carl A Denard
- Department of Chemical Engineering, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
29
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
30
|
Zeng M, Sarker B, Rondthaler SN, Vu V, Andrews LB. Identifying LasR Quorum Sensors with Improved Signal Specificity by Mapping the Sequence-Function Landscape. ACS Synth Biol 2024; 13:568-589. [PMID: 38206199 DOI: 10.1021/acssynbio.3c00543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
Programmable intercellular signaling using components of naturally occurring quorum sensing can allow for coordinated functions to be engineered in microbial consortia. LuxR-type transcriptional regulators are widely used for this purpose and are activated by homoserine lactone (HSL) signals. However, they often suffer from imperfect molecular discrimination of structurally similar HSLs, causing misregulation within engineered consortia containing multiple HSL signals. Here, we studied one such example, the regulator LasR from Pseudomonas aeruginosa. We elucidated its sequence-function relationship for ligand specificity using targeted protein engineering and multiplexed high-throughput biosensor screening. A pooled combinatorial saturation mutagenesis library (9,486 LasR DNA sequences) was created by mutating six residues in LasR's β5 sheet with single, double, or triple amino acid substitutions. Sort-seq assays were performed in parallel using cognate and noncognate HSLs to quantify each corresponding sensor's response to each HSL signal, which identified hundreds of highly specific variants. Sensor variants identified were individually assayed and exhibited up to 60.6-fold (p = 0.0013) improved relative activation by the cognate signal compared to the wildtype. Interestingly, we uncovered prevalent mutational epistasis and previously unidentified residues contributing to signal specificity. The resulting sensors with negligible signal crosstalk could be broadly applied to engineer bacteria consortia.
Collapse
Affiliation(s)
- Min Zeng
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
| | - Biprodev Sarker
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
| | - Stephen N Rondthaler
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
| | - Vanessa Vu
- Department of Biochemistry and Molecular Biology, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
| | - Lauren B Andrews
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
- Molecular and Cellular Biology Graduate Program, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
- Biotechnology Training Program, University of Massachusetts Amherst, Amherst, Massachusetts 01003, United States
| |
Collapse
|
31
|
Ao YF, Dörr M, Menke MJ, Born S, Heuson E, Bornscheuer UT. Data-Driven Protein Engineering for Improving Catalytic Activity and Selectivity. Chembiochem 2024; 25:e202300754. [PMID: 38029350 DOI: 10.1002/cbic.202300754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/29/2023] [Indexed: 12/01/2023]
Abstract
Protein engineering is essential for altering the substrate scope, catalytic activity and selectivity of enzymes for applications in biocatalysis. However, traditional approaches, such as directed evolution and rational design, encounter the challenge in dealing with the experimental screening process of a large protein mutation space. Machine learning methods allow the approximation of protein fitness landscapes and the identification of catalytic patterns using limited experimental data, thus providing a new avenue to guide protein engineering campaigns. In this concept article, we review machine learning models that have been developed to assess enzyme-substrate-catalysis performance relationships aiming to improve enzymes through data-driven protein engineering. Furthermore, we prospect the future development of this field to provide additional strategies and tools for achieving desired activities and selectivities.
Collapse
Affiliation(s)
- Yu-Fei Ao
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Zhongguancun North First Street 2, Beijing, 100190, China
- University of Chinese Academy of Sciences, Yuquan Road 19(A), Beijing, 100049, China
| | - Mark Dörr
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| | - Marian J Menke
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| | - Stefan Born
- Technische Universität Berlin, Chair of Bioprocess Engineering, Ackerstraße 76, 13355, Berlin, Germany
| | - Egon Heuson
- Univ. Lille, CNRS, Centrale Lille, Univ. Artois, UMR 8181 UCCS, Unité de Catalyse et Chimie du Solide, 59000, Lille, France
| | - Uwe T Bornscheuer
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| |
Collapse
|
32
|
Zhu D, Brookes DH, Busia A, Carneiro A, Fannjiang C, Popova G, Shin D, Donohue KC, Lin LF, Miller ZM, Williams ER, Chang EF, Nowakowski TJ, Listgarten J, Schaffer DV. Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy. SCIENCE ADVANCES 2024; 10:eadj3786. [PMID: 38266077 PMCID: PMC10807795 DOI: 10.1126/sciadv.adj3786] [Citation(s) in RCA: 18] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 12/22/2023] [Indexed: 01/26/2024]
Abstract
Adeno-associated viruses (AAVs) hold tremendous promise as delivery vectors for gene therapies. AAVs have been successfully engineered-for instance, for more efficient and/or cell-specific delivery to numerous tissues-by creating large, diverse starting libraries and selecting for desired properties. However, these starting libraries often contain a high proportion of variants unable to assemble or package their genomes, a prerequisite for any gene delivery goal. Here, we present and showcase a machine learning (ML) method for designing AAV peptide insertion libraries that achieve fivefold higher packaging fitness than the standard NNK library with negligible reduction in diversity. To demonstrate our ML-designed library's utility for downstream engineering goals, we show that it yields approximately 10-fold more successful variants than the NNK library after selection for infection of human brain tissue, leading to a promising glial-specific variant. Moreover, our design approach can be applied to other types of libraries for AAV and beyond.
Collapse
Affiliation(s)
- Danqing Zhu
- California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - David H. Brookes
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Akosua Busia
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ana Carneiro
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | | | - Galina Popova
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
| | - David Shin
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
| | - Kevin C. Donohue
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- School of Medicine, University of California San Francisco, San Francisco, CA, USA. 94143
- Kavli Institute of Fundamental Neuroscience, University of California San Francisco, San Francisco, CA 94143, USA
- Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA 94143, USA
| | - Li F. Lin
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Zachary M. Miller
- Department of Chemistry, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Evan R. Williams
- Department of Chemistry, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Edward F. Chang
- Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Tomasz J. Nowakowski
- Department of Anatomy, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Psychiatry and Behavioural Sciences, University of California San Francisco, San Francisco, CA 94143, USA
- Eli and Edythe Broad Center for Regeneration Medicine and Stem Cell Research, University of California San Francisco, San Francisco, CA 94143, USA
- Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Neurological Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - David V. Schaffer
- California Institute for Quantitative Biosciences, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- Innovative Genomics Institute (IGI), University of California, Berkeley, Berkeley, CA 94720, USA
| |
Collapse
|
33
|
Xing H, Cai P, Liu D, Han M, Liu J, Le Y, Zhang D, Hu QN. High-throughput prediction of enzyme promiscuity based on substrate-product pairs. Brief Bioinform 2024; 25:bbae089. [PMID: 38487850 PMCID: PMC10940840 DOI: 10.1093/bib/bbae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 01/20/2024] [Accepted: 02/03/2024] [Indexed: 03/18/2024] Open
Abstract
The screening of enzymes for catalyzing specific substrate-product pairs is often constrained in the realms of metabolic engineering and synthetic biology. Existing tools based on substrate and reaction similarity predominantly rely on prior knowledge, demonstrating limited extrapolative capabilities and an inability to incorporate custom candidate-enzyme libraries. Addressing these limitations, we have developed the Substrate-product Pair-based Enzyme Promiscuity Prediction (SPEPP) model. This innovative approach utilizes transfer learning and transformer architecture to predict enzyme promiscuity, thereby elucidating the intricate interplay between enzymes and substrate-product pairs. SPEPP exhibited robust predictive ability, eliminating the need for prior knowledge of reactions and allowing users to define their own candidate-enzyme libraries. It can be seamlessly integrated into various applications, including metabolic engineering, de novo pathway design, and hazardous material degradation. To better assist metabolic engineers in designing and refining biochemical pathways, particularly those without programming skills, we also designed EnzyPick, an easy-to-use web server for enzyme screening based on SPEPP. EnzyPick is accessible at http://www.biosynther.com/enzypick/.
Collapse
Affiliation(s)
- Huadong Xing
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Pengli Cai
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Dongliang Liu
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Mengying Han
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Yingying Le
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Dachuan Zhang
- Institute of Environmental Engineering, ETH Zurich, Laura-Hezner-Weg 7, 8093 Zurich, Switzerland
| | - Qian-Nan Hu
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
34
|
Xi C, Diao J, Moon TS. Advances in ligand-specific biosensing for structurally similar molecules. Cell Syst 2023; 14:1024-1043. [PMID: 38128482 PMCID: PMC10751988 DOI: 10.1016/j.cels.2023.10.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/23/2023] [Accepted: 10/19/2023] [Indexed: 12/23/2023]
Abstract
The specificity of biological systems makes it possible to develop biosensors targeting specific metabolites, toxins, and pollutants in complex medical or environmental samples without interference from structurally similar compounds. For the last two decades, great efforts have been devoted to creating proteins or nucleic acids with novel properties through synthetic biology strategies. Beyond augmenting biocatalytic activity, expanding target substrate scopes, and enhancing enzymes' enantioselectivity and stability, an increasing research area is the enhancement of molecular specificity for genetically encoded biosensors. Here, we summarize recent advances in the development of highly specific biosensor systems and their essential applications. First, we describe the rational design principles required to create libraries containing potential mutants with less promiscuity or better specificity. Next, we review the emerging high-throughput screening techniques to engineer biosensing specificity for the desired target. Finally, we examine the computer-aided evaluation and prediction methods to facilitate the construction of ligand-specific biosensors.
Collapse
Affiliation(s)
- Chenggang Xi
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Jinjin Diao
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Tae Seok Moon
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA; Division of Biology and Biomedical Sciences, Washington University in St. Louis, St. Louis, MO, USA.
| |
Collapse
|
35
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
36
|
Buller R, Lutz S, Kazlauskas RJ, Snajdrova R, Moore JC, Bornscheuer UT. From nature to industry: Harnessing enzymes for biocatalysis. Science 2023; 382:eadh8615. [PMID: 37995253 DOI: 10.1126/science.adh8615] [Citation(s) in RCA: 143] [Impact Index Per Article: 71.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 10/17/2023] [Indexed: 11/25/2023]
Abstract
Biocatalysis harnesses enzymes to make valuable products. This green technology is used in countless applications from bench scale to industrial production and allows practitioners to access complex organic molecules, often with fewer synthetic steps and reduced waste. The last decade has seen an explosion in the development of experimental and computational tools to tailor enzymatic properties, equipping enzyme engineers with the ability to create biocatalysts that perform reactions not present in nature. By using (chemo)-enzymatic synthesis routes or orchestrating intricate enzyme cascades, scientists can synthesize elaborate targets ranging from DNA and complex pharmaceuticals to starch made in vitro from CO2-derived methanol. In addition, new chemistries have emerged through the combination of biocatalysis with transition metal catalysis, photocatalysis, and electrocatalysis. This review highlights recent key developments, identifies current limitations, and provides a future prospect for this rapidly developing technology.
Collapse
Affiliation(s)
- R Buller
- Competence Center for Biocatalysis, Institute of Chemistry and Biotechnology, Zurich University of Applied Sciences, 8820 Wädenswil, Switzerland
| | - S Lutz
- Codexis Incorporated, Redwood City, CA 94063, USA
| | - R J Kazlauskas
- Department of Biochemistry, Molecular Biology and Biophysics, Biotechnology Institute, University of Minnesota, Saint Paul, MN 55108, USA
| | - R Snajdrova
- Novartis Institutes for BioMedical Research, Global Discovery Chemistry, 4056 Basel, Switzerland
| | - J C Moore
- MRL, Merck & Co., Rahway, NJ 07065, USA
| | - U T Bornscheuer
- Institute of Biochemistry, Dept. of Biotechnology and Enzyme Catalysis, Greifswald University, Greifswald, Germany
| |
Collapse
|
37
|
Nisonoff H, Wang Y, Listgarten J. Coherent Blending of Biophysics-Based Knowledge with Bayesian Neural Networks for Robust Protein Property Prediction. ACS Synth Biol 2023; 12:3242-3251. [PMID: 37888887 DOI: 10.1021/acssynbio.3c00217] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.
Collapse
Affiliation(s)
- Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
| | - Yixin Wang
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109-1107, United States
| | - Jennifer Listgarten
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, California 94720-1776, United States
| |
Collapse
|
38
|
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566287. [PMID: 37987009 PMCID: PMC10659313 DOI: 10.1101/2023.11.08.566287] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.
Collapse
Affiliation(s)
- Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
39
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
40
|
Busia A, Listgarten J. MBE: model-based enrichment estimation and prediction for differential sequencing data. Genome Biol 2023; 24:218. [PMID: 37784130 PMCID: PMC10544408 DOI: 10.1186/s13059-023-03058-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 09/14/2023] [Indexed: 10/04/2023] Open
Abstract
Characterizing differences in sequences between two conditions, such as with and without drug exposure, using high-throughput sequencing data is a prevalent problem involving quantifying changes in sequence abundances, and predicting such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot use sequencing data effectively, nor be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. We evaluate MBE using both simulated and real data. Overall, MBE improves accuracy compared to current differential analysis methods.
Collapse
Affiliation(s)
- Akosua Busia
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| | - Jennifer Listgarten
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| |
Collapse
|
41
|
Golinski AW, Schmitz ZD, Nielsen GH, Johnson B, Saha D, Appiah S, Hackel BJ, Martiniani S. Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation. ACS Synth Biol 2023; 12:2600-2615. [PMID: 37642646 PMCID: PMC10829850 DOI: 10.1021/acssynbio.3c00196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.
Collapse
Affiliation(s)
- Alexander W. Golinski
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Zachary D. Schmitz
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Gregory H. Nielsen
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Bryce Johnson
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Diya Saha
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Sandhya Appiah
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Benjamin J. Hackel
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Stefano Martiniani
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
- Center for Soft Matter Research, Department of Physics, New York University, New York, NY 10003
- Simons Center for Computational Physical Chemistry, Departments of Chemistry, New York University, New York, NY 10003
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10003
| |
Collapse
|
42
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput Biol 2023; 19:e1011460. [PMID: 37713443 PMCID: PMC10529646 DOI: 10.1371/journal.pcbi.1011460] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 09/27/2023] [Accepted: 08/24/2023] [Indexed: 09/17/2023] Open
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ∆V1/2, with a RMSE ~ 32 mV and correlation coefficient of R ~ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ∆V1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, United States of America
| | - Kelli M. White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, United States of America
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| |
Collapse
|
43
|
Chen L, Zhang Z, Li Z, Li R, Huo R, Chen L, Wang D, Luo X, Chen K, Liao C, Zheng M. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst 2023; 14:706-721.e5. [PMID: 37591206 DOI: 10.1016/j.cels.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/30/2023] [Accepted: 07/18/2023] [Indexed: 08/19/2023]
Abstract
One of the key points of machine learning-assisted directed evolution (MLDE) is the accurate learning of the fitness landscape, a conceptual mapping from sequence variants to the desired function. Here, we describe a multi-protein training scheme that leverages the existing deep mutational scanning data from diverse proteins to aid in understanding the fitness landscape of a new protein. Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects. Moreover, our study identified previously overlooked strong baselines, and their unexpectedly good performance brings our attention to the pitfalls of MLDE. Overall, these results may improve our understanding of the association between different protein fitness profiles and shed light on developing better machine learning-assisted approaches to the directed evolution of proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Lin Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zehong Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhenghao Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; Shanghai Institute for Advanced Immunochemical Studies, School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Rui Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Ruifeng Huo
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Lifan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Kaixian Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Cangsong Liao
- University of Chinese Academy of Sciences, Beijing 100049, China; Chemical Biology Research Center, Shanghai Institute of Materia Medica, Chinese Academy of Science, Shanghai 201203, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China; School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China.
| |
Collapse
|
44
|
Li Y, Yao Y, Xia Y, Tang M. Searching for protein variants with desired properties using deep generative models. BMC Bioinformatics 2023; 24:297. [PMID: 37480001 PMCID: PMC10362698 DOI: 10.1186/s12859-023-05415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 07/17/2023] [Indexed: 07/23/2023] Open
Abstract
BACKGROUND Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.
Collapse
Affiliation(s)
- Yan Li
- School of Information, Yunnan Normal University, Kunming, China
| | - Yinying Yao
- National Key Laboratory of Crop Genetic Improvement and National Centre of Plant Gene Research, Huazhong Agricultural University, Wuhan, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yu Xia
- School of Information, Yunnan Normal University, Kunming, China
| | - Mingjing Tang
- Engineering Research Center of Sustainable Development and Utilization of Biomass Energy, Ministry of Education, Yunnan Normal University, Kunming, China
- School of Life Science, Yunnan Normal University, Kunming, China
| |
Collapse
|
45
|
Nordquist E, Zhang G, Barethiya S, Ji N, White KM, Han L, Jia Z, Shi J, Cui J, Chen J. Incorporating physics to overcome data scarcity in predictive modeling of protein function: a case study of BK channels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.24.546384. [PMID: 37425916 PMCID: PMC10327070 DOI: 10.1101/2023.06.24.546384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
Machine learning has played transformative roles in numerous chemical and biophysical problems such as protein folding where large amount of data exists. Nonetheless, many important problems remain challenging for data-driven machine learning approaches due to the limitation of data scarcity. One approach to overcome data scarcity is to incorporate physical principles such as through molecular modeling and simulation. Here, we focus on the big potassium (BK) channels that play important roles in cardiovascular and neural systems. Many mutants of BK channel are associated with various neurological and cardiovascular diseases, but the molecular effects are unknown. The voltage gating properties of BK channels have been characterized for 473 site-specific mutations experimentally over the last three decades; yet, these functional data by themselves remain far too sparse to derive a predictive model of BK channel voltage gating. Using physics-based modeling, we quantify the energetic effects of all single mutations on both open and closed states of the channel. Together with dynamic properties derived from atomistic simulations, these physical descriptors allow the training of random forest models that could reproduce unseen experimentally measured shifts in gating voltage, ΔV 1/2 , with a RMSE ∼ 32 mV and correlation coefficient of R ∼ 0.7. Importantly, the model appears capable of uncovering nontrivial physical principles underlying the gating of the channel, including a central role of hydrophobic gating. The model was further evaluated using four novel mutations of L235 and V236 on the S5 helix, mutations of which are predicted to have opposing effects on V 1/2 and suggest a key role of S5 in mediating voltage sensor-pore coupling. The measured ΔV 1/2 agree quantitatively with prediction for all four mutations, with a high correlation of R = 0.92 and RMSE = 18 mV. Therefore, the model can capture nontrivial voltage gating properties in regions where few mutations are known. The success of predictive modeling of BK voltage gating demonstrates the potential of combining physics and statistical learning for overcoming data scarcity in nontrivial protein function prediction. Author Summary Deep machine learning has brought many exciting breakthroughs in chemistry, physics and biology. These models require large amount of training data and struggle when the data is scarce. The latter is true for predictive modeling of the function of complex proteins such as ion channels, where only hundreds of mutational data may be available. Using the big potassium (BK) channel as a biologically important model system, we demonstrate that a reliable predictive model of its voltage gating property could be derived from only 473 mutational data by incorporating physics-derived features, which include dynamic properties from molecular dynamics simulations and energetic quantities from Rosetta mutation calculations. We show that the final random forest model captures key trends and hotspots in mutational effects of BK voltage gating, such as the important role of pore hydrophobicity. A particularly curious prediction is that mutations of two adjacent residues on the S5 helix would always have opposite effects on the gating voltage, which was confirmed by experimental characterization of four novel mutations. The current work demonstrates the importance and effectiveness of incorporating physics in predictive modeling of protein function with scarce data.
Collapse
Affiliation(s)
- Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Guohui Zhang
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Nathan Ji
- Department of Biology, Boston College, Chestnut Hill, Massachusetts, USA
| | - Kelli M White
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Lu Han
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Zhiguang Jia
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Jingyi Shi
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianmin Cui
- Department of Biomedical Engineering, Center for the Investigation of Membrane Excitability Disorders, Cardiac Bioelectricity and Arrhythmia Center, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| |
Collapse
|
46
|
Valeri JA, Soenksen LR, Collins KM, Ramesh P, Cai G, Powers R, Angenent-Mari NM, Camacho DM, Wong F, Lu TK, Collins JJ. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences. Cell Syst 2023; 14:525-542.e9. [PMID: 37348466 PMCID: PMC10700034 DOI: 10.1016/j.cels.2023.05.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 02/17/2023] [Accepted: 05/22/2023] [Indexed: 06/24/2023]
Abstract
The design choices underlying machine-learning (ML) models present important barriers to entry for many biologists who aim to incorporate ML in their research. Automated machine-learning (AutoML) algorithms can address many challenges that come with applying ML to the life sciences. However, these algorithms are rarely used in systems and synthetic biology studies because they typically do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML algorithms. Here, we present BioAutoMATED, an AutoML platform for biological sequence analysis that integrates multiple AutoML methods into a unified framework. Users are automatically provided with relevant techniques for analyzing, interpreting, and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, and designs optimized synthetic biology components, revealing salient sequence characteristics. By automating sequence modeling, BioAutoMATED allows life scientists to incorporate ML more readily into their work.
Collapse
Affiliation(s)
- Jacqueline A Valeri
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Luis R Soenksen
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
| | - Katherine M Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ, UK
| | - Pradeep Ramesh
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - George Cai
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Rani Powers
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Pluto Biosciences, Golden, CO 80402, USA
| | - Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Diogo M Camacho
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Felix Wong
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Timothy K Lu
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
47
|
Hui T, Descoteaux ML, Miao J, Lin YS. Training Neural Network Models Using Molecular Dynamics Simulation Results to Efficiently Predict Cyclic Hexapeptide Structural Ensembles. J Chem Theory Comput 2023. [PMID: 37236147 PMCID: PMC10373485 DOI: 10.1021/acs.jctc.3c00154] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Cyclic peptides have emerged as a promising class of therapeutics. However, their de novo design remains challenging, and many cyclic peptide drugs are simply natural products or their derivatives. Most cyclic peptides, including the current cyclic peptide drugs, adopt multiple conformations in water. The ability to characterize cyclic peptide structural ensembles would greatly aid their rational design. In a previous pioneering study, our group demonstrated that using molecular dynamics results to train machine learning models can efficiently predict structural ensembles of cyclic pentapeptides. Using this method, which was termed StrEAMM (Structural Ensembles Achieved by Molecular Dynamics and Machine Learning), linear regression models were able to predict the structural ensembles for an independent test set with R2 = 0.94 between the predicted populations for specific structures and the observed populations in molecular dynamics simulations for cyclic pentapeptides. An underlying assumption in these StrEAMM models is that cyclic peptide structural preferences are predominantly influenced by neighboring interactions, namely, interactions between (1,2) and (1,3) residues. Here we demonstrate that for larger cyclic peptides such as cyclic hexapeptides, linear regression models including only (1,2) and (1,3) interactions fail to produce satisfactory predictions (R2 = 0.47); further inclusion of (1,4) interactions leads to moderate improvements (R2 = 0.75). We show that when using convolutional neural networks and graph neural networks to incorporate complex nonlinear interaction patterns, we can achieve R2 = 0.97 and R2 = 0.91 for cyclic pentapeptides and hexapeptides, respectively.
Collapse
Affiliation(s)
- Tiffani Hui
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Marc L Descoteaux
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Jiayuan Miao
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Yu-Shan Lin
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| |
Collapse
|
48
|
Diaz-Colunga J, Skwara A, Gowda K, Diaz-Uriarte R, Tikhonov M, Bajic D, Sanchez A. Global epistasis on fitness landscapes. Philos Trans R Soc Lond B Biol Sci 2023; 378:20220053. [PMID: 37004717 PMCID: PMC10067270 DOI: 10.1098/rstb.2022.0053] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 11/23/2022] [Indexed: 04/04/2023] Open
Abstract
Epistatic interactions between mutations add substantial complexity to adaptive landscapes and are often thought of as detrimental to our ability to predict evolution. Yet, patterns of global epistasis, in which the fitness effect of a mutation is well-predicted by the fitness of its genetic background, may actually be of help in our efforts to reconstruct fitness landscapes and infer adaptive trajectories. Microscopic interactions between mutations, or inherent nonlinearities in the fitness landscape, may cause global epistasis patterns to emerge. In this brief review, we provide a succinct overview of recent work about global epistasis, with an emphasis on building intuition about why it is often observed. To this end, we reconcile simple geometric reasoning with recent mathematical analyses, using these to explain why different mutations in an empirical landscape may exhibit different global epistasis patterns-ranging from diminishing to increasing returns. Finally, we highlight open questions and research directions. This article is part of the theme issue 'Interdisciplinary approaches to predicting evolutionary biology'.
Collapse
Affiliation(s)
- Juan Diaz-Colunga
- Department of Ecology & Evolutionary Biology, Yale University, New Haven, CT 06511, USA
| | - Abigail Skwara
- Department of Ecology & Evolutionary Biology, Yale University, New Haven, CT 06511, USA
| | - Karna Gowda
- Department of Ecology & Evolution & Center for the Physics of Evolving Systems, The University of Chicago, Chicago, IL 60637, USA
| | - Ramon Diaz-Uriarte
- Department of Biochemistry, School of Medicine, Universidad Autónoma de Madrid, Madrid 28029, Spain
- Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (UAM-CSIC), Madrid 28029, Spain
| | - Mikhail Tikhonov
- Department of Physics, Washington University of St Louis, St Louis, MO 63130, USA
| | - Djordje Bajic
- Department of Ecology & Evolutionary Biology, Yale University, New Haven, CT 06511, USA
| | - Alvaro Sanchez
- Department of Ecology & Evolutionary Biology, Yale University, New Haven, CT 06511, USA
- Department of Microbial Biotechnology, Campus de Cantoblanco, CNB-CSIC, Madrid 28049, Spain
| |
Collapse
|
49
|
Chen Y, Hu R, Li K, Zhang Y, Fu L, Zhang J, Si T. Deep Mutational Scanning of an Oxygen-Independent Fluorescent Protein CreiLOV for Comprehensive Profiling of Mutational and Epistatic Effects. ACS Synth Biol 2023; 12:1461-1473. [PMID: 37066862 PMCID: PMC10204710 DOI: 10.1021/acssynbio.2c00662] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Indexed: 04/18/2023]
Abstract
Oxygen-independent, flavin mononucleotide-based fluorescent proteins (FbFPs) are promising alternatives to green fluorescent protein in anaerobic contexts. Deep mutational scanning performs systematic profiling of protein sequence-function relationships but has not been applied to FbFPs. Focusing on CreiLOV from Chlamydomonas reinhardtii, we created and analyzed two comprehensive mutant collections: (1) single-residue, site-saturation mutagenesis libraries covering all 118 residues; and (2) a full combinatorial metagenesis library among 20 mutations at 15 residues, where mutation and residue selection was based on single-site mutagenesis results. Notably, the second type of library is indispensable to study higher-order epistasis but underrepresented in the literature. Using optimized FACS-seq assays, 2,185 (>92.5%) out of 2,360 possible single-site mutants and 165,428 (>89.7%) out of 184,320 possible combinatorial mutants were reliably assigned with fitness values. We constructed statistical and machine-learning models to analyze the CreiLOV data set, enabling accurate fitness prediction of higher-order mutants using lower-order mutagenesis data. In addition, we successfully isolated CreiLOV variants with improved fluorescence quantum yield and thermostability. This work provides new empirical data and design rules to engineer combinatorial protein variants.
Collapse
Affiliation(s)
- Yongcan Chen
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Ruyun Hu
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Keyi Li
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Yating Zhang
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Lihao Fu
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
- University
of Chinese Academy of Sciences, Beijing 100049, China
| | - Jianzhi Zhang
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Tong Si
- CAS
Key Laboratory for Quantitative Engineering Biology, Shenzhen Institute
of Synthetic Biology, Shenzhen Institute
of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
- BGI-Shenzhen, Shenzhen 518083, China
- University
of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
50
|
Gantz M, Neun S, Medcalf EJ, van Vliet LD, Hollfelder F. Ultrahigh-Throughput Enzyme Engineering and Discovery in In Vitro Compartments. Chem Rev 2023; 123:5571-5611. [PMID: 37126602 PMCID: PMC10176489 DOI: 10.1021/acs.chemrev.2c00910] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Indexed: 05/03/2023]
Abstract
Novel and improved biocatalysts are increasingly sourced from libraries via experimental screening. The success of such campaigns is crucially dependent on the number of candidates tested. Water-in-oil emulsion droplets can replace the classical test tube, to provide in vitro compartments as an alternative screening format, containing genotype and phenotype and enabling a readout of function. The scale-down to micrometer droplet diameters and picoliter volumes brings about a >107-fold volume reduction compared to 96-well-plate screening. Droplets made in automated microfluidic devices can be integrated into modular workflows to set up multistep screening protocols involving various detection modes to sort >107 variants a day with kHz frequencies. The repertoire of assays available for droplet screening covers all seven enzyme commission (EC) number classes, setting the stage for widespread use of droplet microfluidics in everyday biochemical experiments. We review the practicalities of adapting droplet screening for enzyme discovery and for detailed kinetic characterization. These new ways of working will not just accelerate discovery experiments currently limited by screening capacity but profoundly change the paradigms we can probe. By interfacing the results of ultrahigh-throughput droplet screening with next-generation sequencing and deep learning, strategies for directed evolution can be implemented, examined, and evaluated.
Collapse
Affiliation(s)
| | | | | | | | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K.
| |
Collapse
|