1
|
Zhang Q, Chen W, Qin M, Wang Y, Pu Z, Ding K, Liu Y, Zhang Q, Li D, Li X, Zhao Y, Yao J, Huang L, Wu J, Yang L, Chen H, Yu H. Integrating protein language models and automatic biofoundry for enhanced protein evolution. Nat Commun 2025; 16:1553. [PMID: 39934638 PMCID: PMC11814318 DOI: 10.1038/s41467-025-56751-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Accepted: 01/24/2025] [Indexed: 02/13/2025] Open
Abstract
Traditional protein engineering methods, such as directed evolution, while effective, are often slow and labor-intensive. Advances in machine learning and automated biofoundry present new opportunities for optimizing these processes. This study devises a protein language model-enabled automatic evolution platform, a closed-loop system for automated protein engineering within the Design-Build-Test-Learn cycle. The protein language model ESM-2 makes zero-shot prediction of 96 variants to initiate the cycle. The biofoundry constructs and evaluates these variants, and feeds the results back to a multi-layer perceptron to train a fitness predictor, which then makes prediction of second round of 96 variants with improved fitness. With the tRNA synthetase as a model enzyme, four-rounds of evolution carried out within 10 days lead to mutants with enzyme activity improved by up to 2.4-fold. Our system significantly enhances the speed and accuracy of protein evolution, driving faster advancements in protein engineering for industrial applications.
Collapse
Affiliation(s)
- Qiang Zhang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-UIUC Institute, International Campus, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Wanyi Chen
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Ming Qin
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- School of Software Technology, Zhejiang University, Hangzhou, 315103, China
| | - Yuhao Wang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Polytechnic Institute, Zhejiang University, Hangzhou, 310015, China
| | - Zhongji Pu
- Xianghu Laboratory, Hangzhou, 311231, China
| | - Keyan Ding
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Yuyue Liu
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Qunfeng Zhang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Dongfang Li
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Xinjia Li
- Xianghu Laboratory, Hangzhou, 311231, China
| | - Yu Zhao
- AI Lab, Tencent, Shenzhen, Guangdong, 518000, China
| | - Jianhua Yao
- AI Lab, Tencent, Shenzhen, Guangdong, 518000, China
| | - Lei Huang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
| | - Jianping Wu
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
- Zhejiang Key Laboratory of Intelligent Manufacturing for Functional Chemicals, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311215, China
| | - Lirong Yang
- Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China
- Zhejiang Key Laboratory of Intelligent Manufacturing for Functional Chemicals, ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311215, China
| | - Huajun Chen
- Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China.
- College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, 310027, China.
| | - Haoran Yu
- Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou, Zhejiang, 311200, China.
| |
Collapse
|
2
|
Yang J, Lal RG, Bowden JC, Astudillo R, Hameedi MA, Kaur S, Hill M, Yue Y, Arnold FH. Active learning-assisted directed evolution. Nat Commun 2025; 16:714. [PMID: 39821082 PMCID: PMC11739421 DOI: 10.1038/s41467-025-55987-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 01/02/2025] [Indexed: 01/19/2025] Open
Abstract
Directed evolution (DE) is a powerful tool to optimize protein fitness for a specific application. However, DE can be inefficient when mutations exhibit non-additive, or epistatic, behavior. Here, we present Active Learning-assisted Directed Evolution (ALDE), an iterative machine learning-assisted DE workflow that leverages uncertainty quantification to explore the search space of proteins more efficiently than current DE methods. We apply ALDE to an engineering landscape that is challenging for DE: optimization of five epistatic residues in the active site of an enzyme. In three rounds of wet-lab experimentation, we improve the yield of a desired product of a non-native cyclopropanation reaction from 12% to 93%. We also perform computational simulations on existing protein sequence-fitness datasets to support our argument that ALDE can be more effective than DE. Overall, ALDE is a practical and broadly applicable strategy to unlock improved protein engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ravi G Lal
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - James C Bowden
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
- Computer Science, University of California-Berkeley, Berkeley, CA, USA
| | - Raul Astudillo
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Mikhail A Hameedi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Matthew Hill
- Elegen Corp, 1300 Industrial Road #16, San Carlos, CA, USA
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA.
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
3
|
Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024; 30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]
Abstract
Directed evolution stands as a seminal technology for generating novel protein functionalities, a cornerstone in biocatalysis, metabolic engineering, and synthetic biology. Today, with the development of various mutagenesis methods and advanced analytical machines, the challenge of diversity generation and high-throughput screening platforms is largely solved, and one of the remaining challenges is: how to empower the potential of single beneficial substitutions with recombination to achieve the epistatic effect. This review overviews experimental and computer-assisted recombination methods in protein engineering campaigns. In addition, integrated and machine learning-guided strategies were highlighted to discuss how these recombination approaches contribute to generating the screening library with better diversity, coverage, and size. A decision tree was finally summarized to guide the further selection of proper recombination strategies in practice, which was beneficial for accelerating protein engineering.
Collapse
Affiliation(s)
- Xinyue Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Anni Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Xiujuan Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Haiyang Cui
- School of Life Sciences, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| |
Collapse
|
4
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
5
|
Chu HY, Fong JHC, Thean DGL, Zhou P, Fung FKC, Huang Y, Wong ASL. Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - John H C Fong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Dawn G L Thean
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Peng Zhou
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Frederic K C Fung
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Alan S L Wong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
6
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
7
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
8
|
Qiu Y, Wei GW. Persistent spectral theory-guided protein engineering. NATURE COMPUTATIONAL SCIENCE 2023; 3:149-163. [PMID: 37637776 PMCID: PMC10456983 DOI: 10.1038/s43588-022-00394-y] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 12/22/2022] [Indexed: 08/29/2023]
Abstract
While protein engineering, which iteratively optimizes protein fitness by screening the gigantic mutational space, is constrained by experimental capacity, various machine learning models have substantially expedited protein engineering. Three-dimensional protein structures promise further advantages, but their intricate geometric complexity hinders their applications in deep mutational screening. Persistent homology, an established algebraic topology tool for protein structural complexity reduction, fails to capture the homotopic shape evolution during the filtration of a given data. This work introduces a Topology-offered protein Fitness (TopFit) framework to complement protein sequence and structure embeddings. Equipped with an ensemble regression strategy, TopFit integrates the persistent spectral theory, a new topological Laplacian, and two auxiliary sequence embeddings to capture mutation-induced topological invariant, shape evolution, and sequence disparity in the protein fitness landscape. The performance of TopFit is assessed by 34 benchmark datasets with 128,634 variants, involving a vast variety of protein structure acquisition modalities and training set size variations.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, MI, 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|