1
|
Song F, Zhang H, Qin Z, Zhou J. Intelligent biomanufacturing of water-soluble vitamins. Trends Biotechnol 2025:S0167-7799(25)00134-9. [PMID: 40335344 DOI: 10.1016/j.tibtech.2025.04.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2025] [Revised: 04/05/2025] [Accepted: 04/07/2025] [Indexed: 05/09/2025]
Abstract
Given the crucial role of water-soluble vitamins in the human body and the rising demand for natural sources, their biosynthesis has gained the attention of researchers. This review offers a comprehensive look at recent progress in water-soluble vitamin biosynthesis, emphasizing synthetic biotechnology for green biomanufacturing. Specifically, it encompasses the optimization of biological components, pathways, and systems, as well as energy metabolism regulation, stress-tolerance enhancement, high-throughput screening, and the upscaling of production processes. It also envisages intelligent biomanufacturing platforms, highlighting the role of systems biology and artificial intelligence (AI), and proposes future research directions, such as integrating AI-driven metabolic models, enzyme engineering, and cell-free systems, to address limitations in the efficiency, toxicity, and scalability of water-soluble vitamin production.
Collapse
Affiliation(s)
- Fuqiang Song
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Jiangsu Province Engineering Research Center of Food Synthetic Biotechnology, Jiangnan University, Wuxi 214122, China
| | - Heng Zhang
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Jiangsu Province Engineering Research Center of Food Synthetic Biotechnology, Jiangnan University, Wuxi 214122, China
| | - Zhijie Qin
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Jingwen Zhou
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China; Jiangsu Province Engineering Research Center of Food Synthetic Biotechnology, Jiangnan University, Wuxi 214122, China; Jiangsu Province Basic Research Center for Synthetic Biology, Jiangnan University, Wuxi 214122, China.
| |
Collapse
|
2
|
Zhou Y, Liu Y, Sun H, Lu Y. Creating novel metabolic pathways by protein engineering for bioproduction. Trends Biotechnol 2025; 43:1094-1103. [PMID: 39632163 PMCID: PMC12064402 DOI: 10.1016/j.tibtech.2024.10.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 10/21/2024] [Accepted: 10/31/2024] [Indexed: 12/07/2024]
Abstract
A diverse array of natural products has been produced by cell biofactories through metabolic engineering, in which enzymes play essential roles in the complex metabolic network. However, the scope of such biotransformation can be limited by the capacities of natural enzymes. To broaden their scope, many natural enzymes have recently been engineered to activate non-native substrates and/or to employ new-to-nature reaction mechanisms, but most of these systems are only demonstrated for in vitro applications. To bridge the gap between in vitro and in vivo biocatalysis, we highlight recent progress in engineering enzymes with non-native substrates or new-to-nature mechanisms that have been successfully applied in living cells to create novel metabolic pathways.
Collapse
Affiliation(s)
- Yu Zhou
- Department of Chemistry, University of Texas at Austin, Austin, TX 78712, USA
| | - Yiwei Liu
- Department of Chemistry, University of Texas at Austin, Austin, TX 78712, USA
| | - Haoran Sun
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA
| | - Yi Lu
- Department of Chemistry, University of Texas at Austin, Austin, TX 78712, USA; Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78712, USA.
| |
Collapse
|
3
|
Li Y, Li J, Zhang M, Liao Y, Wang F, Qiao M. Heterologous production of caffeic acid in microbial hosts: current status and perspectives. Front Microbiol 2025; 16:1570406. [PMID: 40365059 PMCID: PMC12069361 DOI: 10.3389/fmicb.2025.1570406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2025] [Accepted: 04/14/2025] [Indexed: 05/15/2025] Open
Abstract
Caffeic acid, a plant-derived phenolic compound, has attracted much attention in the fields of medicines and cosmetics due to its remarkable physiological activities including antioxidant, anti-inflammation, antibacteria, antivirus and hemostasis. However, traditional plant extraction and chemical synthesis methods exist some problems such as high production costs, low extraction efficiency and environmental pollution. In recent years, the construction of microbial cell factories for the biosynthesis of caffeic acid has attracted much attention due to its potential to offer an efficient and environmentally-friendly alternative for caffeic acid production. This review introduces the caffeic acid biosynthesis pathway first, after which the characteristics of microbial hosts for caffeic acid production are analyzed. Then, the main strategies for caffeic acid production in microbial hosts, including selection and optimization of heterologous enzymes, enhancement of the metabolic flux to caffeic acid, supply and recycling of cofactor, and optimization of the production process, are summarized and discussed. Finally, the future prospects and perspectives of microbial caffeic acid production are discussed. Recent breakthroughs have achieved caffeic acid titers of up to 6.17 g/L, demonstrating the potential of microbial biosynthesis. Future research can focus on the enhancement of metabolic flux to caffeic acid biosynthesis pathway, the development of robust microbial hosts with improved tolerance to caffeic acid and its precursors, and the establishment of cost-effective industrial production processes.
Collapse
Affiliation(s)
- Yuanzi Li
- School of Light Industry Science and Engineering, Beijing Technology and Business University (BTBU), Beijing, China
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University (BTBU), Beijing, China
| | - Jiaxin Li
- School of Light Industry Science and Engineering, Beijing Technology and Business University (BTBU), Beijing, China
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University (BTBU), Beijing, China
| | - Miao Zhang
- School of Light Industry Science and Engineering, Beijing Technology and Business University (BTBU), Beijing, China
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University (BTBU), Beijing, China
| | - Yonghong Liao
- School of Light Industry Science and Engineering, Beijing Technology and Business University (BTBU), Beijing, China
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University (BTBU), Beijing, China
| | - Fenghuan Wang
- School of Light Industry Science and Engineering, Beijing Technology and Business University (BTBU), Beijing, China
- Beijing Advanced Innovation Center for Food Nutrition and Human Health, Beijing Technology and Business University (BTBU), Beijing, China
| | - Mingqiang Qiao
- The Key Laboratory of Molecular Microbiology and Technology, Ministry of Education, College of Life Sciences, Nankai University, Tianjin, China
- College of Life Sciences, Shanxi University, Taiyuan, China
| |
Collapse
|
4
|
Zhang L, Liu T. ATP-Pred: Prediction of Protein-ATP Binding Residues via Fusion of Residue-Level Embeddings and Kolmogorov-Arnold Network. J Chem Inf Model 2025; 65:3812-3826. [PMID: 40119803 DOI: 10.1021/acs.jcim.5c00016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2025]
Abstract
Accurately identifying protein-ATP binding residues is essential for understanding biological processes and designing drugs. However, current sequence-based methods have limitations, such as difficulties in extracting discriminative features and the need for more efficient algorithms. Additionally, methods based on multiple sequence alignments often face challenges in handling large-scale predictions. To address these issues, we developed ATP-Pred, a sequence-based method for predicting ATP-binding residues in proteins. This model applies transfer learning by using two recently developed pretrain protein language models, Ankh and ProstT5, to extract residue-level embeddings that capture protein functionality. ATP-Pred also integrates a CNN-BiLSTM network and a Kolmogorov-Arnold network to build the prediction model. To handle data imbalance, we introduced a weighted focal loss function. Experimental results on three independent test data sets showed that ATP-Pred outperforms most existing methods. Its generalizability was further validated on four protein-mononucleotide binding residue data sets, where it delivered promising results. These findings suggest that ATP-Pred is a robust and reliable predictor.
Collapse
Affiliation(s)
- Lingrong Zhang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
5
|
Thomas N, Belanger D, Xu C, Lee H, Hirano K, Iwai K, Polic V, Nyberg KD, Hoff KG, Frenz L, Emrich CA, Kim JW, Chavarha M, Ramanan A, Agresti JJ, Colwell LJ. Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Syst 2025; 16:101236. [PMID: 40081373 DOI: 10.1016/j.cels.2025.101236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 09/17/2024] [Accepted: 02/19/2025] [Indexed: 03/16/2025]
Abstract
Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization is often hindered by a rugged fitness landscape and costly experiments. In this work, we present TeleProt, a machine learning (ML) framework that blends evolutionary and experimental data to design diverse protein libraries, and employ it to improve the catalytic activity of a nuclease enzyme that degrades biofilms that accumulate on chronic wounds. After multiple rounds of high-throughput experiments, TeleProt found a significantly better top-performing enzyme than directed evolution (DE), had a better hit rate at finding diverse, high-activity variants, and was even able to design a high-performance initial library using no prior experimental data. We have released a dataset of 55,000 nuclease variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date, to drive further progress in ML-guided design. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Neil Thomas
- X, the Moonshot Factory, Mountain View, CA 94043, USA.
| | | | | | | | | | | | | | | | | | | | | | - Jun W Kim
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Abi Ramanan
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Lucy J Colwell
- Google DeepMind, Cambridge, MA 02142, USA; Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK.
| |
Collapse
|
6
|
Ozkan S, Padilla N, de la Cruz X. QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning. Hum Genet 2025; 144:191-208. [PMID: 39048855 PMCID: PMC11976337 DOI: 10.1007/s00439-024-02692-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
Collapse
Affiliation(s)
- Selen Ozkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
7
|
Forrest B, Derbel H, Zhao Z, Liu Q. MMRT: MultiMut Recursive Tree for predicting functional effects of high-order protein variants from low-order variants. Comput Struct Biotechnol J 2025; 27:672-681. [PMID: 40070521 PMCID: PMC11894328 DOI: 10.1016/j.csbj.2025.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 02/10/2025] [Accepted: 02/17/2025] [Indexed: 03/14/2025] Open
Abstract
Protein sequences primarily determine their stability and functions. Mutations may occur at one, two, or three positions at the same time (low-order variants) or at multiple positions simultaneously (high-order variants), which affect protein functions. So far, low-order variants, such as single variants, double variants, and triple variants, have been well-studied through high-throughput experimental scanning techniques and computational prediction methods. However, research on high-order variants remains limited because of the difficulty of scanning an exponentially large number of potential variant combinations. Nonetheless, studying higher-order variants is crucial for understanding the pathogenesis of complex diseases, advancing protein engineering, and driving precision medicine. In this work, we introduce a novel deep learning model, namely MultiMut Recursive Tree (MMRT), to address this challenge of predicting the functional effects of high-order variants. MMRT integrates deep learning with a recursive tree framework to leverage the information from low-order variants to predict functional effects of high-order variants. We evaluated MMRT on datasets comprising 685,593 high-order variants. Our results (mean Spearman's correlation coefficient 0.55) demonstrated that MMRT outperformed three existing state-of-the-art methods: ESM (evolutionary scale modeling), DeepSequence, and ECNet (evolutionary context-integrated neural network). MMRT thus provides more accurate prediction of the functional effects of high-order protein variants, offering great potential for aiding the interpretation of variants in human disease studies.
Collapse
Affiliation(s)
- Bryce Forrest
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| | - Houssemeddine Derbel
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| | - Zhongming Zhao
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Qian Liu
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
- School of Life Sciences, College of Sciences, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| |
Collapse
|
8
|
Gromiha MM, Pandey M, Kulandaisamy A, Sharma D, Ridha F. Progress on the development of prediction tools for detecting disease causing mutations in proteins. Comput Biol Med 2025; 185:109510. [PMID: 39637461 DOI: 10.1016/j.compbiomed.2024.109510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 11/27/2024] [Accepted: 11/29/2024] [Indexed: 12/07/2024]
Abstract
Proteins are involved in a variety of functions in living organisms. The mutation of amino acid residues in a protein alters its structure, stability, binding, and function, with some mutations leading to diseases. Understanding the influence of mutations on protein structure and function help to gain deep insights on the molecular mechanism of diseases and devising therapeutic strategies. Hence, several generic and disease-specific methods have been proposed to reveal pathogenic effects on mutations. In this review, we focus on the development of prediction methods for identifying disease causing mutations in proteins. We briefly outline the existing databases for disease-causing mutations, followed by a discussion on sequence- and structure-based features used for prediction. Further, we discuss computational tools based on machine learning, deep learning and large language models for detecting disease-causing mutations. Specifically, we emphasize the advances in predicting hotspots and mutations for targets involved in cancer, neurodegenerative and infectious diseases as well as in membrane proteins. The computational resources including databases and algorithms understanding/predicting the effect of mutations will be listed. Moreover, limitations of existing methods and possible improvements will be discussed.
Collapse
Affiliation(s)
- M Michael Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India.
| | - Medha Pandey
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India
| | - A Kulandaisamy
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India
| | - Divya Sharma
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India
| | - Fathima Ridha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India
| |
Collapse
|
9
|
Gelman S, Johnson B, Freschlin C, Sharma A, D'Costa S, Peters J, Gitter A, Romero PA. Biophysics-based protein language models for protein engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.03.15.585128. [PMID: 38559182 PMCID: PMC10980077 DOI: 10.1101/2024.03.15.585128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Collapse
Affiliation(s)
- Sam Gelman
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Bryce Johnson
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | | | - Arnav Sharma
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Sameer D'Costa
- Department of Biochemistry, University of Wisconsin-Madison
| | - John Peters
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison
- Department of Biomedical Engineering, Duke University
| |
Collapse
|
10
|
Zheng N, Cai Y, Zhang Z, Zhou H, Deng Y, Du S, Tu M, Fang W, Xia X. Tailoring industrial enzymes for thermostability and activity evolution by the machine learning-based iCASE strategy. Nat Commun 2025; 16:604. [PMID: 39799136 PMCID: PMC11724889 DOI: 10.1038/s41467-025-55944-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 01/03/2025] [Indexed: 01/15/2025] Open
Abstract
The pursuit of obtaining enzymes with high activity and stability remains a grail in enzyme evolution due to the stability-activity trade-off. Here, we develop an isothermal compressibility-assisted dynamic squeezing index perturbation engineering (iCASE) strategy to construct hierarchical modular networks for enzymes of varying complexity. Molecular mechanism analysis elucidates that the peak of adaptive evolution is reached through a structural response mechanism among variants. Furthermore, this dynamic response predictive model using structure-based supervised machine learning is established to predict enzyme function and fitness, demonstrating robust performance across different datasets and reliable prediction for epistasis. The universality of the iCASE strategy is validated by four sorts of enzymes with different structures and catalytic types. This machine learning-based iCASE strategy provides guidance for future research on the fitness evolution of enzymes.
Collapse
Affiliation(s)
- Nan Zheng
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Yongchao Cai
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Zehua Zhang
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Huimin Zhou
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Yu Deng
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Shuang Du
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Mai Tu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, PR China
| | - Wei Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, PR China
| | - Xiaole Xia
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China.
- College of Food Science and Engineering, Tianjin University of Science and Technology, Tianjin, PR China.
| |
Collapse
|
11
|
Dieckhaus H, Kuhlman B. Protein stability models fail to capture epistatic interactions of double point mutations. Protein Sci 2025; 34:e70003. [PMID: 39704075 DOI: 10.1002/pro.70003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Revised: 11/06/2024] [Accepted: 12/05/2024] [Indexed: 12/21/2024]
Abstract
There is strong interest in accurate methods for predicting changes in protein stability resulting from amino acid mutations to the protein sequence. Recombinant proteins must often be stabilized to be used as therapeutics or reagents, and destabilizing mutations are implicated in a variety of diseases. Due to increased data availability and improved modeling techniques, recent studies have shown advancements in predicting changes in protein stability when a single-point mutation is made. Less focus has been directed toward predicting changes in protein stability when there are two or more mutations. Here, we analyze the largest available dataset of double point mutation stability and benchmark several widely used protein stability models on this and other datasets. We find that additive models of protein stability perform surprisingly well on this task, achieving similar performance to comparable non-additive predictors according to most metrics. Accordingly, we find that neither artificial intelligence-based nor physics-based protein stability models consistently capture epistatic interactions between single mutations. We observe one notable deviation from this trend, which is that epistasis-aware models provide marginally better predictions than additive models on stabilizing double point mutations. We develop an extension of the ThermoMPNN framework for double mutant modeling, as well as a novel data augmentation scheme, which mitigates some of the limitations in currently available datasets. Collectively, our findings indicate that current protein stability models fail to capture the nuanced epistatic interactions between concurrent mutations due to several factors, including training dataset limitations and insufficient model sensitivity.
Collapse
Affiliation(s)
- Henry Dieckhaus
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy, Chapel Hill, North Carolina, USA
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| |
Collapse
|
12
|
Keen MM, Keith AD, Ortlund EA. Epitope mapping via in vitro deep mutational scanning methods and its applications. J Biol Chem 2025; 301:108072. [PMID: 39674321 PMCID: PMC11783119 DOI: 10.1016/j.jbc.2024.108072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 12/04/2024] [Accepted: 12/09/2024] [Indexed: 12/16/2024] Open
Abstract
Epitope mapping is a technique employed to define the region of an antigen that elicits an immune response, providing crucial insight into the structural architecture of the antigen as well as epitope-paratope interactions. With this breadth of knowledge, immunotherapies, diagnostics, and vaccines are being developed with a rational and data-supported design. Traditional epitope mapping methods are laborious, time-intensive, and often lack the ability to screen proteins in a high-throughput manner or provide high resolution. Deep mutational scanning (DMS), however, is revolutionizing the field as it can screen all possible single amino acid mutations and provide an efficient and high-throughput way to infer the structures of both linear and three-dimensional epitopes with high resolution. Currently, more than 50 publications take this approach to efficiently identify enhancing or escaping mutations, with many then employing this information to rapidly develop broadly neutralizing antibodies, T-cell immunotherapies, vaccine platforms, or diagnostics. We provide a comprehensive review of the approaches to accomplish epitope mapping while also providing a summation of the development of DMS technology and its impactful applications.
Collapse
Affiliation(s)
- Meredith M Keen
- Department of Biochemistry, Emory School of Medicine, Emory University, Atlanta, Georgia, USA
| | - Alasdair D Keith
- Department of Biochemistry, Emory School of Medicine, Emory University, Atlanta, Georgia, USA
| | - Eric A Ortlund
- Department of Biochemistry, Emory School of Medicine, Emory University, Atlanta, Georgia, USA.
| |
Collapse
|
13
|
Ma Z, Li W, Shen Y, Xu Y, Liu G, Chang J, Li Z, Qin H, Tian B, Gong H, Liu DR, Thuronyi BW, Voigt CA, Zhang S. EvoAI enables extreme compression and reconstruction of the protein sequence space. Nat Methods 2025; 22:102-112. [PMID: 39528677 DOI: 10.1038/s41592-024-02504-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 10/10/2024] [Indexed: 11/16/2024]
Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here we establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Collapse
Affiliation(s)
- Ziyuan Ma
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Wenjie Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunhao Shen
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunxin Xu
- School of Life Sciences, Tsinghua University, Beijing, China
| | - Gengjiang Liu
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Jiamin Chang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Zeju Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Hong Qin
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Boxue Tian
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Haipeng Gong
- School of Life Sciences, Tsinghua University, Beijing, China
| | - David R Liu
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA
| | - B W Thuronyi
- Department of Chemistry, Williams College, Williamstown, MA, USA
| | - Christopher A Voigt
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Shuyi Zhang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
14
|
Tomilova YE, Russkikh NE, Yi IM, Shaburova EV, Tomilov VN, Pyrinova GB, Brezhneva SO, Tikhonyuk OS, Gololobova NS, Popichenko DV, Arkhipov MO, Bryzgalov LO, Brenner EV, Artyukh AA, Shtokalo DN, Antonets DV, Ivanov MK. Enhancing the reverse transcriptase function in Taq polymerase via AI-driven multiparametric rational design. Front Bioeng Biotechnol 2024; 12:1495267. [PMID: 39720166 PMCID: PMC11666352 DOI: 10.3389/fbioe.2024.1495267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 11/19/2024] [Indexed: 12/26/2024] Open
Abstract
Introduction Modification of natural enzymes to introduce new properties and enhance existing ones is a central challenge in bioengineering. This study is focused on the development of Taq polymerase mutants that show enhanced reverse transcriptase (RTase) activity while retaining other desirable properties such as fidelity, 5'- 3' exonuclease activity, effective deoxyuracyl incorporation, and tolerance to locked nucleic acid (LNA)-containing substrates. Our objective was to use AI-driven rational design combined with multiparametric wet-lab analysis to identify and validate Taq polymerase mutants with an optimal combination of these properties. Methods The experimental procedure was conducted in several stages: 1) On the basis of a foundational paper, we selected 18 candidate mutations known to affect RTase activity across six sites. These candidates, along with the wild type, were assessed in the wet lab for multiple properties to establish an initial training dataset. 2) Using embeddings of Taq polymerase variants generated by a protein language model, we trained a Ridge regression model to predict multiple enzyme properties. This model guided the selection of 14 new candidates for experimental validation, expanding the dataset for further refinement. 3) To better manage risk by assessing confidence intervals on predictions, we transitioned to Gaussian process regression and trained this model on an expanded dataset comprising 33 data points. 4) With this enhanced model, we conducted an in silico screen of over 18 million potential mutations, narrowing the field to 16 top candidates for comprehensive wet-lab evaluation. Results and Discussion This iterative, data-driven strategy ultimately led to the identification of 18 enzyme variants that exhibited markedly improved RTase activity while maintaining a favorable balance of other key properties. These enhancements were generally accompanied by lower Kd, moderately reduced fidelity, and greater tolerance to noncanonical substrates, thereby illustrating a strong interdependence among these traits. Several enzymes validated via this procedure were effective in single-enzyme real-time reverse-transcription PCR setups, implying their utility for the development of new tools for real-time reverse-transcription PCR technologies, such as pathogen RNA detection and gene expression analysis. This study illustrates how AI can be effectively integrated with experimental bioengineering to enhance enzyme functionality systematically. Our approach offers a robust framework for designing enzyme mutants tailored to specific biotechnological applications. The results of our biological activity predictions for mutated Taq polymerases can be accessed at https://huggingface.co/datasets/nerusskikh/taqpol_insilico_dms.
Collapse
Affiliation(s)
| | | | | | - Elizaveta V. Shaburova
- MSU Institute for Artificial Intelligence, Lomonosov Moscow State University, Moscow, Russia
| | | | | | | | | | | | | | | | | | | | | | - Dmitry N. Shtokalo
- AcademGene LLC, Novosibirsk, Russia
- MSU Institute for Artificial Intelligence, Lomonosov Moscow State University, Moscow, Russia
- Institute of Informatics Systems SB RAS, Novosibirsk, Russia
| | - Denis V. Antonets
- MSU Institute for Artificial Intelligence, Lomonosov Moscow State University, Moscow, Russia
| | - Mikhail K. Ivanov
- AO Vector-Best, Novosibirsk, Russia
- Institute of Molecular and Cellular Biology SB RAS, Novosibirsk, Russia
| |
Collapse
|
15
|
Zhang Z, Li Z, Wang Q, Wu H, Yang M, Zhao F, Tan M, Han S. A protein fitness predictive framework based on feature combination and intelligent searching. Protein Sci 2024; 33:e5211. [PMID: 39548358 PMCID: PMC11567853 DOI: 10.1002/pro.5211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 09/14/2024] [Accepted: 10/22/2024] [Indexed: 11/17/2024]
Abstract
Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.
Collapse
Affiliation(s)
- Zhihui Zhang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Zhixuan Li
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Qianyue Wang
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Hanlin Wu
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Manli Yang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Fengguang Zhao
- School of Light Industry and EngineeringSouth China University of TechnologyGuangzhouChina
| | - Mingkui Tan
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Shuangyan Han
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| |
Collapse
|
16
|
van der Flier F, Estell D, Pricelius S, Dankmeyer L, van Stigt Thans S, Mulder H, Otsuka R, Goedegebuur F, Lammerts L, Staphorst D, van Dijk AD, de Ridder D, Redestig H. Enzyme structure correlates with variant effect predictability. Comput Struct Biotechnol J 2024; 23:3489-3497. [PMID: 39435338 PMCID: PMC11491678 DOI: 10.1016/j.csbj.2024.09.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/03/2024] [Accepted: 09/12/2024] [Indexed: 10/23/2024] Open
Abstract
Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
Collapse
Affiliation(s)
- Floris van der Flier
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dave Estell
- Health & Biosciences, International Flavors and Fragrances, Palo Alto, 94304 CA, USA
| | - Sina Pricelius
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Lydia Dankmeyer
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Sander van Stigt Thans
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Harm Mulder
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Rei Otsuka
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Frits Goedegebuur
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Laurens Lammerts
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Diego Staphorst
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Aalt D.J. van Dijk
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dick de Ridder
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Henning Redestig
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| |
Collapse
|
17
|
Jiang F, Li M, Dong J, Yu Y, Sun X, Wu B, Huang J, Kang L, Pei Y, Zhang L, Wang S, Xu W, Xin J, Ouyang W, Fan G, Zheng L, Tan Y, Hu Z, Xiong Y, Feng Y, Yang G, Liu Q, Song J, Liu J, Hong L, Tan P. A general temperature-guided language model to design proteins of enhanced stability and activity. SCIENCE ADVANCES 2024; 10:eadr2641. [PMID: 39602544 PMCID: PMC11601203 DOI: 10.1126/sciadv.adr2641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 10/24/2024] [Indexed: 11/29/2024]
Abstract
Designing protein mutants with both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants with improved stability and activity without any prior experimental mutagenesis data for the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive ability compared to current state-of-the-art models on the public mutagenesis dataset across 283 protein assays. Furthermore, we validated PRIME's predictions on five proteins, examining the impact of the top 30 to 45 single-site mutations on various protein properties, including thermal stability, antigen-antibody binding affinity, and the ability to polymerize nonnatural nucleic acid or resilience to extreme alkaline conditions. More than 30% of PRIME-recommended mutants exhibited superior performance compared to their premutation counterparts across all proteins and desired properties. We developed an efficient and effective method based on PRIME to rapidly obtain multisite mutants with enhanced activity and stability. Hence, PRIME demonstrates broad applicability in protein engineering.
Collapse
Affiliation(s)
- Fan Jiang
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Mingchen Li
- Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200240, China
| | - Jiajun Dong
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China
- Guangzhou National Laboratory, No. 9 XingDaoHuanBei Road, Guangzhou International Bio Island, Guangzhou, Guangdong 510005, China
| | - Yuanxi Yu
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xinyu Sun
- Department of Chemistry, University of Science and Technology of China, Hefei, Anhui 230001, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang 310018, China
| | - Banghao Wu
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jin Huang
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Liqi Kang
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yufeng Pei
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang 310018, China
| | - Liang Zhang
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Shaojie Wang
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Wenxue Xu
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Jingyao Xin
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Wanli Ouyang
- Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200240, China
| | - Lirong Zheng
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Tan
- Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200240, China
| | | | - Yi Xiong
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yan Feng
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Guangyu Yang
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
- Institute of Key Biological Raw Material, Shanghai Academy of Experimental Medicine, Shanghai 201401, China
- Hzymes Biotechnology Co. Ltd, Wuhan, Hubei 430075, China
| | - Qian Liu
- School of Life Sciences and Biotechnology, & State Key Laboratory of Microbial Metabolism, & Joint International Research Laboratory of Metabolic, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jie Song
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, Zhejiang 310018, China
| | - Jia Liu
- Shanghai Institute for Advanced Immunochemical Studies and School of Life Sciences and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Liang Hong
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
- Zhanjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Pan Tan
- School of Physics and Astronomy, & Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China
| |
Collapse
|
18
|
Xu Y, Liu D, Gong H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. NATURE COMPUTATIONAL SCIENCE 2024; 4:840-850. [PMID: 39455825 DOI: 10.1038/s43588-024-00716-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 10/03/2024] [Indexed: 10/28/2024]
Abstract
Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models-GeoFitness, GeoDDG and GeoDTm-for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.
Collapse
Affiliation(s)
- Yunxin Xu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Di Liu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
19
|
Majid I, Sergeev YV. Linking Protein Stability to Pathogenicity: Predicting Clinical Significance of Single-Missense Mutations in Ocular Proteins Using Machine Learning. Int J Mol Sci 2024; 25:11649. [PMID: 39519200 PMCID: PMC11546782 DOI: 10.3390/ijms252111649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 10/28/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
Understanding the effect of single-missense mutations on protein stability is crucial for clinical decision-making and therapeutic development. The impact of these mutations on protein stability and 3D structure remains underexplored. Here, we developed a program to investigate the relationship between pathogenic mutations with protein unfolding and compared seven machine learning (ML) models to predict the clinical significance of single-missense mutations with unknown impacts, based on protein stability parameters. We analyzed seven proteins associated with ocular disease-causing genes. The program revealed an R-squared value of 0.846 using Decision Tree Regression between pathogenic mutations and decreased protein stability, with 96.20% of pathogenic mutations in RPE65 leading to protein instability. Among the ML models, Random Forest achieved the highest AUC (0.922) and PR AUC (0.879) in predicting the clinical significance of mutations with unknown effects. Our findings indicate that most pathogenic mutations affecting protein stability occur in alpha-helices, beta-pleated sheets, and active sites. This study suggests that protein stability can serve as a valuable parameter for interpreting the clinical significance of single-missense mutations in ocular proteins.
Collapse
Affiliation(s)
| | - Yuri V. Sergeev
- Ophthalmic Genetics and Visual Function Branch, National Eye Institute, National Institute of Health, Bethesda, MD 20892, USA
| |
Collapse
|
20
|
Xie X, Gui L, Qiao B, Wang G, Huang S, Zhao Y, Sun S. Deep learning in template-free de novo biosynthetic pathway design of natural products. Brief Bioinform 2024; 25:bbae495. [PMID: 39373052 PMCID: PMC11456888 DOI: 10.1093/bib/bbae495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/12/2024] [Accepted: 09/20/2024] [Indexed: 10/08/2024] Open
Abstract
Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.
Collapse
Affiliation(s)
- Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Lin Gui
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital, Harbin Medical University, No. 246 Xuefu Road, Nangang District,Harbin 150081, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| |
Collapse
|
21
|
Qin Z, Yuan B, Qu G, Sun Z. Rational enzyme design by reducing the number of hotspots and library size. Chem Commun (Camb) 2024; 60:10451-10463. [PMID: 39210728 DOI: 10.1039/d4cc01394h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Biocatalysts that are eco-friendly, sustainable, and highly specific have great potential for applications in the production of fine chemicals, food, detergents, biofuels, pharmaceuticals, and more. However, due to factors such as low activity, narrow substrate scope, poor thermostability, or incorrect selectivity, most natural enzymes cannot be directly used for large-scale production of the desired products. To overcome these obstacles, protein engineering methods have been developed over decades and have become powerful and versatile tools for adapting enzymes with improved catalytic properties or new functions. The vastness of the protein sequence space makes screening a bottleneck in obtaining advantageous mutated enzymes in traditional directed evolution. In the realm of mathematics, there are two major constraints in the protein sequence space: (1) the number of residue substitutions (M); and (2) the number of codons encoding amino acids as building blocks (N). This feature review highlights protein engineering strategies to reduce screening efforts from two dimensions by reducing the numbers M and N, and also discusses representative seminal studies of rationally engineered natural enzymes to deliver new catalytic functions.
Collapse
Affiliation(s)
- Zongmin Qin
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
| | - Bo Yuan
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| | - Ge Qu
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| | - Zhoutong Sun
- University of Chinese Academy of Sciences, Beijing 100049, China
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China.
- National Center of Technology Innovation for Synthetic Biology, Tianjin 300308, China
- Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin 300308, China
| |
Collapse
|
22
|
Van Gelder K, Lindner SN, Hanson AD, Zhou J. Strangers in a foreign land: 'Yeastizing' plant enzymes. Microb Biotechnol 2024; 17:e14525. [PMID: 39222378 PMCID: PMC11368087 DOI: 10.1111/1751-7915.14525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 07/02/2024] [Indexed: 09/04/2024] Open
Abstract
Expressing plant metabolic pathways in microbial platforms is an efficient, cost-effective solution for producing many desired plant compounds. As eukaryotic organisms, yeasts are often the preferred platform. However, expression of plant enzymes in a yeast frequently leads to failure because the enzymes are poorly adapted to the foreign yeast cellular environment. Here, we first summarize the current engineering approaches for optimizing performance of plant enzymes in yeast. A critical limitation of these approaches is that they are labour-intensive and must be customized for each individual enzyme, which significantly hinders the establishment of plant pathways in cellular factories. In response to this challenge, we propose the development of a cost-effective computational pipeline to redesign plant enzymes for better adaptation to the yeast cellular milieu. This proposition is underpinned by compelling evidence that plant and yeast enzymes exhibit distinct sequence features that are generalizable across enzyme families. Consequently, we introduce a data-driven machine learning framework designed to extract 'yeastizing' rules from natural protein sequence variations, which can be broadly applied to all enzymes. Additionally, we discuss the potential to integrate the machine learning model into a full design-build-test cycle.
Collapse
Affiliation(s)
- Kristen Van Gelder
- Horticultural Sciences DepartmentUniversity of FloridaGainesvilleFloridaUSA
| | - Steffen N. Lindner
- Department of Systems and Synthetic MetabolismMax Planck Institute of Molecular Plant PhysiologyPotsdamGermany
- Department of BiochemistryCharité Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt‐UniversitätBerlinGermany
| | - Andrew D. Hanson
- Horticultural Sciences DepartmentUniversity of FloridaGainesvilleFloridaUSA
| | - Juannan Zhou
- Department of BiologyUniversity of FloridaGainesvilleFloridaUSA
| |
Collapse
|
23
|
Zhang H, Guo L, Su Y, Wang R, Yang W, Mu W, Xuan L, Huang L, Wang J, Gao W. Hosts engineering and in vitro enzymatic synthesis for the discovery of novel natural products and their derivatives. Crit Rev Biotechnol 2024; 44:1121-1139. [PMID: 37574211 DOI: 10.1080/07388551.2023.2236787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 05/23/2023] [Accepted: 06/17/2023] [Indexed: 08/15/2023]
Abstract
Novel natural products (NPs) and their derivatives are important sources for drug discovery, which have been broadly applied in the fields of agriculture, livestock, and medicine, making the synthesis of NPs and their derivatives necessarily important. In recent years, biosynthesis technology has received increasing attention due to its high efficiency in the synthesis of high value-added novel products and its advantages of green, environmental protection, and controllability. In this review, the technological advances of biosynthesis strategies in the discovery of novel NPs and their derivatives are outlined, with an emphasis on two areas of host engineering and in vitro enzymatic synthesis. In terms of hosts engineering, multiple microorganisms, including Streptomyces, Aspergillus, and Penicillium, have been used as the biosynthetic gene clusters (BGCs) provider and host strain for the expression of BGCs to discover new compounds over the past years. In addition, the use of in vitro enzymatic synthesis strategy to generate novel compounds such as triterpenoid saponins and flavonoids is also hereby described.
Collapse
Affiliation(s)
- Huanyu Zhang
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| | - Lanping Guo
- National Resource Center for Chinese Meteria Medica, China Academy of Chinese Medical Sciences, Beijing, P.R. China
| | - Yaowu Su
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| | - Rubing Wang
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| | - Wenqi Yang
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| | - Wenrong Mu
- College of Pharmacy, Henan University of Chinese Medicine, Zhengzhou, P.R. China
| | - Liangshuang Xuan
- College of Pharmacy, Henan University of Chinese Medicine, Zhengzhou, P.R. China
| | - Luqi Huang
- National Resource Center for Chinese Meteria Medica, China Academy of Chinese Medical Sciences, Beijing, P.R. China
| | - Juan Wang
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| | - Wenyuan Gao
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, P.R. China
- Key Laboratory of Systems Bioengineering, Ministry of Education, Tianjin University, Tianjin, P.R. China
| |
Collapse
|
24
|
Chen Y, Xu Y, Liu D, Xing Y, Gong H. An end-to-end framework for the prediction of protein structure and fitness from single sequence. Nat Commun 2024; 15:7400. [PMID: 39191788 DOI: 10.1038/s41467-024-51776-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 08/19/2024] [Indexed: 08/29/2024] Open
Abstract
Significant research progress has been made in the field of protein structure and fitness prediction. Particularly, single-sequence-based structure prediction methods like ESMFold and OmegaFold achieve a balance between inference speed and prediction accuracy, showing promise for many downstream prediction tasks. Here, we propose SPIRED, a single-sequence-based structure prediction model that exhibits comparable performance to the state-of-the-art methods but with approximately 5-fold acceleration in inference and at least one order of magnitude reduction in training consumption. By integrating SPIRED with downstream neural networks, we compose an end-to-end framework named SPIRED-Fitness for the rapid prediction of both protein structure and fitness from single sequence with satisfactory accuracy. Moreover, SPIRED-Stab, the derivative of SPIRED-Fitness, achieves state-of-the-art performance in predicting the mutational effects on protein stability.
Collapse
Affiliation(s)
- Yinghui Chen
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Yunxin Xu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Di Liu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Yaoguang Xing
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
25
|
Illig AM, Siedhoff NE, Davari MD, Schwaneberg U. Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort. J Chem Inf Model 2024; 64:6350-6360. [PMID: 39088689 DOI: 10.1021/acs.jcim.4c00704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2024]
Abstract
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
Collapse
Affiliation(s)
| | - Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| |
Collapse
|
26
|
Dieckhaus H, Kuhlman B. Protein stability models fail to capture epistatic interactions of double point mutations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.20.608844. [PMID: 39229177 PMCID: PMC11370451 DOI: 10.1101/2024.08.20.608844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
There is strong interest in accurate methods for predicting changes in protein stability resulting from amino acid mutations to the protein sequence. Recombinant proteins must often be stabilized to be used as therapeutics or reagents, and destabilizing mutations are implicated in a variety of diseases. Due to increased data availability and improved modeling techniques, recent studies have shown advancements in predicting changes in protein stability when a single point mutation is made. Less focus has been directed toward predicting changes in protein stability when there are two or more mutations, despite the significance of mutation clusters for disease pathways and protein design studies. Here, we analyze the largest available dataset of double point mutation stability and benchmark several widely used protein stability models on this and other datasets. We identify a blind spot in how predictors are typically evaluated on multiple mutations, finding that, contrary to assumptions in the field, current stability models are unable to consistently capture epistatic interactions between double mutations. We observe one notable deviation from this trend, which is that epistasis-aware models provide marginally better predictions on stabilizing double point mutations. We develop an extension of the ThermoMPNN framework for double mutant modeling as well as a novel data augmentation scheme which mitigates some of the limitations in available datasets. Collectively, our findings indicate that current protein stability models fail to capture the nuanced epistatic interactions between concurrent mutations due to several factors, including training dataset limitations and insufficient model sensitivity.
Collapse
Affiliation(s)
- Henry Dieckhaus
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy, Chapel Hill, North Carolina, USA
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| |
Collapse
|
27
|
Li G, Zhang N, Dai X, Fan L. EnzyACT: A Novel Deep Learning Method to Predict the Impacts of Single and Multiple Mutations on Enzyme Activity. J Chem Inf Model 2024; 64:5912-5921. [PMID: 39038814 PMCID: PMC11323264 DOI: 10.1021/acs.jcim.4c00920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 07/01/2024] [Accepted: 07/09/2024] [Indexed: 07/24/2024]
Abstract
Enzyme engineering involves the customization of enzymes by introducing mutations to expand the application scope of natural enzymes. One limitation of that is the complex interaction between two key properties, activity and stability, where the enhancement of one often leads to the reduction of the other, also called the trade-off mechanism. Although dozens of methods that predict the change of protein stability upon mutations have been developed, the prediction of the effect on activity is still in its early stage. Therefore, developing a fast and accurate method to predict the impact of the mutations on enzyme activity is helpful for enzyme design and understanding of the trade-off mechanism. Here, we introduce a novel approach, EnzyACT, a deep learning method that fuses graph technique and protein embedding to predict activity changes upon single or multiple mutations. Our model combines graph-based techniques and language models to predict the activity changes. Moreover, EnzyACT is trained on a new curated data set including both single- and multiple-point mutations. When benchmarked on multiple independent data sets, it shows uniform performance on problems affected by mutations. This work also provides insights into the impact of distant mutations within activity design, which could also be useful for predicting catalytic residues and developing improved enzyme-engineering strategies.
Collapse
Affiliation(s)
- Gen Li
- Production
and R&D Center I of LSS, GenScript (Shanghai)
Biotech Co.,Ltd., Shanghai 200131, China
| | - Ning Zhang
- Production
and R&D Center I of LSS, GenScript Biotech
Corporation, Nanjing 211122, China
| | - Xiaowen Dai
- Production
and R&D Center I of LSS, GenScript Biotech
Corporation, Nanjing 211122, China
| | - Long Fan
- Production
and R&D Center I of LSS, GenScript (Shanghai)
Biotech Co.,Ltd., Shanghai 200131, China
| |
Collapse
|
28
|
Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, Hong L. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminform 2024; 16:92. [PMID: 39095917 PMCID: PMC11297785 DOI: 10.1186/s13321-024-00884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/13/2024] [Indexed: 08/04/2024] Open
Abstract
Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.
Collapse
Affiliation(s)
- Yang Tan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Mingchen Li
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Ziyi Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Pan Tan
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
| | - Huiqun Yu
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Liang Hong
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China.
| |
Collapse
|
29
|
Wang H, Chen M, Wei X, Xia R, Pei D, Huang X, Han B. Computational tools for plant genomics and breeding. SCIENCE CHINA. LIFE SCIENCES 2024; 67:1579-1590. [PMID: 38676814 DOI: 10.1007/s11427-024-2578-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 03/25/2024] [Indexed: 04/29/2024]
Abstract
Plant genomics and crop breeding are at the intersection of biotechnology and information technology. Driven by a combination of high-throughput sequencing, molecular biology and data science, great advances have been made in omics technologies at every step along the central dogma, especially in genome assembling, genome annotation, epigenomic profiling, and transcriptome profiling. These advances further revolutionized three directions of development. One is genetic dissection of complex traits in crops, along with genomic prediction and selection. The second is comparative genomics and evolution, which open up new opportunities to depict the evolutionary constraints of biological sequences for deleterious variant discovery. The third direction is the development of deep learning approaches for the rational design of biological sequences, especially proteins, for synthetic biology. All three directions of development serve as the foundation for a new era of crop breeding where agronomic traits are enhanced by genome design.
Collapse
Affiliation(s)
- Hai Wang
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, 100193, China.
- Sanya Institute of China Agricultural University, Sanya, 572025, China.
- Hainan Yazhou Bay Seed Laboratory, Sanya, 572025, China.
| | - Mengjiao Chen
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xin Wei
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Rui Xia
- College of Horticulture, South China Agricultural University, Guangzhou, 510640, China
| | - Dong Pei
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xuehui Huang
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Bin Han
- National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200233, China
| |
Collapse
|
30
|
Ding K, Chin M, Zhao Y, Huang W, Mai BK, Wang H, Liu P, Yang Y, Luo Y. Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nat Commun 2024; 15:6392. [PMID: 39080249 PMCID: PMC11289365 DOI: 10.1038/s41467-024-50698-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 07/19/2024] [Indexed: 08/02/2024] Open
Abstract
The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochrome c to achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY's potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Michael Chin
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Yunlong Zhao
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Wei Huang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Binh Khanh Mai
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Huanan Wang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA
| | - Peng Liu
- Department of Chemistry, University of Pittsburgh, Pittsburgh, PA, 15260, USA.
| | - Yang Yang
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, CA, 93106, USA.
- Biomolecular Science and Engineering (BMSE) Program, University of California, Santa Barbara, CA, 93106, USA.
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
31
|
Zeng T, Jin Z, Zheng S, Yu T, Wu R. Developing BioNavi for Hybrid Retrosynthesis Planning. JACS AU 2024; 4:2492-2502. [PMID: 39055138 PMCID: PMC11267531 DOI: 10.1021/jacsau.4c00228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 06/18/2024] [Accepted: 06/20/2024] [Indexed: 07/27/2024]
Abstract
Illuminating synthetic pathways is essential for producing valuable chemicals, such as bioactive molecules. Chemical and biological syntheses are crucial, and their integration often leads to more efficient and sustainable pathways. Despite the rapid development of retrosynthesis models, few of them consider both chemical and biological syntheses, hindering the pathway design for high-value chemicals. Here, we propose BioNavi by innovating multitask learning and reaction templates into the deep learning-driven model to design hybrid synthesis pathways in a more interpretable manner. BioNavi outperforms existing approaches on different data sets, achieving a 75% hit rate in replicating reported biosynthetic pathways and displaying superior ability in designing hybrid synthesis pathways. Additional case studies further illustrate the potential application of BioNavi in a de novo pathway design. The enhanced web server (http://biopathnavi.qmclab.com/bionavi/) simplifies input operations and implements step-by-step exploration according to user experience. We show that BioNavi is a handy navigator for designing synthetic pathways for various chemicals.
Collapse
Affiliation(s)
- Tao Zeng
- School
of Pharmaceutical Sciences, Sun Yat-sen
University, Guangzhou 510006, P. R. China
| | - Zhehao Jin
- Center
for Synthetic Biochemistry, CAS Key Laboratory of Quantitative Engineering
Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
(CAS), Shenzhen 518055, P. R. China
| | - Shuangjia Zheng
- Global
Institute of Future Technology, Shanghai
Jiao Tong University, Shanghai 200240, P. R. China
| | - Tao Yu
- Center
for Synthetic Biochemistry, CAS Key Laboratory of Quantitative Engineering
Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
(CAS), Shenzhen 518055, P. R. China
| | - Ruibo Wu
- School
of Pharmaceutical Sciences, Sun Yat-sen
University, Guangzhou 510006, P. R. China
| |
Collapse
|
32
|
Zhou Z, Zhang L, Yu Y, Wu B, Li M, Hong L, Tan P. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun 2024; 15:5566. [PMID: 38956442 PMCID: PMC11219809 DOI: 10.1038/s41467-024-49798-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/11/2024] [Indexed: 07/04/2024] Open
Abstract
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering.
Collapse
Affiliation(s)
- Ziyi Zhou
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Liang Zhang
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuanxi Yu
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Banghao Wu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Mingchen Li
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Liang Hong
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
- Zhang Jiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai, 201203, China.
| | - Pan Tan
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
| |
Collapse
|
33
|
Hong L, Zhang Z, Wang Z, Yu X, Zhang J. Phase separation provides a mechanism to drive phenotype switching. Phys Rev E 2024; 109:064414. [PMID: 39021038 DOI: 10.1103/physreve.109.064414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Accepted: 06/05/2024] [Indexed: 07/20/2024]
Abstract
Phenotypic switching plays a crucial role in cell fate determination across various organisms. Recent experimental findings highlight the significance of protein compartmentalization via liquid-liquid phase separation in influencing such decisions. However, the precise mechanism through which phase separation regulates phenotypic switching remains elusive. To investigate this, we established a mathematical model that couples a phase separation process and a gene expression process with feedback. We used the chemical master equation theory and mean-field approximation to study the effects of phase separation on the gene expression products. We found that phase separation can cause bistability and bimodality. Furthermore, phase separation can control the bistable properties of the system, such as bifurcation points and bistable ranges. On the other hand, in stochastic dynamics, the droplet phase exhibits double peaks within a more extensive phase separation threshold range than the dilute phase, indicating the pivotal role of the droplet phase in cell fate decisions. These findings propose an alternative mechanism that influences cell fate decisions through the phase separation process. As phase separation is increasingly discovered in gene regulatory networks, related modeling research can help build biomolecular systems with desired properties and offer insights into explaining cell fate decisions.
Collapse
|
34
|
Zhou B, Zheng L, Wu B, Tan Y, Lv O, Yi K, Fan G, Hong L. Protein Engineering with Lightweight Graph Denoising Neural Networks. J Chem Inf Model 2024; 64:3650-3661. [PMID: 38630581 DOI: 10.1021/acs.jcim.4c00036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Protein engineering faces challenges in finding optimal mutants from a massive pool of candidate mutants. In this study, we introduce a deep-learning-based data-efficient fitness prediction tool to steer protein engineering. Our methodology establishes a lightweight graph neural network scheme for protein structures, which efficiently analyzes the microenvironment of amino acids in wild-type proteins and reconstructs the distribution of the amino acid sequences that are more likely to pass natural selection. This distribution serves as a general guidance for scoring proteins toward arbitrary properties on any order of mutations. Our proposed solution undergoes extensive wet-lab experimental validation spanning diverse physicochemical properties of various proteins, including fluorescence intensity, antigen-antibody affinity, thermostability, and DNA cleavage activity. More than 40% of ProtLGN-designed single-site mutants outperform their wild-type counterparts across all studied proteins and targeted properties. More importantly, our model can bypass the negative epistatic effect to combine single mutation sites and form deep mutants with up to seven mutation sites in a single round, whose physicochemical properties are significantly improved. This observation provides compelling evidence of the structure-based model's potential to guide deep mutations in protein engineering. Overall, our approach emerges as a versatile tool for protein engineering, benefiting both the computational and bioengineering communities.
Collapse
Affiliation(s)
- Bingxin Zhou
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
| | - Lirong Zheng
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Banghao Wu
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Tan
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Outongyi Lv
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Kai Yi
- School of Mathematics and Statistics, University of New South Wales, Sydney 2052, Australia
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liang Hong
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 201203, China
| |
Collapse
|
35
|
Wu JS, Liu Y, Ge F, Yu DJ. Prediction of protein-ATP binding residues using multi-view feature learning via contextual-based co-attention network. Comput Biol Med 2024; 172:108227. [PMID: 38460308 DOI: 10.1016/j.compbiomed.2024.108227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 01/17/2024] [Accepted: 02/25/2024] [Indexed: 03/11/2024]
Abstract
Accurately predicting protein-ATP binding residues is critical for protein function annotation and drug discovery. Computational methods dedicated to the prediction of binding residues based on protein sequence information have exhibited notable advancements in predictive accuracy. Nevertheless, these methods continue to grapple with several formidable challenges, including limited means of extracting more discriminative features and inadequate algorithms for integrating protein and residue information. To address the problems, we propose ATP-Deep, a novel protein-ATP binding residues predictor. ATP-Deep harnesses the capabilities of unsupervised pre-trained language models and incorporates domain-specific evolutionary context information from homologous sequences. It further refines the embedding at the residue level through integration with corresponding protein-level information and employs a contextual-based co-attention mechanism to adeptly fuse multiple sources of features. The performance evaluation results on the benchmark datasets reveal that ATP-Deep achieves an AUC of 0.954 and 0.951, respectively, surpassing the performance of the state-of-the-art model. These findings underscore the effectiveness of assimilating protein-level information and deploying a contextual-based co-attention mechanism grounded in context to bolster the prediction performance of protein-ATP binding residues.
Collapse
Affiliation(s)
- Jia-Shun Wu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, 196 West Huayang, Yangzhou, 225100, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China.
| |
Collapse
|
36
|
Bibik P, Alibai S, Pandini A, Dantu SC. PyCoM: a python library for large-scale analysis of residue-residue coevolution data. Bioinformatics 2024; 40:btae166. [PMID: 38532297 PMCID: PMC11009027 DOI: 10.1093/bioinformatics/btae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/02/2024] [Accepted: 03/25/2024] [Indexed: 03/28/2024] Open
Abstract
MOTIVATION Computational methods to detect correlated amino acid positions in proteins have become a valuable tool to predict intra- and inter-residue protein contacts, protein structures, and effects of mutation on protein stability and function. While there are many tools and webservers to compute coevolution scoring matrices, there is no central repository of alignments and coevolution matrices for large-scale studies and pattern detection leveraging on biological and structural annotations already available in UniProt. RESULTS We present a Python library, PyCoM, which enables users to query and analyze coevolution matrices and sequence alignments of 457 622 proteins, selected from UniProtKB/Swiss-Prot database (length ≤ 500 residues), from a precompiled coevolution matrix database (PyCoMdb). PyCoM facilitates the development of statistical analyses of residue coevolution patterns using filters on biological and structural annotations from UniProtKB/Swiss-Prot, with simple access to PyCoMdb for both novice and advanced users, supporting Jupyter Notebooks, Python scripts, and a web API access. The resource is open source and will help in generating data-driven computational models and methods to study and understand protein structures, stability, function, and design. AVAILABILITY AND IMPLEMENTATION PyCoM code is freely available from https://github.com/scdantu/pycom and PyCoMdb and the Jupyter Notebook tutorials are freely available from https://pycom.brunel.ac.uk.
Collapse
Affiliation(s)
- Philipp Bibik
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sabriyeh Alibai
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Alessandro Pandini
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| | - Sarath Chandra Dantu
- Department of Computer Science, Brunel University London, Uxbridge UB8 3PH, United Kingdom
| |
Collapse
|
37
|
Judge A, Sankaran B, Hu L, Palaniappan M, Birgy A, Prasad BVV, Palzkill T. Network of epistatic interactions in an enzyme active site revealed by large-scale deep mutational scanning. Proc Natl Acad Sci U S A 2024; 121:e2313513121. [PMID: 38483989 PMCID: PMC10962969 DOI: 10.1073/pnas.2313513121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 02/14/2024] [Indexed: 03/19/2024] Open
Abstract
Cooperative interactions between amino acids are critical for protein function. A genetic reflection of cooperativity is epistasis, which is when a change in the amino acid at one position changes the sequence requirements at another position. To assess epistasis within an enzyme active site, we utilized CTX-M β-lactamase as a model system. CTX-M hydrolyzes β-lactam antibiotics to provide antibiotic resistance, allowing a simple functional selection for rapid sorting of modified enzymes. We created all pairwise mutations across 17 active site positions in the β-lactamase enzyme and quantitated the function of variants against two β-lactam antibiotics using next-generation sequencing. Context-dependent sequence requirements were determined by comparing the antibiotic resistance function of double mutations across the CTX-M active site to their predicted function based on the constituent single mutations, revealing both positive epistasis (synergistic interactions) and negative epistasis (antagonistic interactions) between amino acid substitutions. The resulting trends demonstrate that positive epistasis is present throughout the active site, that epistasis between residues is mediated through substrate interactions, and that residues more tolerant to substitutions serve as generic compensators which are responsible for many cases of positive epistasis. Additionally, we show that a key catalytic residue (Glu166) is amenable to compensatory mutations, and we characterize one such double mutant (E166Y/N170G) that acts by an altered catalytic mechanism. These findings shed light on the unique biochemical factors that drive epistasis within an enzyme active site and will inform enzyme engineering efforts by bridging the gap between amino acid sequence and catalytic function.
Collapse
Affiliation(s)
- Allison Judge
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Banumathi Sankaran
- Department of Molecular Biophysics and Integrated Bioimaging, Berkeley Center for Structural Biology Lawrence Berkeley National Laboratory, Berkeley, CA94720
| | - Liya Hu
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Murugesan Palaniappan
- Department of Pathology and Immunology, Center for Drug Discovery, Baylor College of Medicine, Houston, TX77030
| | - André Birgy
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
- Infections, Antimicrobials, Modelling, Evolution, UMR 1137, French Insitute for Medical Research (INSERM), Faculty of Health, Université Paris Cité, Paris75006, France
| | - B. V. Venkataram Prasad
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Timothy Palzkill
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| |
Collapse
|
38
|
Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024; 30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]
Abstract
Directed evolution stands as a seminal technology for generating novel protein functionalities, a cornerstone in biocatalysis, metabolic engineering, and synthetic biology. Today, with the development of various mutagenesis methods and advanced analytical machines, the challenge of diversity generation and high-throughput screening platforms is largely solved, and one of the remaining challenges is: how to empower the potential of single beneficial substitutions with recombination to achieve the epistatic effect. This review overviews experimental and computer-assisted recombination methods in protein engineering campaigns. In addition, integrated and machine learning-guided strategies were highlighted to discuss how these recombination approaches contribute to generating the screening library with better diversity, coverage, and size. A decision tree was finally summarized to guide the further selection of proper recombination strategies in practice, which was beneficial for accelerating protein engineering.
Collapse
Affiliation(s)
- Xinyue Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Anni Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Xiujuan Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Haiyang Cui
- School of Life Sciences, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| |
Collapse
|
39
|
Zhang S, Ma Z, Li W, Shen Y, Xu Y, Liu G, Chang J, Li Z, Qin H, Tian B, Gong H, Liu D, Thuronyi B, Voigt C. EvoAI enables extreme compression and reconstruction of the protein sequence space. RESEARCH SQUARE 2024:rs.3.rs-3930833. [PMID: 38464127 PMCID: PMC10925456 DOI: 10.21203/rs.3.rs-3930833/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Collapse
|
40
|
Hassan J, Saeed SM, Deka L, Uddin MJ, Das DB. Applications of Machine Learning (ML) and Mathematical Modeling (MM) in Healthcare with Special Focus on Cancer Prognosis and Anticancer Therapy: Current Status and Challenges. Pharmaceutics 2024; 16:260. [PMID: 38399314 PMCID: PMC10892549 DOI: 10.3390/pharmaceutics16020260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/29/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
The use of data-driven high-throughput analytical techniques, which has given rise to computational oncology, is undisputed. The widespread use of machine learning (ML) and mathematical modeling (MM)-based techniques is widely acknowledged. These two approaches have fueled the advancement in cancer research and eventually led to the uptake of telemedicine in cancer care. For diagnostic, prognostic, and treatment purposes concerning different types of cancer research, vast databases of varied information with manifold dimensions are required, and indeed, all this information can only be managed by an automated system developed utilizing ML and MM. In addition, MM is being used to probe the relationship between the pharmacokinetics and pharmacodynamics (PK/PD interactions) of anti-cancer substances to improve cancer treatment, and also to refine the quality of existing treatment models by being incorporated at all steps of research and development related to cancer and in routine patient care. This review will serve as a consolidation of the advancement and benefits of ML and MM techniques with a special focus on the area of cancer prognosis and anticancer therapy, leading to the identification of challenges (data quantity, ethical consideration, and data privacy) which are yet to be fully addressed in current studies.
Collapse
Affiliation(s)
- Jasmin Hassan
- Drug Delivery & Therapeutics Lab, Dhaka 1212, Bangladesh; (J.H.); (S.M.S.)
| | | | - Lipika Deka
- Faculty of Computing, Engineering and Media, De Montfort University, Leicester LE1 9BH, UK;
| | - Md Jasim Uddin
- Department of Pharmaceutical Technology, Faculty of Pharmacy, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Diganta B. Das
- Department of Chemical Engineering, Loughborough University, Loughborough LE11 3TU, UK
| |
Collapse
|
41
|
Nourbakhsh M, Degn K, Saksager A, Tiberti M, Papaleo E. Prediction of cancer driver genes and mutations: the potential of integrative computational frameworks. Brief Bioinform 2024; 25:bbad519. [PMID: 38261338 PMCID: PMC10805075 DOI: 10.1093/bib/bbad519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 11/27/2023] [Accepted: 12/11/2023] [Indexed: 01/24/2024] Open
Abstract
The vast amount of available sequencing data allows the scientific community to explore different genetic alterations that may drive cancer or favor cancer progression. Software developers have proposed a myriad of predictive tools, allowing researchers and clinicians to compare and prioritize driver genes and mutations and their relative pathogenicity. However, there is little consensus on the computational approach or a golden standard for comparison. Hence, benchmarking the different tools depends highly on the input data, indicating that overfitting is still a massive problem. One of the solutions is to limit the scope and usage of specific tools. However, such limitations force researchers to walk on a tightrope between creating and using high-quality tools for a specific purpose and describing the complex alterations driving cancer. While the knowledge of cancer development increases daily, many bioinformatic pipelines rely on single nucleotide variants or alterations in a vacuum without accounting for cellular compartments, mutational burden or disease progression. Even within bioinformatics and computational cancer biology, the research fields work in silos, risking overlooking potential synergies or breakthroughs. Here, we provide an overview of databases and datasets for building or testing predictive cancer driver tools. Furthermore, we introduce predictive tools for driver genes, driver mutations, and the impact of these based on structural analysis. Additionally, we suggest and recommend directions in the field to avoid silo-research, moving towards integrative frameworks.
Collapse
Affiliation(s)
- Mona Nourbakhsh
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Kristine Degn
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Astrid Saksager
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
| | - Matteo Tiberti
- Cancer Structural Biology, Danish Cancer Institute, 2100 Copenhagen, Denmark
| | - Elena Papaleo
- Cancer Systems Biology, Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, 2800 Lyngby, Denmark
- Cancer Structural Biology, Danish Cancer Institute, 2100 Copenhagen, Denmark
| |
Collapse
|
42
|
Xi C, Diao J, Moon TS. Advances in ligand-specific biosensing for structurally similar molecules. Cell Syst 2023; 14:1024-1043. [PMID: 38128482 PMCID: PMC10751988 DOI: 10.1016/j.cels.2023.10.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/23/2023] [Accepted: 10/19/2023] [Indexed: 12/23/2023]
Abstract
The specificity of biological systems makes it possible to develop biosensors targeting specific metabolites, toxins, and pollutants in complex medical or environmental samples without interference from structurally similar compounds. For the last two decades, great efforts have been devoted to creating proteins or nucleic acids with novel properties through synthetic biology strategies. Beyond augmenting biocatalytic activity, expanding target substrate scopes, and enhancing enzymes' enantioselectivity and stability, an increasing research area is the enhancement of molecular specificity for genetically encoded biosensors. Here, we summarize recent advances in the development of highly specific biosensor systems and their essential applications. First, we describe the rational design principles required to create libraries containing potential mutants with less promiscuity or better specificity. Next, we review the emerging high-throughput screening techniques to engineer biosensing specificity for the desired target. Finally, we examine the computer-aided evaluation and prediction methods to facilitate the construction of ligand-specific biosensors.
Collapse
Affiliation(s)
- Chenggang Xi
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Jinjin Diao
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Tae Seok Moon
- Department of Energy, Environmental and Chemical Engineering, Washington University in St. Louis, St. Louis, MO, USA; Division of Biology and Biomedical Sciences, Washington University in St. Louis, St. Louis, MO, USA.
| |
Collapse
|
43
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
44
|
Luo Y, Liu Y, Peng J. Calibrated geometric deep learning improves kinase-drug binding predictions. NAT MACH INTELL 2023; 5:1390-1401. [PMID: 38962391 PMCID: PMC11221792 DOI: 10.1038/s42256-023-00751-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 09/29/2023] [Indexed: 07/05/2024]
Abstract
Protein kinases regulate various cellular functions and hold significant pharmacological promise in cancer and other diseases. Although kinase inhibitors are one of the largest groups of approved drugs, much of the human kinome remains unexplored but potentially druggable. Computational approaches, such as machine learning, offer efficient solutions for exploring kinase-compound interactions and uncovering novel binding activities. Despite the increasing availability of three-dimensional (3D) protein and compound structures, existing methods predominantly focus on exploiting local features from one-dimensional protein sequences and two-dimensional molecular graphs to predict binding affinities, overlooking the 3D nature of the binding process. Here we present KDBNet, a deep learning algorithm that incorporates 3D protein and molecule structure data to predict binding affinities. KDBNet uses graph neural networks to learn structure representations of protein binding pockets and drug molecules, capturing the geometric and spatial characteristics of binding activity. In addition, we introduce an algorithm to quantify and calibrate the uncertainties of KDBNet's predictions, enhancing its utility in model-guided discovery in chemical or protein space. Experiments demonstrated that KDBNet outperforms existing deep learning models in predicting kinase-drug binding affinities. The uncertainties estimated by KDBNet are informative and well-calibrated with respect to prediction errors. When integrated with a Bayesian optimization framework, KDBNet enables data-efficient active learning and accelerates the exploration and exploitation of diverse high-binding kinase-drug pairs.
Collapse
Affiliation(s)
- Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- These authors contributed equally: Yunan Luo, Yang Liu
| | - Yang Liu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, USA
- These authors contributed equally: Yunan Luo, Yang Liu
| | - Jian Peng
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
45
|
Qu Y, Niu Z, Ding Q, Zhao T, Kong T, Bai B, Ma J, Zhao Y, Zheng J. Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction. Int J Mol Sci 2023; 24:16496. [PMID: 38003686 PMCID: PMC10671426 DOI: 10.3390/ijms242216496] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 11/11/2023] [Accepted: 11/17/2023] [Indexed: 11/26/2023] Open
Abstract
Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.
Collapse
Affiliation(s)
- Yang Qu
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Zitong Niu
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Qiaojiao Ding
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Taowa Zhao
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Tong Kong
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Bing Bai
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Jianwei Ma
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Yitian Zhao
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Jianping Zheng
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| |
Collapse
|
46
|
Parthiban S, Vijeesh T, Gayathri T, Shanmugaraj B, Sharma A, Sathishkumar R. Artificial intelligence-driven systems engineering for next-generation plant-derived biopharmaceuticals. FRONTIERS IN PLANT SCIENCE 2023; 14:1252166. [PMID: 38034587 PMCID: PMC10684705 DOI: 10.3389/fpls.2023.1252166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023]
Abstract
Recombinant biopharmaceuticals including antigens, antibodies, hormones, cytokines, single-chain variable fragments, and peptides have been used as vaccines, diagnostics and therapeutics. Plant molecular pharming is a robust platform that uses plants as an expression system to produce simple and complex recombinant biopharmaceuticals on a large scale. Plant system has several advantages over other host systems such as humanized expression, glycosylation, scalability, reduced risk of human or animal pathogenic contaminants, rapid and cost-effective production. Despite many advantages, the expression of recombinant proteins in plant system is hindered by some factors such as non-human post-translational modifications, protein misfolding, conformation changes and instability. Artificial intelligence (AI) plays a vital role in various fields of biotechnology and in the aspect of plant molecular pharming, a significant increase in yield and stability can be achieved with the intervention of AI-based multi-approach to overcome the hindrance factors. Current limitations of plant-based recombinant biopharmaceutical production can be circumvented with the aid of synthetic biology tools and AI algorithms in plant-based glycan engineering for protein folding, stability, viability, catalytic activity and organelle targeting. The AI models, including but not limited to, neural network, support vector machines, linear regression, Gaussian process and regressor ensemble, work by predicting the training and experimental data sets to design and validate the protein structures thereby optimizing properties such as thermostability, catalytic activity, antibody affinity, and protein folding. This review focuses on, integrating systems engineering approaches and AI-based machine learning and deep learning algorithms in protein engineering and host engineering to augment protein production in plant systems to meet the ever-expanding therapeutics market.
Collapse
Affiliation(s)
- Subramanian Parthiban
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thandarvalli Vijeesh
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thashanamoorthi Gayathri
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Balamurugan Shanmugaraj
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Ashutosh Sharma
- Tecnologico de Monterrey, School of Engineering and Sciences, Centre of Bioengineering, Queretaro, Mexico
| | - Ramalingam Sathishkumar
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| |
Collapse
|
47
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
48
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
49
|
Xiao B, Zhang C, Zhou J, Wang S, Meng H, Wu M, Zheng Y, Yu R. Design of SC PEP with enhanced stability against pepsin digestion and increased activity by machine learning and structural parameters modeling. Int J Biol Macromol 2023; 250:125933. [PMID: 37482154 DOI: 10.1016/j.ijbiomac.2023.125933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 06/20/2023] [Accepted: 07/20/2023] [Indexed: 07/25/2023]
Abstract
Prolyl endopeptidases from Sphingomonas capsulata (SC PEP) has attracted much attention as promising oral therapy candidate for celiac sprue, however, its low stability in the gastric environment leads to unsatisfactory clinical results. Therefore, improving its stability against pepsin digestion at low pH is crucial for clinical applications, but challenging. In this study, machine learning and physical parameter model were combined to design SC PEP mutants. After iterations, 20 mutants had higher hydrolysis activity in stomach environment, which was up to 14.1-fold compared with wild-type SC PEP. Mutant M24 involving stable and active mutations and pegylated M24 (M24-PEG) had higher activity of hydrolyzing immunogen in bread than wild-type SC PEP in vitro and in vivo, and residual immunogens in simulated gastric environment were only 1/8 and 1/10 of that in the wild-type SC PEP group. The total residual immunogens in the gastrointestinal tract of mice in the M24 and M24-PEG groups were <20 ppm, reaching the standard of non-toxic food. Our results indicate that the combination of M24 (or M24-PEG) with EP-B2 may be a promising candidate for celiac disease, and the strategies developed in this study provide a paradigm for the design of SC PEP stability mutants.
Collapse
Affiliation(s)
- Bin Xiao
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Chun Zhang
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Junxiu Zhou
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Sa Wang
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Huan Meng
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Miao Wu
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Yongxiang Zheng
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China.
| | - Rong Yu
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China.
| |
Collapse
|
50
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|