1
|
Boob AG, Tan SI, Zaidi A, Singh N, Xue X, Zhou S, Martin TA, Chen LQ, Zhao H. Design of diverse, functional mitochondrial targeting sequences across eukaryotic organisms using variational autoencoder. Nat Commun 2025; 16:4151. [PMID: 40320395 PMCID: PMC12050285 DOI: 10.1038/s41467-025-59499-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Accepted: 04/16/2025] [Indexed: 05/08/2025] Open
Abstract
Mitochondria play a key role in energy production and metabolism, making them a promising target for metabolic engineering and disease treatment. However, despite the known influence of passenger proteins on localization efficiency, only a few protein-localization tags have been characterized for mitochondrial targeting. To address this limitation, we leverage a Variational Autoencoder to design novel mitochondrial targeting sequences. In silico analysis reveals that a high fraction of the generated peptides (90.14%) are functional and possess features important for mitochondrial targeting. We characterize artificial peptides in four eukaryotic organisms and, as a proof-of-concept, demonstrate their utility in increasing 3-hydroxypropionic acid titers through pathway compartmentalization and improving 5-aminolevulinate synthase delivery by 1.62-fold and 4.76-fold, respectively. Moreover, we employ latent space interpolation to shed light on the evolutionary origins of dual-targeting sequences. Overall, our work demonstrates the potential of generative artificial intelligence for both fundamental research and practical applications in mitochondrial biology.
Collapse
Affiliation(s)
- Aashutosh Girish Boob
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Shih-I Tan
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Airah Zaidi
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Nilmani Singh
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Xueyi Xue
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Shuaizhen Zhou
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Teresa A Martin
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Li-Qing Chen
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
2
|
Chen Q, Zhang Y, Gao J, Zhang J. CPPCGM: A Highly Efficient Sequence-Based Tool for Simultaneously Identifying and Generating Cell-Penetrating Peptides. J Chem Inf Model 2025; 65:3357-3369. [PMID: 40105337 DOI: 10.1021/acs.jcim.5c00199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
Cell-penetrating peptides (CPPs) are usually short oligopeptides with 5-30 amino acid residues. CPPs have been proven as important drug delivery vehicles into cells through different mechanisms, demonstrating their potential as therapeutic candidates. However, experimental screening and synthesis of CPPs could be time-consuming and expensive. Recently, numerous attempts have been made to develop computational methods as a cost-effective way for screening a number of potential CPP candidates. Despite significant advancements, current methods exhibit limited feature representation capabilities, thereby constraining the potential for further performance enhancements. In this study, we developed a deep learning framework called CPPCGM, which uses protein language models (PLMs) to identify and generate novel CPPs. There are two separate blocks in this framework: CPPClassifier and CPPGenerator. The former utilizes three pretrained models for simple voting, thereby accurately categorizing CPPs and non-CPPs. The latter, similar to a generative adversarial network, including a discriminator and a generator, generates peptides that are not present in the training data set. Our proposed CPPCGM has achieved remarkably high Matthews correlation coefficient scores of 0.876, 0.923, and 0.664 on three data sets based on the classification results. Compared with the state-of-the-art methods, the performance of our method is significantly improved. The results also demonstrated the generating potential of CPPCGM through qualitative and quantitative evaluation of the generated samples. Significantly, using PLM-based methods can optimize peptides for biochemical functions, benefiting drug delivery and biomedical applications. Materials related are publicly available at https://github.com/QiufenChen/CPPCGM.
Collapse
Affiliation(s)
- Qiufen Chen
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yuewei Zhang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Jiali Gao
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
- School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen 518055, China
- Department of Chemistry and Supercomputing Institute, University of Minnesota, Minneapolis, Minnesota 55455, United States
| | - Jun Zhang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|
3
|
Wang K, Zhu M, Boulila W, Driss M, Gadekallu TR, Chen CM, Wang L, Kumari S, Yiu SM. SeqNovo: De Novo Peptide Sequencing Prediction in IoMT via Seq2Seq. IEEE J Biomed Health Inform 2025; 29:2377-2387. [PMID: 37792659 DOI: 10.1109/jbhi.2023.3321780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/06/2023]
Abstract
In the Internet of Medical Things (IoMT), de novo peptide sequencing prediction is one of the most important techniques for the fields of disease prediction, diagnosis, and treatment. Recently, deep-learning-based peptide sequencing prediction has been a new trend. However, most popular deep learning models for peptide sequencing prediction suffer from poor interpretability and poor ability to capture long-range dependencies. To solve these issues, we propose a model named SeqNovo, which has the encoding-decoding structure of sequence to sequence (Seq2Seq), the highly nonlinear properties of multilayer perceptron (MLP), and the ability of the attention mechanism to capture long-range dependencies. SeqNovo use MLP to improve the feature extraction and utilize the attention mechanism to discover key information. A series of experiments have been conducted to show that the SeqNovo is superior to the Seq2Seq benchmark model, DeepNovo. SeqNovo improves both the accuracy and interpretability of the predictions, which will be expected to support more related research.
Collapse
|
4
|
Li W, Wang X, Chen K, Zhu Y, Yang G, Jin Y, Wang J. Engineered Bacillus subtilis WB600/ZD prevents Salmonella Infantis-induced intestinal inflammation and alters the colon microbiota in a mouse model. Vet Res 2025; 56:35. [PMID: 39920770 PMCID: PMC11806837 DOI: 10.1186/s13567-024-01438-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Accepted: 11/04/2024] [Indexed: 02/09/2025] Open
Abstract
Antimicrobial peptides (AMPs) are instrumental in maintaining intestinal homeostasis and have emerged as potential therapeutic candidates for ameliorating intestinal bacterial infections. However, the intrinsic instability associated with the in vivo delivery of AMPs constitutes a substantial impediment to their therapeutic efficacy in treating infections. In this study, we genetically modified Bacillus subtilis (B. subtilis) WB600 to express Zophobas atratus defensin (ZD), an antimicrobial peptide with broad-spectrum activity isolated from Zophobas atratus, for oral administration. This engineered strain effectively protects against Salmonella Infantis (S. Infantis) infection in mice. Pretreatment with WB600/ZD prevented NF-κB pathway activation induced by S. Infantis infection and increased expression of antioxidant and tight junction proteins, thus alleviating the severity of intestinal inflammation in both the jejunum and ileum (P < 0.01). Moreover, WB600/ZD pretreatment facilitated the growth of beneficial bacteria such as Lachnospiraceae, Butyricicoccus, Eubacterium_xylanophilum, and Clostridia_UCG-014 while decreasing the abundance of pathogenic bacteria such as Escherichia-Shigella and Salmonella (P < 0.05). In conclusion, this study underscores the protective effects of WB600/ZD on S. Infantis-induced intestinal inflammation, suggesting that oral delivery of B. subtilis WB600/ZD may be a promising prophylactic strategy for combating bacterial infections in the intestine.
Collapse
Affiliation(s)
- Wei Li
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China
- Sanya Institute of China Agricultural University, Sanya, 572025, Hainan, China
| | - Xue Wang
- College of Veterinary Medicine, Inner Mongolia Agricultural University, Hohhot, 010000, China
| | - Keyuan Chen
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China
- Sanya Institute of China Agricultural University, Sanya, 572025, Hainan, China
| | - Yaohong Zhu
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China
- Sanya Institute of China Agricultural University, Sanya, 572025, Hainan, China
| | - Guiyan Yang
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China
- Sanya Institute of China Agricultural University, Sanya, 572025, Hainan, China
| | - Yipeng Jin
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China.
| | - Jiufeng Wang
- College of Veterinary Medicine, China Agricultural University, Beijing, 100193, China.
- Sanya Institute of China Agricultural University, Sanya, 572025, Hainan, China.
| |
Collapse
|
5
|
Buric F, Viknander S, Fu X, Lemke O, Carmona OG, Zrimec J, Szyrwiel L, Mülleder M, Ralser M, Zelezniak A. Amino acid sequence encodes protein abundance shaped by protein stability at reduced synthesis cost. Protein Sci 2025; 34:e5239. [PMID: 39665261 PMCID: PMC11635393 DOI: 10.1002/pro.5239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 10/11/2024] [Accepted: 11/14/2024] [Indexed: 12/13/2024]
Abstract
Understanding what drives protein abundance is essential to biology, medicine, and biotechnology. Driven by evolutionary selection, an amino acid sequence is tailored to meet the required abundance of a proteome, underscoring the intricate relationship between sequence and functional demand. Yet, the specific role of amino acid sequences in determining proteome abundance remains elusive. Here we show that the amino acid sequence alone encodes over half of protein abundance variation across all domains of life, ranging from bacteria to mouse and human. With an attempt to go beyond predictions, we trained a manageable-size Transformer model to interpret latent factors predictive of protein abundances. Intuitively, the model's attention focused on the protein's structural features linked to stability and metabolic costs related to protein synthesis. To probe these relationships, we introduce MGEM (Mutation Guided by an Embedded Manifold), a methodology for guiding protein abundance through sequence modifications. We find that mutations which increase predicted abundance have significantly altered protein polarity and hydrophobicity, underscoring a connection between protein structural features and abundance. Through molecular dynamics simulations we revealed that abundance-enhancing mutations possibly contribute to protein thermostability by increasing rigidity, which occurs at a lower synthesis cost.
Collapse
Affiliation(s)
- Filip Buric
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Sandra Viknander
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Xiaozhi Fu
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Oliver Lemke
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Oriol Gracia Carmona
- Randall Centre for Cell & Molecular BiophysicsKing's College LondonLondonUK
- Institute of Structural and Molecular BiologyUniversity College LondonLondonUK
| | - Jan Zrimec
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Department of Biotechnology and Systems BiologyNational Institute of BiologyLjubljanaSlovenia
| | - Lukasz Szyrwiel
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Michael Mülleder
- Core Facility High Throughput Mass SpectrometryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Markus Ralser
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Aleksej Zelezniak
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Randall Centre for Cell & Molecular BiophysicsKing's College LondonLondonUK
- Institute of Biotechnology, Life Sciences CentreVilnius UniversityVilniusLithuania
| |
Collapse
|
6
|
Gillani M, Pollastri G. Protein subcellular localization prediction tools. Comput Struct Biotechnol J 2024; 23:1796-1807. [PMID: 38707539 PMCID: PMC11066471 DOI: 10.1016/j.csbj.2024.04.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 04/11/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Protein subcellular localization prediction is of great significance in bioinformatics and biological research. Most of the proteins do not have experimentally determined localization information, computational prediction methods and tools have been acting as an active research area for more than two decades now. Knowledge of the subcellular location of a protein provides valuable information about its functionalities, the functioning of the cell, and other possible interactions with proteins. Fast, reliable, and accurate predictors provides platforms to harness the abundance of sequence data to predict subcellular locations accordingly. During the last decade, there has been a considerable amount of research effort aimed at developing subcellular localization predictors. This paper reviews recent subcellular localization prediction tools in the Eukaryotic, Prokaryotic, and Virus-based categories followed by a detailed analysis. Each predictor is discussed based on its main features, strengths, weaknesses, algorithms used, prediction techniques, and analysis. This review is supported by prediction tools taxonomies that highlight their rele- vant area and examples for uncomplicated categorization and ease of understandability. These taxonomies help users find suitable tools according to their needs. Furthermore, recent research gaps and challenges are discussed to cover areas that need the utmost attention. This survey provides an in-depth analysis of the most recent prediction tools to facilitate readers and can be considered a quick guide for researchers to identify and explore the recent literature advancements.
Collapse
Affiliation(s)
- Maryam Gillani
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| | - Gianluca Pollastri
- School of Computer Science, University College Dublin (UCD), Dublin, D04 V1W8, Ireland
| |
Collapse
|
7
|
Cao J, Zhang J, Yu Q, Ji J, Li J, He S, Zhu Z. TG-CDDPM: text-guided antimicrobial peptides generation based on conditional denoising diffusion probabilistic model. Brief Bioinform 2024; 26:bbae644. [PMID: 39668337 PMCID: PMC11637771 DOI: 10.1093/bib/bbae644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 11/13/2024] [Accepted: 11/27/2024] [Indexed: 12/14/2024] Open
Abstract
Antimicrobial peptides (AMPs) have emerged as a promising substitution to antibiotics thanks to their boarder range of activities, less likelihood of drug resistance, and low toxicity. Traditional biochemical methods for AMP discovery are costly and inefficient. Deep generative models, including the long-short term memory model, variational autoencoder model, and generative adversarial model, have been widely introduced to expedite AMP discovery. However, these models tend to suffer from the lack of diversity in generating AMPs. The denoising diffusion probabilistic model serves as a good candidate for solving this issue. We proposed a three-stage Text-Guided Conditional Denoising Diffusion Probabilistic Model (TG-CDDPM) to generate novel and homologous AMPs. In the first two stages, contrastive learning and inferring models are crafted to create better conditions for guiding AMP generation, respectively. In the last stage, a pre-trained conditional denoising diffusion probabilistic model is leveraged to enrich the peptide knowledge and fine-tuned to learn feature representation in downstream. TG-CDDPM was compared to the state-of-the-art generative models for AMP generation, and it demonstrated competitive or better performance with the assistance of text description as supervised information. The membrane penetration capabilities of the identified candidate AMPs by TG-CDDPM were also validated through molecular weight dynamics experiments.
Collapse
Affiliation(s)
- Junhang Cao
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| | - Qiyuan Yu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Junkai Ji
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham B15 2TT, United Kingdom
| | - Zexuan Zhu
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| |
Collapse
|
8
|
Zhou L, Zhang R, Jiang B, Meng Q, Chen J, Liu X. Efficient Production of an Alginate Lyase in Bacillus subtilis with Combined Strategy: Vector and Host Selection, Promoter and Signal Peptide Screening, and Modification of a Translation Initiation Region. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2024; 72:19403-19412. [PMID: 39180506 DOI: 10.1021/acs.jafc.4c03532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/26/2024]
Abstract
Alginate lyases (ALys) whose degrading products, alginate oligosaccharides, exhibit various outstanding biochemical activities have aroused increasing interest of researchers in the marine bioresource field. However, their predominant sourcing from marine bacteria, with limited yields and unclear genetic backgrounds, presents a challenge for industrial production. In this study, ALys (Aly01) from Vibrio natriegens SK 42.001 was expressed in Bacillus subtilis (B. subtilis), a nonpathogenic microorganism recognized as generally safe (GRAS). This accomplishment was realized through a comprehensive strategy involving vector and host selection, promoter and signal peptide screening, and engineering of the ribosome binding site (RBS) and the N-terminal coding sequence (NCS). The optimal combination was identified as the pP43NMK and B. subtilis WB600. Among the 19 reported strong promoters, PnprE exhibited the best performance, showing intracellular enzyme activities of 4.47 U/mL. Despite expectations, dual promoter construction did not yield a significant increase. Further, SPydhT demonstrated the highest extracellular activity (1.33 U/mL), which was further improved by RBS/NCS engineering, reaching 4.58 U/mL. Finally, after fed-batch fermentation, the extracellular activity reached 18.01 U/mL, which was the highest of ALys with a high molecular weight expressed in B. subtilis. These findings are expected to offer valuable insights into the heterologous expression of ALys in B. subtilis.
Collapse
Affiliation(s)
- Licheng Zhou
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- International Joint Laboratory on Food Safety, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Ran Zhang
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- International Joint Laboratory on Food Safety, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Bo Jiang
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- International Joint Laboratory on Food Safety, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Qing Meng
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- International Joint Laboratory on Food Safety, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Jingjing Chen
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu 214122, China
- International Joint Laboratory on Food Safety, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Xiaoyong Liu
- Shandong Haizhibao Ocean Technology Co., Ltd., Weihai 264333, China
| |
Collapse
|
9
|
Lipsh-Sokolik R, Fleishman SJ. Addressing epistasis in the design of protein function. Proc Natl Acad Sci U S A 2024; 121:e2314999121. [PMID: 39133844 PMCID: PMC11348311 DOI: 10.1073/pnas.2314999121] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024] Open
Abstract
Mutations in protein active sites can dramatically improve function. The active site, however, is densely packed and extremely sensitive to mutations. Therefore, some mutations may only be tolerated in combination with others in a phenomenon known as epistasis. Epistasis reduces the likelihood of obtaining improved functional variants and dramatically slows natural and lab evolutionary processes. Research has shed light on the molecular origins of epistasis and its role in shaping evolutionary trajectories and outcomes. In addition, sequence- and AI-based strategies that infer epistatic relationships from mutational patterns in natural or experimental evolution data have been used to design functional protein variants. In recent years, combinations of such approaches and atomistic design calculations have successfully predicted highly functional combinatorial mutations in active sites. These were used to design thousands of functional active-site variants, demonstrating that, while our understanding of epistasis remains incomplete, some of the determinants that are critical for accurate design are now sufficiently understood. We conclude that the space of active-site variants that has been explored by evolution may be expanded dramatically to enhance natural activities or discover new ones. Furthermore, design opens the way to systematically exploring sequence and structure space and mutational impacts on function, deepening our understanding and control over protein activity.
Collapse
Affiliation(s)
- Rosalie Lipsh-Sokolik
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| |
Collapse
|
10
|
Rutter JW, Dekker L, Clare C, Slendebroek ZF, Owen KA, McDonald JAK, Nair SP, Fedorec AJH, Barnes CP. A bacteriocin expression platform for targeting pathogenic bacterial species. Nat Commun 2024; 15:6332. [PMID: 39068147 PMCID: PMC11283563 DOI: 10.1038/s41467-024-50591-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 07/16/2024] [Indexed: 07/30/2024] Open
Abstract
Bacteriocins are antimicrobial peptides that are naturally produced by many bacteria. They hold great potential in the fight against antibiotic resistant bacteria, including ESKAPE pathogens. Engineered live biotherapeutic products (eLBPs) that secrete bacteriocins can be created to deliver targeted bacteriocin production. Here we develop a modular bacteriocin secretion platform that can be used to express and secrete multiple bacteriocins from non-pathogenic Escherichia coli host strains. As a proof of concept we create Enterocin A (EntA) and Enterocin B (EntB) secreting strains that show strong antimicrobial activity against Enterococcus faecalis and Enterococcus faecium in vitro, and characterise this activity in both solid culture and liquid co-culture. We then develop a Lotka-Volterra model that can be used to capture the interactions of these competitor strains. We show that simultaneous exposure to EntA and EntB can delay Enterococcus growth. Our system has the potential to be used as an eLBP to secrete additional bacteriocins for the targeted killing of pathogenic bacteria.
Collapse
Affiliation(s)
- Jack W Rutter
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Linda Dekker
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Chania Clare
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Zoe F Slendebroek
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Kimberley A Owen
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Julie A K McDonald
- Centre for Bacterial Resistance Biology, Department of Life Sciences, Imperial College London, London, UK
| | - Sean P Nair
- Department of Microbial Diseases, UCL Eastman Dental Institute, University College London, London, UK
| | - Alex J H Fedorec
- Department of Cell and Developmental Biology, University College London, London, UK
| | - Chris P Barnes
- Department of Cell and Developmental Biology, University College London, London, UK.
| |
Collapse
|
11
|
Meynard-Piganeau B, Feinauer C, Weigt M, Walczak AM, Mora T. TULIP: A transformer-based unsupervised language model for interacting peptides and T cell receptors that generalizes to unseen epitopes. Proc Natl Acad Sci U S A 2024; 121:e2316401121. [PMID: 38838016 PMCID: PMC11181096 DOI: 10.1073/pnas.2316401121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 04/29/2024] [Indexed: 06/07/2024] Open
Abstract
The accurate prediction of binding between T cell receptors (TCR) and their cognate epitopes is key to understanding the adaptive immune response and developing immunotherapies. Current methods face two significant limitations: the shortage of comprehensive high-quality data and the bias introduced by the selection of the negative training data commonly used in the supervised learning approaches. We propose a method, Transformer-based Unsupervised Language model for Interacting Peptides and T cell receptors (TULIP), that addresses both limitations by leveraging incomplete data and unsupervised learning and using the transformer architecture of language models. Our model is flexible and integrates all possible data sources, regardless of their quality or completeness. We demonstrate the existence of a bias introduced by the sampling procedure used in previous supervised approaches, emphasizing the need for an unsupervised approach. TULIP recognizes the specific TCRs binding an epitope, performing well on unseen epitopes. Our model outperforms state-of-the-art models and offers a promising direction for the development of more accurate TCR epitope recognition models.
Collapse
Affiliation(s)
- Barthelemy Meynard-Piganeau
- Laboratory of Computational and Quantitative Biology, Institut de Biologie Paris Seine, CNRS, Sorbonne Université, Paris75005, France
- Department of Computing Sciences, Bocconi University, Milan20100, Italy
| | | | - Martin Weigt
- Laboratory of Computational and Quantitative Biology, Institut de Biologie Paris Seine, CNRS, Sorbonne Université, Paris75005, France
| | - Aleksandra M. Walczak
- Laboratoire de Physique de l’Ecole Normale Supérieure, Université Paris Sciences et Lettres, CNRS, Sorbonne Université, Université de Paris Cité, Paris75005, France
| | - Thierry Mora
- Laboratoire de Physique de l’Ecole Normale Supérieure, Université Paris Sciences et Lettres, CNRS, Sorbonne Université, Université de Paris Cité, Paris75005, France
| |
Collapse
|
12
|
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
13
|
Min J, Rong X, Zhang J, Su R, Wang Y, Qi W. Computational Design of Peptide Assemblies. J Chem Theory Comput 2024; 20:532-550. [PMID: 38206800 DOI: 10.1021/acs.jctc.3c01054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2024]
Abstract
With the ongoing development of peptide self-assembling materials, there is growing interest in exploring novel functional peptide sequences. From short peptides to long polypeptides, as the functionality increases, the sequence space is also expanding exponentially. Consequently, attempting to explore all functional sequences comprehensively through experience and experiments alone has become impractical. By utilizing computational methods, especially artificial intelligence enhanced molecular dynamics (MD) simulation and de novo peptide design, there has been a significant expansion in the exploration of sequence space. Through these methods, a variety of supramolecular functional materials, including fibers, two-dimensional arrays, nanocages, etc., have been designed by meticulously controlling the inter- and intramolecular interactions. In this review, we first provide a brief overview of the current main computational methods and then focus on the computational design methods for various self-assembled peptide materials. Additionally, we introduce some representative protein self-assemblies to offer guidance for the design of self-assembling peptides.
Collapse
Affiliation(s)
- Jiwei Min
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
| | - Xi Rong
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
| | - Jiaxing Zhang
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
| | - Rongxin Su
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
- Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300072, P. R. China
- Tianjin Key Laboratory of Membrane Science and Desalination Technology, Tianjin 300072, P. R. China
| | - Yuefei Wang
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
- Tianjin Key Laboratory of Membrane Science and Desalination Technology, Tianjin 300072, P. R. China
| | - Wei Qi
- State Key Laboratory of Chemical Engineering, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, P. R. China
- Collaborative Innovation Center of Chemical Science and Engineering (Tianjin), Tianjin 300072, P. R. China
- Tianjin Key Laboratory of Membrane Science and Desalination Technology, Tianjin 300072, P. R. China
| |
Collapse
|
14
|
Min X, Yang C, Xie J, Huang Y, Liu N, Jin X, Wang T, Kong Z, Lu X, Ge S, Zhang J, Xia N. Tpgen: a language model for stable protein design with a specific topology structure. BMC Bioinformatics 2024; 25:35. [PMID: 38254030 PMCID: PMC10804651 DOI: 10.1186/s12859-024-05637-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND Natural proteins occupy a small portion of the protein sequence space, whereas artificial proteins can explore a wider range of possibilities within the sequence space. However, specific requirements may not be met when generating sequences blindly. Research indicates that small proteins have notable advantages, including high stability, accurate resolution prediction, and facile specificity modification. RESULTS This study involves the construction of a neural network model named TopoProGenerator(TPGen) using a transformer decoder. The model is trained with sequences consisting of a maximum of 65 amino acids. The training process of TopoProGenerator incorporates reinforcement learning and adversarial learning, for fine-tuning. Additionally, it encompasses a stability predictive model trained with a dataset comprising over 200,000 sequences. The results demonstrate that TopoProGenerator is capable of designing stable small protein sequences with specified topology structures. CONCLUSION TPGen has the ability to generate protein sequences that fold into the specified topology, and the pretraining and fine-tuning methods proposed in this study can serve as a framework for designing various types of proteins.
Collapse
Affiliation(s)
- Xiaoping Min
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Chongzhou Yang
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Jun Xie
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Yang Huang
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Life Sciences, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Nan Liu
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Xiaocheng Jin
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Tianshu Wang
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Zhibo Kong
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Xiaoli Lu
- Information and Networking Center, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Shengxiang Ge
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China.
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China.
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China.
| | - Jun Zhang
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Ningshao Xia
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| |
Collapse
|
15
|
Minot M, Reddy ST. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering. Cell Syst 2024; 15:4-18.e4. [PMID: 38194961 DOI: 10.1016/j.cels.2023.12.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 07/21/2023] [Accepted: 12/07/2023] [Indexed: 01/11/2024]
Abstract
Machine learning-guided protein engineering is rapidly progressing; however, collecting high-quality, large datasets remains a bottleneck. Directed evolution and protein engineering studies often require extensive experimental processes to eliminate noise and label protein sequence-function data. Meta learning has proven effective in other fields in learning from noisy data via bi-level optimization given the availability of a small dataset with trusted labels. Here, we leverage meta learning approaches to overcome noisy and under-labeled data and expedite workflows in antibody engineering. We generate yeast display antibody mutagenesis libraries and screen them for target antigen binding followed by deep sequencing. We then create representative learning tasks, including learning from noisy training data, positive and unlabeled learning, and learning out of distribution properties. We demonstrate that meta learning has the potential to reduce experimental screening time and improve the robustness of machine learning models by training with noisy and under-labeled training data.
Collapse
Affiliation(s)
- Mason Minot
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland
| | - Sai T Reddy
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland.
| |
Collapse
|
16
|
Beiki H, Murdoch BM, Park CA, Kern C, Kontechy D, Becker G, Rincon G, Jiang H, Zhou H, Thorne J, Koltes JE, Michal JJ, Davenport K, Rijnkels M, Ross PJ, Hu R, Corum S, McKay S, Smith TPL, Liu W, Ma W, Zhang X, Xu X, Han X, Jiang Z, Hu ZL, Reecy JM. Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology. Gigascience 2024; 13:giae019. [PMID: 38626724 PMCID: PMC11020238 DOI: 10.1093/gigascience/giae019] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 07/29/2023] [Accepted: 03/27/2024] [Indexed: 04/18/2024] Open
Abstract
BACKGROUND The accurate identification of the functional elements in the bovine genome is a fundamental requirement for high-quality analysis of data informing both genome biology and genomic selection. Functional annotation of the bovine genome was performed to identify a more complete catalog of transcript isoforms across bovine tissues. RESULTS A total of 160,820 unique transcripts (50% protein coding) representing 34,882 unique genes (60% protein coding) were identified across tissues. Among them, 118,563 transcripts (73% of the total) were structurally validated by independent datasets (PacBio isoform sequencing data, Oxford Nanopore Technologies sequencing data, de novo assembled transcripts from RNA sequencing data) and comparison with Ensembl and NCBI gene sets. In addition, all transcripts were supported by extensive data from different technologies such as whole transcriptome termini site sequencing, RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression, chromatin immunoprecipitation sequencing, and assay for transposase-accessible chromatin using sequencing. A large proportion of identified transcripts (69%) were unannotated, of which 86% were produced by annotated genes and 14% by unannotated genes. A median of two 5' untranslated regions were expressed per gene. Around 50% of protein-coding genes in each tissue were bifunctional and transcribed both coding and noncoding isoforms. Furthermore, we identified 3,744 genes that functioned as noncoding genes in fetal tissues but as protein-coding genes in adult tissues. Our new bovine genome annotation extended more than 11,000 annotated gene borders compared to Ensembl or NCBI annotations. The resulting bovine transcriptome was integrated with publicly available quantitative trait loci data to study tissue-tissue interconnection involved in different traits and construct the first bovine trait similarity network. CONCLUSIONS These validated results show significant improvement over current bovine genome annotations.
Collapse
Affiliation(s)
- Hamid Beiki
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Brenda M Murdoch
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Carissa A Park
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Chandlar Kern
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Denise Kontechy
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Gabrielle Becker
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | | | - Honglin Jiang
- Department of Animal and Poultry Sciences, Virginia Tech, VA 24060, USA
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Jacob Thorne
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - James E Koltes
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - Jennifer J Michal
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Kimberly Davenport
- Department of Animal and Veterinary and Food Science, University of Idaho, ID 83844, USA
| | - Monique Rijnkels
- Department of Veterinary Integrative Biosciences, Texas A&M University, TX 77843, USA
| | - Pablo J Ross
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Rui Hu
- Department of Animal and Poultry Sciences, Virginia Tech, VA 24060, USA
| | - Sarah Corum
- Zoetis, Parsippany-Troy Hills, NJ 07054, USA
| | | | | | - Wansheng Liu
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Wenzhi Ma
- Department of Animal Science, Pennsylvania State University, PA 16802, USA
| | - Xiaohui Zhang
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Xiaoqing Xu
- Department of Animal Science, University of California, Davis, CA 95616, USA
| | - Xuelei Han
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Zhihua Jiang
- Department of Animal Science, Washington State University, WA 99164, USA
| | - Zhi-Liang Hu
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| | - James M Reecy
- Department of Animal Science, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
17
|
Wu W, Krijgsveld J. Secretome Analysis: Reading Cellular Sign Language to Understand Intercellular Communication. Mol Cell Proteomics 2024; 23:100692. [PMID: 38081362 PMCID: PMC10793180 DOI: 10.1016/j.mcpro.2023.100692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/07/2023] [Accepted: 12/08/2023] [Indexed: 01/06/2024] Open
Abstract
A significant portion of mammalian proteomes is secreted to the extracellular space to fulfill crucial roles in cell-to-cell communication. To best recapitulate the intricate and multi-faceted crosstalk between cells in a live organism, there is an ever-increasing need for methods to study protein secretion in model systems that include multiple cell types. In addition, posttranslational modifications further expand the complexity and versatility of cellular communication. This review aims to summarize recent strategies and model systems that employ cellular coculture, chemical biology tools, protein enrichment, and proteomic methods to characterize the composition and function of cellular secretomes. This is all geared towards gaining better understanding of organismal biology in vivo mediated by secretory signaling.
Collapse
Affiliation(s)
- Wei Wu
- Singapore Immunology Network (SIgN), Agency for Science, Technology and Research (A∗STAR), Singapore, Singapore; Department of Pharmacy, National University of Singapore, Singapore, Singapore.
| | - Jeroen Krijgsveld
- Division of Proteomics of Stem Cells and Cancer, German Cancer Research Center (DKFZ), Heidelberg, Germany; Medical Faculty, Heidelberg University, Heidelberg, Germany.
| |
Collapse
|
18
|
Wu Z, Wu Y, Zhu C, Wu X, Zhai S, Wang X, Su Z, Duan H. Efficient Computational Framework for Target-Specific Active Peptide Discovery: A Case Study on IL-17C Targeting Cyclic Peptides. J Chem Inf Model 2023; 63:7655-7668. [PMID: 38049371 DOI: 10.1021/acs.jcim.3c01385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/06/2023]
Abstract
The development of potentially active peptides for specific targets is critical for the modern pharmaceutical industry's growth. In this study, we present an efficient computational framework for the discovery of active peptides targeting a specific pharmacological target, which combines a conditional variational autoencoder (CVAE) and a classifier named TCPP based on the Transformer and convolutional neural network. In our example scenario, we constructed an active cyclic peptide library targeting interleukin-17C (IL-17C) through a library-based in vitro selection strategy. The CVAE model is trained on the preprocessed peptide data sets to generate potentially active peptides and the TCPP further screens the generated peptides. Ultimately, six candidate peptides predicted by the model were synthesized and assayed for their activity, and four of them exhibited promising binding affinity to IL-17C. Our study provides a one-stop-shop for target-specific active peptide discovery, which is expected to boost up the process of peptide drug development.
Collapse
Affiliation(s)
- Zhipeng Wu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Yejian Wu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Cheng Zhu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Xinyi Wu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Silong Zhai
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Xinqiao Wang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Zhihao Su
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
| |
Collapse
|
19
|
Schilling T, Ferrero-Bordera B, Neef J, Maaβ S, Becher D, van Dijl JM. Let There Be Light: Genome Reduction Enables Bacillus subtilis to Produce Disulfide-Bonded Gaussia Luciferase. ACS Synth Biol 2023; 12:3656-3668. [PMID: 38011677 PMCID: PMC10729301 DOI: 10.1021/acssynbio.3c00444] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Revised: 11/09/2023] [Accepted: 11/17/2023] [Indexed: 11/29/2023]
Abstract
Bacillus subtilis is a major workhorse for enzyme production in industrially relevant quantities. Compared to mammalian-based expression systems, B. subtilis presents intrinsic advantages, such as high growth rates, high space-time yield, unique protein secretion capabilities, and low maintenance costs. However, B. subtilis shows clear limitations in the production of biopharmaceuticals, especially proteins from eukaryotic origin that contain multiple disulfide bonds. In the present study, we deployed genome minimization, signal peptide screening, and coexpression of recombinant thiol oxidases as strategies to improve the ability of B. subtilis to secrete proteins with multiple disulfide bonds. Different genome-reduced strains served as the chassis for expressing the model protein Gaussia Luciferase (GLuc), which contains five disulfide bonds. These chassis lack extracellular proteases, prophages, and key sporulation genes. Importantly, compared to the reference strain with a full-size genome, the best-performing genome-minimized strain achieved over 3000-fold increased secretion of active GLuc while growing to lower cell densities. Our results show that high-level GLuc secretion relates, at least in part, to the absence of major extracellular proteases. In addition, we show that the thiol-disulfide oxidoreductase requirements for disulfide bonding have changed upon genome reduction. Altogether, our results highlight genome-engineered Bacillus strains as promising expression platforms for proteins with multiple disulfide bonds.
Collapse
Affiliation(s)
- Tobias Schilling
- Department
of Medical Microbiology, University of Groningen,
University Medical Center Groningen, Hanzeplein 1, P.O. Box 30001, 9700RB Groningen, The Netherlands
| | - Borja Ferrero-Bordera
- Institute
of Microbiology Department of Microbial Proteomics, University of Greifswald, D-17489 Greifswald, Germany
| | - Jolanda Neef
- Department
of Medical Microbiology, University of Groningen,
University Medical Center Groningen, Hanzeplein 1, P.O. Box 30001, 9700RB Groningen, The Netherlands
| | - Sandra Maaβ
- Institute
of Microbiology Department of Microbial Proteomics, University of Greifswald, D-17489 Greifswald, Germany
| | - Dörte Becher
- Institute
of Microbiology Department of Microbial Proteomics, University of Greifswald, D-17489 Greifswald, Germany
| | - Jan Maarten van Dijl
- Department
of Medical Microbiology, University of Groningen,
University Medical Center Groningen, Hanzeplein 1, P.O. Box 30001, 9700RB Groningen, The Netherlands
| |
Collapse
|
20
|
Parthiban S, Vijeesh T, Gayathri T, Shanmugaraj B, Sharma A, Sathishkumar R. Artificial intelligence-driven systems engineering for next-generation plant-derived biopharmaceuticals. FRONTIERS IN PLANT SCIENCE 2023; 14:1252166. [PMID: 38034587 PMCID: PMC10684705 DOI: 10.3389/fpls.2023.1252166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023]
Abstract
Recombinant biopharmaceuticals including antigens, antibodies, hormones, cytokines, single-chain variable fragments, and peptides have been used as vaccines, diagnostics and therapeutics. Plant molecular pharming is a robust platform that uses plants as an expression system to produce simple and complex recombinant biopharmaceuticals on a large scale. Plant system has several advantages over other host systems such as humanized expression, glycosylation, scalability, reduced risk of human or animal pathogenic contaminants, rapid and cost-effective production. Despite many advantages, the expression of recombinant proteins in plant system is hindered by some factors such as non-human post-translational modifications, protein misfolding, conformation changes and instability. Artificial intelligence (AI) plays a vital role in various fields of biotechnology and in the aspect of plant molecular pharming, a significant increase in yield and stability can be achieved with the intervention of AI-based multi-approach to overcome the hindrance factors. Current limitations of plant-based recombinant biopharmaceutical production can be circumvented with the aid of synthetic biology tools and AI algorithms in plant-based glycan engineering for protein folding, stability, viability, catalytic activity and organelle targeting. The AI models, including but not limited to, neural network, support vector machines, linear regression, Gaussian process and regressor ensemble, work by predicting the training and experimental data sets to design and validate the protein structures thereby optimizing properties such as thermostability, catalytic activity, antibody affinity, and protein folding. This review focuses on, integrating systems engineering approaches and AI-based machine learning and deep learning algorithms in protein engineering and host engineering to augment protein production in plant systems to meet the ever-expanding therapeutics market.
Collapse
Affiliation(s)
- Subramanian Parthiban
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thandarvalli Vijeesh
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Thashanamoorthi Gayathri
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Balamurugan Shanmugaraj
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| | - Ashutosh Sharma
- Tecnologico de Monterrey, School of Engineering and Sciences, Centre of Bioengineering, Queretaro, Mexico
| | - Ramalingam Sathishkumar
- Plant Genetic Engineering Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore, India
| |
Collapse
|
21
|
Romero-Romero S, Lindner S, Ferruz N. Exploring the Protein Sequence Space with Global Generative Models. Cold Spring Harb Perspect Biol 2023; 15:a041471. [PMID: 37848247 PMCID: PMC10626256 DOI: 10.1101/cshperspect.a041471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
Recent advancements in specialized large-scale architectures for training images and language have profoundly impacted the field of computer vision and natural language processing (NLP). Language models, such as the recent ChatGPT and GPT-4, have demonstrated exceptional capabilities in processing, translating, and generating human language. These breakthroughs have also been reflected in protein research, leading to the rapid development of numerous new methods in a short time, with unprecedented performance. Several of these models have been developed with the goal of generating sequences in novel regions of the protein space. In this work, we provide an overview of the use of protein generative models, reviewing (1) language models for the design of novel artificial proteins, (2) works that use non-transformer architectures, and (3) applications in directed evolution approaches.
Collapse
Affiliation(s)
| | | | - Noelia Ferruz
- Barcelona Institute of Molecular Biology, 08028 Barcelona, Spain
| |
Collapse
|
22
|
Tran JN, Sherwood KR, Mostafa A, Benedicto RV, ElaAlim A, Greenshields A, Keown P, Liwski R, Lan JH. Novel alleles in the era of next-generation sequencing-based HLA typing calls for standardization and policy. Front Genet 2023; 14:1282834. [PMID: 37900182 PMCID: PMC10611506 DOI: 10.3389/fgene.2023.1282834] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/02/2023] [Indexed: 10/31/2023] Open
Abstract
Next-Generation Sequencing (NGS) has transformed clinical histocompatibility laboratories through its capacity to provide accurate, high-throughput, high-resolution typing of Human Leukocyte Antigen (HLA) genes, which is critical for transplant safety and success. As this technology becomes widely used for clinical genotyping, histocompatibility laboratories now have an increased capability to identify novel HLA alleles that previously would not be detected using traditional genotyping methods. Standard guidelines for the clinical verification and reporting of novelties in the era of NGS are greatly needed. Here, we describe the experience of a clinical histocompatibility laboratory's use of NGS for HLA genotyping and its management of novel alleles detected in an ethnically-diverse population of British Columbia, Canada. Over a period of 18 months, 3,450 clinical samples collected for the purpose of solid organ or hematopoietic stem cell transplantation were sequenced using NGS. Overall, 29 unique novel alleles were identified at a rate of ∼1.6 per month. The majority of novelties (52%) were detected in the alpha chains of class II (HLA-DQA1 and -DPA1). Novelties were found in all 11 HLA classical genes except for HLA-DRB3, -DRB4, and -DQB1. All novelties were single nucleotide polymorphisms, where more than half led to an amino acid change, and one resulted in a premature stop codon. Missense mutations were evaluated for changes in their amino acid properties to assess the potential effect on the novel HLA protein. All novelties identified were confirmed independently at another accredited HLA laboratory using a different NGS assay and platform to ensure validity in the reporting of novelties. The novel alleles were submitted to the Immuno Polymorphism Database-Immunogenetics/HLA (IPD-IMGT/HLA) for official allele name designation and inclusion in future database releases. A nationwide survey involving all Canadian HLA laboratories confirmed the common occurrence of novel allele detection but identified a wide variability in the assessment and reporting of novelties. In summary, a considerable proportion of novel alleles were identified in routine clinical testing. We propose a framework for the standardization of policies on the clinical management of novel alleles and inclusion in proficiency testing programs in the era of NGS-based HLA genotyping.
Collapse
Affiliation(s)
- Jenny N. Tran
- British Columbia Provincial Immunology Laboratory, Vancouver Coastal Health, Vancouver, BC, Canada
| | - Karen R. Sherwood
- British Columbia Provincial Immunology Laboratory, Vancouver Coastal Health, Vancouver, BC, Canada
| | - Ahmed Mostafa
- Department of Pathology and Laboratory Medicine, University of Saskatchewan, Saskatoon, SK, Canada
| | - Rey Vincent Benedicto
- British Columbia Provincial Immunology Laboratory, Vancouver Coastal Health, Vancouver, BC, Canada
| | - Allaa ElaAlim
- British Columbia Provincial Immunology Laboratory, Vancouver Coastal Health, Vancouver, BC, Canada
| | | | - Paul Keown
- Department of Pathology and Laboratory Medicine, Vancouver Coastal Health, University of British Columbia, Vancouver, BC, Canada
| | - Robert Liwski
- Department of Pathology, Dalhousie University, Halifax, NS, Canada
| | - James H. Lan
- Department of Pathology and Laboratory Medicine, Vancouver Coastal Health, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
23
|
Mardikoraem M, Wang Z, Pascual N, Woldring D. Generative models for protein sequence modeling: recent advances and future directions. Brief Bioinform 2023; 24:bbad358. [PMID: 37864295 PMCID: PMC10589401 DOI: 10.1093/bib/bbad358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 09/08/2023] [Accepted: 09/12/2023] [Indexed: 10/22/2023] Open
Abstract
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Michigan State University (MSU)‘s Department of Chemical Engineering and Materials Science
| | - Zirui Wang
- Regeneron Pharmaceuticals, Inc. Having received his B.S. in Chemical Engineering from MSU, he is currently pursuing a M.S. in Computer Science from Syracuse University
| | | | - Daniel Woldring
- MSU’s Department of Chemical Engineering and Materials Science and a member of MSU’s Institute for Quantitative Health Sciences and Engineering
| |
Collapse
|
24
|
O’Neill P, Mistry RK, Brown AJ, James DC. Protein-Specific Signal Peptides for Mammalian Vector Engineering. ACS Synth Biol 2023; 12:2339-2352. [PMID: 37487508 PMCID: PMC10443038 DOI: 10.1021/acssynbio.3c00157] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Indexed: 07/26/2023]
Abstract
Expression of recombinant proteins in mammalian cell factories relies on synthetic assemblies of genetic parts to optimally control flux through the product biosynthetic pathway. In comparison to other genetic part-types, there is a relative paucity of characterized signal peptide components, particularly for mammalian cell contexts. In this study, we describe a toolkit of signal peptide elements, created using bioinformatics-led and synthetic design approaches, that can be utilized to enhance production of biopharmaceutical proteins in Chinese hamster ovary cell factories. We demonstrate, for the first time in a mammalian cell context, that machine learning can be used to predict how discrete signal peptide elements will perform when utilized to drive endoplasmic reticulum (ER) translocation of specific single chain protein products. For more complex molecular formats, such as multichain monoclonal antibodies, we describe how a combination of in silico and targeted design rule-based in vitro testing can be employed to rapidly identify product-specific signal peptide solutions from minimal screening spaces. The utility of this technology is validated by deriving vector designs that increase product titers ≥1.8×, compared to standard industry systems, for a range of products, including a difficult-to-express monoclonal antibody. The availability of a vastly expanded toolbox of characterized signal peptide parts, combined with streamlined in silico/in vitro testing processes, will permit efficient expression vector re-design to maximize titers of both simple and complex protein products.
Collapse
Affiliation(s)
- Pamela O’Neill
- Department
of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield S1 3JD, U.K.
| | - Rajesh K. Mistry
- AstraZeneca, BioPharmaceutical Development, Cell Culture and Fermentation
Sciences, Aaron Klugg Building, Granta
Park, Cambridge CB21 6GH, U.K.
| | - Adam J. Brown
- Department
of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield S1 3JD, U.K.
- SynGenSys
Limited, Freeths LLP, Norfolk Street, Sheffield S1 2JE, U.K.
| | - David C. James
- Department
of Chemical and Biological Engineering, University of Sheffield, Mappin Street, Sheffield S1 3JD, U.K.
- SynGenSys
Limited, Freeths LLP, Norfolk Street, Sheffield S1 2JE, U.K.
| |
Collapse
|
25
|
Yu T, Boob AG, Singh N, Su Y, Zhao H. In vitro continuous protein evolution empowered by machine learning and automation. Cell Syst 2023; 14:633-644. [PMID: 37224814 DOI: 10.1016/j.cels.2023.04.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/19/2022] [Accepted: 04/20/2023] [Indexed: 05/26/2023]
Abstract
Directed evolution has become one of the most successful and powerful tools for protein engineering. However, the efforts required for designing, constructing, and screening a large library of variants can be laborious, time-consuming, and costly. With the recent advent of machine learning (ML) in the directed evolution of proteins, researchers can now evaluate variants in silico and guide a more efficient directed evolution campaign. Furthermore, recent advancements in laboratory automation have enabled the rapid execution of long, complex experiments for high-throughput data acquisition in both industrial and academic settings, thus providing the means to collect a large quantity of data required to develop ML models for protein engineering. In this perspective, we propose a closed-loop in vitro continuous protein evolution framework that leverages the best of both worlds, ML and automation, and provide a brief overview of the recent developments in the field.
Collapse
Affiliation(s)
- Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA
| | - Aashutosh Girish Boob
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nilmani Singh
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Yufeng Su
- NSF Molecule Maker Lab Institute, Urbana, IL, USA; Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
26
|
Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023; 41:1099-1106. [PMID: 36702895 PMCID: PMC10400306 DOI: 10.1038/s41587-022-01618-2] [Citation(s) in RCA: 337] [Impact Index Per Article: 168.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 11/17/2022] [Indexed: 01/27/2023]
Abstract
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
Collapse
Affiliation(s)
- Ali Madani
- Salesforce Research, Palo Alto, CA, USA.
- Profluent Bio, San Francisco, CA, USA.
| | | | - Eric R Greene
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Subu Subramanian
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA, USA
| | | | - James M Holton
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, CA, USA
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
| | - Jose Luis Olmos
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | | | | | | | - James S Fraser
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | | |
Collapse
|
27
|
Durmusoglu D, Al'Abri I, Li Z, Islam Williams T, Collins LB, Martínez JL, Crook N. Improving therapeutic protein secretion in the probiotic yeast Saccharomyces boulardii using a multifactorial engineering approach. Microb Cell Fact 2023; 22:109. [PMID: 37287064 PMCID: PMC10245609 DOI: 10.1186/s12934-023-02117-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/20/2023] [Indexed: 06/09/2023] Open
Abstract
The probiotic yeast Saccharomyces boulardii (Sb) is a promising chassis to deliver therapeutic proteins to the gut due to Sb's innate therapeutic properties, resistance to phage and antibiotics, and high protein secretion capacity. To maintain therapeutic efficacy in the context of challenges such as washout, low rates of diffusion, weak target binding, and/or high rates of proteolysis, it is desirable to engineer Sb strains with enhanced levels of protein secretion. In this work, we explored genetic modifications in both cis- (i.e. to the expression cassette of the secreted protein) and trans- (i.e. to the Sb genome) that enhance Sb's ability to secrete proteins, taking a Clostridioides difficile Toxin A neutralizing peptide (NPA) as our model therapeutic. First, by modulating the copy number of the NPA expression cassette, we found NPA concentrations in the supernatant could be varied by sixfold (76-458 mg/L) in microbioreactor fermentations. In the context of high NPA copy number, we found a previously-developed collection of native and synthetic secretion signals could further tune NPA secretion between 121 and 463 mg/L. Then, guided by prior knowledge of S. cerevisiae's secretion mechanisms, we generated a library of homozygous single gene deletion strains, the most productive of which achieved 2297 mg/L secretory production of NPA. We then expanded on this library by performing combinatorial gene deletions, supplemented by proteomics experiments. We ultimately constructed a quadruple protease-deficient Sb strain that produces 5045 mg/L secretory NPA, an improvement of > tenfold over wild-type Sb. Overall, this work systematically explores a broad collection of engineering strategies to improve protein secretion in Sb and highlights the ability of proteomics to highlight under-explored mediators of this process. In doing so, we created a set of probiotic strains that are capable of delivering a wide range of protein titers and therefore furthers the ability of Sb to deliver therapeutics to the gut and other settings to which it is adapted.
Collapse
Affiliation(s)
- Deniz Durmusoglu
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA
| | - Ibrahim Al'Abri
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA
| | - Zidan Li
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA
| | - Taufika Islam Williams
- Molecular Education, Technology and Research Innovation Center (METRIC), North Carolina State University, Raleigh, NC, USA
- Department of Chemistry, North Carolina State University, Raleigh, NC, USA
| | - Leonard B Collins
- Molecular Education, Technology and Research Innovation Center (METRIC), North Carolina State University, Raleigh, NC, USA
| | - José L Martínez
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Nathan Crook
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
28
|
Grasso S, Dabene V, Hendriks MMW, Zwartjens P, Pellaux R, Held M, Panke S, van Dijl JM, Meyer A, van Rij T. Signal Peptide Efficiency: From High-Throughput Data to Prediction and Explanation. ACS Synth Biol 2023; 12:390-404. [PMID: 36649479 PMCID: PMC9942255 DOI: 10.1021/acssynbio.2c00328] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Indexed: 01/18/2023]
Abstract
The passage of proteins across biological membranes via the general secretory (Sec) pathway is a universally conserved process with critical functions in cell physiology and important industrial applications. Proteins are directed into the Sec pathway by a signal peptide at their N-terminus. Estimating the impact of physicochemical signal peptide features on protein secretion levels has not been achieved so far, partially due to the extreme sequence variability of signal peptides. To elucidate relevant features of the signal peptide sequence that influence secretion efficiency, an evaluation of ∼12,000 different designed signal peptides was performed using a novel miniaturized high-throughput assay. The results were used to train a machine learning model, and a post-hoc explanation of the model is provided. By describing each signal peptide with a selection of 156 physicochemical features, it is now possible to both quantify feature importance and predict the protein secretion levels directed by each signal peptide. Our analyses allow the detection and explanation of the relevant signal peptide features influencing the efficiency of protein secretion, generating a versatile tool for the de novo design and in silico evaluation of signal peptides.
Collapse
Affiliation(s)
- Stefano Grasso
- Department
of Medical Microbiology, University of Groningen,
University Medical Center Groningen, Hanzeplein 1, Groningen 9700 RB, The Netherlands
- DSM
Biotechnology Center, Alexander Fleminglaan 1, Delft 2613 AX, Netherlands
| | - Valentina Dabene
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse
26, Basel 4058, Switzerland
- FGen
AG, Hochbergerstrasse
60C, Basel 4057, Switzerland
| | | | - Priscilla Zwartjens
- DSM
Biotechnology Center, Alexander Fleminglaan 1, Delft 2613 AX, Netherlands
| | - René Pellaux
- FGen
AG, Hochbergerstrasse
60C, Basel 4057, Switzerland
| | - Martin Held
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse
26, Basel 4058, Switzerland
| | - Sven Panke
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse
26, Basel 4058, Switzerland
| | - Jan Maarten van Dijl
- Department
of Medical Microbiology, University of Groningen,
University Medical Center Groningen, Hanzeplein 1, Groningen 9700 RB, The Netherlands
| | - Andreas Meyer
- FGen
AG, Hochbergerstrasse
60C, Basel 4057, Switzerland
| | - Tjeerd van Rij
- DSM
Biotechnology Center, Alexander Fleminglaan 1, Delft 2613 AX, Netherlands
| |
Collapse
|
29
|
Tubiana J, Adriana-Lifshits L, Nissan M, Gabay M, Sher I, Sova M, Wolfson HJ, Gal M. Funneling modulatory peptide design with generative models: Discovery and characterization of disruptors of calcineurin protein-protein interactions. PLoS Comput Biol 2023; 19:e1010874. [PMID: 36730443 PMCID: PMC9928118 DOI: 10.1371/journal.pcbi.1010874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 02/14/2023] [Accepted: 01/16/2023] [Indexed: 02/04/2023] Open
Abstract
Design of peptide binders is an attractive strategy for targeting "undruggable" protein-protein interfaces. Current design protocols rely on the extraction of an initial sequence from one known protein interactor of the target protein, followed by in-silico or in-vitro mutagenesis-based optimization of its binding affinity. Wet lab protocols can explore only a minor portion of the vast sequence space and cannot efficiently screen for other desirable properties such as high specificity and low toxicity, while in-silico design requires intensive computational resources and often relies on simplified binding models. Yet, for a multivalent protein target, dozens to hundreds of natural protein partners already exist in the cellular environment. Here, we describe a peptide design protocol that harnesses this diversity via a machine learning generative model. After identifying putative natural binding fragments by literature and homology search, a compositional Restricted Boltzmann Machine is trained and sampled to yield hundreds of diverse candidate peptides. The latter are further filtered via flexible molecular docking and an in-vitro microchip-based binding assay. We validate and test our protocol on calcineurin, a calcium-dependent protein phosphatase involved in various cellular pathways in health and disease. In a single screening round, we identified multiple 16-length peptides with up to six mutations from their closest natural sequence that successfully interfere with the binding of calcineurin to its substrates. In summary, integrating protein interaction and sequence databases, generative modeling, molecular docking and interaction assays enables the discovery of novel protein-protein interaction modulators.
Collapse
Affiliation(s)
- Jérôme Tubiana
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Lucia Adriana-Lifshits
- Department of Oral Biology, The Goldschleger School of Dental Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Michael Nissan
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Matan Gabay
- Department of Oral Biology, The Goldschleger School of Dental Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Inbal Sher
- Department of Oral Biology, The Goldschleger School of Dental Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Marina Sova
- Department of Oral Biology, The Goldschleger School of Dental Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Haim J. Wolfson
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Maayan Gal
- Department of Oral Biology, The Goldschleger School of Dental Medicine, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
30
|
Wittmund M, Cadet F, Davari MD. Learning Epistasis and Residue Coevolution Patterns: Current Trends and Future Perspectives for Advancing Enzyme Engineering. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Marcel Wittmund
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Frederic Cadet
- Laboratory of Excellence LABEX GR, DSIMB, Inserm UMR S1134, University of Paris city & University of Reunion, Paris 75014, France
| | - Mehdi D. Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| |
Collapse
|
31
|
Fannjiang C, Bates S, Angelopoulos AN, Listgarten J, Jordan MI. Conformal prediction under feedback covariate shift for biomolecular design. Proc Natl Acad Sci U S A 2022; 119:e2204569119. [PMID: 36256807 PMCID: PMC9618043 DOI: 10.1073/pnas.2204569119] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Accepted: 06/20/2022] [Indexed: 11/18/2022] Open
Abstract
Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
| | - Stephen Bates
- Department of Statistics, University of California, Berkeley, CA 94720
| | - Anastasios N. Angelopoulos
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
- Center for Computational Biology, University of California, Berkeley, CA 94720
| | - Michael I. Jordan
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
- Department of Statistics, University of California, Berkeley, CA 94720
| |
Collapse
|
32
|
Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, Winther O, Brunak S, von Heijne G, Nielsen H. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 2022. [PMID: 34980915 DOI: 10.1038/s41587-41021-01156-41583] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
Collapse
Affiliation(s)
- Felix Teufel
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.,Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - José Juan Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Magnús Halldór Gíslason
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen, Denmark
| | - Silas Irby Pihl
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Ole Winther
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen, Denmark.,Department of Biology, Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark.,Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden.,Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
33
|
Wan F, Kontogiorgos-Heintz D, de la Fuente-Nunez C. Deep generative models for peptide design. DIGITAL DISCOVERY 2022; 1:195-208. [PMID: 35769205 PMCID: PMC9189861 DOI: 10.1039/d1dd00024a] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 03/19/2022] [Indexed: 12/13/2022]
Abstract
Computers can already be programmed for superhuman pattern recognition of images and text. For machines to discover novel molecules, they must first be trained to sort through the many characteristics of molecules and determine which properties should be retained, suppressed, or enhanced to optimize functions of interest. Machines need to be able to understand, read, write, and eventually create new molecules. Today, this creative process relies on deep generative models, which have gained popularity since powerful deep neural networks were introduced to generative model frameworks. In recent years, they have demonstrated excellent ability to model complex distribution of real-word data (e.g., images, audio, text, molecules, and biological sequences). Deep generative models can generate data beyond those provided in training samples, thus yielding an efficient and rapid tool for exploring the massive search space of high-dimensional data such as DNA/protein sequences and facilitating the design of biomolecules with desired functions. Here, we review the emerging field of deep generative models applied to peptide science. In particular, we discuss several popular deep generative model frameworks as well as their applications to generate peptides with various kinds of properties (e.g., antimicrobial, anticancer, cell penetration, etc). We conclude our review with a discussion of current limitations and future perspectives in this emerging field.
Collapse
Affiliation(s)
- Fangping Wan
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania Philadelphia Pennsylvania USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania Philadelphia Pennsylvania USA
- Penn Institute for Computational Science, University of Pennsylvania Philadelphia Pennsylvania USA
| | - Daphne Kontogiorgos-Heintz
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania Philadelphia Pennsylvania USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania Philadelphia Pennsylvania USA
- Penn Institute for Computational Science, University of Pennsylvania Philadelphia Pennsylvania USA
- Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania Philadelphia Pennsylvania USA
| | - Cesar de la Fuente-Nunez
- Machine Biology Group, Departments of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania Philadelphia Pennsylvania USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied Science, University of Pennsylvania Philadelphia Pennsylvania USA
- Penn Institute for Computational Science, University of Pennsylvania Philadelphia Pennsylvania USA
| |
Collapse
|
34
|
Holec PV, Camacho KV, Breuckman KC, Mou J, Birnbaum ME. Proteome-Scale Screening to Identify High-Expression Signal Peptides with Minimal N-Terminus Biases via Yeast Display. ACS Synth Biol 2022; 11:2405-2416. [PMID: 35687717 DOI: 10.1021/acssynbio.2c00101] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Signal peptides are critical for the efficient expression and routing of extracellular and secreted proteins. Most protein production and screening technologies rely upon a relatively small set of signal peptides. Despite their central role in biotechnology, there are limited studies comprehensively examining the interplay between signal peptides and expressed protein sequences. Here, we describe a high-throughput method to screen novel signal peptides that maintain a high degree of surface expression across a range of protein scaffolds with highly variable N-termini. We find that the canonical signal peptide used in yeast surface display, derived from Aga2p, fails to achieve high surface expression for 42.5% of constructs containing diverse N-termini. To circumvent this, we have identified two novel signal peptides derived from endogenous yeast proteins, SRL1 and KISH, which are highly tolerant to diverse N-terminal sequences. This pipeline can be used to expand our understanding of signal peptide function, identify improved signal peptides for protein expression, and refine the computational tools used for signal peptide prediction.
Collapse
Affiliation(s)
- Patrick V Holec
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Karen V Camacho
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Kathryn C Breuckman
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Jody Mou
- Harvard-MIT Program in Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Michael E Birnbaum
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Ragon Institute of MGH, MIT, and Harvard, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
35
|
Freschlin CR, Fahlberg SA, Romero PA. Machine learning to navigate fitness landscapes for protein engineering. Curr Opin Biotechnol 2022; 75:102713. [PMID: 35413604 PMCID: PMC9177649 DOI: 10.1016/j.copbio.2022.102713] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/05/2022] [Accepted: 02/28/2022] [Indexed: 11/19/2022]
Abstract
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA; Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
36
|
Nakai K, Wei L. Recent Advances in the Prediction of Subcellular Localization of Proteins and Related Topics. FRONTIERS IN BIOINFORMATICS 2022; 2:910531. [PMID: 36304291 PMCID: PMC9580943 DOI: 10.3389/fbinf.2022.910531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 04/25/2022] [Indexed: 11/13/2022] Open
Abstract
Prediction of subcellular localization of proteins from their amino acid sequences has a long history in bioinformatics and is still actively developing, incorporating the latest advances in machine learning and proteomics. Notably, deep learning-based methods for natural language processing have made great contributions. Here, we review recent advances in the field as well as its related fields, such as subcellular proteomics and the prediction/recognition of subcellular localization from image data.
Collapse
Affiliation(s)
- Kenta Nakai
- Institute of Medical Science, The University of Tokyo, Minato-Ku, Japan
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
37
|
Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022; 23:bbac102. [PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/11/2022] Open
Abstract
Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.
Collapse
Affiliation(s)
- Wenze Ding
- School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China
- School of Future Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| | - Kenta Nakai
- Institute of Medical Science, the University of Tokyo, Tokyo 1088639, Japan
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
38
|
Computational enzyme redesign: large jumps in function. TRENDS IN CHEMISTRY 2022. [DOI: 10.1016/j.trechm.2022.03.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
39
|
Teufel F, Almagro Armenteros JJ, Johansen AR, Gíslason MH, Pihl SI, Tsirigos KD, Winther O, Brunak S, von Heijne G, Nielsen H. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 2022; 40:1023-1025. [PMID: 34980915 PMCID: PMC9287161 DOI: 10.1038/s41587-021-01156-3] [Citation(s) in RCA: 1275] [Impact Index Per Article: 425.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 11/08/2021] [Indexed: 11/09/2022]
Abstract
Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data. A new version of SignalP predicts all types of signal peptides.
Collapse
Affiliation(s)
- Felix Teufel
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.,Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - José Juan Almagro Armenteros
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | | | - Magnús Halldór Gíslason
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen, Denmark
| | - Silas Irby Pihl
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Ole Winther
- Center for Genomic Medicine, Rigshospitalet (Copenhagen University Hospital), Copenhagen, Denmark.,Department of Biology, Bioinformatics Centre, University of Copenhagen, Copenhagen, Denmark.,Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Gunnar von Heijne
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden.,Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Henrik Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
40
|
Akbar R, Bashour H, Rawat P, Robert PA, Smorodina E, Cotet TS, Flem-Karlsen K, Frank R, Mehta BB, Vu MH, Zengin T, Gutierrez-Marcos J, Lund-Johansen F, Andersen JT, Greiff V. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs 2022; 14:2008790. [PMID: 35293269 PMCID: PMC8928824 DOI: 10.1080/19420862.2021.2008790] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 11/04/2021] [Accepted: 11/17/2021] [Indexed: 12/15/2022] Open
Abstract
Although the therapeutic efficacy and commercial success of monoclonal antibodies (mAbs) are tremendous, the design and discovery of new candidates remain a time and cost-intensive endeavor. In this regard, progress in the generation of data describing antigen binding and developability, computational methodology, and artificial intelligence may pave the way for a new era of in silico on-demand immunotherapeutics design and discovery. Here, we argue that the main necessary machine learning (ML) components for an in silico mAb sequence generator are: understanding of the rules of mAb-antigen binding, capacity to modularly combine mAb design parameters, and algorithms for unconstrained parameter-driven in silico mAb sequence synthesis. We review the current progress toward the realization of these necessary components and discuss the challenges that must be overcome to allow the on-demand ML-based discovery and design of fit-for-purpose mAb therapeutic candidates.
Collapse
Affiliation(s)
- Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Habib Bashour
- School of Life Sciences, University of Warwick, Coventry, UK
| | - Puneet Rawat
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Philippe A. Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Eva Smorodina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Russia
| | | | - Karine Flem-Karlsen
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Institute of Clinical Medicine, Department of Pharmacology, University of Oslo and Oslo University Hospital, Norway
| | - Robert Frank
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Brij Bhushan Mehta
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Mai Ha Vu
- Department of Linguistics and Scandinavian Studies, University of Oslo, Norway
| | - Talip Zengin
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Department of Bioinformatics, Mugla Sitki Kocman University, Turkey
| | | | | | - Jan Terje Andersen
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Institute of Clinical Medicine, Department of Pharmacology, University of Oslo and Oslo University Hospital, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| |
Collapse
|
41
|
Cadet XF, Gelly JC, van Noord A, Cadet F, Acevedo-Rocha CG. Learning Strategies in Protein Directed Evolution. Methods Mol Biol 2022; 2461:225-275. [PMID: 35727454 DOI: 10.1007/978-1-0716-2152-3_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Synthetic biology is a fast-evolving research field that combines biology and engineering principles to develop new biological systems for medical, pharmacological, and industrial applications. Synthetic biologists use iterative "design, build, test, and learn" cycles to efficiently engineer genetic systems that are reliable, reproducible, and predictable. Protein engineering by directed evolution can benefit from such a systematic engineering approach for various reasons. Learning can be carried out before starting, throughout or after finalizing a directed evolution project. Computational tools, bioinformatics, and scanning mutagenesis methods can be excellent starting points, while molecular dynamics simulations and other strategies can guide engineering efforts. Similarly, studying protein intermediates along evolutionary pathways offers fascinating insights into the molecular mechanisms shaped by evolution. The learning step of the cycle is not only crucial for proteins or enzymes that are not suitable for high-throughput screening or selection systems, but it is also valuable for any platform that can generate a large amount of data that can be aided by machine learning algorithms. The main challenge in protein engineering is to predict the effect of a single mutation on one functional parameter-to say nothing of several mutations on multiple parameters. This is largely due to nonadditive mutational interactions, known as epistatic effects-beneficial mutations present in a genetic background may not be beneficial in another genetic background. In this work, we provide an overview of experimental and computational strategies that can guide the user to learn protein function at different stages in a directed evolution project. We also discuss how epistatic effects can influence the success of directed evolution projects. Since machine learning is gaining momentum in protein engineering and the field is becoming more interdisciplinary thanks to collaboration between mathematicians, computational scientists, engineers, molecular biologists, and chemists, we provide a general workflow that familiarizes nonexperts with the basic concepts, dataset requirements, learning approaches, model capabilities and performance metrics of this intriguing area. Finally, we also provide some practical recommendations on how machine learning can harness epistatic effects for engineering proteins in an "outside-the-box" way.
Collapse
Affiliation(s)
- Xavier F Cadet
- PEACCEL, Artificial Intelligence Department, Paris, France
| | - Jean Christophe Gelly
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | | - Frédéric Cadet
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | |
Collapse
|
42
|
Robinson SL, Piel J, Sunagawa S. A roadmap for metagenomic enzyme discovery. Nat Prod Rep 2021; 38:1994-2023. [PMID: 34821235 PMCID: PMC8597712 DOI: 10.1039/d1np00006c] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Indexed: 12/13/2022]
Abstract
Covering: up to 2021Metagenomics has yielded massive amounts of sequencing data offering a glimpse into the biosynthetic potential of the uncultivated microbial majority. While genome-resolved information about microbial communities from nearly every environment on earth is now available, the ability to accurately predict biocatalytic functions directly from sequencing data remains challenging. Compared to primary metabolic pathways, enzymes involved in secondary metabolism often catalyze specialized reactions with diverse substrates, making these pathways rich resources for the discovery of new enzymology. To date, functional insights gained from studies on environmental DNA (eDNA) have largely relied on PCR- or activity-based screening of eDNA fragments cloned in fosmid or cosmid libraries. As an alternative, shotgun metagenomics holds underexplored potential for the discovery of new enzymes directly from eDNA by avoiding common biases introduced through PCR- or activity-guided functional metagenomics workflows. However, inferring new enzyme functions directly from eDNA is similar to searching for a 'needle in a haystack' without direct links between genotype and phenotype. The goal of this review is to provide a roadmap to navigate shotgun metagenomic sequencing data and identify new candidate biosynthetic enzymes. We cover both computational and experimental strategies to mine metagenomes and explore protein sequence space with a spotlight on natural product biosynthesis. Specifically, we compare in silico methods for enzyme discovery including phylogenetics, sequence similarity networks, genomic context, 3D structure-based approaches, and machine learning techniques. We also discuss various experimental strategies to test computational predictions including heterologous expression and screening. Finally, we provide an outlook for future directions in the field with an emphasis on meta-omics, single-cell genomics, cell-free expression systems, and sequence-independent methods.
Collapse
Affiliation(s)
| | - Jörn Piel
- Eidgenössische Technische Hochschule (ETH), Zürich, Switzerland.
| | | |
Collapse
|
43
|
Munro LJ, Kell DB. Intelligent host engineering for metabolic flux optimisation in biotechnology. Biochem J 2021; 478:3685-3721. [PMID: 34673920 PMCID: PMC8589332 DOI: 10.1042/bcj20210535] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 09/22/2021] [Accepted: 09/24/2021] [Indexed: 12/13/2022]
Abstract
Optimising the function of a protein of length N amino acids by directed evolution involves navigating a 'search space' of possible sequences of some 20N. Optimising the expression levels of P proteins that materially affect host performance, each of which might also take 20 (logarithmically spaced) values, implies a similar search space of 20P. In this combinatorial sense, then, the problems of directed protein evolution and of host engineering are broadly equivalent. In practice, however, they have different means for avoiding the inevitable difficulties of implementation. The spare capacity exhibited in metabolic networks implies that host engineering may admit substantial increases in flux to targets of interest. Thus, we rehearse the relevant issues for those wishing to understand and exploit those modern genome-wide host engineering tools and thinking that have been designed and developed to optimise fluxes towards desirable products in biotechnological processes, with a focus on microbial systems. The aim throughput is 'making such biology predictable'. Strategies have been aimed at both transcription and translation, especially for regulatory processes that can affect multiple targets. However, because there is a limit on how much protein a cell can produce, increasing kcat in selected targets may be a better strategy than increasing protein expression levels for optimal host engineering.
Collapse
Affiliation(s)
- Lachlan J. Munro
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | - Douglas B. Kell
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St, Liverpool L69 7ZB, U.K
- Mellizyme Biotechnology Ltd, IC1, Liverpool Science Park, 131 Mount Pleasant, Liverpool L3 5TF, U.K
| |
Collapse
|
44
|
Defresne M, Barbe S, Schiex T. Protein Design with Deep Learning. Int J Mol Sci 2021; 22:11741. [PMID: 34769173 PMCID: PMC8584038 DOI: 10.3390/ijms222111741] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/23/2021] [Accepted: 10/26/2021] [Indexed: 12/21/2022] Open
Abstract
Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Collapse
Affiliation(s)
- Marianne Defresne
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| |
Collapse
|
45
|
Kell DB. The Transporter-Mediated Cellular Uptake and Efflux of Pharmaceutical Drugs and Biotechnology Products: How and Why Phospholipid Bilayer Transport Is Negligible in Real Biomembranes. Molecules 2021; 26:5629. [PMID: 34577099 PMCID: PMC8470029 DOI: 10.3390/molecules26185629] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 09/03/2021] [Accepted: 09/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low. This is because (i) most real biomembranes are mostly protein, not lipid, (ii) unlike purely lipid bilayers that can form transient aqueous channels, the high concentrations of proteins serve to stop such activity, (iii) natural evolution long ago selected against transport methods that just let any undesirable products enter a cell, (iv) transporters have now been identified for all kinds of molecules (even water) that were once thought not to require them, (v) many experiments show a massive variation in the uptake of drugs between different cells, tissues, and organisms, that cannot be explained if lipid bilayer transport is significant or if efflux were the only differentiator, and (vi) many experiments that manipulate the expression level of individual transporters as an independent variable demonstrate their role in drug and nutrient uptake (including in cytotoxicity or adverse drug reactions). This makes such transporters valuable both as a means of targeting drugs (not least anti-infectives) to selected cells or tissues and also as drug targets. The same considerations apply to the exploitation of substrate uptake and product efflux transporters in biotechnology. We are also beginning to recognise that transporters are more promiscuous, and antiporter activity is much more widespread, than had been realised, and that such processes are adaptive (i.e., were selected by natural evolution). The purpose of the present review is to summarise the above, and to rehearse and update readers on recent developments. These developments lead us to retain and indeed to strengthen our contention that for transmembrane pharmaceutical drug transport "phospholipid bilayer transport is negligible".
Collapse
Affiliation(s)
- Douglas B. Kell
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St, Liverpool L69 7ZB, UK;
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs Lyngby, Denmark
- Mellizyme Biotechnology Ltd., IC1, Liverpool Science Park, Mount Pleasant, Liverpool L3 5TF, UK
| |
Collapse
|
46
|
Harnessing the yeast Saccharomyces cerevisiae for the production of fungal secondary metabolites. Essays Biochem 2021; 65:277-291. [PMID: 34061167 PMCID: PMC8314005 DOI: 10.1042/ebc20200137] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/09/2021] [Accepted: 04/14/2021] [Indexed: 12/17/2022]
Abstract
Fungal secondary metabolites (FSMs) represent a remarkable array of bioactive compounds, with potential applications as pharmaceuticals, nutraceuticals, and agrochemicals. However, these molecules are typically produced only in limited amounts by their native hosts. The native organisms may also be difficult to cultivate and genetically engineer, and some can produce undesirable toxic side-products. Alternatively, recombinant production of fungal bioactives can be engineered into industrial cell factories, such as aspergilli or yeasts, which are well amenable for large-scale manufacturing in submerged fermentations. In this review, we summarize the development of baker's yeast Saccharomyces cerevisiae to produce compounds derived from filamentous fungi and mushrooms. These compounds mainly include polyketides, terpenoids, and amino acid derivatives. We also describe how native biosynthetic pathways can be combined or expanded to produce novel derivatives and new-to-nature compounds. We describe some new approaches for cell factory engineering, such as genome-scale engineering, biosensor-based high-throughput screening, and machine learning, and how these tools have been applied for S. cerevisiae strain improvement. Finally, we prospect the challenges and solutions in further development of yeast cell factories to more efficiently produce FSMs.
Collapse
|
47
|
Wu Z, Johnston KE, Arnold FH, Yang KK. Protein sequence design with deep generative models. Curr Opin Chem Biol 2021; 65:18-27. [PMID: 34051682 DOI: 10.1016/j.cbpa.2021.04.004] [Citation(s) in RCA: 72] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/02/2021] [Accepted: 04/07/2021] [Indexed: 12/20/2022]
Abstract
Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, protein sequence generation methods can draw on prior knowledge and experimental efforts to improve this process. In this review, we highlight recent applications of machine learning to generate protein sequences, focusing on the emerging field of deep generative methods.
Collapse
Affiliation(s)
- Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, 91125, CA, USA
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, 91125, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, 91125, CA, USA; Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, 91125, CA, USA
| | - Kevin K Yang
- Microsoft Research New England, 1 Memorial Drive, Cambridge, 02142, MA, USA.
| |
Collapse
|
48
|
Revolutionizing enzyme engineering through artificial intelligence and machine learning. Emerg Top Life Sci 2021; 5:113-125. [PMID: 33835131 DOI: 10.1042/etls20200257] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/17/2021] [Accepted: 03/22/2021] [Indexed: 12/20/2022]
Abstract
The combinatorial space of an enzyme sequence has astronomical possibilities and exploring it with contemporary experimental techniques is arduous and often ineffective. Multi-target objectives such as concomitantly achieving improved selectivity, solubility and activity of an enzyme have narrow plausibility under approaches of restricted mutagenesis and combinatorial search. Traditional enzyme engineering approaches have a limited scope for complex optimization due to the requirement of a priori knowledge or experimental burden of screening huge protein libraries. The recent surge in high-throughput experimental methods including Next Generation Sequencing and automated screening has flooded the field of molecular biology with big-data, which requires us to re-think our concurrent approaches towards enzyme engineering. Artificial Intelligence (AI) and Machine Learning (ML) have great potential to revolutionize smart enzyme engineering without the explicit need for a complete understanding of the underlying molecular system. Here, we portray the role and position of AI techniques in the field of enzyme engineering along with their scope and limitations. In addition, we explain how the traditional approaches of directed evolution and rational design can be extended through AI tools. Recent successful examples of AI-assisted enzyme engineering projects and their deviation from traditional approaches are highlighted. A comprehensive picture of current challenges and future avenues for AI in enzyme engineering are also discussed.
Collapse
|
49
|
Püllmann P, Knorrscheidt A, Münch J, Palme PR, Hoehenwarter W, Marillonnet S, Alcalde M, Westermann B, Weissenborn MJ. A modular two yeast species secretion system for the production and preparative application of unspecific peroxygenases. Commun Biol 2021; 4:562. [PMID: 33980981 PMCID: PMC8115255 DOI: 10.1038/s42003-021-02076-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 03/31/2021] [Indexed: 01/27/2023] Open
Abstract
Fungal unspecific peroxygenases (UPOs) represent an enzyme class catalysing versatile oxyfunctionalisation reactions on a broad substrate scope. They are occurring as secreted, glycosylated proteins bearing a haem-thiolate active site and rely on hydrogen peroxide as the oxygen source. However, their heterologous production in a fast-growing organism suitable for high throughput screening has only succeeded once-enabled by an intensive directed evolution campaign. We developed and applied a modular Golden Gate-based secretion system, allowing the first production of four active UPOs in yeast, their one-step purification and application in an enantioselective conversion on a preparative scale. The Golden Gate setup was designed to be universally applicable and consists of the three module types: i) signal peptides for secretion, ii) UPO genes, and iii) protein tags for purification and split-GFP detection. The modular episomal system is suitable for use in Saccharomyces cerevisiae and was transferred to episomal and chromosomally integrated expression cassettes in Pichia pastoris. Shake flask productions in Pichia pastoris yielded up to 24 mg/L secreted UPO enzyme, which was employed for the preparative scale conversion of a phenethylamine derivative reaching 98.6 % ee. Our results demonstrate a rapid, modular yeast secretion workflow of UPOs yielding preparative scale enantioselective biotransformations.
Collapse
Affiliation(s)
- Pascal Püllmann
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany
| | | | - Judith Münch
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany
| | - Paul R Palme
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany
| | | | | | - Miguel Alcalde
- Department of Biocatalysis, Institute of Catalysis, CSIC, Madrid, Spain
| | - Bernhard Westermann
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany
- Institute of Chemistry, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
| | - Martin J Weissenborn
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany.
- Institute of Chemistry, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany.
| |
Collapse
|
50
|
Frappier V, Keating AE. Data-driven computational protein design. Curr Opin Struct Biol 2021; 69:63-69. [PMID: 33910104 DOI: 10.1016/j.sbi.2021.03.009] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Revised: 03/18/2021] [Accepted: 03/19/2021] [Indexed: 01/28/2023]
Abstract
Computational protein design can generate proteins not found in nature that adopt desired structures and perform novel functions. Although proteins could, in theory, be designed with ab initio methods, practical success has come from using large amounts of data that describe the sequences, structures, and functions of existing proteins and their variants. We present recent creative uses of multiple-sequence alignments, protein structures, and high-throughput functional assays in computational protein design. Approaches range from enhancing structure-based design with experimental data to building regression models to training deep neural nets that generate novel sequences. Looking ahead, deep learning will be increasingly important for maximizing the value of data for protein design.
Collapse
Affiliation(s)
- Vincent Frappier
- Generate Biomedicines, 26 Landsdowne Street, Cambridge, MA, 02139, USA
| | - Amy E Keating
- MIT Departments of Biology and Biological Engineering, 77 Massachusetts Ave., Cambridge, MA, 02139, USA.
| |
Collapse
|