1
|
Ertelt M, Moretti R, Meiler J, Schoeder CT. Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants. SCIENCE ADVANCES 2025; 11:eadr7338. [PMID: 39937901 PMCID: PMC11817935 DOI: 10.1126/sciadv.adr7338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Accepted: 01/10/2025] [Indexed: 02/14/2025]
Abstract
Machine learning (ML) is changing the world of computational protein design, with data-driven methods surpassing biophysical-based methods in experimental success. However, they are most often reported as case studies, lack integration and standardization, and are therefore hard to objectively compare. In this study, we established a streamlined and diverse toolbox for methods that predict amino acid probabilities inside the Rosetta software framework that allows for the side-by-side comparison of these models. Subsequently, existing protein fitness landscapes were used to benchmark novel ML methods in realistic protein design settings. We focused on the traditional problems of protein design: sampling and scoring. A major finding of our study is that ML approaches are better at purging the sampling space from deleterious mutations. Nevertheless, scoring resulting mutations without model fine-tuning showed no clear improvement over scoring with Rosetta. We conclude that ML now complements, rather than replaces, biophysical methods in protein design.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| | - Rocco Moretti
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Jens Meiler
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Clara T. Schoeder
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| |
Collapse
|
2
|
Hunter Wilson R, Diaz DJ, Damodaran AR, Bhagi-Damodaran A. Machine Learning Guided Rational Design of a Non-Heme Iron-Based Lysine Dioxygenase Improves its Total Turnover Number. Chembiochem 2024; 25:e202400495. [PMID: 39370399 DOI: 10.1002/cbic.202400495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 09/05/2024] [Accepted: 10/04/2024] [Indexed: 10/08/2024]
Abstract
Highly selective C-H functionalization remains an ongoing challenge in organic synthetic methodologies. Biocatalysts are robust tools for achieving these difficult chemical transformations. Biocatalyst engineering has often required directed evolution or structure-based rational design campaigns to improve their activities. In recent years, machine learning has been integrated into these workflows to improve the discovery of beneficial enzyme variants. In this work, we combine a structure-based self-supervised machine learning framework, MutComputeX, with classical molecular dynamics simulations to down select mutations for rational design of a non-heme iron-dependent lysine dioxygenase, LDO. This approach consistently resulted in functional LDO mutants and circumvents the need for extensive study of mutational activity before-hand. Our rationally designed single mutants purified with up to 2-fold higher expression yields than WT and displayed higher total turnover numbers (TTN). Combining five such single mutations into a pentamutant variant, LPNYI LDO, leads to a 40 % improvement in the TTN (218±3) as compared to WT LDO (TTN=160±2). Overall, this work offers a low-barrier approach for those seeking to synergize machine learning algorithms with pre-existing protein engineering strategies.
Collapse
Affiliation(s)
- R Hunter Wilson
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN-55455, United States
| | - Daniel J Diaz
- Department of Chemistry, Department of Computer Science, University of Texas at Austin, Austin, TX-78705, United States
- Institute for Foundations of Machine Learning, University of Texas at Austin, Austin, TX-78705, United States
| | - Anoop R Damodaran
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN-55455, United States
| | - Ambika Bhagi-Damodaran
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN-55455, United States
| |
Collapse
|
3
|
Tripp A, Braun M, Wieser F, Oberdorfer G, Lechner H. Click, Compute, Create: A Review of Web-based Tools for Enzyme Engineering. Chembiochem 2024; 25:e202400092. [PMID: 38634409 DOI: 10.1002/cbic.202400092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/14/2024] [Accepted: 04/15/2024] [Indexed: 04/19/2024]
Abstract
Enzyme engineering, though pivotal across various biotechnological domains, is often plagued by its time-consuming and labor-intensive nature. This review aims to offer an overview of supportive in silico methodologies for this demanding endeavor. Starting from methods to predict protein structures, to classification of their activity and even the discovery of new enzymes we continue with describing tools used to increase thermostability and production yields of selected targets. Subsequently, we discuss computational methods to modulate both, the activity as well as selectivity of enzymes. Last, we present recent approaches based on cutting-edge machine learning methods to redesign enzymes. With exception of the last chapter, there is a strong focus on methods easily accessible via web-interfaces or simple Python-scripts, therefore readily useable for a diverse and broad community.
Collapse
Affiliation(s)
- Adrian Tripp
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Markus Braun
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Florian Wieser
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Gustav Oberdorfer
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| | - Horst Lechner
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| |
Collapse
|
4
|
Son A, Park J, Kim W, Yoon Y, Lee S, Park Y, Kim H. Revolutionizing Molecular Design for Innovative Therapeutic Applications through Artificial Intelligence. Molecules 2024; 29:4626. [PMID: 39407556 PMCID: PMC11477718 DOI: 10.3390/molecules29194626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2024] [Revised: 09/19/2024] [Accepted: 09/27/2024] [Indexed: 10/20/2024] Open
Abstract
The field of computational protein engineering has been transformed by recent advancements in machine learning, artificial intelligence, and molecular modeling, enabling the design of proteins with unprecedented precision and functionality. Computational methods now play a crucial role in enhancing the stability, activity, and specificity of proteins for diverse applications in biotechnology and medicine. Techniques such as deep learning, reinforcement learning, and transfer learning have dramatically improved protein structure prediction, optimization of binding affinities, and enzyme design. These innovations have streamlined the process of protein engineering by allowing the rapid generation of targeted libraries, reducing experimental sampling, and enabling the rational design of proteins with tailored properties. Furthermore, the integration of computational approaches with high-throughput experimental techniques has facilitated the development of multifunctional proteins and novel therapeutics. However, challenges remain in bridging the gap between computational predictions and experimental validation and in addressing ethical concerns related to AI-driven protein design. This review provides a comprehensive overview of the current state and future directions of computational methods in protein engineering, emphasizing their transformative potential in creating next-generation biologics and advancing synthetic biology.
Collapse
Affiliation(s)
- Ahrum Son
- Department of Molecular Medicine, Scripps Research, La Jolla, CA 92037, USA;
| | - Jongham Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Woojin Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Yoonki Yoon
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Sangwoon Lee
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Yongho Park
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
| | - Hyunsoo Kim
- Department of Bio-AI Convergence, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea; (J.P.); (W.K.); (Y.Y.); (S.L.); (Y.P.)
- Department of Convergent Bioscience and Informatics, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- Protein AI Design Institute, Chungnam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
- SCICS, Prove beyond AI, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea
| |
Collapse
|
5
|
Diaz DJ, Gong C, Ouyang-Zhang J, Loy JM, Wells J, Yang D, Ellington AD, Dimakis AG, Klivans AR. Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations. Nat Commun 2024; 15:6170. [PMID: 39043654 PMCID: PMC11266546 DOI: 10.1038/s41467-024-49780-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Accepted: 06/14/2024] [Indexed: 07/25/2024] Open
Abstract
Engineering stabilized proteins is a fundamental challenge in the development of industrial and pharmaceutical biotechnologies. We present Stability Oracle: a structure-based graph-transformer framework that achieves SOTA performance on accurately identifying thermodynamically stabilizing mutations. Our framework introduces several innovations to overcome well-known challenges in data scarcity and bias, generalization, and computation time, such as: Thermodynamic Permutations for data augmentation, structural amino acid embeddings to model a mutation with a single structure, a protein structure-specific attention-bias mechanism that makes transformers a viable alternative to graph neural networks. We provide training/test splits that mitigate data leakage and ensure proper model evaluation. Furthermore, to examine our data engineering contributions, we fine-tune ESM2 representations (Prostata-IFML) and achieve SOTA for sequence-based models. Notably, Stability Oracle outperforms Prostata-IFML even though it was pretrained on 2000X less proteins and has 548X less parameters. Our framework establishes a path for fine-tuning structure-based transformers to virtually any phenotype, a necessary task for accelerating the development of protein-based biotechnologies.
Collapse
Affiliation(s)
- Daniel J Diaz
- UT Austin, Department of Computer Science, Austin, TX, 78712, USA.
- Intelligent Proteins, LLC, Austin, TX, 78712, USA.
- UT Austin, Department of Chemistry, Austin, TX, 78712, USA.
| | - Chengyue Gong
- UT Austin, Department of Computer Science, Austin, TX, 78712, USA
| | | | - James M Loy
- Intelligent Proteins, LLC, Austin, TX, 78712, USA
- UT Austin, Department of Molecular Biosciences, Austin, TX, 78712, USA
| | - Jordan Wells
- UT Austin, McKetta Department of Chemical Engineering, Austin, TX, 78712, USA
| | - David Yang
- UT Austin, Department of Molecular Biosciences, Austin, TX, 78712, USA
| | | | - Alexandros G Dimakis
- UT Austin, Chandra Family Department of Electrical and Computer Engineering, Austin, TX, 78712, USA
| | - Adam R Klivans
- UT Austin, Department of Computer Science, Austin, TX, 78712, USA
| |
Collapse
|
6
|
Hunter Wilson R, Damodaran AR, Bhagi-Damodaran A. Machine learning guided rational design of a non-heme iron-based lysine dioxygenase improves its total turnover number. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.597480. [PMID: 38895203 PMCID: PMC11185610 DOI: 10.1101/2024.06.04.597480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Highly selective C-H functionalization remains an ongoing challenge in organic synthetic methodologies. Biocatalysts are robust tools for achieving these difficult chemical transformations. Biocatalyst engineering has often required directed evolution or structure-based rational design campaigns to improve their activities. In recent years, machine learning has been integrated into these workflows to improve the discovery of beneficial enzyme variants. In this work, we combine a structure-based machine-learning algorithm with classical molecular dynamics simulations to down select mutations for rational design of a non-heme iron-dependent lysine dioxygenase, LDO. This approach consistently resulted in functional LDO mutants and circumvents the need for extensive study of mutational activity before-hand. Our rationally designed single mutants purified with up to 2-fold higher yields than WT and displayed higher total turnover numbers (TTN). Combining five such single mutations into a pentamutant variant, LPNYI LDO, leads to a 40% improvement in the TTN (218±3) as compared to WT LDO (TTN = 160±2). Overall, this work offers a low-barrier approach for those seeking to synergize machine learning algorithms with pre-existing protein engineering strategies.
Collapse
Affiliation(s)
- R Hunter Wilson
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, 55455
| | - Anoop R Damodaran
- Department of Chemistry, University of Minnesota, Twin Cities, Minneapolis, MN, 55455
| | | |
Collapse
|
7
|
Liu Y, Bender SG, Sorigue D, Diaz DJ, Ellington AD, Mann G, Allmendinger S, Hyster TK. Asymmetric Synthesis of α-Chloroamides via Photoenzymatic Hydroalkylation of Olefins. J Am Chem Soc 2024; 146:7191-7197. [PMID: 38442365 PMCID: PMC11622607 DOI: 10.1021/jacs.4c00927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2024]
Abstract
Photoenzymatic intermolecular hydroalkylations of olefins are highly enantioselective for chiral centers formed during radical termination but poorly selective for centers set in the C-C bond-forming event. Here, we report the evolution of a flavin-dependent "ene"-reductase to catalyze the coupling of α,α-dichloroamides with alkenes to afford α-chloroamides in good yield with excellent chemo- and stereoselectivity. These products can serve as linchpins in the synthesis of pharmaceutically valuable motifs. Mechanistic studies indicate that radical formation occurs by exciting a charge-transfer complex templated by the protein. Precise control over the orientation of molecules within the charge-transfer complex potentially accounts for the observed stereoselectivity. The work expands the types of motifs that can be prepared using photoenzymatic catalysis.
Collapse
Affiliation(s)
- Yi Liu
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, United States
| | - Sophie G Bender
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, United States
| | - Damien Sorigue
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, United States
- Aix-Marseille University, CEA, CNRS, Institute of Biosciences and Biotechnologies, BIAM Cadarache, 13108 Saint-Paul-lez-Durance, France
| | - Daniel J Diaz
- Department of Chemistry, University of Texas at Austin, Austin, Texas 78712, United States
- Institute for Foundations of Machine Learning, University of Texas at Austin, Austin, Texas 78712, United States
| | - Andrew D Ellington
- Department of Molecular Bioscience, University of Texas at Austin, Austin, Texas 78712, United States
| | - Greg Mann
- Novartis Pharm. AG, Basel 4002, Switzerland
| | | | - Todd K Hyster
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York 14853, United States
| |
Collapse
|
8
|
d'Oelsnitz S, Diaz DJ, Kim W, Acosta DJ, Dangerfield TL, Schechter MW, Minus MB, Howard JR, Do H, Loy JM, Alper HS, Zhang YJ, Ellington AD. Biosensor and machine learning-aided engineering of an amaryllidaceae enzyme. Nat Commun 2024; 15:2084. [PMID: 38453941 PMCID: PMC10920890 DOI: 10.1038/s41467-024-46356-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 02/22/2024] [Indexed: 03/09/2024] Open
Abstract
A major challenge to achieving industry-scale biomanufacturing of therapeutic alkaloids is the slow process of biocatalyst engineering. Amaryllidaceae alkaloids, such as the Alzheimer's medication galantamine, are complex plant secondary metabolites with recognized therapeutic value. Due to their difficult synthesis they are regularly sourced by extraction and purification from the low-yielding daffodil Narcissus pseudonarcissus. Here, we propose an efficient biosensor-machine learning technology stack for biocatalyst development, which we apply to engineer an Amaryllidaceae enzyme in Escherichia coli. Directed evolution is used to develop a highly sensitive (EC50 = 20 μM) and specific biosensor for the key Amaryllidaceae alkaloid branchpoint 4'-O-methylnorbelladine. A structure-based residual neural network (MutComputeX) is subsequently developed and used to generate activity-enriched variants of a plant methyltransferase, which are rapidly screened with the biosensor. Functional enzyme variants are identified that yield a 60% improvement in product titer, 2-fold higher catalytic activity, and 3-fold lower off-product regioisomer formation. A solved crystal structure elucidates the mechanism behind key beneficial mutations.
Collapse
Affiliation(s)
- Simon d'Oelsnitz
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA.
- Synthetic Biology HIVE, Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA.
| | - Daniel J Diaz
- Department of Chemistry, University of Texas at Austin, Austin, TX, 78712, USA
- Institute for Foundations of Machine Learning, University of Texas at Austin, Austin, TX, 78712, USA
| | - Wantae Kim
- McKetta Department of Chemical Engineering, University of Texas at Austin, Austin, TX, 78712, USA
| | - Daniel J Acosta
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Tyler L Dangerfield
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Mason W Schechter
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Matthew B Minus
- Department of Chemistry, Prairie View A&M University, 100 University Dr, Prairie View, TX, 77446, USA
| | - James R Howard
- Department of Chemistry, University of Texas at Austin, Austin, TX, 78712, USA
| | - Hannah Do
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - James M Loy
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Hal S Alper
- McKetta Department of Chemical Engineering, University of Texas at Austin, Austin, TX, 78712, USA
| | - Y Jessie Zhang
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| | - Andrew D Ellington
- Department of Molecular Biosciences, University of Texas at Austin, Austin, TX, 78712, USA
| |
Collapse
|
9
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
10
|
Wang T, Jin X, Lu X, Min X, Ge S, Li S. Empirical validation of ProteinMPNN's efficiency in enhancing protein fitness. Front Genet 2024; 14:1347667. [PMID: 38274106 PMCID: PMC10808456 DOI: 10.3389/fgene.2023.1347667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 12/20/2023] [Indexed: 01/27/2024] Open
Abstract
Introduction: Protein engineering, which aims to improve the properties and functions of proteins, holds great research significance and application value. However, current models that predict the effects of amino acid substitutions often perform poorly when evaluated for precision. Recent research has shown that ProteinMPNN, a large-scale pre-training sequence design model based on protein structure, performs exceptionally well. It is capable of designing mutants with structures similar to the original protein. When applied to the field of protein engineering, the diverse designs for mutation positions generated by this model can be viewed as a more precise mutation range. Methods: We collected three biological experimental datasets and compared the design results of ProteinMPNN for wild-type proteins with the experimental datasets to verify the ability of ProteinMPNN in improving protein fitness. Results: The validation on biological experimental datasets shows that ProteinMPNN has the ability to design mutation types with higher fitness in single and multi-point mutations. We have verified the high accuracy of ProteinMPNN in protein engineering tasks from both positive and negative perspectives. Discussion: Our research indicates that using large-scale pre trained models to design protein mutants provides a new approach for protein engineering, providing strong support for guiding biological experiments and applications in biotechnology.
Collapse
Affiliation(s)
- Tianshu Wang
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, Xiamen, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University, Xiamen, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, Xiamen University, Xiamen, China
- State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Xiamen University, Xiamen, China
| | - Xiaocheng Jin
- State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University, Xiamen, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, Xiamen University, Xiamen, China
- State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Xiamen University, Xiamen, China
- School of Public Health, Xiamen University, Xiamen, China
| | - Xiaoli Lu
- Information and Networking Center, Xiamen University, Xiamen, China
| | - Xiaoping Min
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, Xiamen, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University, Xiamen, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, Xiamen University, Xiamen, China
- State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Xiamen University, Xiamen, China
| | - Shengxiang Ge
- State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University, Xiamen, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, Xiamen University, Xiamen, China
- State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Xiamen University, Xiamen, China
- School of Public Health, Xiamen University, Xiamen, China
| | - Shaowei Li
- State Key Laboratory of Vaccines for Infectious Diseases, Xiamen University, Xiamen, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, Xiamen University, Xiamen, China
- State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Xiamen University, Xiamen, China
- School of Public Health, Xiamen University, Xiamen, China
| |
Collapse
|
11
|
Kulikova AV, Diaz DJ, Chen T, Cole TJ, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. Sci Rep 2023; 13:13280. [PMID: 37587128 PMCID: PMC10432456 DOI: 10.1038/s41598-023-40247-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/07/2023] [Indexed: 08/18/2023] Open
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| | - Andrew D Ellington
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
12
|
Kulikova AV, Diaz DJ, Chen T, Jeffrey Cole T, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.20.533508. [PMID: 36993648 PMCID: PMC10055221 DOI: 10.1101/2023.03.20.533508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V. Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J. Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T. Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| | - Andrew D. Ellington
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Claus O. Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
13
|
Diaz DJ, Kulikova AV, Ellington AD, Wilke CO. Using machine learning to predict the effects and consequences of mutations in proteins. Curr Opin Struct Biol 2023; 78:102518. [PMID: 36603229 PMCID: PMC9908841 DOI: 10.1016/j.sbi.2022.102518] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 11/07/2022] [Accepted: 11/20/2022] [Indexed: 01/05/2023]
Abstract
Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.
Collapse
Affiliation(s)
- Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, 105 E 24TH St., Austin, 78712, Texas, USA; Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/aiproteins
| | - Anastasiya V Kulikova
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA
| | - Andrew D Ellington
- Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/CSSBatUT
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA.
| |
Collapse
|
14
|
Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J 2022; 21:238-250. [PMID: 36544476 PMCID: PMC9755234 DOI: 10.1016/j.csbj.2022.11.014] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/20/2022] Open
Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Collapse
Key Words
- ADMM, Alternating Direction Method of Multipliers
- CNN, Convolutional Neural Network
- DL, Deep learning
- Deep learning
- Drug discovery
- FNN, fully-connected neural network
- GAN, Generative Adversarial Network
- GCN, Graph Convolutional Network
- GNN, Graph Neural Network
- GO, Gene Ontology
- GVP, Geometric Vector Perceptron
- LSTM, Long-Short Term Memory
- MLP, Multilayer Perceptron
- MSA, Multiple Sequence Alignment
- NLP, Natural Language Processing
- NSR, Natural Sequence Recovery
- Protein design
- Protein language models
- Protein prediction
- VAE, Variational Autoencoder
- pLM, protein Language Model
Collapse
Affiliation(s)
- Noelia Ferruz
- Institute of Informatics and Applications, University of Girona, Girona, Spain
- Department of Biochemistry, University of Bayreuth, Bayreuth, Germany
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
| | - Mehmet Akdel
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | | | - Luca Naef
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
- NVIDIA DE GmbH, Einsteinstraße 172, 81677 München, Germany
| |
Collapse
|
15
|
Linchangco GV, Foley B, Leitner T. Updated HIV-1 Consensus Sequences Change but Stay Within Similar Distance From Worldwide Samples. Front Microbiol 2022; 12:828765. [PMID: 35178042 PMCID: PMC8843389 DOI: 10.3389/fmicb.2021.828765] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 12/20/2021] [Indexed: 12/15/2022] Open
Abstract
HIV consensus sequences are used in various bioinformatic, evolutionary, and vaccine related research. Since the previous HIV-1 subtype and CRF consensus sequences were constructed in 2002, the number of publicly available HIV-1 sequences have grown exponentially, especially from non-EU and US countries. Here, we reconstruct 90 new HIV-1 subtype and CRF consensus sequences from 3,470 high-quality, representative, full genome sequences in the LANL HIV database. While subtypes and CRFs are unevenly spread across the world, in total 89 countries were represented. For consensus sequences that were based on at least 20 genomes, we found that on average 2.3% (range 0.8–10%) of the consensus genome site states changed from 2002 to 2021, of which about half were nucleotide state differences and the rest insertions and deletions. Interestingly, the 2021 consensus sequences were shorter than in 2002, and compared to 4,674 HIV-1 worldwide genome sequences, the 2021 consensuses were somewhat closer to the worldwide genome sequences, i.e., showing on average fewer nucleotide state differences. Some subtypes/CRFs have had limited geographical spread, and thus sampling of subtypes/CRFs is uneven, at least in part, due to the epidemiological dynamics. Thus, taken as a whole, the 2021 consensus sequences likely are good representations of the typical subtype/CRF genome nucleotide states. The new consensus sequences are available at the LANL HIV database.
Collapse
Affiliation(s)
- Gregorio V Linchangco
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Brian Foley
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, United States
| | - Thomas Leitner
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, United States
| |
Collapse
|