1
|
Govender S, Morgan E, Ramahala R, Lobb K, Bishop NT, Tastan Bishop Ö. Transfer learning towards predicting viral missense mutations: A case study on SARS-CoV-2. Comput Struct Biotechnol J 2025; 27:1686-1692. [PMID: 40352476 PMCID: PMC12063013 DOI: 10.1016/j.csbj.2025.04.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2025] [Revised: 04/16/2025] [Accepted: 04/22/2025] [Indexed: 05/14/2025] Open
Abstract
Understanding viral evolution and predicting future mutations are crucial for overcoming drug resistance and developing long-lasting treatments. Previously, we established machine learning (ML) models using dynamic residue network (DRN) metric data and leveraging a vast amount of existing mutation data from the SARS-CoV-2 main protease (Mpro). Here, we sought to assess the generalizability and robustness of the current models across other SARS-CoV-2 proteins. To achieve this, for the first time, we employed a transfer learning (TL) approach, allowing us to determine the extent to which Mpro trained models could be applied to other SARS-CoV-2 proteins. The TL results were highly promising, with artificial neural network (ANN) and random forest (RF) correlation coefficients for Mpro closely matching those of NSP10, NSP16, and PLpro. The ANN |R| value for Mpro was 0.564, while NSP10, NSP16, and PLpro had values of 0.533, 0.527, and 0.464, respectively. Similarly, the RF |R| value for Mpro was 0.673, compared to 0.457, 0.460, and 0.437 for NSP10, NSP16, and PLpro, respectively. Interestingly, we did not observe a strong correlation for the spike (S) protein monomer and its domains. The low p-values that are associated with the correlation |R| values show that the linear correlations between predicted and actual mutation frequencies are statistically significant. This indicates that TL may generalize well across structurally related viral proteins using DRN-derived ML model from Mpro. Overall, we aim to develop a universal ML model for predicting missense mutation frequencies in viral proteins, and this study lays the foundation for that goal.
Collapse
Affiliation(s)
- Shaylyn Govender
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Emily Morgan
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Rabelani Ramahala
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
| | - Kevin Lobb
- Department of Chemistry, Rhodes University, Makhanda 6139, South Africa
| | - Nigel T. Bishop
- Department of Pure and Applied Mathematics, Rhodes University, Makhanda 6139, South Africa
- National Institute for Theoretical and Computational Studies (NITheCS), South Africa
| | - Özlem Tastan Bishop
- Research Unit in Bioinformatics (RUBi), Department of Biochemistry, Microbiology and Bioinformatics, Rhodes University, Makhanda 6139, South Africa
- National Institute for Theoretical and Computational Studies (NITheCS), South Africa
| |
Collapse
|
2
|
Clark JD, Mi X, Mitchell DA, Shukla D. Substrate prediction for RiPP biosynthetic enzymes via masked language modeling and transfer learning. DIGITAL DISCOVERY 2025; 4:343-354. [PMID: 39649639 PMCID: PMC11622008 DOI: 10.1039/d4dd00170b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 11/28/2024] [Indexed: 12/11/2024]
Abstract
Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.
Collapse
Affiliation(s)
- Joseph D Clark
- School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
| | - Xuenan Mi
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
| | - Douglas A Mitchell
- Department of Biochemistry, Vanderbilt University School of Medicine Nashville TN 37232 USA
- Department of Chemistry, Vanderbilt University Nashville TN 37232 USA
| | - Diwakar Shukla
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Bioengineering, University of Illinois at Urbana-Champaign Urbana IL 61801 USA
- Department of Chemistry, University of Illinois at Urbana-Chamapaign Urbana IL 61801 USA
| |
Collapse
|
3
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
4
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
5
|
Yin S, Mi X, Shukla D. Leveraging machine learning models for peptide-protein interaction prediction. RSC Chem Biol 2024; 5:401-417. [PMID: 38725911 PMCID: PMC11078210 DOI: 10.1039/d3cb00208j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 02/07/2024] [Indexed: 05/12/2024] Open
Abstract
Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein-protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide-protein complexes by traditional computational approaches, such as docking and molecular dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide-protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide-protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide-protein interactions.
Collapse
Affiliation(s)
- Song Yin
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign Urbana 61801 Illinois USA
| | - Xuenan Mi
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign Urbana IL 61801 USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign Urbana 61801 Illinois USA
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign Urbana IL 61801 USA
- Department of Bioengineering, University of Illinois Urbana-Champaign Urbana IL 61801 USA
| |
Collapse
|
6
|
Paul A, Shukla D. Oligomerization of Monoamine Transporters. Subcell Biochem 2024; 104:119-137. [PMID: 38963486 DOI: 10.1007/978-3-031-58843-3_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2024]
Abstract
Transporters of the monoamine transporter (MAT) family regulate the uptake of important neurotransmitters like dopamine, serotonin, and norepinephrine. The MAT family functions using the electrochemical gradient of ions across the membrane and comprises three transporters, dopamine transporter (DAT), serotonin transporter (SERT), and norepinephrine transporter (NET). MAT transporters have been observed to exist in monomeric states to higher-order oligomeric states. Structural features, allosteric modulation, and lipid environment regulate the oligomerization of MAT transporters. NET and SERT oligomerization are regulated by levels of PIP2 present in the membrane. The kink present in TM12 in the MAT family is crucial for dimer interface formation. Allosteric modulation in the dimer interface hinders dimer formation. Oligomerization also influences the transporters' function, trafficking, and regulation. This chapter will focus on recent studies on monoamine transporters and discuss the factors affecting their oligomerization and its impact on their function.
Collapse
Affiliation(s)
- Arnav Paul
- Department of Chemistry, University of Illinois Urbana-Champaign, Urbana, IL, USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, Department of Bioengineering, Center for Biophysics and Quantitative Biology, Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
7
|
Yu T, Boob AG, Singh N, Su Y, Zhao H. In vitro continuous protein evolution empowered by machine learning and automation. Cell Syst 2023; 14:633-644. [PMID: 37224814 DOI: 10.1016/j.cels.2023.04.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/19/2022] [Accepted: 04/20/2023] [Indexed: 05/26/2023]
Abstract
Directed evolution has become one of the most successful and powerful tools for protein engineering. However, the efforts required for designing, constructing, and screening a large library of variants can be laborious, time-consuming, and costly. With the recent advent of machine learning (ML) in the directed evolution of proteins, researchers can now evaluate variants in silico and guide a more efficient directed evolution campaign. Furthermore, recent advancements in laboratory automation have enabled the rapid execution of long, complex experiments for high-throughput data acquisition in both industrial and academic settings, thus providing the means to collect a large quantity of data required to develop ML models for protein engineering. In this perspective, we propose a closed-loop in vitro continuous protein evolution framework that leverages the best of both worlds, ML and automation, and provide a brief overview of the recent developments in the field.
Collapse
Affiliation(s)
- Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA
| | - Aashutosh Girish Boob
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nilmani Singh
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Yufeng Su
- NSF Molecule Maker Lab Institute, Urbana, IL, USA; Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
8
|
Makowski EK, Chen HT, Tessier PM. Simplifying complex antibody engineering using machine learning. Cell Syst 2023; 14:667-675. [PMID: 37591204 PMCID: PMC10733906 DOI: 10.1016/j.cels.2023.04.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 03/06/2023] [Accepted: 04/26/2023] [Indexed: 08/19/2023]
Abstract
Machine learning is transforming antibody engineering by enabling the generation of drug-like monoclonal antibodies with unprecedented efficiency. Unsupervised algorithms trained on massive and diverse protein sequence datasets facilitate the prediction of panels of antibody variants with native-like intrinsic properties (e.g., high stability), greatly reducing the amount of subsequent experimentation needed to identify specific candidates that also possess desired extrinsic properties (e.g., high affinity). Additionally, supervised algorithms, which are trained on deep sequencing datasets obtained after enrichment of in vitro antibody libraries for one or more specific extrinsic properties, enable the prediction of antibody variants with desired combinations of extrinsic properties without the need for additional screening. Here we review recent advances using both machine learning approaches and how they are impacting the field of antibody engineering as well as key outstanding challenges and opportunities for these paradigm-changing methods.
Collapse
Affiliation(s)
- Emily K Makowski
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hsin-Ting Chen
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter M Tessier
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
9
|
Chen L, Zhang Z, Li Z, Li R, Huo R, Chen L, Wang D, Luo X, Chen K, Liao C, Zheng M. Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst 2023; 14:706-721.e5. [PMID: 37591206 DOI: 10.1016/j.cels.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/30/2023] [Accepted: 07/18/2023] [Indexed: 08/19/2023]
Abstract
One of the key points of machine learning-assisted directed evolution (MLDE) is the accurate learning of the fitness landscape, a conceptual mapping from sequence variants to the desired function. Here, we describe a multi-protein training scheme that leverages the existing deep mutational scanning data from diverse proteins to aid in understanding the fitness landscape of a new protein. Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects. Moreover, our study identified previously overlooked strong baselines, and their unexpectedly good performance brings our attention to the pitfalls of MLDE. Overall, these results may improve our understanding of the association between different protein fitness profiles and shed light on developing better machine learning-assisted approaches to the directed evolution of proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Lin Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zehong Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhenghao Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; Shanghai Institute for Advanced Immunochemical Studies, School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Rui Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Ruifeng Huo
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Lifan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Kaixian Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China
| | - Cangsong Liao
- University of Chinese Academy of Sciences, Beijing 100049, China; Chemical Biology Research Center, Shanghai Institute of Materia Medica, Chinese Academy of Science, Shanghai 201203, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China; University of Chinese Academy of Sciences, Beijing 100049, China; School of Pharmacy, China Pharmaceutical University, Nanjing 211198, China; School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China.
| |
Collapse
|
10
|
Chan MC, Chan KK, Procko E, Shukla D. Machine Learning Guided Design of High-Affinity ACE2 Decoys for SARS-CoV-2 Neutralization. J Phys Chem B 2023; 127:1995-2001. [PMID: 36827526 PMCID: PMC9999943 DOI: 10.1021/acs.jpcb.3c00469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 02/03/2023] [Indexed: 02/26/2023]
Abstract
A potential therapeutic strategy for neutralizing SARS-CoV-2 infection is engineering high-affinity soluble ACE2 decoy proteins to compete for binding to the viral spike (S) protein. Previously, a deep mutational scan of ACE2 was performed and has led to the identification of a triple mutant variant, named sACE22.v.2.4, that exhibits subnanomolar affinity to the receptor-binding domain (RBD) of S. Using a recently developed transfer learning algorithm, TLmutation, we sought to identify other ACE2 variants that may exhibit similar binding affinity with decreased mutational load. Upon training a TLmutation model on the effects of single mutations, we identified multiple ACE2 double mutants that bind SARS-CoV-2 S with tighter affinity as compared to the wild type, most notably L79V;N90D that binds RBD similarly to ACE22.v.2.4. The experimental validation of the double mutants successfully demonstrates the use of machine learning approaches for engineering protein-protein interactions and identifying high-affinity ACE2 peptides for targeting SARS-CoV-2.
Collapse
Affiliation(s)
- Matthew C. Chan
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61081, USA
| | - Kui. K. Chan
- Cyrus Biotechnology, Inc., Seattle, WA, 98101, USA
| | - Erik Procko
- Cyrus Biotechnology, Inc., Seattle, WA, 98101, USA
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61081, USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61081, USA
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61081, USA
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61081, USA
| |
Collapse
|
11
|
Masson P, Lushchekina S. Conformational Stability and Denaturation Processes of Proteins Investigated by Electrophoresis under Extreme Conditions. Molecules 2022; 27:6861. [PMID: 36296453 PMCID: PMC9610776 DOI: 10.3390/molecules27206861] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 10/10/2022] [Accepted: 10/10/2022] [Indexed: 11/17/2022] Open
Abstract
The functional structure of proteins results from marginally stable folded conformations. Reversible unfolding, irreversible denaturation, and deterioration can be caused by chemical and physical agents due to changes in the physicochemical conditions of pH, ionic strength, temperature, pressure, and electric field or due to the presence of a cosolvent that perturbs the delicate balance between stabilizing and destabilizing interactions and eventually induces chemical modifications. For most proteins, denaturation is a complex process involving transient intermediates in several reversible and eventually irreversible steps. Knowledge of protein stability and denaturation processes is mandatory for the development of enzymes as industrial catalysts, biopharmaceuticals, analytical and medical bioreagents, and safe industrial food. Electrophoresis techniques operating under extreme conditions are convenient tools for analyzing unfolding transitions, trapping transient intermediates, and gaining insight into the mechanisms of denaturation processes. Moreover, quantitative analysis of electrophoretic mobility transition curves allows the estimation of the conformational stability of proteins. These approaches include polyacrylamide gel electrophoresis and capillary zone electrophoresis under cold, heat, and hydrostatic pressure and in the presence of non-ionic denaturing agents or stabilizers such as polyols and heavy water. Lastly, after exposure to extremes of physical conditions, electrophoresis under standard conditions provides information on irreversible processes, slow conformational drifts, and slow renaturation processes. The impressive developments of enzyme technology with multiple applications in fine chemistry, biopharmaceutics, and nanomedicine prompted us to revisit the potentialities of these electrophoretic approaches. This feature review is illustrated with published and unpublished results obtained by the authors on cholinesterases and paraoxonase, two physiologically and toxicologically important enzymes.
Collapse
Affiliation(s)
- Patrick Masson
- Biochemical Neuropharmacology Laboratory, Kazan Federal University, Kremlievskaya Str. 18, 420111 Kazan, Russia
| | - Sofya Lushchekina
- Emanuel Institute of Biochemical Physics, Russian Academy of Sciences, Kosygin Str. 4, 119334 Moscow, Russia
| |
Collapse
|
12
|
Horne J, Shukla D. Recent Advances in Machine Learning Variant Effect Prediction Tools for Protein Engineering. Ind Eng Chem Res 2022; 61:6235-6245. [PMID: 36051311 PMCID: PMC9432854 DOI: 10.1021/acs.iecr.1c04943] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Proteins are Nature's molecular machinery and comprise diverse roles while consisting of chemically similar building blocks. In recent years, protein engineering and design have become important research areas, with many applications in the pharmaceutical, energy, and biocatalysis fields, among others-where the aim is to ultimately create a protein given desired structural and functional properties. It is often critical to model the relationship between a protein's sequence, folded structure, and biological function to assist in such protein engineering pursuits. However, significant challenges remain in concretely mapping an amino acid sequence to specific protein properties and biological activities. Mutations may enhance or diminish molecular protein function, and the epistatic interactions between mutations result in an inherently complex mapping between genetic modifications and protein function. Therefore, estimating the quantitative effects of mutations on protein function(s) remains a grand challenge of biology, bioinformatics, and many related fields and would rapidly accelerate protein engineering tasks when successful. Such estimation is often known as variant effect prediction (VEP). However, progress has been demonstrated in recent years with the development of machine learning (ML) methods in modeling the relationship between mutations and protein function. In this Review, recent advances in variant effect prediction (VEP) are discussed as tools for protein engineering, focusing on techniques incorporating gains from the broader ML community and challenges in estimating biomolecular functional differences. Primary developments highlighted include convolutional neural networks, graph neural networks, and natural language embeddings for protein sequences.
Collapse
Affiliation(s)
- Jesse Horne
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering and Department of Bioengineering, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States; Department of Plant Biology, Cancer Center at Illinois, and Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States
| |
Collapse
|
13
|
Integration of machine learning with computational structural biology of plants. Biochem J 2022; 479:921-928. [PMID: 35484946 DOI: 10.1042/bcj20200942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 04/01/2022] [Accepted: 04/06/2022] [Indexed: 11/17/2022]
Abstract
Computational structural biology of proteins has developed rapidly in recent decades with the development of new computational tools and the advancement of computing hardware. However, while these techniques have widely been used to make advancements in human medicine, these methods have seen less utilization in the plant sciences. In the last several years, machine learning methods have gained popularity in computational structural biology. These methods have enabled the development of new tools which are able to address the major challenges that have hampered the wide adoption of the computational structural biology of plants. This perspective examines the remaining challenges in computational structural biology and how the development of machine learning techniques enables more in-depth computational structural biology of plants.
Collapse
|
14
|
Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 2022; 40:1114-1122. [PMID: 35039677 DOI: 10.1038/s41587-021-01146-5] [Citation(s) in RCA: 92] [Impact Index Per Article: 30.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 11/02/2021] [Indexed: 01/27/2023]
Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
Collapse
Affiliation(s)
- Chloe Hsu
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA.
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, USA
| | - Clara Fannjiang
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA. .,Center for Computational Biology, University of California, Berkeley, USA.
| |
Collapse
|
15
|
Ovek D, Abali Z, Zeylan ME, Keskin O, Gursoy A, Tuncbag N. Artificial intelligence based methods for hot spot prediction. Curr Opin Struct Biol 2021; 72:209-218. [PMID: 34954608 DOI: 10.1016/j.sbi.2021.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 10/07/2021] [Accepted: 11/08/2021] [Indexed: 11/29/2022]
Abstract
Proteins interact through their interfaces to fulfill essential functions in the cell. They bind to their partners in a highly specific manner and form complexes that have a profound effect on understanding the biological pathways they are involved in. Any abnormal interactions may cause diseases. Therefore, the identification of small molecules which modulate protein interactions through their interfaces has high therapeutic potential. However, discovering such molecules is challenging. Most protein-protein binding affinity is attributed to a small set of amino acids found in protein interfaces known as hot spots. Recent studies demonstrate that drug-like small molecules specifically may bind to hot spots. Therefore, hot spot prediction is crucial. As experimental data accumulates, artificial intelligence begins to be used for computational hot spot prediction. First, we review machine learning and deep learning for computational hot spot prediction and then explain the significance of hot spots toward drug design.
Collapse
Affiliation(s)
- Damla Ovek
- College of Engineering, Koc University, 34450 Istanbul, Turkey
| | - Zeynep Abali
- College of Engineering, Koc University, 34450 Istanbul, Turkey
| | | | - Ozlem Keskin
- College of Engineering, Koc University, 34450 Istanbul, Turkey.
| | - Attila Gursoy
- College of Engineering, Koc University, 34450 Istanbul, Turkey.
| | - Nurcan Tuncbag
- College of Engineering, Koc University, 34450 Istanbul, Turkey; School of Medicine, Koc University, 34450 Istanbul, Turkey.
| |
Collapse
|
16
|
Chan MC, Chan KK, Procko E, Shukla D. Machine learning guided design of high affinity ACE2 decoys for SARS-CoV-2 neutralization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.12.22.473902. [PMID: 34981064 PMCID: PMC8722601 DOI: 10.1101/2021.12.22.473902] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
A potential therapeutic candidate for neutralizing SARS-CoV-2 infection is engineering high-affinity soluble ACE2 decoy proteins to compete for binding of the viral spike (S) protein. Previously, a deep mutational scan of ACE2 was performed and has led to the identification of a triple mutant ACE2 variant, named ACE2 2 .v.2.4, that exhibits nanomolar affinity binding to the RBD domain of S. Using a recently developed transfer learning algorithm, TLmutation, we sought to identified other ACE2 variants, namely double mutants, that may exhibit similar binding affinity with decreased mutational load. Upon training a TLmutation model on the effects of single mutations, we identified several ACE2 double mutants that bind to RBD with tighter affinity as compared to the wild type, most notably, L79V;N90D that binds RBD with similar affinity to ACE2 2 .v.2.4. The successful experimental validation of the double mutants demonstrated the use transfer and supervised learning approaches for engineering protein-protein interactions and identifying high affinity ACE2 peptides for targeting SARS-CoV-2.
Collapse
Affiliation(s)
- Matthew C Chan
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61081
| | - Kui K Chan
- Cyrus Biotechnology, Inc., Seattle, WA, 98101
| | - Erik Procko
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61081
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61081
| |
Collapse
|
17
|
Wittmann BJ, Yue Y, Arnold FH. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst 2021; 12:1026-1045.e7. [PMID: 34416172 DOI: 10.1016/j.cels.2021.07.008] [Citation(s) in RCA: 93] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 05/06/2021] [Accepted: 07/26/2021] [Indexed: 11/17/2022]
Abstract
Directed evolution of proteins often involves a greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. The efficiency of such a single-step greedy walk depends on the order in which beneficial mutations are identified-the process is path dependent. Here, we investigate and optimize a path-independent machine learning-assisted directed evolution (MLDE) protocol that allows in silico screening of full combinatorial libraries. In particular, we evaluate the importance of different protein encoding strategies, training procedures, models, and training set design strategies on MLDE outcome, finding the most important consideration to be the implementation of strategies that reduce inclusion of minimally informative "holes" (protein variants with zero or extremely low fitness) in training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape, our optimized protocol achieved the global fitness maximum up to 81-fold more frequently than single-step greedy optimization. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Bruce J Wittmann
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA
| | - Yisong Yue
- Department of Computing and Mathematical Sciences, California Institute of Technology, MC 305-16, 1200 E. California Blvd., Pasadena, CA 91125, USA
| | - Frances H Arnold
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA; Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Blvd., Pasadena, CA 91125, USA.
| |
Collapse
|
18
|
Luo Y, Jiang G, Yu T, Liu Y, Vo L, Ding H, Su Y, Qian WW, Zhao H, Peng J. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat Commun 2021; 12:5743. [PMID: 34593817 PMCID: PMC8484459 DOI: 10.1038/s41467-021-25976-8] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2021] [Accepted: 09/09/2021] [Indexed: 11/28/2022] Open
Abstract
Machine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates.
Collapse
Affiliation(s)
- Yunan Luo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Guangde Jiang
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Yang Liu
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Lam Vo
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Hantian Ding
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Yufeng Su
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Wesley Wei Qian
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA.
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA.
| |
Collapse
|
19
|
Evolution-aided engineering of plant specialized metabolism. ABIOTECH 2021; 2:240-263. [PMID: 36303885 PMCID: PMC9590541 DOI: 10.1007/s42994-021-00052-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 06/04/2021] [Indexed: 02/07/2023]
Abstract
The evolution of new traits in living organisms occurs via the processes of mutation, recombination, genetic drift, and selection. These processes that have resulted in the immense biological diversity on our planet are also being employed in metabolic engineering to optimize enzymes and pathways, create new-to-nature reactions, and synthesize complex natural products in heterologous systems. In this review, we discuss two evolution-aided strategies for metabolic engineering-directed evolution, which improves upon existing genetic templates using the evolutionary process, and combinatorial pathway reconstruction, which brings together genes evolved in different organisms into a single heterologous host. We discuss the general principles of these strategies, describe the technologies involved and the molecular traits they influence, provide examples of their use, and discuss the roadblocks that need to be addressed for their wider adoption. A better understanding of these strategies can provide an impetus to research on gene function discovery and biochemical evolution, which is foundational for improved metabolic engineering. These evolution-aided approaches thus have a substantial potential for improving our understanding of plant metabolism in general, for enhancing the production of plant metabolites, and in sustainable agriculture.
Collapse
|
20
|
Ferguson AL, Hachmann J, Miller TF, Pfaendtner J. The Journal of Physical Chemistry A/ B/ C Virtual Special Issue on Machine Learning in Physical Chemistry. J Phys Chem A 2021; 124:9113-9118. [PMID: 33147969 DOI: 10.1021/acs.jpca.0c09205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
21
|
Narayanan KK, Procko E. Deep Mutational Scanning of Viral Glycoproteins and Their Host Receptors. Front Mol Biosci 2021; 8:636660. [PMID: 33898517 PMCID: PMC8062978 DOI: 10.3389/fmolb.2021.636660] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 03/18/2021] [Indexed: 11/17/2022] Open
Abstract
Deep mutational scanning or deep mutagenesis is a powerful tool for understanding the sequence diversity available to viruses for adaptation in a laboratory setting. It generally involves tracking an in vitro selection of protein sequence variants with deep sequencing to map mutational effects based on changes in sequence abundance. Coupled with any of a number of selection strategies, deep mutagenesis can explore the mutational diversity available to viral glycoproteins, which mediate critical roles in cell entry and are exposed to the humoral arm of the host immune response. Mutational landscapes of viral glycoproteins for host cell attachment and membrane fusion reveal extensive epistasis and potential escape mutations to neutralizing antibodies or other therapeutics, as well as aiding in the design of optimized immunogens for eliciting broadly protective immunity. While less explored, deep mutational scans of host receptors further assist in understanding virus-host protein interactions. Critical residues on the host receptors for engaging with viral spikes are readily identified and may help with structural modeling. Furthermore, mutations may be found for engineering soluble decoy receptors as neutralizing agents that specifically bind viral targets with tight affinity and limited potential for viral escape. By untangling the complexities of how sequence contributes to viral glycoprotein and host receptor interactions, deep mutational scanning is impacting ideas and strategies at multiple levels for combatting circulating and emergent virus strains.
Collapse
Affiliation(s)
| | - Erik Procko
- Department of Biochemistry and Cancer Center at Illinois, University of Illinois, Urbana, IL, United States
| |
Collapse
|
22
|
Ferguson AL, Hachmann J, Miller TF, Pfaendtner J. The Journal of Physical Chemistry A/ B/ C Virtual Special Issue on Machine Learning in Physical Chemistry. J Phys Chem B 2021; 124:9767-9772. [PMID: 33147970 DOI: 10.1021/acs.jpcb.0c09206] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|