1
|
Rossi I, Barducci G, Sanavia T, Turina P, Capriotti E, Fariselli P. Mass balance approximation of unfolding boosts potential-based protein stability predictions. Protein Sci 2025; 34:e70134. [PMID: 40277391 PMCID: PMC12023412 DOI: 10.1002/pro.70134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 03/18/2025] [Accepted: 04/08/2025] [Indexed: 04/26/2025]
Abstract
Predicting protein stability changes upon single-point mutations is crucial in computational biology, with applications in drug design, enzyme engineering, and understanding disease mechanisms. While deep-learning approaches have emerged, many remain inaccessible for routine use. In contrast, potential-like methods, including deep-learning-based ones, are faster, user-friendly, and effective in estimating stability changes. However, most of them approximate Gibbs free-energy differences without accounting for the free-energy changes of the unfolded state, violating mass balance and potentially reducing accuracy. Here, we show that incorporating mass balance as a first approximation of the unfolded state significantly improves potential-like methods. While many machine-learning models implicitly or explicitly use mass balance, our findings suggest that a more accurate unfolded-state representation could further enhance stability change predictions.
Collapse
Affiliation(s)
- Ivan Rossi
- Department of Medical SciencesUniversity of TorinoTorinoItaly
| | - Guido Barducci
- Department of Medical SciencesUniversity of TorinoTorinoItaly
| | - Tiziana Sanavia
- Department of Medical SciencesUniversity of TorinoTorinoItaly
| | - Paola Turina
- Department of Pharmacy and Biotechnology (FaBiT)University of BolognaBolognaItaly
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT)University of BolognaBolognaItaly
- Computational Genomics Platform, IRCCS University Hospital of BolognaBolognaItaly
| | - Piero Fariselli
- Department of Medical SciencesUniversity of TorinoTorinoItaly
| |
Collapse
|
2
|
Liang T, Sun ZY, Ishima R, Xie XQ, Xue Y, Li W, Feng Z. ProstaNet: A Novel Geometric Vector Perceptrons-Graph Neural Network Algorithm for Protein Stability Prediction in Single- and Multiple-Point Mutations with Experimental Validation. RESEARCH (WASHINGTON, D.C.) 2025; 8:0674. [PMID: 40235597 PMCID: PMC11997553 DOI: 10.34133/research.0674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Revised: 03/21/2025] [Accepted: 03/23/2025] [Indexed: 04/17/2025]
Abstract
Proteins play a critical role in biology and biopharma due to their specificity and minimal side effects. Predicting the effects of mutations on protein stability is vital but experimentally challenging. Deep learning offers an efficient solution to this problem. In the present work, we introduced ProstaNet, a deep learning framework that predicts stability changes resulting from single- and multiple-point mutations using geometric vector perceptrons-graph neural network for 3-dimensional feature processing. For training ProstaNet, we meticulously crafted ProstaDB, a comprehensive and pristine thermodynamics repository, including 3,784 single-point mutations and 1,642 multiple-point mutations. We also created thermodynamic looping for enlarging the limited data size of multiple-point mutation and applied an innovative clustering method to generate a standard testing set of multiple-point mutation. Besides, we identified residue scoring as the most important encoding method in protein properties prediction. With these innovations, ProstaNet accurately predicts thermostability changes for both single-point and multiple-point mutations without showing any bias. ProstaNet achieves an accuracy of 0.75, outperforming existing methods for single-point mutation prediction, including ThermoMPNN (0.63), PoPMuSiCsym (0.66), MUPRO (0.52), and FoldX (0.71). ProstaNet also achieves a 1.3-fold increase in accuracy compared to FoldX for multiple-point mutation predictions. Validated by experiment, 4 out of 5 single-point mutation predictions (80%) and all multiple-point mutation predictions (100%) for HuJ3 mutants were accurate, demonstrating the potential benefits of ProstaNet for protein engineering and drug development.
Collapse
Affiliation(s)
- Tianjian Liang
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics and System Pharmacology PharmacoAnalytics, School of Pharmacy, National Center of Excellence for Computational Drug Abuse Research,
University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Ze-Yu Sun
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics and System Pharmacology PharmacoAnalytics, School of Pharmacy, National Center of Excellence for Computational Drug Abuse Research,
University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Rieko Ishima
- Department of Structural Biology, School of Medicine,
University of Pittsburgh, Pittsburgh, PA, USA
| | - Xiang-Qun Xie
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics and System Pharmacology PharmacoAnalytics, School of Pharmacy, National Center of Excellence for Computational Drug Abuse Research,
University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Ying Xue
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics and System Pharmacology PharmacoAnalytics, School of Pharmacy, National Center of Excellence for Computational Drug Abuse Research,
University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Wei Li
- Department of Medicine, Center for Antibody Therapeutics, Division of Infectious Diseases, School of Medicine,
University of Pittsburgh, Pittsburgh, PA, USA
| | - Zhiwei Feng
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics and System Pharmacology PharmacoAnalytics, School of Pharmacy, National Center of Excellence for Computational Drug Abuse Research,
University of Pittsburgh, Pittsburgh, PA 15261, USA
| |
Collapse
|
3
|
Li D, Zhu Y, Zhang W, Liu J, Yang X, Liu Z, Wei D. AI Prediction of Structural Stability of Nanoproteins Based on Structures and Residue Properties by Mean Pooled Dual Graph Convolutional Network. Interdiscip Sci 2025; 17:101-113. [PMID: 39367992 DOI: 10.1007/s12539-024-00662-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 09/18/2024] [Accepted: 09/22/2024] [Indexed: 10/07/2024]
Abstract
The structural stability of proteins is an important topic in various fields such as biotechnology, pharmaceuticals, and enzymology. Specifically, understanding the structural stability of protein is crucial for protein design. Artificial design, while pursuing high thermodynamic stability and rigidity of proteins, inevitably sacrifices biological functions closely related to protein flexibility. The thermodynamic stability of proteins is not always optimal when they are highest to perfectly perform their biological functions. Extensive theoretical and experimental screening is often required to obtain stable protein structures. Thus, it becomes critically important to develop a stability prediction model based on the balance between protein stability and bioactivity. To design protein drugs with better functionality in a broader structural space, a novel protein structural stability predictor called PSSP has been developed in this study. PSSP is a mean pooled dual graph convolutional network (GCN) model based on sequence characteristics and secondary structure, distance matrix, graph, and residue properties of a nanoprotein to provide rapid prediction and judgment. This model exhibits excellent robustness in predicting the structural stability of nanoproteins. Comparing with previous artificial intelligence algorithms, the results indicate this model can provide a rapid and accurate assessment of the structural stability of artificially designed proteins, which shows the great promises for promoting the robust development of protein design.
Collapse
Affiliation(s)
- Daixi Li
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China.
- Pengcheng Laboratory, Shenzhen, 518055, China.
| | - Yuqi Zhu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Wujie Zhang
- Chemical and Biomolecular Engineering Program, Physics and Chemistry Department, Milwaukee School of Engineering, Milwaukee, 53202, USA
| | - Jing Liu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Xiaochen Yang
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Zhihong Liu
- Pingshan Translational Medicine Center, Shenzhen Bay Laboratory, Shenzhen, 518118, China
| | - Dongqing Wei
- Pengcheng Laboratory, Shenzhen, 518055, China
- State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation, Center On Antibacterial Resistances, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| |
Collapse
|
4
|
Turina P, Petrosino M, Enriquez Sandoval CA, Novak L, Pasquo A, Alexov E, Alladin MA, Ascher DB, Babbi G, Bakolitsa C, Casadio R, Cheng J, Fariselli P, Folkman L, Kamandula A, Katsonis P, Li M, Li D, Lichtarge O, Mahmud S, Martelli PL, Pal D, Panday SK, Pires DEV, Portelli S, Pucci F, Rodrigues CHM, Rooman M, Savojardo C, Schwersensky M, Shen Y, Strokach AV, Sun Y, Woo J, Radivojac P, Brenner SE, Chiaraluce R, Consalvi V, Capriotti E. Assessing the predicted impact of single amino acid substitutions in MAPK proteins for CAGI6 challenges. Hum Genet 2025; 144:265-280. [PMID: 39976676 PMCID: PMC11975483 DOI: 10.1007/s00439-024-02724-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 12/27/2024] [Indexed: 03/05/2025]
Abstract
New thermodynamic and functional studies have been recently conducted to evaluate the impact of amino acid substitutions on the Mitogen Activated Protein Kinases 1 and 3 (MAPK1/3). The Critical Assessment of Genome Interpretation (CAGI) data provider, at Sapienza University of Rome, measured the unfolding free energy and the enzymatic activity of a set of variants (MAPK challenge dataset). Thermodynamic measurements for the denaturant-induced equilibrium unfolding of the phosphorylated and unphosphorylated forms of the MAPKs were obtained by monitoring the far-UV circular dichroism and intrinsic fluorescence changes as a function of denaturant concentration. These values have been used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero concentration of denaturant ( Δ Δ G H 2 O ). The enzymatic activity of the phosphorylated MAPKs variants was also measured using Chelation-Enhanced Fluorescence to monitor the phosphorylation of a peptide substrate. The MAPK challenge dataset, composed of a total of 23 single amino acid substitutions (11 and 12 for MAPK1 and MAPK3, respectively), was used to assess the effectiveness of the computational methods in predicting the Δ Δ G H 2 O values, associated with the variants, and categorize them as destabilizing and not destabilizing. The data on the enzymatic activity of the MAPKs mutants were used to assess the performance of the methods for predicting the functional impact of the variants. For the sixth edition of CAGI, thirteen independent research groups from four continents (Asia, Australia, Europe and North America) submitted > 80 sets of predictions, obtained from different approaches. In this manuscript, we summarized the results of our assessment to highlight the possible limitations of the available algorithms.
Collapse
Affiliation(s)
- Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Maria Petrosino
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy
| | | | - Leonore Novak
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy
| | - Alessandra Pasquo
- Diagnostics and Metrology Laboratory FSN-TECFIS-DIM, ENEA CR Frascati, 00044, Frascati, Italy
| | - Emil Alexov
- Department of Physics and Astronomy, Clemson University, Clemson, SC, 29634, USA
| | - Muttaqi Ahmad Alladin
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO, 65211, USA
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10126, Torino, Italy
| | - Lukas Folkman
- Institute for Integrated and Intelligent Systems, Griffith University, Southport, QLD, 4222, Australia
| | - Akash Kamandula
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Minghui Li
- School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, 215123, Jiangsu, China
| | - Dong Li
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Sajid Mahmud
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO, 65211, USA
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | | | - Douglas E V Pires
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3053, Australia
| | - Stephanie Portelli
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Carlos H M Rodrigues
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Yang Shen
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, 77843, USA
| | - Alexey V Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, 77843, USA
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Roberta Chiaraluce
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy.
| | - Valerio Consalvi
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy.
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy.
- Computational Genomics Platform, IRCCS University Hospital of Bologna, 40138, Bologna, Italy.
| |
Collapse
|
5
|
Li Z, Luo Y. Rewiring protein sequence and structure generative models to enhance protein stability prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.13.638154. [PMID: 40027759 PMCID: PMC11870403 DOI: 10.1101/2025.02.13.638154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Predicting changes in protein thermostability due to amino acid substitutions is essential for understanding human diseases and engineering useful proteins for clinical and industrial applications. While recent advances in protein generative models, which learn probability distributions over amino acids conditioned on structural or evolutionary sequence contexts, have shown impressive performance in predicting various protein properties without task-specific training, their strong unsupervised prediction ability does not extend to all protein functions. In particular, their potential to improve protein stability prediction remains underexplored. In this work, we present SPURS, a novel deep learning framework that adapts and integrates two general-purpose protein generative models-a protein language model (ESM) and an inverse folding model (ProteinMPNN)-into an effective stability predictor. SPURS employs a lightweight neural network module to rewire per-residue structure representations learned by ProteinMPNN into the attention layers of ESM, thereby informing and enhancing ESM's sequence representation learning. This rewiring strategy enables SPURS to harness evolutionary patterns from both sequence and structure data, where the sequence like-lihood distribution learned by ESM is conditioned on structure priors encoded by ProteinMPNN to predict mutation effects. We steer this integrated framework to a stability prediction model through supervised training on a recently released mega-scale thermostability dataset. Evaluations across 12 benchmark datasets showed that SPURS delivers accurate, rapid, scalable, and generalizable stability predictions, consistently outperforming current state-of-the-art methods. Notably, SPURS demonstrates remarkable versatility in protein stability and function analyses: when combined with a protein language model, it accurately identifies protein functional sites in an unsupervised manner. Additionally, it enhances current low- N protein fitness prediction models by serving as a stability prior model to improve accuracy. These results highlight SPURS as a powerful tool to advance current protein stability prediction and machine learning-guided protein engineering workflows. The source code of SPURS is available at https://github.com/luo-group/SPURS .
Collapse
|
6
|
Savojardo C, Manfredi M, Martelli PL, Casadio R. DDGemb: predicting protein stability change upon single- and multi-point variations with embeddings and deep learning. Bioinformatics 2024; 41:btaf019. [PMID: 39799516 PMCID: PMC11783275 DOI: 10.1093/bioinformatics/btaf019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 11/14/2024] [Accepted: 01/10/2025] [Indexed: 01/15/2025] Open
Abstract
MOTIVATION The knowledge of protein stability upon residue variation is an important step for functional protein design and for understanding how protein variants can promote disease onset. Computational methods are important to complement experimental approaches and allow a fast screening of large datasets of variations. RESULTS In this work, we present DDGemb, a novel method combining protein language model embeddings and transformer architectures to predict protein ΔΔG upon both single- and multi-point variations. DDGemb has been trained on a high-quality dataset derived from literature and tested on available benchmark datasets of single- and multi-point variations. DDGemb performs at the state of the art in both single- and multi-point variations. AVAILABILITY AND IMPLEMENTATION DDGemb is available as web server at https://ddgemb.biocomp.unibo.it. Datasets used in this study are available at https://ddgemb.biocomp.unibo.it/datasets.
Collapse
Affiliation(s)
- Castrense Savojardo
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, Bologna, 40126, Italy
| | - Matteo Manfredi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, Bologna, 40126, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, Bologna, 40126, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Via San Giacomo 9/2, Bologna, 40126, Italy
- The Alma Climate Institute, Interdepartmental Center, University of Bologna, Bologna, 40100, Italy
| |
Collapse
|
7
|
Xu Y, Liu D, Gong H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. NATURE COMPUTATIONAL SCIENCE 2024; 4:840-850. [PMID: 39455825 DOI: 10.1038/s43588-024-00716-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 10/03/2024] [Indexed: 10/28/2024]
Abstract
Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models-GeoFitness, GeoDDG and GeoDTm-for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.
Collapse
Affiliation(s)
- Yunxin Xu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Di Liu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
8
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: trends from three decades of genetic variant impact predictors. Hum Genomics 2024; 18:90. [PMID: 39198917 PMCID: PMC11360829 DOI: 10.1186/s40246-024-00663-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 08/19/2024] [Indexed: 09/01/2024] Open
Abstract
BACKGROUND Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). RESULTS The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past three decades, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 190 VIPs, resulting in a total of 407 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. CONCLUSIONS VIPdb version 2 summarizes 407 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. VIPdb is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Arul S Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA
- Illumina, Foster City, CA, 94404, USA
| | - Steven E Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, 94720, USA.
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA.
- College of Computing, Data Science, and Society, University of California, Berkeley, CA, 94720, USA.
- Department of Plant and Microbial Biology, University of California, 111 Koshland Hall #3102, Berkeley, CA, 94720-3102, USA.
| |
Collapse
|
9
|
Cuturello F, Celoria M, Ansuini A, Cazzaniga A. Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models. Bioinformatics 2024; 40:btae447. [PMID: 39012369 PMCID: PMC11269464 DOI: 10.1093/bioinformatics/btae447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/19/2024] [Accepted: 07/10/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a Protein Language Model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data pre-processing to mitigate overfitting. RESULTS We ensure comprehensive comparisons by fine-tuning various pre-trained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. AVAILABILITY AND IMPLEMENTATION Code and data at https://github.com/RitAreaSciencePark/PLM4Muts. SUPPLEMENTARY INFORMATION Supplementary Information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Cuturello
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Marco Celoria
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
- HPC Department, , CINECA National Supercomputing Center, Bologna 40033, Italy
| | - Alessio Ansuini
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Alberto Cazzaniga
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| |
Collapse
|
10
|
Lin YJ, Menon AS, Hu Z, Brenner SE. Variant Impact Predictor database (VIPdb), version 2: Trends from 25 years of genetic variant impact predictors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600283. [PMID: 38979289 PMCID: PMC11230257 DOI: 10.1101/2024.06.25.600283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Background Variant interpretation is essential for identifying patients' disease-causing genetic variants amongst the millions detected in their genomes. Hundreds of Variant Impact Predictors (VIPs), also known as Variant Effect Predictors (VEPs), have been developed for this purpose, with a variety of methodologies and goals. To facilitate the exploration of available VIP options, we have created the Variant Impact Predictor database (VIPdb). Results The Variant Impact Predictor database (VIPdb) version 2 presents a collection of VIPs developed over the past 25 years, summarizing their characteristics, ClinGen calibrated scores, CAGI assessment results, publication details, access information, and citation patterns. We previously summarized 217 VIPs and their features in VIPdb in 2019. Building upon this foundation, we identified and categorized an additional 186 VIPs, resulting in a total of 403 VIPs in VIPdb version 2. The majority of the VIPs have the capacity to predict the impacts of single nucleotide variants and nonsynonymous variants. More VIPs tailored to predict the impacts of insertions and deletions have been developed since the 2010s. In contrast, relatively few VIPs are dedicated to the prediction of splicing, structural, synonymous, and regulatory variants. The increasing rate of citations to VIPs reflects the ongoing growth in their use, and the evolving trends in citations reveal development in the field and individual methods. Conclusions VIPdb version 2 summarizes 403 VIPs and their features, potentially facilitating VIP exploration for various variant interpretation applications. Availability VIPdb version 2 is available at https://genomeinterpretation.org/vipdb.
Collapse
Affiliation(s)
- Yu-Jen Lin
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
| | - Arul S. Menon
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
- Currently at: Illumina, Foster City, California 94404, USA
| | - Steven E. Brenner
- Department of Molecular and Cell Biology, University of California, Berkeley, California 94720, USA
- Center for Computational Biology, University of California, Berkeley, California 94720, USA
- College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, USA
- Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA
| |
Collapse
|
11
|
Aneli S, Ceccatelli Berti C, Gilea AI, Birolo G, Mutti G, Pavesi A, Baruffini E, Goffrini P, Capelli C. Functional characterization of archaic-specific variants in mitonuclear genes: insights from comparative analysis in S. cerevisiae. Hum Mol Genet 2024; 33:1152-1163. [PMID: 38558123 DOI: 10.1093/hmg/ddae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 02/29/2024] [Accepted: 03/14/2024] [Indexed: 04/04/2024] Open
Abstract
Neanderthal and Denisovan hybridisation with modern humans has generated a non-random genomic distribution of introgressed regions, the result of drift and selection dynamics. Cross-species genomic incompatibility and more efficient removal of slightly deleterious archaic variants have been proposed as selection-based processes involved in the post-hybridisation purge of archaic introgressed regions. Both scenarios require the presence of functionally different alleles across Homo species onto which selection operated differently according to which populations hosted them, but only a few of these variants have been pinpointed so far. In order to identify functionally divergent archaic variants removed in humans, we focused on mitonuclear genes, which are underrepresented in the genomic landscape of archaic humans. We searched for non-synonymous, fixed, archaic-derived variants present in mitonuclear genes, rare or absent in human populations. We then compared the functional impact of archaic and human variants in the model organism Saccharomyces cerevisiae. Notably, a variant within the mitochondrial tyrosyl-tRNA synthetase 2 (YARS2) gene exhibited a significant decrease in respiratory activity and a substantial reduction of Cox2 levels, a proxy for mitochondrial protein biosynthesis, coupled with the accumulation of the YARS2 protein precursor and a lower amount of mature enzyme. Our work suggests that this variant is associated with mitochondrial functionality impairment, thus contributing to the purging of archaic introgression in YARS2. While different molecular mechanisms may have impacted other mitonuclear genes, our approach can be extended to the functional screening of mitonuclear genetic variants present across species and populations.
Collapse
Affiliation(s)
- Serena Aneli
- Department of Public Health Sciences and Pediatrics, University of Turin, C.so Galileo Galilei 22, Turin 10126, Italy
| | - Camilla Ceccatelli Berti
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
| | - Alexandru Ionut Gilea
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
| | - Giovanni Birolo
- Department of Medical Sciences, University of Turin, Via Santena 5, Turin 10126, Italy
| | - Giacomo Mutti
- Barcelona Supercomputing Centre (BSC-CNS), Department of Life Sciences, Plaça Eusebi Güell, 1-3, Barcelona 08034, Spain
- Institute for Research in Biomedicine (IRB Barcelona), Department of Mechanisms of Disease, The Barcelona Institute of Science and Technology, Baldiri Reixac, 10, Barcelona 08028, Spain
| | - Angelo Pavesi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
| | - Enrico Baruffini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
| | - Paola Goffrini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
| | - Cristian Capelli
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parco Area delle Scienze 11/a, Parma 43124, Italy
- Department of Biology, University of Oxford, 11a Mansfield Rd, Oxford OX1 3SZ, United Kingdom
| |
Collapse
|
12
|
Sun X, Yang S, Wu Z, Su J, Hu F, Chang F, Li C. PMSPcnn: Predicting protein stability changes upon single point mutations with convolutional neural network. Structure 2024; 32:838-848.e3. [PMID: 38508191 DOI: 10.1016/j.str.2024.02.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/19/2023] [Accepted: 02/22/2024] [Indexed: 03/22/2024]
Abstract
Protein missense mutations and resulting protein stability changes are important causes for many human genetic diseases. However, the accurate prediction of stability changes due to mutations remains a challenging problem. To address this problem, we have developed an unbiased effective model: PMSPcnn that is based on a convolutional neural network. We have included an anti-symmetry property to build a balanced training dataset, which improves the prediction, in particular for stabilizing mutations. Persistent homology, which is an effective approach for characterizing protein structures, is used to obtain topological features. Additionally, a regression stratification cross-validation scheme has been proposed to improve the prediction for mutations with extreme ΔΔG. For three test datasets: Ssym, p53, and myoglobin, PMSPcnn achieves a better performance than currently existing predictors. PMSPcnn also outperforms currently available methods for membrane proteins. Overall, PMSPcnn is a promising method for the prediction of protein stability changes caused by single point mutations.
Collapse
Affiliation(s)
- Xiaohan Sun
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Shuang Yang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Zhixiang Wu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Jingjie Su
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fangrui Hu
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Fubin Chang
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China
| | - Chunhua Li
- College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.
| |
Collapse
|
13
|
Qiu Y, Huang T, Cai YD. Review of predicting protein stability changes upon variations. Proteomics 2024; 24:e2300371. [PMID: 38643379 DOI: 10.1002/pmic.202300371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/07/2024] [Accepted: 04/08/2024] [Indexed: 04/22/2024]
Abstract
Forecasting alterations in protein stability caused by variations holds immense importance. Improving the thermal stability of proteins is important for biomedical and industrial applications. This review discusses the latest methods for predicting the effects of mutations on protein stability, databases containing protein mutations and thermodynamic parameters, and experimental techniques for efficiently assessing protein stability in high-throughput settings. Various publicly available databases for protein stability prediction are introduced. Furthermore, state-of-the-art computational approaches for anticipating protein stability changes due to variants are reviewed. Each method's types of features, base algorithm, and prediction results are also detailed. Additionally, some experimental approaches for verifying the prediction results of computational methods are introduced. Finally, the review summarizes the progress and challenges of protein stability prediction and discusses potential models for future research directions.
Collapse
Affiliation(s)
- Yiling Qiu
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
| |
Collapse
|
14
|
Wagner A. Genotype sampling for deep-learning assisted experimental mapping of a combinatorially complete fitness landscape. Bioinformatics 2024; 40:btae317. [PMID: 38745436 PMCID: PMC11132821 DOI: 10.1093/bioinformatics/btae317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 03/21/2024] [Accepted: 05/14/2024] [Indexed: 05/16/2024] Open
Abstract
MOTIVATION Experimental characterization of fitness landscapes, which map genotypes onto fitness, is important for both evolutionary biology and protein engineering. It faces a fundamental obstacle in the astronomical number of genotypes whose fitness needs to be measured for any one protein. Deep learning may help to predict the fitness of many genotypes from a smaller neural network training sample of genotypes with experimentally measured fitness. Here I use a recently published experimentally mapped fitness landscape of more than 260 000 protein genotypes to ask how such sampling is best performed. RESULTS I show that multilayer perceptrons, recurrent neural networks, convolutional networks, and transformers, can explain more than 90% of fitness variance in the data. In addition, 90% of this performance is reached with a training sample comprising merely ≈103 sequences. Generalization to unseen test data is best when training data is sampled randomly and uniformly, or sampled to minimize the number of synonymous sequences. In contrast, sampling to maximize sequence diversity or codon usage bias reduces performance substantially. These observations hold for more than one network architecture. Simple sampling strategies may perform best when training deep learning neural networks to map fitness landscapes from experimental data. AVAILABILITY AND IMPLEMENTATION The fitness landscape data analyzed here is publicly available as described previously (Papkou et al. 2023). All code used to analyze this landscape is publicly available at https://github.com/andreas-wagner-uzh/fitness_landscape_sampling.
Collapse
Affiliation(s)
- Andreas Wagner
- Department of Evolutionary Biology and Environmental Studies, University of Zurich, 8057 Zurich, Switzerland
- Swiss Institute of Bioinformatics, Quartier Sorge-Batiment Genopode,1015 Lausanne, Switzerland
- The Santa Fe Institute, Santa Fe, 87501 NM, United States
| |
Collapse
|
15
|
Dieckhaus H, Brocidiacono M, Randolph NZ, Kuhlman B. Transfer learning to leverage larger datasets for improved prediction of protein stability changes. Proc Natl Acad Sci U S A 2024; 121:e2314853121. [PMID: 38285937 PMCID: PMC10861915 DOI: 10.1073/pnas.2314853121] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Accepted: 12/26/2023] [Indexed: 01/31/2024] Open
Abstract
Amino acid mutations that lower a protein's thermodynamic stability are implicated in numerous diseases, and engineered proteins with enhanced stability can be important in research and medicine. Computational methods for predicting how mutations perturb protein stability are, therefore, of great interest. Despite recent advancements in protein design using deep learning, in silico prediction of stability changes has remained challenging, in part due to a lack of large, high-quality training datasets for model development. Here, we describe ThermoMPNN, a deep neural network trained to predict stability changes for protein point mutations given an initial structure. In doing so, we demonstrate the utility of a recently released megascale stability dataset for training a robust stability model. We also employ transfer learning to leverage a second, larger dataset by using learned features extracted from ProteinMPNN, a deep neural network trained to predict a protein's amino acid sequence given its three-dimensional structure. We show that our method achieves state-of-the-art performance on established benchmark datasets using a lightweight model architecture that allows for rapid, scalable predictions. Finally, we make ThermoMPNN readily available as a tool for stability prediction and design.
Collapse
Affiliation(s)
- Henry Dieckhaus
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC27599
- Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy, Chapel Hill, NC27599
| | - Michael Brocidiacono
- Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy, Chapel Hill, NC27599
| | - Nicholas Z. Randolph
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC27599
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, NC27599
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, NC27599
- Department of Bioinformatics and Computational Biology, University of North Carolina School of Medicine, Chapel Hill, NC27599
- Lineberger Comprehensive Cancer Center, University of North Carolina School of Medicine, Chapel Hill, NC27599
| |
Collapse
|
16
|
Li G, Yao S, Fan L. ProSTAGE: Predicting Effects of Mutations on Protein Stability by Using Protein Embeddings and Graph Convolutional Networks. J Chem Inf Model 2024; 64:340-347. [PMID: 38166383 PMCID: PMC10806799 DOI: 10.1021/acs.jcim.3c01697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 12/11/2023] [Accepted: 12/12/2023] [Indexed: 01/04/2024]
Abstract
Protein thermodynamic stability is essential to clarify the relationships among structure, function, and interaction. Therefore, developing a faster and more accurate method to predict the impact of the mutations on protein stability is helpful for protein design and understanding the phenotypic variation. Recent studies have shown that protein embedding will be particularly powerful at modeling sequence information with context dependence, such as subcellular localization, variant effect, and secondary structure prediction. Herein, we introduce a novel method, ProSTAGE, which is a deep learning method that fuses structure and sequence embedding to predict protein stability changes upon single point mutations. Our model combines graph-based techniques and language models to predict stability changes. Moreover, ProSTAGE is trained on a larger data set, which is almost twice as large as the most used S2648 data set. It consistently outperforms all existing state-of-the-art methods on mutation-affected problems as benchmarked on several independent data sets. The protein embedding as the prediction input achieves better results than the previous results, which shows the potential of protein language models in predicting the effect of mutations on proteins. ProSTAGE is implemented as a user-friendly web server.
Collapse
Affiliation(s)
- Gen Li
- Production and R&D Center
I of LSS, GenScript (Shanghai) Biotech Co.,
Ltd., Shanghai 200131, China
| | - Sijie Yao
- Production and R&D Center
I of LSS, GenScript (Shanghai) Biotech Co.,
Ltd., Shanghai 200131, China
| | - Long Fan
- Production and R&D Center
I of LSS, GenScript (Shanghai) Biotech Co.,
Ltd., Shanghai 200131, China
| |
Collapse
|
17
|
Sanavia T, Turina P, Morante S, Consalvi V, Lesk AM, Bakolitsa C, Dell'Orco D. Editorial: Computational and experimental protein variant interpretation in the era of precision medicine. Front Mol Biosci 2024; 11:1363813. [PMID: 38293601 PMCID: PMC10823009 DOI: 10.3389/fmolb.2024.1363813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 01/04/2024] [Indexed: 02/01/2024] Open
Affiliation(s)
- Tiziana Sanavia
- Department of Medical Sciences, Computational Biomedicine Unit, University of Torino, Torino, Italy
| | - Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Silvia Morante
- Department of Physics, University of Roma Tor Vergata, Roma, Italy
- Istituto Nazionale di Fisica Nucleare, University of Roma Tor Vergata, Roma, Italy
| | - Valerio Consalvi
- Department of Biochemical Sciences “A. Rossi Fanelli”, Sapienza University of Roma, Roma, Italy
| | - Arthur M. Lesk
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, United States
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, Berkeley, CA, United States
| | - Daniele Dell'Orco
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biological Chemistry, University of Verona, Verona, Italy
| |
Collapse
|
18
|
Liu B, Jiang Y, Yang Y, Chen JX. OmeDDG: Improved Protein Mutation Stability Prediction Based on Predicted 3D Structures. J Phys Chem B 2024; 128:67-76. [PMID: 38130113 DOI: 10.1021/acs.jpcb.3c05601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Determining changes in the protein's thermal stability following mutations is critical in protein engineering and understanding pathogenic missense mutations. Despite the development of various computational methods to predict the effects of single-point mutations, their accuracy remains limited. In this study, we propose a new computational method, OmeDDG, that more accurately predicts mutation-induced Gibbs free energy changes in protein folding (ΔΔG). OmeDDG takes the sequences of wild-type and mutant proteins as input, utilizes OmegaFold to obtain the 3D structure, employs a convolutional neural network to extract structural features, and combines them with protein mutation features and pretraining features to predict the stability of single-point mutations in proteins. We performed a comprehensive comparison between OmeDDG and other available prediction methods on four blind test datasets, confirming that OmeDDG can effectively enhance protein mutation prediction performance. Notably, on the antisymmetric dataset Ssym, OmeDDG achieves the best performance, demonstrating favorable antisymmetry with PCC = 0.79 and RMSE = 0.96 for forward mutations and PCC = 0.77 and RMSE = 0.97 for reverse mutant types.
Collapse
Affiliation(s)
- Baoying Liu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, Sichuan, China
| | - Yongquan Jiang
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, Sichuan, China
- Artificial Intelligence Research Institute, Southwest Jiaotong University, Chengdu 611756, Sichuan, China
| | - Yan Yang
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, Sichuan, China
- Artificial Intelligence Research Institute, Southwest Jiaotong University, Chengdu 611756, Sichuan, China
| | - Jim X Chen
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030-4444, United States
| |
Collapse
|
19
|
Zheng F, Liu Y, Yang Y, Wen Y, Li M. Assessing computational tools for predicting protein stability changes upon missense mutations using a new dataset. Protein Sci 2024; 33:e4861. [PMID: 38084013 PMCID: PMC10751734 DOI: 10.1002/pro.4861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 11/14/2023] [Accepted: 12/06/2023] [Indexed: 12/28/2023]
Abstract
Insight into how mutations affect protein stability is crucial for protein engineering, understanding genetic diseases, and exploring protein evolution. Numerous computational methods have been developed to predict the impact of amino acid substitutions on protein stability. Nevertheless, comparing these methods poses challenges due to variations in their training data. Moreover, it is observed that they tend to perform better at predicting destabilizing mutations than stabilizing ones. Here, we meticulously compiled a new dataset from three recently published databases: ThermoMutDB, FireProtDB, and ProThermDB. This dataset, which does not overlap with the well-established S2648 dataset, consists of 4038 single-point mutations, including over 1000 stabilizing mutations. We assessed these mutations using 27 computational methods, including the latest ones utilizing mega-scale stability datasets and transfer learning. We excluded entries with overlap or similarity to training datasets to ensure fairness. Pearson correlation coefficients for the tested tools ranged from 0.20 to 0.53 on unseen data, and none of the methods could accurately predict stabilizing mutations, even those performing well in anti-symmetric property analysis. While most methods present consistent trends for predicting destabilizing mutations across various properties such as solvent exposure and secondary conformation, stabilizing mutations do not exhibit a clear pattern. Our study also suggests that solely addressing training dataset bias may not significantly enhance accuracy of predicting stabilizing mutations. These findings emphasize the importance of developing precise predictive methods for stabilizing mutations.
Collapse
Affiliation(s)
- Feifan Zheng
- MOE Key Laboratory of Geriatric Diseases and ImmunologySchool of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow UniversitySuzhouChina
| | - Yang Liu
- MOE Key Laboratory of Geriatric Diseases and ImmunologySchool of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow UniversitySuzhouChina
| | - Yan Yang
- MOE Key Laboratory of Geriatric Diseases and ImmunologySchool of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow UniversitySuzhouChina
| | - Yuhao Wen
- MOE Key Laboratory of Geriatric Diseases and ImmunologySchool of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow UniversitySuzhouChina
| | - Minghui Li
- MOE Key Laboratory of Geriatric Diseases and ImmunologySchool of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow UniversitySuzhouChina
| |
Collapse
|
20
|
Kurniawan J, Ishida T. Comparing Supervised Learning and Rigorous Approach for Predicting Protein Stability upon Point Mutations in Difficult Targets. J Chem Inf Model 2023; 63:6778-6788. [PMID: 37897811 DOI: 10.1021/acs.jcim.3c00750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/30/2023]
Abstract
Accurate prediction of protein stability upon a point mutation has important applications in drug discovery and personalized medicine. It remains a challenging issue in computational biology. Existing computational prediction methods, which range from mechanistic to supervised learning approaches, have experienced limited progress over the last few decades. This stagnation is largely due to their heavy reliance on both the quantity and quality of the training data. This is evident in recent state-of-the-art methods that continue to yield substantial errors on two challenging blind test sets: frataxin and p53, with average root-mean-square errors exceeding 3 and 1.5 kcal/mol, respectively, which is still above the theoretical 1 kcal/mol prediction barrier. Rigorous approaches, on the other hand, offer greater potential for accuracy without relying on training data but are computationally demanding and require both wild-type and mutant structure information. Although they showed high accuracy for conserving mutations, their performance is still limited for charge-changing mutation cases. This might be due to the lack of an available mutant structure, often represented by a simplified capped peptide. The recent advances in protein structure prediction methods now make it possible to obtain structures comparable to experimental ones, including complete mutant structure information. In this work, we compare the performance of supervised learning-based methods and rigorous approaches for predicting protein stability on point mutations in difficult targets: frataxin and p53. The rigorous alchemical method significantly surpasses state-of-the-art techniques in terms of both the root-mean-squared error and Pearson correlation coefficient in these two challenging blind test sets. Additionally, we propose an improved alchemical method that employs the pmx double-system/single-box approach to accurately predict the folding free energy change upon both conserving and charge-changing mutations. The enhanced protocol can accurately predict both types of mutations, thereby outperforming existing state-of-the-art methods in overall performance.
Collapse
Affiliation(s)
- Jason Kurniawan
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo 152-8550, Japan
| | - Takashi Ishida
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo 152-8550, Japan
| |
Collapse
|
21
|
Gong J, Jiang L, Chen Y, Zhang Y, Li X, Ma Z, Fu Z, He F, Sun P, Ren Z, Tian M. THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model. Bioinformatics 2023; 39:btad646. [PMID: 37874953 PMCID: PMC10627365 DOI: 10.1093/bioinformatics/btad646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 09/25/2023] [Accepted: 10/22/2023] [Indexed: 10/26/2023] Open
Abstract
MOTIVATION Quantitative determination of protein thermodynamic stability is a critical step in protein and drug design. Reliable prediction of protein stability changes caused by point variations contributes to developing-related fields. Over the past decades, dozens of structure-based and sequence-based methods have been proposed, showing good prediction performance. Despite the impressive progress, it is necessary to explore wild-type and variant protein representations to address the problem of how to represent the protein stability change in view of global sequence. With the development of structure prediction using learning-based methods, protein language models (PLMs) have shown accurate and high-quality predictions of protein structure. Because PLM captures the atomic-level structural information, it can help to understand how single-point variations cause functional changes. RESULTS Here, we proposed THPLM, a sequence-based deep learning model for stability change prediction using Meta's ESM-2. With ESM-2 and a simple convolutional neural network, THPLM achieved comparable or even better performance than most methods, including sequence-based and structure-based methods. Furthermore, the experimental results indicate that the PLM's ability to generate representations of sequence can effectively improve the ability of protein function prediction. AVAILABILITY AND IMPLEMENTATION The source code of THPLM and the testing data can be accessible through the following links: https://github.com/FPPGroup/THPLM.
Collapse
Affiliation(s)
- Jianting Gong
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Lili Jiang
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Yongbing Chen
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Yixiang Zhang
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Xue Li
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Zhiqiang Ma
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Department of Computer Science, College of Humanities and Sciences of Northeast Normal University, Changchun 130117, China
| | - Zhiguo Fu
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Fei He
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Pingping Sun
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
| | - Zilin Ren
- School of Information Science and Technology, Institution of Computational Biology, Northeast Normal University, Changchun 130117, China
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| | - Mingyao Tian
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Changchun 130122, China
| |
Collapse
|
22
|
Umerenkov D, Nikolaev F, Shashkova TI, Strashnov PV, Sindeeva M, Shevtsov A, Ivanisenko NV, Kardymon OL. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics 2023; 39:btad671. [PMID: 37935419 PMCID: PMC10651431 DOI: 10.1093/bioinformatics/btad671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 10/25/2023] [Accepted: 11/02/2023] [Indexed: 11/09/2023] Open
Abstract
MOTIVATION Accurate prediction of change in protein stability due to point mutations is an attractive goal that remains unachieved. Despite the high interest in this area, little consideration has been given to the transformer architecture, which is dominant in many fields of machine learning. RESULTS In this work, we introduce PROSTATA, a predictive model built in a knowledge-transfer fashion on a new curated dataset. PROSTATA demonstrates advantage over existing solutions based on neural networks. We show that the large improvement margin is due to both the architecture of the model and the quality of the new training dataset. This work opens up opportunities to develop new lightweight and accurate models for protein stability assessment. AVAILABILITY AND IMPLEMENTATION PROSTATA is available at https://github.com/AIRI-Institute/PROSTATA and https://prostata.airi.net.
Collapse
Affiliation(s)
| | | | | | - Pavel V Strashnov
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Department of Computer Design and Technology, Bauman Moscow State Technical University, Moscow 105005, Russia
| | | | - Andrey Shevtsov
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Regulatory Transcriptomics and Epigenomics Group, Institute of Bioengineering, Research Center of Biotechnology RAS, Moscow 117036, Russia
| | - Nikita V Ivanisenko
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Laboratory of Computational Proteomics, Institute of Cytology and Genetics SB RAS, Novosibirsk 630090, Russia
| | | |
Collapse
|
23
|
Patsch D, Eichenberger M, Voss M, Bornscheuer UT, Buller RM. LibGENiE - A bioinformatic pipeline for the design of information-enriched enzyme libraries. Comput Struct Biotechnol J 2023; 21:4488-4496. [PMID: 37736300 PMCID: PMC10510078 DOI: 10.1016/j.csbj.2023.09.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/13/2023] [Accepted: 09/13/2023] [Indexed: 09/23/2023] Open
Abstract
Enzymes are potent catalysts with high specificity and selectivity. To leverage nature's synthetic potential for industrial applications, various protein engineering techniques have emerged which allow to tailor the catalytic, biophysical, and molecular recognition properties of enzymes. However, the many possible ways a protein can be altered forces researchers to carefully balance between the exhaustiveness of an enzyme screening campaign and the required resources. Consequently, the optimal engineering strategy is often defined on a case-by-case basis. Strikingly, while predicting mutations that lead to an improved target function is challenging, here we show that the prediction and exclusion of deleterious mutations is a much more straightforward task as analyzed for an engineered carbonic acid anhydrase, a transaminase, a squalene-hopene cyclase and a Kemp eliminase. Combining such a pre-selection of allowed residues with advanced gene synthesis methods opens a path toward an efficient and generalizable library construction approach for protein engineering. To give researchers easy access to this methodology, we provide the website LibGENiE containing the bioinformatic tools for the library design workflow.
Collapse
Affiliation(s)
- David Patsch
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, Institute of Chemistry and Biotechnology, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
- Institute of Biochemistry, Department of Biotechnology & Enzyme Catalysis, Greifswald University, Felix-Hausdorff-Strasse 4, 17487 Greifswald, Germany
| | - Michael Eichenberger
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, Institute of Chemistry and Biotechnology, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Moritz Voss
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, Institute of Chemistry and Biotechnology, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| | - Uwe T. Bornscheuer
- Institute of Biochemistry, Department of Biotechnology & Enzyme Catalysis, Greifswald University, Felix-Hausdorff-Strasse 4, 17487 Greifswald, Germany
| | - Rebecca M. Buller
- Zurich University of Applied Sciences, School of Life Sciences and Facility Management, Institute of Chemistry and Biotechnology, Einsiedlerstrasse 31, 8820 Wädenswil, Switzerland
| |
Collapse
|
24
|
Quan N, Eguchi Y, Geiler-Samerotte K. Intra- FCY1: a novel system to identify mutations that cause protein misfolding. Front Genet 2023; 14:1198203. [PMID: 37745845 PMCID: PMC10512024 DOI: 10.3389/fgene.2023.1198203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/22/2023] [Indexed: 09/26/2023] Open
Abstract
Protein misfolding is a common intracellular occurrence. Most mutations to coding sequences increase the propensity of the encoded protein to misfold. These misfolded molecules can have devastating effects on cells. Despite the importance of protein misfolding in human disease and protein evolution, there are fundamental questions that remain unanswered, such as, which mutations cause the most misfolding? These questions are difficult to answer partially because we lack high-throughput methods to compare the destabilizing effects of different mutations. Commonly used systems to assess the stability of mutant proteins in vivo often rely upon essential proteins as sensors, but misfolded proteins can disrupt the function of the essential protein enough to kill the cell. This makes it difficult to identify and compare mutations that cause protein misfolding using these systems. Here, we present a novel in vivo system named Intra-FCY1 that we use to identify mutations that cause misfolding of a model protein [yellow fluorescent protein (YFP)] in Saccharomyces cerevisiae. The Intra-FCY1 system utilizes two complementary fragments of the yeast cytosine deaminase Fcy1, a toxic protein, into which YFP is inserted. When YFP folds, the Fcy1 fragments associate together to reconstitute their function, conferring toxicity in media containing 5-fluorocytosine and hindering growth. But mutations that make YFP misfold abrogate Fcy1 toxicity, thus strains possessing misfolded YFP variants rise to high frequency in growth competition experiments. This makes such strains easier to study. The Intra-FCY1 system cancels localization of the protein of interest, thus can be applied to study the relative stability of mutant versions of diverse cellular proteins. Here, we confirm this method can identify novel mutations that cause misfolding, highlighting the potential for Intra-FCY1 to illuminate the relationship between protein sequence and stability.
Collapse
Affiliation(s)
- N. Quan
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, United States
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| | - Y. Eguchi
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, United States
| | - K. Geiler-Samerotte
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, United States
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| |
Collapse
|
25
|
Dutagaci B, Duan B, Qiu C, Kaplan CD, Feig M. Characterization of RNA polymerase II trigger loop mutations using molecular dynamics simulations and machine learning. PLoS Comput Biol 2023; 19:e1010999. [PMID: 36947548 PMCID: PMC10069792 DOI: 10.1371/journal.pcbi.1010999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 04/03/2023] [Accepted: 03/06/2023] [Indexed: 03/23/2023] Open
Abstract
Catalysis and fidelity of multisubunit RNA polymerases rely on a highly conserved active site domain called the trigger loop (TL), which achieves roles in transcription through conformational changes and interaction with NTP substrates. The mutations of TL residues cause distinct effects on catalysis including hypo- and hyperactivity and altered fidelity. We applied molecular dynamics simulation (MD) and machine learning (ML) techniques to characterize TL mutations in the Saccharomyces cerevisiae RNA Polymerase II (Pol II) system. We did so to determine relationships between individual mutations and phenotypes and to associate phenotypes with MD simulated structural alterations. Using fitness values of mutants under various stress conditions, we modeled phenotypes along a spectrum of continual values. We found that ML could predict the phenotypes with 0.68 R2 correlation from amino acid sequences alone. It was more difficult to incorporate MD data to improve predictions from machine learning, presumably because MD data is too noisy and possibly incomplete to directly infer functional phenotypes. However, a variational auto-encoder model based on the MD data allowed the clustering of mutants with different phenotypes based on structural details. Overall, we found that a subset of loss-of-function (LOF) and lethal mutations tended to increase distances of TL residues to the NTP substrate, while another subset of LOF and lethal substitutions tended to confer an increase in distances between TL and bridge helix (BH). In contrast, some of the gain-of-function (GOF) mutants appear to cause disruption of hydrophobic contacts among TL and nearby helices.
Collapse
Affiliation(s)
- Bercem Dutagaci
- Department of Molecular and Cell Biology, University of California Merced, Merced, California, United States of America
| | - Bingbing Duan
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Chenxi Qiu
- Department of Genetics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Craig D. Kaplan
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Michael Feig
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
26
|
Benevenuta S, Birolo G, Sanavia T, Capriotti E, Fariselli P. Challenges in predicting stabilizing variations: An exploration. Front Mol Biosci 2023; 9:1075570. [PMID: 36685278 PMCID: PMC9849384 DOI: 10.3389/fmolb.2022.1075570] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 12/15/2022] [Indexed: 01/06/2023] Open
Abstract
An open challenge of computational and experimental biology is understanding the impact of non-synonymous DNA variations on protein function and, subsequently, human health. The effects of these variants on protein stability can be measured as the difference in the free energy of unfolding (ΔΔG) between the mutated structure of the protein and its wild-type form. Throughout the years, bioinformaticians have developed a wide variety of tools and approaches to predict the ΔΔG. Although the performance of these tools is highly variable, overall they are less accurate in predicting ΔΔG stabilizing variations rather than the destabilizing ones. Here, we analyze the possible reasons for this difference by focusing on the relationship between experimentally-measured ΔΔG and seven protein properties on three widely-used datasets (S2648, VariBench, Ssym) and a recently introduced one (S669). These properties include protein structural information, different physical properties and statistical potentials. We found that two highly used input features, i.e., hydrophobicity and the Blosum62 substitution matrix, show a performance close to random choice when trying to separate stabilizing variants from either neutral or destabilizing ones. We then speculate that, since destabilizing variations are the most abundant class in the available datasets, the overall performance of the methods is higher when including features that improve the prediction for the destabilizing variants at the expense of the stabilizing ones. These findings highlight the need of designing predictive methods able to exploit also input features highly correlated with the stabilizing variants. New tools should also be tested on a not-artificially balanced dataset, reporting the performance on all the three classes (i.e., stabilizing, neutral and destabilizing variants) and not only the overall results.
Collapse
Affiliation(s)
| | - Giovanni Birolo
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Tiziana Sanavia
- Department of Medical Sciences, University of Torino, Torino, Italy
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Torino, Italy,*Correspondence: Piero Fariselli,
| |
Collapse
|
27
|
PSP-GNM: Predicting Protein Stability Changes upon Point Mutations with a Gaussian Network Model. Int J Mol Sci 2022; 23:ijms231810711. [PMID: 36142614 PMCID: PMC9505940 DOI: 10.3390/ijms231810711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Revised: 09/05/2022] [Accepted: 09/09/2022] [Indexed: 11/26/2022] Open
Abstract
Understanding the effects of missense mutations on protein stability is a widely acknowledged significant biological problem. Genomic missense mutations may alter one or more amino acids, leading to increased or decreased stability of the encoded proteins. In this study, we describe a novel approach—Protein Stability Prediction with a Gaussian Network Model (PSP-GNM)—to measure the unfolding Gibbs free energy change (ΔΔG) and evaluate the effects of single amino acid substitutions on protein stability. Specifically, PSP-GNM employs a coarse-grained Gaussian Network Model (GNM) that has interactions between amino acids weighted by the Miyazawa–Jernigan statistical potential. We used PSP-GNM to simulate partial unfolding of the wildtype and mutant protein structures, and then used the difference in the energies and entropies of the unfolded wildtype and mutant proteins to calculate ΔΔG. The extent of the agreement between the ΔΔG calculated by PSP-GNM and the experimental ΔΔG was evaluated on three benchmark datasets: 350 forward mutations (S350 dataset), 669 forward and reverse mutations (S669 dataset) and 611 forward and reverse mutations (S611 dataset). We observed a Pearson correlation coefficient as high as 0.61, which is comparable to many of the existing state-of-the-art methods. The agreement with experimental ΔΔG further increased when we considered only those measurements made close to 25 °C and neutral pH, suggesting dependence on experimental conditions. We also assessed for the antisymmetry (ΔΔGreverse = −ΔΔGforward) between the forward and reverse mutations on the Ssym+ dataset, which has 352 forward and reverse mutations. While most available methods do not display significant antisymmetry, PSP-GNM demonstrated near-perfect antisymmetry, with a Pearson correlation of −0.97. PSP-GNM is written in Python and can be downloaded as a stand-alone code.
Collapse
|
28
|
Iqbal S, Ge F, Li F, Akutsu T, Zheng Y, Gasser RB, Yu DJ, Webb GI, Song J. PROST: AlphaFold2-aware Sequence-Based Predictor to Estimate Protein Stability Changes upon Missense Mutations. J Chem Inf Model 2022; 62:4270-4282. [PMID: 35973091 DOI: 10.1021/acs.jcim.2c00799] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for the protein stability (PROST) change (Gibb's free energy change, ΔΔG) upon a single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and an extra-trees regressor; PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct data sets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on frataxin, S217, S349, Ssym, S669, Myoglobin, and CAGI5 data sets in blind tests and similarly to the state-of-the-art predictors for p53 and S276 data sets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, data sets, examples, and pretrained models along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.
Collapse
Affiliation(s)
- Shahid Iqbal
- Department of Data Science and AI, Faculty of IT, Monash University, Clayton, Victoria 3800, Australia
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Fang Ge
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Yuanting Zheng
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Geoffrey I Webb
- Department of Data Science and AI, Faculty of IT, Monash University, Clayton, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, Victoria 3800, Australia
| | - Jiangning Song
- Department of Data Science and AI, Faculty of IT, Monash University, Clayton, Victoria 3800, Australia
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, Victoria 3800, Australia
| |
Collapse
|