1
|
Kulikova AV, Diaz DJ, Chen T, Cole TJ, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. Sci Rep 2023; 13:13280. [PMID: 37587128 PMCID: PMC10432456 DOI: 10.1038/s41598-023-40247-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/07/2023] [Indexed: 08/18/2023] Open
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| | - Andrew D Ellington
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
2
|
Kulikova AV, Diaz DJ, Chen T, Jeffrey Cole T, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.20.533508. [PMID: 36993648 PMCID: PMC10055221 DOI: 10.1101/2023.03.20.533508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V. Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J. Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T. Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| | - Andrew D. Ellington
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Claus O. Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
3
|
Koseki J, Hayashi S, Kojima Y, Hirose H, Shimamura T. Topological data analysis of protein structure and inter/intra-molecular interaction changes attributable to amino acid mutations. Comput Struct Biotechnol J 2023; 21:2950-2959. [PMID: 37228703 PMCID: PMC10205437 DOI: 10.1016/j.csbj.2023.05.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 05/09/2023] [Accepted: 05/09/2023] [Indexed: 05/27/2023] Open
Abstract
The presence of some amino acid mutations in the amino acid sequence that determines a protein's structure can significantly affect that 3D structure and its biological function. However, the effects upon structural and functional changes differ for each displaced amino acid, and it is very difficult to predict these changes in advance. Although computer simulations are very effective at predicting conformational changes, they struggle to determine whether the amino acid mutation of interest induces sufficient conformational changes, unless the researcher is a specialist in molecular structure calculations. Therefore, we created a framework that efficiently utilizes molecular dynamics and persistent homology methods to identify amino acid mutations that induce structural changes. We show that this framework can be used not only to predict conformational changes produced by amino acid mutations but also to extract groups of mutations that significantly alter similar molecular interactions, by capturing the resultant protein-protein interaction changes.
Collapse
Affiliation(s)
- Jun Koseki
- Division of Systems Biology, Graduate School of Medicine, Nagoya University, Aichi 466-8550, Japan
| | - Shuto Hayashi
- Division of Systems Biology, Graduate School of Medicine, Nagoya University, Aichi 466-8550, Japan
- Department of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan
| | - Yasuhiro Kojima
- Division of Systems Biology, Graduate School of Medicine, Nagoya University, Aichi 466-8550, Japan
- Department of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan
- Laboratory of Computational Life Science, National Cancer Center Research Institute, Tokyo 104-0045, Japan
| | - Haruka Hirose
- Division of Systems Biology, Graduate School of Medicine, Nagoya University, Aichi 466-8550, Japan
| | - Teppei Shimamura
- Division of Systems Biology, Graduate School of Medicine, Nagoya University, Aichi 466-8550, Japan
- Department of Computational and Systems Biology, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan
| |
Collapse
|
4
|
Diaz DJ, Kulikova AV, Ellington AD, Wilke CO. Using machine learning to predict the effects and consequences of mutations in proteins. Curr Opin Struct Biol 2023; 78:102518. [PMID: 36603229 PMCID: PMC9908841 DOI: 10.1016/j.sbi.2022.102518] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 11/07/2022] [Accepted: 11/20/2022] [Indexed: 01/05/2023]
Abstract
Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.
Collapse
Affiliation(s)
- Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, 105 E 24TH St., Austin, 78712, Texas, USA; Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/aiproteins
| | - Anastasiya V Kulikova
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA
| | - Andrew D Ellington
- Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/CSSBatUT
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA.
| |
Collapse
|
5
|
Jilani M, Turcan A, Haspel N, Jagodzinski F. Elucidating the Structural Impacts of Protein InDels. Biomolecules 2022; 12:1435. [PMID: 36291643 PMCID: PMC9599607 DOI: 10.3390/biom12101435] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 09/23/2022] [Accepted: 09/27/2022] [Indexed: 09/17/2023] Open
Abstract
The effects of amino acid insertions and deletions (InDels) remain a rather under-explored area of structural biology. These variations oftentimes are the cause of numerous disease phenotypes. In spite of this, research to study InDels and their structural significance remains limited, primarily due to a lack of experimental information and computational methods. In this work, we fill this gap by modeling InDels computationally; we investigate the rigidity differences between the wildtype and a mutant variant with one or more InDels. Further, we compare how structural effects due to InDels differ from the effects of amino acid substitutions, which are another type of amino acid mutation. We finish by performing a correlation analysis between our rigidity-based metrics and wet lab data for their ability to infer the effects of InDels on protein fitness.
Collapse
Affiliation(s)
- Muneeba Jilani
- Department of Computer Science, University of Massachusetts Boston, Boston, MA 02125, USA
| | - Alistair Turcan
- Department of Computer Science, Western Washington University, Bellingham, WA 98225, USA
| | - Nurit Haspel
- Department of Computer Science, University of Massachusetts Boston, Boston, MA 02125, USA
| | - Filip Jagodzinski
- Department of Computer Science, Western Washington University, Bellingham, WA 98225, USA
| |
Collapse
|
6
|
Tam JZ, Palumbo T, Miwa JM, Chen BY. Analysis of Protein-Protein Interactions for Intermolecular Bond Prediction. Molecules 2022; 27:molecules27196178. [PMID: 36234723 PMCID: PMC9572624 DOI: 10.3390/molecules27196178] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 09/03/2022] [Accepted: 09/10/2022] [Indexed: 11/24/2022] Open
Abstract
Protein-protein interactions often involve a complex system of intermolecular interactions between residues and atoms at the binding site. A comprehensive exploration of these interactions can help reveal key residues involved in protein-protein recognition that are not obvious using other protein analysis techniques. This paper presents and extends DiffBond, a novel method for identifying and classifying intermolecular bonds while applying standard definitions of bonds in chemical literature to explain protein interactions. DiffBond predicted intermolecular bonds from four protein complexes: Barnase-Barstar, Rap1a-raf, SMAD2-SMAD4, and a subset of complexes formed from three-finger toxins and nAChRs. Based on validation through manual literature search and through comparison of two protein complexes from the SKEMPI dataset, DiffBond was able to identify intermolecular ionic bonds and hydrogen bonds with high precision and recall, and identify salt bridges with high precision. DiffBond predictions on bond existence were also strongly correlated with observations of Gibbs free energy change and electrostatic complementarity in mutational experiments. DiffBond can be a powerful tool for predicting and characterizing influential residues in protein-protein interactions, and its predictions can support research in mutational experiments and drug design.
Collapse
Affiliation(s)
- Justin Z. Tam
- Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA
| | - Talulla Palumbo
- Department of Biological Sciences, Lehigh University, Bethlehem, PA 18015, USA
| | - Julie M. Miwa
- Department of Biological Sciences, Lehigh University, Bethlehem, PA 18015, USA
| | - Brian Y. Chen
- Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA
- Correspondence:
| |
Collapse
|
7
|
CGRAP: A Web Server for Coarse-Grained Rigidity Analysis of Proteins. Symmetry (Basel) 2021. [DOI: 10.3390/sym13122401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Elucidating protein rigidity offers insights about protein conformational changes. An understanding of protein motion can help speed drug development, and provide general insights into the dynamic behaviors of biomolecules. Existing rigidity analysis techniques employ fine-grained, all-atom modeling, which has a costly run-time, particularly for proteins made up of more than 500 residues. In this work, we introduce coarse-grained rigidity analysis, and showcase that it provides flexibility information about a protein that is similar in accuracy to an all-atom modeling approach. We assess the accuracy of the coarse-grained method relative to an all-atom approach via a comparison metric that reasons about the largest rigid clusters of the two methods. The apparent symmetry between the all-atom and coarse-grained methods yields very similar results, but the coarse-grained method routinely exhibits 40% reduced run-times. The CGRAP web server outputs rigid cluster information, and provides data visualization capabilities, including a interactive protein visualizer.
Collapse
|
8
|
Tam J, Palumbo T, Miwa JM, Chen BY. DiffBond: A Method for Predicting Intermolecular Bond Formation. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2021; 2021:2574-2586. [PMID: 35378834 DOI: 10.1109/bibm52615.2021.9669850] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Many tools that explore models of protein complexes are also able to analyze interactions between specific residues and atoms. A comprehensive exploration of these interactions can often uncover aspects of protein-protein recognition that are not obvious using other protein analysis techniques. This paper describes DiffBond, a novel method for searching for intermolecular interactions between protein complexes while differentiating between three different types of interaction: hydrogen bonds, ionic bonds, and salt bridges. DiffBond incorporates textbook definitions of these three interactions while contending with uncertainties that are inherent in computational models of interacting proteins. We used it to examine the barnase-barstar, Rap1a-raf, and Smad2-Smad4 complexes, as well as a subset of protein complexes formed between three-finger toxins and nAChRs. Based on electrostatic interactions established by previous experimental studies, DiffBond was able to identify ionic and hydrogen bonds with high precision and recall, and identify salt bridges with high precision. In combination with other electrostatic analysis methods, DiffBond can be a useful tool in helping predict influential amino acids in protein-protein interactions and characterizing the type of interaction.
Collapse
Affiliation(s)
- Justin Tam
- Dept. Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA
| | - Talulla Palumbo
- Dept. Biological Sciences, Lehigh University, Bethlehem, PA 18015, USA
| | - Julie M Miwa
- Dept. Biological Sciences, Lehigh University, Bethlehem, PA 18015, USA
| | - Brian Y Chen
- Dept. Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA
| |
Collapse
|
9
|
Iqbal S, Li F, Akutsu T, Ascher DB, Webb GI, Song J. Assessing the performance of computational predictors for estimating protein stability changes upon missense mutations. Brief Bioinform 2021; 22:6289890. [PMID: 34058752 DOI: 10.1093/bib/bbab184] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 04/07/2021] [Accepted: 04/21/2021] [Indexed: 11/14/2022] Open
Abstract
Understanding how a mutation might affect protein stability is of significant importance to protein engineering and for understanding protein evolution genetic diseases. While a number of computational tools have been developed to predict the effect of missense mutations on protein stability protein stability upon mutations, they are known to exhibit large biases imparted in part by the data used to train and evaluate them. Here, we provide a comprehensive overview of predictive tools, which has provided an evolving insight into the importance and relevance of features that can discern the effects of mutations on protein stability. A diverse selection of these freely available tools was benchmarked using a large mutation-level blind dataset of 1342 experimentally characterised mutations across 130 proteins from ThermoMutDB, a second test dataset encompassing 630 experimentally characterised mutations across 39 proteins from iStable2.0 and a third blind test dataset consisting of 268 mutations in 27 proteins from the newly published ProThermDB. The performance of the methods was further evaluated with respect to the site of mutation, type of mutant residue and by ranging the pH and temperature. Additionally, the classification performance was also evaluated by classifying the mutations as stabilizing (∆∆G ≥ 0) or destabilizing (∆∆G < 0). The results reveal that the performance of the predictors is affected by the site of mutation and the type of mutant residue. Further, the results show very low performance for pH values 6-8 and temperature higher than 65 for all predictors except iStable2.0 on the S630 dataset. To illustrate how stability and structure change upon single point mutation, we considered four stabilizing, two destabilizing and two stabilizing mutations from two proteins, namely the toxin protein and bovine liver cytochrome. Overall, the results on S268, S630 and S1342 datasets show that the performance of the integrated predictors is better than the mechanistic or individual machine learning predictors. We expect that this paper will provide useful guidance for the design and development of next-generation bioinformatic tools for predicting protein stability changes upon mutations.
Collapse
Affiliation(s)
- Shahid Iqbal
- Computer System Engineering from Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Pakistan
| | - Fuyi Li
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, the University of Melbourne, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Japan
| | | | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Victoria 3800, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Australia
| |
Collapse
|
10
|
Wilson CJ, Chang M, Karttunen M, Choy WY. KEAP1 Cancer Mutants: A Large-Scale Molecular Dynamics Study of Protein Stability. Int J Mol Sci 2021; 22:5408. [PMID: 34065616 PMCID: PMC8161161 DOI: 10.3390/ijms22105408] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Revised: 05/11/2021] [Accepted: 05/13/2021] [Indexed: 12/30/2022] Open
Abstract
We have performed 280 μs of unbiased molecular dynamics (MD) simulations to investigate the effects of 12 different cancer mutations on Kelch-like ECH-associated protein 1 (KEAP1) (G333C, G350S, G364C, G379D, R413L, R415G, A427V, G430C, R470C, R470H, R470S and G476R), one of the frequently mutated proteins in lung cancer. The aim was to provide structural insight into the effects of these mutants, including a new class of ANCHOR (additionally NRF2-complexed hypomorph) mutant variants. Our work provides additional insight into the structural dynamics of mutants that could not be analyzed experimentally, painting a more complete picture of their mutagenic effects. Notably, blade-wise analysis of the Kelch domain points to stability as a possible target of cancer in KEAP1. Interestingly, structural analysis of the R470C ANCHOR mutant, the most prevalent missense mutation in KEAP1, revealed no significant change in structural stability or NRF2 binding site dynamics, possibly indicating an covalent modification as this mutant's mode of action.
Collapse
Affiliation(s)
- Carter J. Wilson
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
- Department of Applied Mathematics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
| | - Megan Chang
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
| | - Mikko Karttunen
- Department of Applied Mathematics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
- Department of Chemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 3K7, Canada
- Centre for Advanced Materials and Biomaterials Research, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5B7, Canada
| | - Wing-Yiu Choy
- Department of Biochemistry, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 5C1, Canada; (C.J.W.); (M.C.)
| |
Collapse
|
11
|
Afrasiabi F, Dehghanpoor R, Haspel N. Integrating Rigidity Analysis into the Exploration of Protein Conformational Pathways Using RRT* and MC. Molecules 2021; 26:molecules26082329. [PMID: 33923805 PMCID: PMC8073574 DOI: 10.3390/molecules26082329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 04/12/2021] [Accepted: 04/13/2021] [Indexed: 11/16/2022] Open
Abstract
To understand how proteins function on a cellular level, it is of paramount importance to understand their structures and dynamics, including the conformational changes they undergo to carry out their function. For the aforementioned reasons, the study of large conformational changes in proteins has been an interest to researchers for years. However, since some proteins experience rapid and transient conformational changes, it is hard to experimentally capture the intermediate structures. Additionally, computational brute force methods are computationally intractable, which makes it impossible to find these pathways which require a search in a high-dimensional, complex space. In our previous work, we implemented a hybrid algorithm that combines Monte-Carlo (MC) sampling and RRT*, a version of the Rapidly Exploring Random Trees (RRT) robotics-based method, to make the conformational exploration more accurate and efficient, and produce smooth conformational pathways. In this work, we integrated the rigidity analysis of proteins into our algorithm to guide the search to explore flexible regions. We demonstrate that rigidity analysis dramatically reduces the run time and accelerates convergence.
Collapse
|
12
|
Popov AV, Endutkin AV, Yatsenko DD, Yudkina AV, Barmatov AE, Makasheva KA, Raspopova DY, Diatlova EA, Zharkov DO. Molecular dynamics approach to identification of new OGG1 cancer-associated somatic variants with impaired activity. J Biol Chem 2021; 296:100229. [PMID: 33361155 PMCID: PMC7948927 DOI: 10.1074/jbc.ra120.014455] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 12/22/2020] [Accepted: 12/23/2020] [Indexed: 01/02/2023] Open
Abstract
DNA of living cells is always exposed to damaging factors. To counteract the consequences of DNA lesions, cells have evolved several DNA repair systems, among which base excision repair is one of the most important systems. Many currently used antitumor drugs act by damaging DNA, and DNA repair often interferes with chemotherapy and radiotherapy in cancer cells. Tumors are usually extremely genetically heterogeneous, often bearing mutations in DNA repair genes. Thus, knowledge of the functionality of cancer-related variants of proteins involved in DNA damage response and repair is of great interest for personalization of cancer therapy. Although computational methods to predict the variant functionality have attracted much attention, at present, they are mostly based on sequence conservation and make little use of modern capabilities in computational analysis of 3D protein structures. We have used molecular dynamics (MD) to model the structures of 20 clinically observed variants of a DNA repair enzyme, 8-oxoguanine DNA glycosylase. In parallel, we have experimentally characterized the activity, thermostability, and DNA binding in a subset of these mutant proteins. Among the analyzed variants of 8-oxoguanine DNA glycosylase, three (I145M, G202C, and V267M) were significantly functionally impaired and were successfully predicted by MD. Alone or in combination with sequence-based methods, MD may be an important functional prediction tool for cancer-related protein variants of unknown significance.
Collapse
Affiliation(s)
- Aleksandr V Popov
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia.
| | - Anton V Endutkin
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Darya D Yatsenko
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia; Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia
| | - Anna V Yudkina
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Alexander E Barmatov
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Kristina A Makasheva
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia
| | - Darya Yu Raspopova
- Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia
| | - Evgeniia A Diatlova
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia
| | - Dmitry O Zharkov
- Laboratory of Genome and Protein Engineering, SB RAS Institute of Chemical Biology and Fundamental Medicine, Novosibirsk, Russia; Department of Natural Sciences, Novosibirsk State University, Novosibirsk, Russia.
| |
Collapse
|
13
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 64] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|
14
|
In silico features of ADAMTS13 contributing to plasmatic ADAMTS13 levels in neonates with congenital heart disease. Thromb Res 2020; 193:66-76. [PMID: 32531546 DOI: 10.1016/j.thromres.2020.05.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 04/24/2020] [Accepted: 05/26/2020] [Indexed: 11/20/2022]
Abstract
INTRODUCTION Risk factors contributing to heightened thrombosis in pediatric congenital heart disease (CHD) patients are not fully understood. Among the neonatal CHD population, those presenting with single ventricular physiology are at the highest risk for perioperative thrombosis. The von Willebrand factor and ADAMTS13 interactions have emerged as causative risk factors for pediatric stroke and could contribute to heightened thrombosis in CHD neonates. METHODS This study investigates a cohort of children with single ventricle physiology and undergoing cardiac surgery, during which some patients developed thrombosis. In this cohort, we analyzed the relationship of several molecular features of ADAMTS13 with the plasma and activity levels in patients at risk of thrombosis. Additionally, in light of the natural antithrombotic activity of ADAMTS13, we have sequenced the ADAMTS13 gene for each patient and evaluated the role of genetic variants in determining the plasma ADAMTS13 levels using a series of in silico tools including Hidden Markov Models, EVmutation, and Rosetta. RESULTS Lower ADAMTS13 levels were found in patients that developed thrombosis. A novel in silico analysis to assess haplotype effect of co-occurring variants identified alterations in relative surface area and solvation energy as important contributors. Our analysis suggested that beneficial or deleterious effect of a variant can be reasonably predicted by comprehensive analysis of in silico assessment and in vitro and/or in vivo data. CONCLUSION Findings from this study add to our understanding the role of genetic features of ADAMTS13 in patients at high risk of thrombosis related to an imbalanced relation between VWF and ADAMTS13.
Collapse
|
15
|
Huang P, Chu SKS, Frizzo HN, Connolly MP, Caster RW, Siegel JB. Evaluating Protein Engineering Thermostability Prediction Tools Using an Independently Generated Dataset. ACS OMEGA 2020; 5:6487-6493. [PMID: 32258884 PMCID: PMC7114132 DOI: 10.1021/acsomega.9b04105] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Accepted: 03/06/2020] [Indexed: 05/04/2023]
Abstract
Engineering proteins to enhance thermal stability is a widely utilized approach for creating industrially relevant biocatalysts. The development of new experimental datasets and computational tools to guide these engineering efforts remains an active area of research. Thus, to complement the previously reported measures of T 50 and kinetic constants, we are reporting an expansion of our previously published dataset of mutants for β-glucosidase to include both measures of T M and ΔΔG. For a set of 51 mutants, we found that T 50 and T M are moderately correlated, with a Pearson correlation coefficient and Spearman's rank coefficient of 0.58 and 0.47, respectively, indicating that the two methods capture different physical features. The performance of predicted stability using nine computational tools was also evaluated on the dataset of 51 mutants, none of which are found to be strong predictors of the observed changes in T 50, T M, or ΔΔG. Furthermore, the ability of the nine algorithms to predict the production of isolatable soluble protein was examined, which revealed that Rosetta ΔΔG, FoldX, DeepDDG, PoPMuSiC, and SDM were capable of predicting if a mutant could be produced and isolated as a soluble protein. These results further highlight the need for new algorithms for predicting modest, yet important, changes in thermal stability as well as a new utility for current algorithms for prescreening designs for the production of mutants that maintain fold and soluble production properties.
Collapse
Affiliation(s)
- Peishan Huang
- Biophysics
Graduate Group, University of California, Davis 95616, California, United States
| | - Simon K. S. Chu
- Biophysics
Graduate Group, University of California, Davis 95616, California, United States
| | - Henrique N. Frizzo
- Genome
Center, University of California, Davis 95616, California, United States
| | - Morgan P. Connolly
- Microbiology
Graduate Group, University of California, Davis 95616, California, United States
| | - Ryan W. Caster
- Genome
Center, University of California, Davis 95616, California, United States
| | - Justin B. Siegel
- Genome
Center, University of California, Davis 95616, California, United States
- Department
of Biochemistry & Molecular Medicine, University of California, Davis 95616, California, United States
- Department
of Chemistry, University of California, Davis 95616, California, United States
| |
Collapse
|
16
|
Lv X, Chen J, Lu Y, Chen Z, Xiao N, Yang Y. Accurately Predicting Mutation-Caused Stability Changes from Protein Sequences Using Extreme Gradient Boosting. J Chem Inf Model 2020; 60:2388-2395. [PMID: 32203653 DOI: 10.1021/acs.jcim.0c00064] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Accurately predicting the impact of point mutation on protein stability has crucial roles in protein design and engineering. In this study, we proposed a novel method (BoostDDG) to predict stability changes upon point mutations from protein sequences based on the extreme gradient boosting. We extracted features comprehensively from evolutional information and predicted structures and performed feature selection by a strategy of sequential forward selection. The features and parameters were optimized by homologue-based cross-validation to avoid overfitting. Finally, we found that 14 features from six groups led to the highest Pearson correlation coefficient (PCC) of 0.535, which is consistent with the 0.540 on an independent test. Our method was indicated to consistently outperform other sequence-based methods on three precompiled test sets, and 7363 variants on two proteins (PTEN and TPMT). These results highlighted that BoostDDG is a powerful tool for predicting stability changes upon point mutations from protein sequences.
Collapse
Affiliation(s)
- Xuan Lv
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China
| | - Jianwen Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yutong Lu
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Zhiguang Chen
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Nong Xiao
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Changsha, Hunan 410073, China.,School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-sen University, Guangzhou, Guangdong 510275, China.,Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Ministry of Education, Guangzhou, Guangdong 510275, China
| |
Collapse
|
17
|
PETRA: Drug Engineering via Rigidity Analysis. Molecules 2020; 25:molecules25061304. [PMID: 32178472 PMCID: PMC7144111 DOI: 10.3390/molecules25061304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 02/19/2020] [Accepted: 02/25/2020] [Indexed: 11/23/2022] Open
Abstract
Rational drug design aims to develop pharmaceutical agents that impart maximal therapeutic benefits via their interaction with their intended biological targets. In the past several decades, advances in computational tools that inform wet-lab techniques have aided the development of a wide variety of new medicines with high efficacies. Nonetheless, drug development remains a time and cost intensive process. In this work, we have developed a computational pipeline for assessing how individual atoms contribute to a ligand’s effect on the structural stability of a biological target. Our approach takes as input a protein-ligand resolved PDB structure file and systematically generates all possible ligand variants. We assess how the atomic-level edits to the ligand alter the drug’s effect via a graph theoretic rigidity analysis approach. We demonstrate, via four case studies of common drugs, the utility of our pipeline and corroborate our analyses with known biophysical properties of the medicines, as reported in the literature.
Collapse
|
18
|
Robust Prediction of Single and Multiple Point Protein Mutations Stability Changes. Biomolecules 2019; 10:biom10010067. [PMID: 31906171 PMCID: PMC7023245 DOI: 10.3390/biom10010067] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 11/16/2022] Open
Abstract
Accurate prediction of protein stability changes resulting from amino acid substitutions is of utmost importance in medicine to better understand which mutations are deleterious, leading to diseases, and which are neutral. Since conducting wet lab experiments to get a better understanding of protein mutations is costly and time consuming, and because of huge number of possible mutations the need of computational methods that could accurately predict effects of amino acid mutations is of greatest importance. In this research, we present a robust methodology to predict the energy changes of a proteins upon mutations. The proposed prediction scheme is based on two step algorithm that is a Holdout Random Sampler followed by a neural network model for regression. The Holdout Random Sampler is utilized to analysis the energy change, the corresponding uncertainty, and to obtain a set of admissible energy changes, expressed as a cumulative distribution function. These values are further utilized to train a simple neural network model that can predict the energy changes. Results were blindly tested (validated) against experimental energy changes, giving Pearson correlation coefficients of 0.66 for Single Point Mutations and 0.77 for Multiple Point Mutations. These results confirm the successfulness of our method, since it outperforms majority of previous studies in this field.
Collapse
|
19
|
Olney R, Tuor A, Jagodzinski F, Hutchinson B. A systematic exploration of ΔΔG cutoff ranges in machine learning models for protein mutation stability prediction. J Bioinform Comput Biol 2018; 16:1840022. [DOI: 10.1142/s021972001840022x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Discerning how a mutation affects the stability of a protein is central to the study of a wide range of diseases. Mutagenesis experiments on physical proteins provide precise insights about the effects of amino acid substitutions, but such studies are time and cost prohibitive. Computational approaches for informing experimentalists where to allocate wet-lab resources are available, including a variety of machine learning models. Assessing the accuracy of machine learning models for predicting the effects of mutations is dependent on experiments for amino acid substitutions performed in vitro. When similar experiments on physical proteins have been performed by multiple laboratories, the use of the data near the juncture of stabilizing and destabilizing mutations is questionable. In this work, we explore a systematic and principled alternative to discarding experimental data close to the juncture of stabilizing and destabilizing mutations. We model the inconclusive range of experimental [Formula: see text] values via 3- and 5-way classifiers, and systematically explore potential boundaries for the range of inconclusive experimental values. We demonstrate the effectiveness of potential boundaries through confusion matrices and heat map visualizations. We explore two novel metrics for assessing viable cutoff ranges, and find that under these metrics, a lower cutoff near [Formula: see text] and an upper cutoff near [Formula: see text] are optimal across multiple machine learning models.
Collapse
Affiliation(s)
| | - Aaron Tuor
- Pacific Northwest National Laboratory, Seattle, WA, USA
| | | | - Brian Hutchinson
- Western Washington University, Bellingham, WA, USA
- Pacific Northwest National Laboratory, Seattle, WA, USA
| |
Collapse
|
20
|
Ming D, Chen R, Huang H. Amino-Acid Network Clique Analysis of Protein Mutation Non-Additive Effects: A Case Study of Lysozme. Int J Mol Sci 2018; 19:ijms19051427. [PMID: 29747478 PMCID: PMC5983764 DOI: 10.3390/ijms19051427] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Revised: 04/28/2018] [Accepted: 05/07/2018] [Indexed: 01/23/2023] Open
Abstract
Optimizing amino-acid mutations in enzyme design has been a very challenging task in modern bio-industrial applications. It is well known that many successful designs often hinge on extensive correlations among mutations at different sites within the enzyme, however, the underpinning mechanism for these correlations is far from clear. Here, we present a topology-based model to quantitively characterize non-additive effects between mutations. The method is based on the molecular dynamic simulations and the amino-acid network clique analysis. It examines if the two mutation sites of a double-site mutation fall into to a 3-clique structure, and associates such topological property of mutational site spatial distribution with mutation additivity features. We analyzed 13 dual mutations of T4 phage lysozyme and found that the clique-based model successfully distinguishes highly correlated or non-additive double-site mutations from those additive ones whose component mutations have less correlation. We also applied the model to protein Eglin c whose structural topology is significantly different from that of T4 phage lysozyme, and found that the model can, to some extension, still identify non-additive mutations from additive ones. Our calculations showed that mutation non-additive effects may heavily depend on a structural topology relationship between mutation sites, which can be quantitatively determined using amino-acid network k-cliques. We also showed that double-site mutation correlations can be significantly altered by exerting a third mutation, indicating that more detailed physicochemical interactions should be considered along with the network clique-based model for better understanding of this elusive mutation-correlation principle.
Collapse
Affiliation(s)
- Dengming Ming
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Biotech Building Room B1-404, 30 South Puzhu Road, Nanjing 211816, Jiangsu, China.
| | - Rui Chen
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Biotech Building Room B1-404, 30 South Puzhu Road, Nanjing 211816, Jiangsu, China.
| | - He Huang
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Biotech Building Room B1-404, 30 South Puzhu Road, Nanjing 211816, Jiangsu, China.
- College of Pharmaceutical Sciences, Nanjing Tech University, 30 Puzhu South Road, Nanjing 211816, Jiangsu, China.
| |
Collapse
|