1
|
Huang Z, Cui X, Xia Y, Zhao K, Zhang G. Pathfinder: Protein folding pathway prediction based on conformational sampling. PLoS Comput Biol 2023; 19:e1011438. [PMID: 37695768 PMCID: PMC10513300 DOI: 10.1371/journal.pcbi.1011438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 09/21/2023] [Accepted: 08/17/2023] [Indexed: 09/13/2023] Open
Abstract
The study of protein folding mechanism is a challenge in molecular biology, which is of great significance for revealing the movement rules of biological macromolecules, understanding the pathogenic mechanism of folding diseases, and designing protein engineering materials. Based on the hypothesis that the conformational sampling trajectory contain the information of folding pathway, we propose a protein folding pathway prediction algorithm named Pathfinder. Firstly, Pathfinder performs large-scale sampling of the conformational space and clusters the decoys obtained in the sampling. The heterogeneous conformations obtained by clustering are named seed states. Then, a resampling algorithm that is not constrained by the local energy basin is designed to obtain the transition probabilities of seed states. Finally, protein folding pathways are inferred from the maximum transition probabilities of seed states. The proposed Pathfinder is tested on our developed test set (34 proteins). For 11 widely studied proteins, we correctly predicted their folding pathways and specifically analyzed 5 of them. For 13 proteins, we predicted their folding pathways to be further verified by biological experiments. For 6 proteins, we analyzed the reasons for the low prediction accuracy. For the other 4 proteins without biological experiment results, potential folding pathways were predicted to provide new insights into protein folding mechanism. The results reveal that structural analogs may have different folding pathways to express different biological functions, homologous proteins may contain common folding pathways, and α-helices may be more prone to early protein folding than β-strands.
Collapse
Affiliation(s)
- Zhaohong Huang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China
| | - Xinyue Cui
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China
| | - Yuhao Xia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China
| | - Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, China
| |
Collapse
|
2
|
Nithiyanandam S, Sangaraju VK, Manavalan B, Lee G. Computational prediction of protein folding rate using structural parameters and network centrality measures. Comput Biol Med 2023; 155:106436. [PMID: 36848800 DOI: 10.1016/j.compbiomed.2022.106436] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 11/28/2022] [Accepted: 12/13/2022] [Indexed: 02/17/2023]
Abstract
Protein folding is a complex physicochemical process whereby a polymer of amino acids samples numerous conformations in its unfolded state before settling on an essentially unique native three-dimensional (3D) structure. To understand this process, several theoretical studies have used a set of 3D structures, identified different structural parameters, and analyzed their relationships using the natural logarithmic protein folding rate (ln(kf)). Unfortunately, these structural parameters are specific to a small set of proteins that are not capable of accurately predicting ln(kf) for both two-state (TS) and non-two-state (NTS) proteins. To overcome the limitations of the statistical approach, a few machine learning (ML)-based models have been proposed using limited training data. However, none of these methods can explain plausible folding mechanisms. In this study, we evaluated the predictive capabilities of ten different ML algorithms using eight different structural parameters and five different network centrality measures based on newly constructed datasets. In comparison to the other nine regressors, support vector machine was found to be the most appropriate for predicting ln(kf) with mean absolute differences of 1.856, 1.55, and 1.745 for the TS, NTS, and combined datasets, respectively. Furthermore, combining structural parameters and network centrality measures improves the prediction performance compared to individual parameters, indicating that multiple factors are involved in the folding process.
Collapse
Affiliation(s)
- Saraswathy Nithiyanandam
- Department of Molecular Science and Technology, Ajou University, 206 World Cup-ro, Suwon, 16499, South Korea
| | - Vinoth Kumar Sangaraju
- Department of Physiology, Ajou University School of Medicine, 206 World Cup-ro, Suwon, 16499, South Korea
| | - Balachandran Manavalan
- Department of Physiology, Ajou University School of Medicine, 206 World Cup-ro, Suwon, 16499, South Korea.
| | - Gwang Lee
- Department of Molecular Science and Technology, Ajou University, 206 World Cup-ro, Suwon, 16499, South Korea; Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Gyeonggi-do, South Korea.
| |
Collapse
|
3
|
FRTpred: A novel approach for accurate prediction of protein folding rate and type. Comput Biol Med 2022; 149:105911. [DOI: 10.1016/j.compbiomed.2022.105911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/08/2022] [Accepted: 07/23/2022] [Indexed: 11/20/2022]
|
4
|
Packiam KAR, Ooi CW, Li F, Mei S, Tey BT, Fang Ong H, Song J, Ramanan RN. PERISCOPE-Opt: Machine learning-based prediction of optimal fermentation conditions and yields of recombinant periplasmic protein expressed in Escherichia coli. Comput Struct Biotechnol J 2022; 20:2909-2920. [PMID: 35765650 PMCID: PMC9201004 DOI: 10.1016/j.csbj.2022.06.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 06/01/2022] [Accepted: 06/01/2022] [Indexed: 11/26/2022] Open
Abstract
The ensemble model considered both fermentation conditions and protein properties. Optimal fermentation conditions and periplasmic recombinant protein yield can be predicted. Predictor’s accuracy and Pearson correlation coefficient are 75% and 0.91, respectively.
Optimization of the fermentation process for recombinant protein production (RPP) is often resource-intensive. Machine learning (ML) approaches are helpful in minimizing the experimentations and find vast applications in RPP. However, these ML-based tools primarily focus on features with respect to amino-acid-sequence, ruling out the influence of fermentation process conditions. The present study combines the features derived from fermentation process conditions with that from amino acid-sequence to construct an ML-based model that predicts the maximal protein yields and the corresponding fermentation conditions for the expression of target recombinant protein in the Escherichia coli periplasm. Two sets of XGBoost classifiers were employed in the first stage to classify the expression levels of the target protein as high (>50 mg/L), medium (between 0.5 and 50 mg/L), or low (<0.5 mg/L). The second-stage framework consisted of three regression models involving support vector machines and random forest to predict the expression yields corresponding to each expression-level-class. Independent tests showed that the predictor achieved an overall average accuracy of 75% and a Pearson coefficient correlation of 0.91 for the correctly classified instances. Therefore, our model offers a reliable substitution of numerous trial-and-error experiments to identify the optimal fermentation conditions and yield for RPP. It is also implemented as an open-access webserver, PERISCOPE-Opt (http://periscope-opt.erc.monash.edu).
Collapse
|
5
|
The protein folding rate and the geometry and topology of the native state. Sci Rep 2022; 12:6384. [PMID: 35430582 PMCID: PMC9013383 DOI: 10.1038/s41598-022-09924-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 03/21/2022] [Indexed: 11/08/2022] Open
Abstract
AbstractProteins fold in 3-dimensional conformations which are important for their function. Characterizing the global conformation of proteins rigorously and separating secondary structure effects from topological effects is a challenge. New developments in applied knot theory allow to characterize the topological characteristics of proteins (knotted or not). By analyzing a small set of two-state and multi-state proteins with no knots or slipknots, our results show that 95.4% of the analyzed proteins have non-trivial topological characteristics, as reflected by the second Vassiliev measure, and that the logarithm of the experimental protein folding rate depends on both the local geometry and the topology of the protein’s native state.
Collapse
|
6
|
McBride JM, Tlusty T. Slowest-first protein translation scheme: Structural asymmetry and co-translational folding. Biophys J 2021; 120:5466-5477. [PMID: 34813729 PMCID: PMC8715247 DOI: 10.1016/j.bpj.2021.11.024] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 09/30/2021] [Accepted: 11/17/2021] [Indexed: 11/19/2022] Open
Abstract
Proteins are translated from the N to the C terminus, raising the basic question of how this innate directionality affects their evolution. To explore this question, we analyze 16,200 structures from the Protein Data Bank (PDB). We find remarkable enrichment of α helices at the C terminus and β strands at the N terminus. Furthermore, this α-β asymmetry correlates with sequence length and contact order, both determinants of folding rate, hinting at possible links to co-translational folding (CTF). Hence, we propose the "slowest-first" scheme, whereby protein sequences evolved structural asymmetry to accelerate CTF: the slowest of the cooperatively folding segments are positioned near the N terminus so they have more time to fold during translation. A phenomenological model predicts that CTF can be accelerated by asymmetry in folding rate, up to double the rate, when folding time is commensurate with translation time; analysis of the PDB predicts that structural asymmetry is indeed maximal in this regime. This correspondence is greater in prokaryotes, which generally require faster protein production. Altogether, this indicates that accelerating CTF is a substantial evolutionary force whose interplay with stability and functionality is encoded in secondary structure asymmetry.
Collapse
Affiliation(s)
- John M McBride
- Center for Soft and Living Matter, Institute for Basic Science, Ulsan, South Korea.
| | - Tsvi Tlusty
- Center for Soft and Living Matter, Institute for Basic Science, Ulsan, South Korea; Departments of Physics and Chemistry, Ulsan National Institute of Science and Technology, Ulsan, South Korea.
| |
Collapse
|
7
|
Stepwise optimization of recombinant protein production in Escherichia coli utilizing computational and experimental approaches. Appl Microbiol Biotechnol 2020; 104:3253-3266. [DOI: 10.1007/s00253-020-10454-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 01/28/2020] [Accepted: 02/07/2020] [Indexed: 12/14/2022]
|
8
|
Khor S. Folding with a protein's native shortcut network. Proteins 2019; 86:924-934. [PMID: 29790602 DOI: 10.1002/prot.25524] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 04/13/2018] [Accepted: 05/14/2018] [Indexed: 11/09/2022]
Abstract
A complex network approach to protein folding is proposed, wherein a protein's contact map is reconceptualized as a network of shortcut edges, and folding is steered by a structural characteristic of this network. Shortcut networks are generated by a known message passing algorithm operating on protein residue networks. It is found that the shortcut networks of native structures (SCN0s) are relevant graph objects with which to study protein folding at a formal level. The logarithm form of their contact order (SCN0_lnCO) correlates significantly with folding rate of two-state and nontwo-state proteins. The clustering coefficient of SCN0s (CSCN0 ) correlates significantly with folding rate, transition-state placement and stability of two-state folders. Reasonable folding pathways for several model proteins are produced when CSCN0 is used to combine protein segments incrementally to form the native structure. The folding bias captured by CSCN0 is detectable in non-native structures, as evidenced by Molecular Dynamics simulation generated configurations for the fast folding Villin-headpiece peptide. These results support the use of shortcut networks to investigate the role protein geometry plays in the folding of both small and large globular proteins, and have implications for the design of multibody interaction schemes in folding models. One facet of this geometry is the set of native shortcut triangles, whose attributes are found to be well-suited to identify dehydrated intraprotein areas in tight turns, or at the interface of different secondary structure elements.
Collapse
Affiliation(s)
- Susan Khor
- Department of Computer Science, Memorial University of Newfoundland, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
9
|
Network measures for protein folding state discrimination. Sci Rep 2016; 6:30367. [PMID: 27464796 PMCID: PMC4964642 DOI: 10.1038/srep30367] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2016] [Accepted: 06/24/2016] [Indexed: 11/09/2022] Open
Abstract
Proteins fold using a two-state or multi-state kinetic mechanisms, but up to now there is not a first-principle model to explain this different behavior. We exploit the network properties of protein structures by introducing novel observables to address the problem of classifying the different types of folding kinetics. These observables display a plain physical meaning, in terms of vibrational modes, possible configurations compatible with the native protein structure, and folding cooperativity. The relevance of these observables is supported by a classification performance up to 90%, even with simple classifiers such as discriminant analysis.
Collapse
|
10
|
Chang CCH, Li C, Webb GI, Tey B, Song J, Ramanan RN. Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli. Sci Rep 2016; 6:21844. [PMID: 26931649 PMCID: PMC4773868 DOI: 10.1038/srep21844] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Accepted: 01/28/2016] [Indexed: 12/20/2022] Open
Abstract
Periplasmic expression of soluble proteins in Escherichia coli not only offers a much-simplified downstream purification process, but also enhances the probability of obtaining correctly folded and biologically active proteins. Different combinations of signal peptides and target proteins lead to different soluble protein expression levels, ranging from negligible to several grams per litre. Accurate algorithms for rational selection of promising candidates can serve as a powerful tool to complement with current trial-and-error approaches. Accordingly, proteomics studies can be conducted with greater efficiency and cost-effectiveness. Here, we developed a predictor with a two-stage architecture, to predict the real-valued expression level of target protein in the periplasm. The output of the first-stage support vector machine (SVM) classifier determines which second-stage support vector regression (SVR) classifier to be used. When tested on an independent test dataset, the predictor achieved an overall prediction accuracy of 78% and a Pearson's correlation coefficient (PCC) of 0.77. We further illustrate the relative importance of various features with respect to different models. The results indicate that the occurrence of dipeptide glutamine and aspartic acid is the most important feature for the classification model. Finally, we provide access to the implemented predictor through the Periscope webserver, freely accessible at http://lightning.med.monash.edu/periscope/.
Collapse
Affiliation(s)
- Catherine Ching Han Chang
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
| | - Geoffrey I. Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
| | - BengTi Tey
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia
- National Engineering Laboratory for Industrial Enzymes, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, China
| | - Ramakrishnan Nagasundara Ramanan
- Chemical Engineering Discipline, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- Advanced Engineering Platform, School of Engineering, Monash University, Jalan Lagoon Selatan 46150, Bandar Sunway, Selangor, Malaysia
- School of Chemistry, Monash University, Melbourne VIC 3800, Australia
| |
Collapse
|
11
|
Computational and experimental approaches to reveal the effects of single nucleotide polymorphisms with respect to disease diagnostics. Int J Mol Sci 2014; 15:9670-717. [PMID: 24886813 PMCID: PMC4100115 DOI: 10.3390/ijms15069670] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2014] [Revised: 05/15/2014] [Accepted: 05/16/2014] [Indexed: 12/25/2022] Open
Abstract
DNA mutations are the cause of many human diseases and they are the reason for natural differences among individuals by affecting the structure, function, interactions, and other properties of DNA and expressed proteins. The ability to predict whether a given mutation is disease-causing or harmless is of great importance for the early detection of patients with a high risk of developing a particular disease and would pave the way for personalized medicine and diagnostics. Here we review existing methods and techniques to study and predict the effects of DNA mutations from three different perspectives: in silico, in vitro and in vivo. It is emphasized that the problem is complicated and successful detection of a pathogenic mutation frequently requires a combination of several methods and a knowledge of the biological phenomena associated with the corresponding macromolecules.
Collapse
|
12
|
Chang CCH, Tey BT, Song J, Ramanan RN. Towards more accurate prediction of protein folding rates: a review of the existing web-based bioinformatics approaches. Brief Bioinform 2014; 16:314-24. [DOI: 10.1093/bib/bbu007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
|
13
|
Identification of human protein complexes from local sub-graphs of protein-protein interaction network based on random forest with topological structure features. Anal Chim Acta 2012; 718:32-41. [PMID: 22305895 DOI: 10.1016/j.aca.2011.12.069] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2011] [Revised: 12/28/2011] [Accepted: 12/30/2011] [Indexed: 11/20/2022]
Abstract
In the post-genomic era, one of the most important and challenging tasks is to identify protein complexes and further elucidate its molecular mechanisms in specific biological processes. Previous computational approaches usually identify protein complexes from protein interaction network based on dense sub-graphs and incomplete priori information. Additionally, the computational approaches have little concern about the biological properties of proteins and there is no a common evaluation metric to evaluate the performance. So, it is necessary to construct novel method for identifying protein complexes and elucidating the function of protein complexes. In this study, a novel approach is proposed to identify protein complexes using random forest and topological structure. Each protein complex is represented by a graph of interactions, where descriptor of the protein primary structure is used to characterize biological properties of protein and vertex is weighted by the descriptor. The topological structure features are developed and used to characterize protein complexes. Random forest algorithm is utilized to build prediction model and identify protein complexes from local sub-graphs instead of dense sub-graphs. As a demonstration, the proposed approach is applied to protein interaction data in human, and the satisfied results are obtained with accuracy of 80.24%, sensitivity of 81.94%, specificity of 80.07%, and Matthew's correlation coefficient of 0.4087 in 10-fold cross-validation test. Some new protein complexes are identified, and analysis based on Gene Ontology shows that the complexes are likely to be true complexes and play important roles in the pathogenesis of some diseases. PCI-RFTS, a corresponding executable program for protein complexes identification, can be acquired freely on request from the authors.
Collapse
|