1
|
Xu R, Pan Q, Zhu G, Ye Y, Xin M, Wang Z, Wang S, Li W, Wei Y, Guo J, Zheng L. ThermoLink: Bridging disulfide bonds and enzyme thermostability through database construction and machine learning prediction. Protein Sci 2024; 33:e5097. [PMID: 39145402 PMCID: PMC11325166 DOI: 10.1002/pro.5097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 05/27/2024] [Accepted: 06/15/2024] [Indexed: 08/16/2024]
Abstract
Disulfide bonds, covalently formed by sulfur atoms in cysteine residues, play a crucial role in protein folding and structure stability. Considering their significance, artificial disulfide bonds are often introduced to enhance protein thermostability. Although an increasing number of tools can assist with this task, significant amounts of time and resources are often wasted owing to inadequate consideration. To enhance the accuracy and efficiency of designing disulfide bonds for protein thermostability improvement, we initially collected disulfide bond and protein thermostability data from extensive literature sources. Thereafter, we extracted various sequence- and structure-based features and constructed machine-learning models to predict whether disulfide bonds can improve protein thermostability. Among all models, the neighborhood context model based on the Adaboost-DT algorithm performed the best, yielding "area under the receiver operating characteristic curve" and accuracy scores of 0.773 and 0.714, respectively. Furthermore, we also found AlphaFold2 to exhibit high superiority in predicting disulfide bonds, and to some extent, the coevolutionary relationship between residue pairs potentially guided artificial disulfide bond design. Moreover, several mutants of imine reductase 89 (IR89) with artificially designed thermostable disulfide bonds were experimentally proven to be considerably efficient for substrate catalysis. The SS-bond data have been integrated into an online server, namely, ThermoLink, available at guolab.mpu.edu.mo/thermoLink.
Collapse
Affiliation(s)
- Ran Xu
- Centre in Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Qican Pan
- Zelixir Biotech Company Ltd, Shanghai, China
| | | | - Yilin Ye
- Centre in Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Minghui Xin
- School of Physics, Shandong University, Jinan, China
| | - Zechen Wang
- School of Physics, Shandong University, Jinan, China
| | - Sheng Wang
- Zelixir Biotech Company Ltd, Shanghai, China
| | - Weifeng Li
- School of Physics, Shandong University, Jinan, China
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Jingjing Guo
- Centre in Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Liangzhen Zheng
- Zelixir Biotech Company Ltd, Shanghai, China
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|
2
|
Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier. BMC Bioinformatics 2022; 23:518. [PMID: 36457083 PMCID: PMC9713954 DOI: 10.1186/s12859-022-04880-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 08/03/2022] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Self-interacting proteins (SIPs), two or more copies of the protein that can interact with each other expressed by one gene, play a central role in the regulation of most living cells and cellular functions. Although numerous SIPs data can be provided by using high-throughput experimental techniques, there are still several shortcomings such as in time-consuming, costly, inefficient, and inherently high in false-positive rates, for the experimental identification of SIPs even nowadays. Therefore, it is more and more significant how to develop efficient and accurate automatic approaches as a supplement of experimental methods for assisting and accelerating the study of predicting SIPs from protein sequence information. RESULTS In this paper, we present a novel framework, termed GLCM-WSRC (gray level co-occurrence matrix-weighted sparse representation based classification), for predicting SIPs automatically based on protein evolutionary information from protein primary sequences. More specifically, we firstly convert the protein sequence into Position Specific Scoring Matrix (PSSM) containing protein sequence evolutionary information, exploiting the Position Specific Iterated BLAST (PSI-BLAST) tool. Secondly, using an efficient feature extraction approach, i.e., GLCM, we extract abstract salient and invariant feature vectors from the PSSM, and then perform a pre-processing operation, the adaptive synthetic (ADASYN) technique, to balance the SIPs dataset to generate new feature vectors for classification. Finally, we employ an efficient and reliable WSRC model to identify SIPs according to the known information of self-interacting and non-interacting proteins. CONCLUSIONS Extensive experimental results show that the proposed approach exhibits high prediction performance with 98.10% accuracy on the yeast dataset, and 91.51% accuracy on the human dataset, which further reveals that the proposed model could be a useful tool for large-scale self-interacting protein prediction and other bioinformatics tasks detection in the future.
Collapse
|
3
|
Onah E, Uzor PF, Ugwoke IC, Eze JU, Ugwuanyi ST, Chukwudi IR, Ibezim A. Prediction of HIV-1 protease cleavage site from octapeptide sequence information using selected classifiers and hybrid descriptors. BMC Bioinformatics 2022; 23:466. [DOI: 10.1186/s12859-022-05017-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
In most parts of the world, especially in underdeveloped countries, acquired immunodeficiency syndrome (AIDS) still remains a major cause of death, disability, and unfavorable economic outcomes. This has necessitated intensive research to develop effective therapeutic agents for the treatment of human immunodeficiency virus (HIV) infection, which is responsible for AIDS. Peptide cleavage by HIV-1 protease is an essential step in the replication of HIV-1. Thus, correct and timely prediction of the cleavage site of HIV-1 protease can significantly speed up and optimize the drug discovery process of novel HIV-1 protease inhibitors. In this work, we built and compared the performance of selected machine learning models for the prediction of HIV-1 protease cleavage site utilizing a hybrid of octapeptide sequence information comprising bond composition, amino acid binary profile (AABP), and physicochemical properties as numerical descriptors serving as input variables for some selected machine learning algorithms. Our work differs from antecedent studies exploring the same subject in the combination of octapeptide descriptors and method used. Instead of using various subsets of the dataset for training and testing the models, we combined the dataset, applied a 3-way data split, and then used a "stratified" 10-fold cross-validation technique alongside the testing set to evaluate the models.
Results
Among the 8 models evaluated in the “stratified” 10-fold CV experiment, logistic regression, multi-layer perceptron classifier, linear discriminant analysis, gradient boosting classifier, Naive Bayes classifier, and decision tree classifier with AUC, F-score, and B. Acc. scores in the ranges of 0.91–0.96, 0.81–0.88, and 80.1–86.4%, respectively, have the closest predictive performance to the state-of-the-art model (AUC 0.96, F-score 0.80 and B. Acc. ~ 80.0%). Whereas, the perceptron classifier and the K-nearest neighbors had statistically lower performance (AUC 0.77–0.82, F-score 0.53–0.69, and B. Acc. 60.0–68.5%) at p < 0.05. On the other hand, logistic regression, and multi-layer perceptron classifier (AUC of 0.97, F-score > 0.89, and B. Acc. > 90.0%) had the best performance on further evaluation on the testing set, though linear discriminant analysis, gradient boosting classifier, and Naive Bayes classifier equally performed well (AUC > 0.94, F-score > 0.87, and B. Acc. > 86.0%).
Conclusions
Logistic regression and multi-layer perceptron classifiers have comparable predictive performances to the state-of-the-art model when octapeptide sequence descriptors consisting of AABP, bond composition and standard physicochemical properties are used as input variables. In our future work, we hope to develop a standalone software for HIV-1 protease cleavage site prediction utilizing the linear regression algorithm and the aforementioned octapeptide sequence descriptors.
Collapse
|
4
|
Qin X, Zhang L, Liu M, Xu Z, Liu G. ASFold-DNN: Protein Fold Recognition Based on Evolutionary Features With Variable Parameters Using Full Connected Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2712-2722. [PMID: 34133282 DOI: 10.1109/tcbb.2021.3089168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Protein fold recognition contribute to comprehend the function of proteins, which is of great help to the gene therapy of diseases and the development of new drugs. Researchers have been working in this direction and have made considerable achievements, but challenges still exist on low sequence similarity datasets. In this study, we propose the ASFold-DNN framework for protein fold recognition research. Above all, four groups of evolutionary features are extracted from the primary structures of proteins, and a preliminary selection of variable parameter is made for two groups of features including ACC _HMM and SXG _HMM, respectively. Then several feature selection algorithms are selected for comparison and the best feature selection scheme is obtained by changing their internal threshold values. Finally, multiple hyper-parameters of Full Connected Neural Network are fully optimized to construct the best model. DD, EDD and TG datasets with low sequence similarities are chosen to evaluate the performance of the models constructed by the framework, and the final prediction accuracy are 85.28, 95.00 and 88.84 percent, respectively. Furthermore, the ASTRAL186 and LE datasets are introduced to further verify the generalization ability of our proposed framework. Comprehensive experimental results prove that the ASFold-DNN framework is more prominent than the state-of-the-art studies on protein fold recognition. The source code and data of ASFold-DNN can be downloaded from https://github.com/Bioinformatics-Laboratory/project/tree/master/ASFold.
Collapse
|
5
|
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: A Comprehensive R Package for Generating Evolutionary-based Descriptors of Protein Sequences from PSSM Profiles. BIOLOGY METHODS AND PROTOCOLS 2022; 7:bpac008. [PMID: 35388370 PMCID: PMC8977839 DOI: 10.1093/biomethods/bpac008] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 01/21/2022] [Indexed: 11/14/2022]
Abstract
Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.
Collapse
Affiliation(s)
- Alireza Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Javad Zahiri
- Department of Neuroscience, University of California San Diego, California, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Saber Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mohsen Khodarahmi
- Department of Radiology, Shahid Madani Hospital, Karaj, Iran
- Bahar Medical Imaging Center, Karaj, Iran
- Dr. Khodarahmi Medical Imaging Center, Karaj, Iran
| | - Seyed Shahriar Arab
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
6
|
Gong Y, Dong B, Zhang Z, Zhai Y, Gao B, Zhang T, Zhang J. VTP-Identifier: Vesicular Transport Proteins Identification Based on PSSM Profiles and XGBoost. Front Genet 2022; 12:808856. [PMID: 35047020 PMCID: PMC8762342 DOI: 10.3389/fgene.2021.808856] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Accepted: 11/29/2021] [Indexed: 11/13/2022] Open
Abstract
Vesicular transport proteins are related to many human diseases, and they threaten human health when they undergo pathological changes. Protein function prediction has been one of the most in-depth topics in bioinformatics. In this work, we developed a useful tool to identify vesicular transport proteins. Our strategy is to extract transition probability composition, autocovariance transformation and other information from the position-specific scoring matrix as feature vectors. EditedNearesNeighbours (ENN) is used to address the imbalance of the data set, and the Max-Relevance-Max-Distance (MRMD) algorithm is adopted to reduce the dimension of the feature vector. We used 5-fold cross-validation and independent test sets to evaluate our model. On the test set, VTP-Identifier presented a higher performance compared with GRU. The accuracy, Matthew's correlation coefficient (MCC) and area under the ROC curve (AUC) were 83.6%, 0.531 and 0.873, respectively.
Collapse
Affiliation(s)
- Yue Gong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Benzhi Dong
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
7
|
Prediction for understanding the effectiveness of antiviral peptides. Comput Biol Chem 2021; 95:107588. [PMID: 34655913 DOI: 10.1016/j.compbiolchem.2021.107588] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/20/2022]
Abstract
The low efficacy of current antivirals in conjunction with the resistance of viruses against existing antiviral drugs has resulted in the demand for the development of novel antiviral agents. Antiviral peptides (AVPs) are those bioactive peptides having virucidal activity and they can be developed into promising antiviral drugs. They are shorter length peptides having the ability to cease the progression of viral infections. The use of antiviral peptides in therapeutics has recently attracted the attention of the research community. The development and identification of AVPs is imperative for the discovery of novel therapeutics for viral infections. In the present work, a meta classifier (stacking) based approach is implemented for the prediction of IC50 (half maximal inhibitory concentration) and pIC50 (negative log of half maximal inhibitory concentration) values. The best prediction model with evolutionary information and local alignment scores as features achieved a correlation coefficient values of 0.670 and 0.753 on the training and testing sets respectively for IC50. Further, the prediction of pIC50 reached a correlation coefficient value of 0.797 and 0.789 for training and testing sets respectively. For the development of machine learning models involved in the prediction of IC50, the use of pIC50 over IC50 is recommended as the target variable. Further on a systematic comparison of AVPs with high IC50 values and Low IC50 values, it is revealed that higher mean charge and tiny amino acids are preferred and higher length and consecutive hydrophilic amino acids are avoided in the former.
Collapse
|
8
|
Zervou MA, Doutsi E, Pavlidis P, Tsakalides P. Structural classification of proteins based on the computationally efficient recurrence quantification analysis and horizontal visibility graphs. Bioinformatics 2021; 37:1796-1804. [PMID: 34048559 DOI: 10.1093/bioinformatics/btab407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 04/13/2021] [Accepted: 05/27/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Protein structural class prediction is one of the most significant problems in bioinformatics, as it has a prominent role in understanding the function and evolution of proteins. Designing a computationally efficient but at the same time accurate prediction method remains a pressing issue, especially for sequences that we cannot obtain a sufficient amount of homologous information from existing protein sequence databases. Several studies demonstrate the potential of utilizing chaos game representation (CGR) along with time series analysis tools such as recurrence quantification analysis (RQA), complex networks, horizontal visibility graphs (HVG) and others. However, the majority of existing works involve a large amount of features and they require an exhaustive, time consuming search of the optimal parameters. To address the aforementioned problems, this work adopts the generalized multidimensional recurrence quantification analysis (GmdRQA) as an efficient tool that enables to process concurrently a multidimensional time series and reduce the number of features. In addition, two data-driven algorithms, namely average mutual information (AMI) and false nearest neighbors (FNN), are utilized to define in a fast yet precise manner the optimal GmdRQA parameters. RESULTS The classification accuracy is improved by the combination of GmdRQA with the HVG. Experimental evaluation on a real benchmark dataset demonstrates that our methods achieve similar performance with the state-of-the-art but with a smaller computational cost. AVAILABILITY The code to reproduce all the results is available at https://github.com/aretiz/protein_structure_classification/tree/main. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michaela Areti Zervou
- Department of Computer Science, University of Crete, Heraklion, 700 13, Greece.,Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, 700 13, Greece
| | - Effrosyni Doutsi
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, 700 13, Greece
| | - Pavlos Pavlidis
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, 700 13, Greece
| | - Panagiotis Tsakalides
- Department of Computer Science, University of Crete, Heraklion, 700 13, Greece.,Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, 700 13, Greece
| |
Collapse
|
9
|
Qin X, Liu M, Zhang L, Liu G. Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms. Comput Biol Chem 2021; 91:107456. [PMID: 33610129 DOI: 10.1016/j.compbiolchem.2021.107456] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 01/04/2021] [Accepted: 02/06/2021] [Indexed: 11/18/2022]
Abstract
Understanding the function of protein is conducive to research in advanced fields such as gene therapy of diseases, the development and design of new drugs, etc. The prerequisite for understanding the function of a protein is to determine its tertiary structure. The realization of protein structure classification is indispensable for this problem and fold recognition is a commonly used method of protein structure classification. Protein sequences of 40% identity in the ASTRAL protein classification database are used for fold recognition research in current work to predict 27 folding types which mostly belong to four protein structural classes: α, β, α/β and α + β. We extract features from primary structure of protein using methods covering DSSP, PSSM and HMM which are based on secondary structure and evolutionary information to convert protein sequences into feature vectors that can be recognized by machine learning algorithm and utilize the combination of LightGBM feature selection algorithm and incremental feature selection method (IFS) to find the optimal classifiers respectively constructed by machine learning algorithms on the basis of tree structure including Random Forest, XGBoost and LightGBM. Bayesian optimization method is used for hyper-parameter adjustment of machine learning algorithms to make the accuracy of fold recognition reach as high as 93.45% at last. The result obtained by the model we propose is outstanding in the study of protein fold recognition.
Collapse
Affiliation(s)
- Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Lu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| |
Collapse
|
10
|
Apurva M, Mazumdar H. Predicting structural class for protein sequences of 40% identity based on features of primary and secondary structure using Random Forest algorithm. Comput Biol Chem 2020; 84:107164. [DOI: 10.1016/j.compbiolchem.2019.107164] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Revised: 10/25/2019] [Accepted: 11/10/2019] [Indexed: 02/08/2023]
|
11
|
Zhu XJ, Feng CQ, Lai HY, Chen W, Hao L. Predicting protein structural classes for low-similarity sequences by evaluating different features. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.10.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
12
|
Wang YB, You ZH, Li X, Jiang TH, Cheng L, Chen ZH. Prediction of protein self-interactions using stacked long short-term memory from protein sequences information. BMC SYSTEMS BIOLOGY 2018; 12:129. [PMID: 30577794 PMCID: PMC6302371 DOI: 10.1186/s12918-018-0647-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
BACKGROUND Self-interacting Proteins (SIPs) plays a critical role in a series of life function in most living cells. Researches on SIPs are important part of molecular biology. Although numerous SIPs data be provided, traditional experimental methods are labor-intensive, time-consuming and costly and can only yield limited results in real-world needs. Hence,it's urgent to develop an efficient computational SIPs prediction method to fill the gap. Deep learning technologies have proven to produce subversive performance improvements in many areas, but the effectiveness of deep learning methods for SIPs prediction has not been verified. RESULTS We developed a deep learning model for predicting SIPs by constructing a Stacked Long Short-Term Memory (SLSTM) neural network that contains "dropout". We extracted features from protein sequences using a novel feature extraction scheme that combined Zernike Moments (ZMs) with Position Specific Weight Matrix (PSWM). The capability of the proposed approach was assessed on S.erevisiae and Human SIPs datasets. The result indicates that the approach based on deep learning can effectively resist data skew and achieve good accuracies of 95.69 and 97.88%, respectively. To demonstrate the progressiveness of deep learning, we compared the results of the SLSTM-based method and the celebrated Support Vector Machine (SVM) method and several other well-known methods on the same datasets. CONCLUSION The results show that our method is overall superior to any of the other existing state-of-the-art techniques. As far as we know, this study first applies deep learning method to predict SIPs, and practical experimental results reveal its potential in SIPs identification.
Collapse
Affiliation(s)
- Yan-Bin Wang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
| | - Xiao Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
| | - Tong-Hai Jiang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
| | - Li Cheng
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
| | - Zhan-Heng Chen
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi, 830011 China
- University of Chinese Academy of Sciences, Beijing, 100049 China
| |
Collapse
|
13
|
Wang YB, You ZH, Li LP, Huang DS, Zhou FF, Yang S. Improving Prediction of Self-interacting Proteins Using Stacked Sparse Auto-Encoder with PSSM profiles. Int J Biol Sci 2018; 14:983-991. [PMID: 29989064 PMCID: PMC6036743 DOI: 10.7150/ijbs.23817] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2017] [Accepted: 03/29/2018] [Indexed: 12/05/2022] Open
Abstract
Self-interacting proteins (SIPs) play a significant role in the execution of most important molecular processes in cells, such as signal transduction, gene expression regulation, immune response and enzyme activation. Although the traditional experimental methods can be used to generate SIPs data, it is very expensive and time-consuming based only on biological technique. Therefore, it is important and urgent to develop an efficient computational method for SIPs detection. In this study, we present a novel SIPs identification method based on machine learning technology by combing the Zernike Moments (ZMs) descriptor on Position Specific Scoring Matrix (PSSM) with Probabilistic Classification Vector Machines (PCVM) and Stacked Sparse Auto-Encoder (SSAE). More specifically, an efficient feature extraction technique called ZMs is firstly utilized to generate feature vectors on Position Specific Scoring Matrix (PSSM); Then, Deep neural network is employed for reducing the feature dimensions and noise; Finally, the Probabilistic Classification Vector Machine is used to execute the classification. The prediction performance of the proposed method is evaluated on S.erevisiae and Human SIPs datasets via cross-validation. The experimental results indicate that the proposed method can achieve good accuracies of 92.55% and 97.47%, respectively. To further evaluate the advantage of our scheme for SIPs prediction, we also compared the PCVM classifier with the Support Vector Machine (SVM) and other existing techniques on the same data sets. Comparison results reveal that the proposed strategy is outperforms other methods and could be a used tool for identifying SIPs.
Collapse
Affiliation(s)
- Yan-Bin Wang
- University of Chinese Academy of Sciences, Beijing 100049, China
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Caoan Road 4800, Shanghai 201804, China
| | - Feng-Feng Zhou
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China
| | - Shan Yang
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China
| |
Collapse
|