1
|
Qin X, Zhang L, Liu M, Liu G. PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1740-1751. [PMID: 38875077 DOI: 10.1109/tcbb.2024.3414497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2024]
Abstract
Understanding the tertiary structures of proteins is of great benefit to function in many aspects of human life. Protein fold recognition is a vital and salient means to know protein structure. Until now, researchers have successively proposed a variety of methods to realize protein fold recognition, but the novel and effective computational method is still needed to handle this problem with the continuous updating of protein structure databases. In this study, we develop a new protein structure dataset named AT and propose the PRFold-TNN model for protein fold recognition. First, different types of feature extraction methods including AAC, HMM, HMM-Bigram and ACC are selected to extract corresponding features for protein sequences. Then an ensemble feature selection method based on PageRank algorithm integrating various tree-based algorithms is used to screen the fusion features. Ultimately, the classifier based on the Transformer model achieves the final prediction. Experiments show that the prediction accuracy is 86.27% on the AT dataset and 88.91% on the independent test set, indicating that the model can demonstrate superior performance and generalization ability in the problem of protein fold recognition. Furthermore, we also carry out research on the DD, EDD and TG benchmark datasets, and make them achieve prediction accuracy of 88.41%, 97.91% and 95.16%, which are at least 3.0%, 0.8% and 2.5% higher than those of the state-of-the-art methods. It can be concluded that the PRFold-TNN model is more prominent.
Collapse
|
2
|
Qin X, Liu M, Liu G. ResCNNT-fold: Combining residual convolutional neural network and Transformer for protein fold recognition from language model embeddings. Comput Biol Med 2023; 166:107571. [PMID: 37864911 DOI: 10.1016/j.compbiomed.2023.107571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/30/2023] [Accepted: 10/11/2023] [Indexed: 10/23/2023]
Abstract
A comprehensive understanding of protein functions holds significant promise for disease research and drug development, and proteins with analogous tertiary structures tend to exhibit similar functions. Protein fold recognition stands as a classical approach in the realm of protein structure investigation. Despite significant advancements made by researchers in this field, the continuous updating of protein databases presents an ongoing challenge in accurately identifying protein fold types. In this study, we introduce a predictor, ResCNNT-fold, for protein fold recognition and employ the LE dataset for testing purpose. ResCNNT-fold leverages a pre-trained language model to obtain embedding representations for protein sequences, which are then processed by the ResCNNT feature extractor, a combination of residual convolutional neural network and Transformer, to derive fold-specific features. Subsequently, the query protein is paired with each protein whose structure is known in the template dataset. For each pair, the similarity score of their fold-specific features is calculated. Ultimately, the query protein is identified as the fold type of the template protein in the pair with the highest similarity score. To further validate the utility and efficacy of the proposed ResCNNT-fold predictor, we conduct a 2-fold cross-validation experiment on the fold level of the LE dataset. Remarkably, this rigorous evaluation yields an exceptional accuracy of 91.57%, which surpasses the best result among other state-of-the-art protein fold recognition methods by an approximate margin of 10%. The excellent performance unequivocally underscores the compelling advantages inherent to our proposed ResCNNT-fold predictor in the realm of protein fold recognition. The source code and data of ResCNNT-fold can be downloaded from https://github.com/Bioinformatics-Laboratory/ResCNNT-fold.
Collapse
Affiliation(s)
- Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| |
Collapse
|
3
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
4
|
Qin X, Zhang L, Liu M, Xu Z, Liu G. ASFold-DNN: Protein Fold Recognition Based on Evolutionary Features With Variable Parameters Using Full Connected Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2712-2722. [PMID: 34133282 DOI: 10.1109/tcbb.2021.3089168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Protein fold recognition contribute to comprehend the function of proteins, which is of great help to the gene therapy of diseases and the development of new drugs. Researchers have been working in this direction and have made considerable achievements, but challenges still exist on low sequence similarity datasets. In this study, we propose the ASFold-DNN framework for protein fold recognition research. Above all, four groups of evolutionary features are extracted from the primary structures of proteins, and a preliminary selection of variable parameter is made for two groups of features including ACC _HMM and SXG _HMM, respectively. Then several feature selection algorithms are selected for comparison and the best feature selection scheme is obtained by changing their internal threshold values. Finally, multiple hyper-parameters of Full Connected Neural Network are fully optimized to construct the best model. DD, EDD and TG datasets with low sequence similarities are chosen to evaluate the performance of the models constructed by the framework, and the final prediction accuracy are 85.28, 95.00 and 88.84 percent, respectively. Furthermore, the ASTRAL186 and LE datasets are introduced to further verify the generalization ability of our proposed framework. Comprehensive experimental results prove that the ASFold-DNN framework is more prominent than the state-of-the-art studies on protein fold recognition. The source code and data of ASFold-DNN can be downloaded from https://github.com/Bioinformatics-Laboratory/project/tree/master/ASFold.
Collapse
|
5
|
Zhang T, Jia Y, Li H, Xu D, Zhou J, Wang G. CRISPRCasStack: a stacking strategy-based ensemble learning framework for accurate identification of Cas proteins. Brief Bioinform 2022; 23:6674167. [PMID: 35998924 DOI: 10.1093/bib/bbac335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 07/13/2022] [Accepted: 07/23/2022] [Indexed: 11/12/2022] Open
Abstract
CRISPR-Cas system is an adaptive immune system widely found in most bacteria and archaea to defend against exogenous gene invasion. One of the most critical steps in the study of exploring and classifying novel CRISPR-Cas systems and their functional diversity is the identification of Cas proteins in CRISPR-Cas systems. The discovery of novel Cas proteins has also laid the foundation for technologies such as CRISPR-Cas-based gene editing and gene therapy. Currently, accurate and efficient screening of Cas proteins from metagenomic sequences and proteomic sequences remains a challenge. For Cas proteins with low sequence conservation, existing tools for Cas protein identification based on homology cannot guarantee identification accuracy and efficiency. In this paper, we have developed a novel stacking-based ensemble learning framework for Cas protein identification, called CRISPRCasStack. In particular, we applied the SHAP (SHapley Additive exPlanations) method to analyze the features used in CRISPRCasStack. Sufficient experimental validation and independent testing have demonstrated that CRISPRCasStack can address the accuracy deficiencies and inefficiencies of the existing state-of-the-art tools. We also provide a toolkit to accurately identify and analyze potential Cas proteins, Cas operons, CRISPR arrays and CRISPR-Cas locus in prokaryotic sequences. The CRISPRCasStack toolkit is available at https://github.com/yrjia1015/CRISPRCasStack.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Dali Xu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
6
|
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: A Comprehensive R Package for Generating Evolutionary-based Descriptors of Protein Sequences from PSSM Profiles. BIOLOGY METHODS AND PROTOCOLS 2022; 7:bpac008. [PMID: 35388370 PMCID: PMC8977839 DOI: 10.1093/biomethods/bpac008] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 01/21/2022] [Indexed: 11/14/2022]
Abstract
Position-specific scoring matrix (PSSM), also called profile, is broadly used for representing the evolutionary history of a given protein sequence. Several investigations reported that the PSSM-based feature descriptors can improve the prediction of various protein attributes such as interaction, function, subcellular localization, secondary structure, disorder regions, and accessible surface area. While plenty of algorithms have been suggested for extracting evolutionary features from PSSM in recent years, there is not any integrated standalone tool for providing these descriptors. Here, we introduce PSSMCOOL, a flexible comprehensive R package that generates 38 PSSM-based feature vectors. To our best knowledge, PSSMCOOL is the first PSSM-based feature extraction tool implemented in R. With the growing demand for exploiting machine-learning algorithms in computational biology, this package would be a practical tool for machine-learning predictions.
Collapse
Affiliation(s)
- Alireza Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Javad Zahiri
- Department of Neuroscience, University of California San Diego, California, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Saber Mohammadi
- Bioinformatics and Computational Omics Lab (BioCOOL), Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| | - Mohsen Khodarahmi
- Department of Radiology, Shahid Madani Hospital, Karaj, Iran
- Bahar Medical Imaging Center, Karaj, Iran
- Dr. Khodarahmi Medical Imaging Center, Karaj, Iran
| | - Seyed Shahriar Arab
- Department of Biophysics, Faculty of Biological Sciences, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
7
|
Predicting Protein-Protein Interactions via Random Ferns with Evolutionary Matrix Representation. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:7191684. [PMID: 35242211 PMCID: PMC8888042 DOI: 10.1155/2022/7191684] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/15/2022] [Accepted: 01/18/2022] [Indexed: 11/27/2022]
Abstract
Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.
Collapse
|
8
|
Qin X, Liu M, Zhang L, Liu G. Structural protein fold recognition based on secondary structure and evolutionary information using machine learning algorithms. Comput Biol Chem 2021; 91:107456. [PMID: 33610129 DOI: 10.1016/j.compbiolchem.2021.107456] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 01/04/2021] [Accepted: 02/06/2021] [Indexed: 11/18/2022]
Abstract
Understanding the function of protein is conducive to research in advanced fields such as gene therapy of diseases, the development and design of new drugs, etc. The prerequisite for understanding the function of a protein is to determine its tertiary structure. The realization of protein structure classification is indispensable for this problem and fold recognition is a commonly used method of protein structure classification. Protein sequences of 40% identity in the ASTRAL protein classification database are used for fold recognition research in current work to predict 27 folding types which mostly belong to four protein structural classes: α, β, α/β and α + β. We extract features from primary structure of protein using methods covering DSSP, PSSM and HMM which are based on secondary structure and evolutionary information to convert protein sequences into feature vectors that can be recognized by machine learning algorithm and utilize the combination of LightGBM feature selection algorithm and incremental feature selection method (IFS) to find the optimal classifiers respectively constructed by machine learning algorithms on the basis of tree structure including Random Forest, XGBoost and LightGBM. Bayesian optimization method is used for hyper-parameter adjustment of machine learning algorithms to make the accuracy of fold recognition reach as high as 93.45% at last. The result obtained by the model we propose is outstanding in the study of protein fold recognition.
Collapse
Affiliation(s)
- Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Lu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China.
| |
Collapse
|
9
|
Bankapur S, Patil N. An Enhanced Protein Fold Recognition for Low Similarity Datasets Using Convolutional and Skip-Gram Features With Deep Neural Network. IEEE Trans Nanobioscience 2020; 20:42-49. [PMID: 32894720 DOI: 10.1109/tnb.2020.3022456] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The protein fold recognition is one of the important tasks of structural biology, which helps in addressing further challenges like predicting the protein tertiary structures and its functions. Many machine learning works are published to identify the protein folds effectively. However, very few works have reported the fold recognition accuracy above 80% on benchmark datasets. In this study, an effective set of global and local features are extracted from the proposed Convolutional (Conv) and SkipXGram bi-gram (SXGbg) techniques, and the fold recognition is performed using the proposed deep neural network. The performance of the proposed model reported 91.4% fold accuracy on one of the derived low similarity (< 25%) datasets of latest extended version of SCOPe_2.07. The proposed model is further evaluated on three popular and publicly available benchmark datasets such as DD, EDD, and TG and obtained 85.9%, 95.8%, and 88.8% fold accuracies, respectively. This work is first to report fold recognition accuracy above 85% on all the benchmark datasets. The performance of the proposed model has outperformed the best state-of-the-art models by 5% to 23% on DD, 2% to 19% on EDD, and 3% to 30% on TG dataset.
Collapse
|
10
|
Elhefnawy W, Li M, Wang J, Li Y. DeepFrag-k: a fragment-based deep learning approach for protein fold recognition. BMC Bioinformatics 2020; 21:203. [PMID: 33203392 PMCID: PMC7672895 DOI: 10.1186/s12859-020-3504-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2020] [Accepted: 04/16/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One of the most essential problems in structural bioinformatics is protein fold recognition. In this paper, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features at fragment level to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multi-modal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolutional neural network (CNN) to classify the fragment vector into the corresponding fold. RESULTS Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition. CONCLUSIONS There is a set of fragments that can serve as structural "keywords" distinguishing between major protein folds. The deep learning architecture in DeepFrag-k is able to accurately identify these fragments as structure features to improve protein fold recognition.
Collapse
Affiliation(s)
- Wessam Elhefnawy
- Department of Computer Science, Old Dominion University, Norfolk, U.S.A
| | - Min Li
- Department of Computer Science, Central South University, Changsha, China
| | - Jianxin Wang
- Department of Computer Science, Central South University, Changsha, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, U.S.A..
| |
Collapse
|
11
|
Li Y, Liu X, You Z, Li L, Guo J, Wang Z. A computational approach for predicting drug–target interactions from protein sequence and drug substructure fingerprint information. INT J INTELL SYST 2020. [DOI: 10.1002/int.22332] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Yang Li
- School of Computer Science & Cyberspace Security Hainan University Haikou China
| | - Xiao‐zhang Liu
- School of Computer Science & Cyberspace Security Hainan University Haikou China
| | - Zhu‐Hong You
- School of Information Engineering Xijing University Xi'an China
| | - Li‐Ping Li
- School of Information Engineering Xijing University Xi'an China
| | - Jian‐Xin Guo
- School of Information Engineering Xijing University Xi'an China
| | - Zheng Wang
- School of Information Engineering Xijing University Xi'an China
| |
Collapse
|
12
|
Zhang J, Lv L, Lu D, Kong D, Al-Alashaari MAA, Zhao X. Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors. BMC Bioinformatics 2020; 21:480. [PMID: 33109082 PMCID: PMC7590791 DOI: 10.1186/s12859-020-03826-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 10/19/2020] [Indexed: 12/13/2022] Open
Abstract
Background Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
Collapse
Affiliation(s)
- Jian Zhang
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Lixin Lv
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Donglei Lu
- College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China
| | - Denan Kong
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China
| | | | - Xudong Zhao
- College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
| |
Collapse
|
13
|
Refahi MS, Mir A, Nasiri JA. A novel fusion based on the evolutionary features for protein fold recognition using support vector machines. Sci Rep 2020; 10:14368. [PMID: 32873824 PMCID: PMC7463267 DOI: 10.1038/s41598-020-71172-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Accepted: 08/10/2020] [Indexed: 11/29/2022] Open
Abstract
Protein fold recognition plays a crucial role in discovering three-dimensional structure of proteins and protein functions. Several approaches have been employed for the prediction of protein folds. Some of these approaches are based on extracting features from protein sequences and using a strong classifier. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physicochemical-based information to extract features. In recent years, finding an efficient technique for integrating discriminate features have been received advancing attention. In this study, we integrate Auto-Cross-Covariance and Separated dimer evolutionary feature extraction methods. The results’ features are scored by Information gain to define and select several discriminated features. According to three benchmark datasets, DD, RDD ,and EDD, the results of the support vector machine show more than 6\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\%$$\end{document}% improvement in accuracy on these benchmark datasets.
Collapse
Affiliation(s)
- Mohammad Saleh Refahi
- Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran
| | - A Mir
- Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran
| | - Jalal A Nasiri
- Iranian Research Institute for Information Science and Technology (IranDoc), Tehran, Iran.
| |
Collapse
|
14
|
HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:1384749. [PMID: 32300371 PMCID: PMC7142336 DOI: 10.1155/2020/1384749] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 03/16/2020] [Indexed: 02/08/2023]
Abstract
Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.
Collapse
|
15
|
Khanh Le NQ, Nguyen QH, Chen X, Rahardja S, Nguyen BP. Classification of adaptor proteins using recurrent neural networks and PSSM profiles. BMC Genomics 2019; 20:966. [PMID: 31874633 PMCID: PMC6929330 DOI: 10.1186/s12864-019-6335-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Accepted: 11/25/2019] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Adaptor proteins are carrier proteins that play a crucial role in signal transduction. They commonly consist of several modular domains, each having its own binding activity and operating by forming complexes with other intracellular-signaling molecules. Many studies determined that the adaptor proteins had been implicated in a variety of human diseases. Therefore, creating a precise model to predict the function of adaptor proteins is one of the vital tasks in bioinformatics and computational biology. Few computational biology studies have been conducted to predict the protein functions, and in most of those studies, position specific scoring matrix (PSSM) profiles had been used as the features to be fed into the neural networks. However, the neural networks could not reach the optimal result because the sequential information in PSSMs has been lost. This study proposes an innovative approach by incorporating recurrent neural networks (RNNs) and PSSM profiles to resolve this problem. RESULTS Compared to other state-of-the-art methods which had been applied successfully in other problems, our method achieves enhancement in all of the common measurement metrics. The area under the receiver operating characteristic curve (AUC) metric in prediction of adaptor proteins in the cross-validation and independent datasets are 0.893 and 0.853, respectively. CONCLUSIONS This study opens a research path that can promote the use of RNNs and PSSM profiles in bioinformatics and computational biology. Our approach is reproducible by scientists that aim to improve the performance results of different protein function prediction problems. Our source code and datasets are available at https://github.com/ngphubinh/adaptors.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Keelung Road, Da'an Distric, Taipei City 106, Taiwan (R.O.C.)
| | - Quang H Nguyen
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, Hanoi 100000, Vietnam
| | - Xuan Chen
- Beijing Genomics Institute, 21 Hongan 3rd Street, Shenzhen 518083, China
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, 127 West Youyi Road, Xi'an 710072, China.
| | - Binh P Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, Wellington 6140, New Zealand
| |
Collapse
|
16
|
Chandra A, Sharma A, Dehzangi A, Shigemizu D, Tsunoda T. Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix. BMC Mol Cell Biol 2019; 20:57. [PMID: 31856704 PMCID: PMC6923822 DOI: 10.1186/s12860-019-0240-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 11/20/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. RESULTS We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. CONCLUSIONS The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK.
Collapse
Affiliation(s)
- Abel Chandra
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji.
| | - Alok Sharma
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, 4111, Australia. .,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan. .,CREST, JST, Tokyo, 102-8666, Japan.
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, USA
| | - Daichi Shigemizu
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan.,CREST, JST, Tokyo, 102-8666, Japan.,Medical Genome Center, National Center for Geriatrics and Gerontology, Obu, Aichi, 474-8511, Japan
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan.,CREST, JST, Tokyo, 102-8666, Japan.,Laboratory for Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, 108-8639, Japan
| |
Collapse
|
17
|
Patil K, Chouhan U. Relevance of Machine Learning Techniques and Various Protein Features in Protein Fold Classification: A Review. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190204154038] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background:
Protein fold prediction is a fundamental step in Structural Bioinformatics.
The tertiary structure of a protein determines its function and to predict its tertiary structure, fold
prediction serves an important role. Protein fold is simply the arrangement of the secondary
structure elements relative to each other in space. A number of studies have been carried out till
date by different research groups working worldwide in this field by using the combination of
different benchmark datasets, different types of descriptors, features and classification techniques.
Objective:
In this study, we have tried to put all these contributions together, analyze their study
and to compare different techniques used by them.
Methods:
Different features are derived from protein sequence, its secondary structure, different
physicochemical properties of amino acids, domain composition, Position Specific Scoring Matrix,
profile and threading techniques.
Conclusion:
Combination of these different features can improve classification accuracy to a
large extent. With the help of this survey, one can know the most suitable feature/attribute set and
classification technique for this multi-class protein fold classification problem.
Collapse
Affiliation(s)
- Komal Patil
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| | - Usha Chouhan
- Department of Mathematics, Maulana Azad National Institute of Technology (MANIT), Bhopal, 462003 M.P, India
| |
Collapse
|
18
|
Abstract
This article introduces the Turn-Taking Spiking Neural Network (TTSNet), which is a cognitive model to perform early turn-taking prediction about a human or agent’s intentions. The TTSNet framework relies on implicit and explicit multimodal communication cues (physical, neurological and physiological) to be able to predict when the turn-taking event will occur in a robust and unambiguous fashion. To test the theories proposed, the TTSNet framework was implemented on an assistant robotic nurse, which predicts surgeon’s turn-taking intentions and delivers surgical instruments accordingly. Experiments were conducted to evaluate TTSNet’s performance in early turn-taking prediction. It was found to reach an [Formula: see text] score of 0.683 given 10% of completed action, and an [Formula: see text] score of 0.852 at 50% and 0.894 at 100% of the completed action. This performance outperformed multiple state-of-the-art algorithms, and surpassed human performance when limited partial observation is given (<40%). Such early turn-taking prediction capability would allow robots to perform collaborative actions proactively, in order to facilitate collaboration and increase team efficiency.
Collapse
Affiliation(s)
- Tian Zhou
- Industrial Engineering, Purdue University, USA
| | | |
Collapse
|
19
|
Peyravi F, Latif A, Moshtaghioun SM. Protein tertiary structure prediction using hidden Markov model based on lattice. J Bioinform Comput Biol 2019; 17:1950007. [PMID: 31057069 DOI: 10.1142/s0219720019500070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The prediction of protein structure from its amino acid sequence is one of the most prominent problems in computational biology. The biological function of a protein depends on its tertiary structure which is determined by its amino acid sequence via the process of protein folding. We propose a novel fold recognition method for protein tertiary structure prediction based on a hidden Markov model and 3D coordinates of amino acid residues. The method introduces states based on the basis vectors in Bravais cubic lattices to learn the path of amino acids of the proteins of each fold. Three hidden Markov models are considered based on simple cubic, body-centered cubic (BCC) and face-centered cubic (FCC) lattices. A 10-fold cross validation was performed on a set of 42 fold SCOP dataset. The proposed composite methodology is compared to fold recognition methods which have HMM as base of their algorithms having approaches on only amino acid sequence or secondary structure. The accuracy of proposed model based on face-centered cubic lattices is quite better in comparison with SAM, 3-HMM optimized and Markov chain optimized in overall experiment. The huge data of 3D space help the model to have greater performance in comparison to methods which use only primary structures or only secondary structures.
Collapse
Affiliation(s)
- Farzad Peyravi
- * Department of Computer Engineering, Yazd University, Yazd, Iran
| | | | | |
Collapse
|
20
|
Chandra AA, Sharma A, Dehzangi A, Tsunoda T. EvolStruct-Phogly: incorporating structural properties and evolutionary information from profile bigrams for the phosphoglycerylation prediction. BMC Genomics 2019; 19:984. [PMID: 30999859 PMCID: PMC7402405 DOI: 10.1186/s12864-018-5383-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 12/17/2018] [Indexed: 01/21/2023] Open
Abstract
Background Post-translational modification (PTM), which is a biological process, tends to modify proteome that leads to changes in normal cell biology and pathogenesis. In the recent times, there has been many reported PTMs. Out of the many modifications, phosphoglycerylation has become particularly the subject of interest. The experimental procedure for identification of phosphoglycerylated residues continues to be an expensive, inefficient and time-consuming effort, even with a large number of proteins that are sequenced in the post-genomic period. Computational methods are therefore being anticipated in order to effectively predict phosphoglycerylated lysines. Even though there are predictors available, the ability to detect phosphoglycerylated lysine residues still remains inadequate. Results We have introduced a new predictor in this paper named EvolStruct-Phogly that uses structural and evolutionary information relating to amino acids to predict phosphoglycerylated lysine residues. Benchmarked data is employed containing experimentally identified phosphoglycerylated and non-phosphoglycerylated lysines. We have then extracted the three structural information which are accessible surface area of amino acids, backbone torsion angles, amino acid’s local structure conformations and profile bigrams of position-specific scoring matrices. Conclusion EvolStruct-Phogly showed a noteworthy improvement in regards to the performance when compared with the previous predictors. The performance metrics obtained are as follows: sensitivity 0.7744, specificity 0.8533, precision 0.7368, accuracy 0.8275, and Mathews correlation coefficient of 0.6242. The software package and data of this work can be obtained from https://github.com/abelavit/EvolStruct-Phogly or www.alok-ai-lab.com
Collapse
Affiliation(s)
| | - Alok Sharma
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia. .,CREST, JST, Tokyo, Japan.
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, USA
| | - Tatushiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.,CREST, JST, Tokyo, Japan.,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
21
|
Zhang J, Liu B. A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods. Curr Bioinform 2019. [DOI: 10.2174/1574893614666181212102749] [Citation(s) in RCA: 96] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:Proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of protein structures and functions are critical for both basic research and applications. Because the traditional experiments for studying the structures and functions of proteins are expensive and time consuming, computational approaches are highly desired. In key for computational methods is how to efficiently extract the features from the protein sequences. During the last decade, many powerful feature extraction algorithms have been proposed, significantly promoting the development of the studies of protein structures and functions.Objective:To help the researchers to catch up the recent developments in this important field, in this study, an updated review is given, focusing on the sequence-based feature extractions of protein sequences.Method:These sequence-based features of proteins were grouped into three categories, including composition-based features, autocorrelation-based features and profile-based features. The detailed information of features in each group was introduced, and their advantages and disadvantages were discussed. Besides, some useful tools for generating these features will also be introduced.Results:Generally, autocorrelation-based features outperform composition-based features, and profile-based features outperform autocorrelation-based features. The reason is that profile-based features consider the evolutionary information, which is useful for identification of protein structures and functions. However, profile-based features are more time consuming, because the multiple sequence alignment process is required.Conclusion:In this study, some recently proposed sequence-based features were introduced and discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features did make great contributions to the developments of protein sequence analysis. Future studies can be focus on exploring the combinations of these features. Besides, techniques from other fields, such as signal processing, natural language process (NLP), image processing etc., would also contribute to this important field, because natural languages (such as English) and protein sequences share some similarities. Therefore, the proteins can be treated as documents, and the features, such as k-mers, top-n-grams, motifs, can be treated as the words in the languages. Techniques from these filed will give some new ideas and strategies for extracting the features from proteins.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, HIT Campus Shenzhen University Town, Xili, Shenzhen, Guangdong 518055, China
| |
Collapse
|
22
|
Zhan ZH, Jia LN, Zhou Y, Li LP, Yi HC. BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information. Int J Mol Sci 2019; 20:E978. [PMID: 30813451 PMCID: PMC6412311 DOI: 10.3390/ijms20040978] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2019] [Revised: 02/19/2019] [Accepted: 02/20/2019] [Indexed: 11/26/2022] Open
Abstract
The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.
Collapse
Affiliation(s)
- Zhao-Hui Zhan
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Na Jia
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, Shandong, China.
| | - Yong Zhou
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| |
Collapse
|
23
|
Yan K, Fang X, Xu Y, Liu B. Protein fold recognition based on multi-view modeling. Bioinformatics 2019; 35:2982-2990. [DOI: 10.1093/bioinformatics/btz040] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 12/29/2018] [Accepted: 01/16/2019] [Indexed: 12/22/2022] Open
Abstract
Abstract
Motivation
Protein fold recognition has attracted increasing attention because it is critical for studies of the 3D structures of proteins and drug design. Researchers have been extensively studying this important task, and several features with high discriminative power have been proposed. However, the development of methods that efficiently combine these features to improve the predictive performance remains a challenging problem.
Results
In this study, we proposed two algorithms: MV-fold and MT-fold. MV-fold is a new computational predictor based on the multi-view learning model for fold recognition. Different features of proteins were treated as different views of proteins, including the evolutionary information, secondary structure information and physicochemical properties. These different views constituted the latent space. The ε-dragging technique was employed to enlarge the margins between different protein folds, improving the predictive performance of MV-fold. Then, MV-fold was combined with two template-based methods: HHblits and HMMER. The ensemble method is called MT-fold incorporating the advantages of both discriminative methods and template-based methods. Experimental results on five widely used benchmark datasets (DD, RDD, EDD, TG and LE) showed that the proposed methods outperformed some state-of-the-art methods in this field, indicating that MV-fold and MT-fold are useful computational tools for protein fold recognition and protein homology detection and would be efficient tools for protein sequence analysis. Finally, we constructed an update and rigorous benchmark dataset based on SCOPe (version 2.07) to fairly evaluate the performance of the proposed method, and our method achieved stable performance on this new dataset. This new benchmark dataset will become a widely used benchmark dataset to fairly evaluate the performance of different methods for fold recognition.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Xiaozhao Fang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
24
|
Flot M, Mishra A, Kuchi AS, Hoque MT. StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence. Methods Mol Biol 2019; 1958:101-122. [PMID: 30945215 DOI: 10.1007/978-1-4939-9161-7_5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Supersecondary structure (SSS) refers to specific geometric arrangements of several secondary structure (SS) elements that are connected by loops. The SSS can provide useful information about the spatial structure and function of a protein. As such, the SSS is a bridge between the secondary structure and tertiary structure. In this chapter, we propose a stacking-based machine learning method for the prediction of two types of SSSs, namely, β-hairpins and β-α-β, from the protein sequence based on comprehensive feature encoding. To encode protein residues, we utilize key features such as solvent accessibility, conservation profile, half surface exposure, torsion angle fluctuation, disorder probabilities, and more. The usefulness of the proposed approach is assessed using a widely used threefold cross-validation technique. The obtained empirical result shows that the proposed approach is useful and prediction can be improved further.
Collapse
Affiliation(s)
- Michael Flot
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Avdesh Mishra
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Aditi Sharma Kuchi
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
25
|
Dehzangi A, López Y, Taherzadeh G, Sharma A, Tsunoda T. SumSec: Accurate Prediction of Sumoylation Sites Using Predicted Secondary Structure. Molecules 2018; 23:E3260. [PMID: 30544729 PMCID: PMC6320791 DOI: 10.3390/molecules23123260] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2018] [Revised: 11/30/2018] [Accepted: 12/05/2018] [Indexed: 12/13/2022] Open
Abstract
Post Translational Modification (PTM) is defined as the modification of amino acids along the protein sequences after the translation process. These modifications significantly impact on the functioning of proteins. Therefore, having a comprehensive understanding of the underlying mechanism of PTMs turns out to be critical in studying the biological roles of proteins. Among a wide range of PTMs, sumoylation is one of the most important modifications due to its known cellular functions which include transcriptional regulation, protein stability, and protein subcellular localization. Despite its importance, determining sumoylation sites via experimental methods is time-consuming and costly. This has led to a great demand for the development of fast computational methods able to accurately determine sumoylation sites in proteins. In this study, we present a new machine learning-based method for predicting sumoylation sites called SumSec. To do this, we employed the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. As a result, our proposed method is able to enhance the sumoylation site prediction task, outperforming previously proposed methods in the literature. SumSec demonstrated high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than those reported in previous studies. The script and extracted features are publicly available at: https://github.com/YosvanyLopez/SumSec.
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD 21251, USA.
| | - Yosvany López
- Genesis Institute of Genetic Research, Genesis Healthcare Co., Tokyo 150-6015, Japan.
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Gold Coast 4222, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane 4111, Australia.
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.
- CREST, JST, Tokyo 102-0076, Japan.
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan.
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.
- CREST, JST, Tokyo 102-0076, Japan.
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan.
| |
Collapse
|
26
|
Sudha P, Ramyachitra D, Manikandan P. Enhanced Artificial Neural Network for Protein Fold Recognition and Structural Class Prediction. GENE REPORTS 2018. [DOI: 10.1016/j.genrep.2018.07.012] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
27
|
Wang J, Yang B, Revote J, Leier A, Marquez-Lago TT, Webb G, Song J, Chou KC, Lithgow T. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 2018; 33:2756-2758. [PMID: 28903538 DOI: 10.1093/bioinformatics/btx302] [Citation(s) in RCA: 107] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2017] [Accepted: 05/09/2017] [Indexed: 11/13/2022] Open
Abstract
Summary Evolutionary information in the form of a Position-Specific Scoring Matrix (PSSM) is a widely used and highly informative representation of protein sequences. Accordingly, PSSM-based feature descriptors have been successfully applied to improve the performance of various predictors of protein attributes. Even though a number of algorithms have been proposed in previous studies, there is currently no universal web server or toolkit available for generating this wide variety of descriptors. Here, we present POSSUM ( Po sition- S pecific S coring matrix-based feat u re generator for m achine learning), a versatile toolkit with an online web server that can generate 21 types of PSSM-based feature descriptors, thereby addressing a crucial need for bioinformaticians and computational biologists. We envisage that this comprehensive toolkit will be widely used as a powerful tool to facilitate feature extraction, selection, and benchmarking of machine learning-based models, thereby contributing to a more effective analysis and modeling pipeline for bioinformatics research. Availability and implementation http://possum.erc.monash.edu/ . Contact trevor.lithgow@monash.edu or jiangning.song@monash.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiawei Wang
- Biomedicine Discovery Institute, Monash University, VIC 3800, Australia
| | - Bingjiao Yang
- College of Mechanical Engineering, Yanshan University, Qinhuangdao 066004, China
| | - Jerico Revote
- Biomedicine Discovery Institute, Monash University, VIC 3800, Australia
| | - André Leier
- Informatics Institute and Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Informatics Institute and Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Geoffrey Webb
- Monash Centre for Data Science, Faculty of Information Technology
| | - Jiangning Song
- Biomedicine Discovery Institute, Monash University, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology.,ARC Centre of Excellence for Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah 21589, Saudi Arabia
| | - Trevor Lithgow
- Biomedicine Discovery Institute, Monash University, VIC 3800, Australia
| |
Collapse
|
28
|
Dehzangi A, López Y, Lal SP, Taherzadeh G, Sattar A, Tsunoda T, Sharma A. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One 2018; 13:e0191900. [PMID: 29432431 PMCID: PMC5809022 DOI: 10.1371/journal.pone.0191900] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Accepted: 01/12/2018] [Indexed: 11/18/2022] Open
Abstract
Post-translational modification refers to the biological mechanism involved in the enzymatic modification of proteins after being translated in the ribosome. This mechanism comprises a wide range of structural modifications, which bring dramatic variations to the biological function of proteins. One of the recently discovered modifications is succinylation. Although succinylation can be detected through mass spectrometry, its current experimental detection turns out to be a timely process unable to meet the exponential growth of sequenced proteins. Therefore, the implementation of fast and accurate computational methods has emerged as a feasible solution. This paper proposes a novel classification approach, which effectively incorporates the secondary structure and evolutionary information of proteins through profile bigrams for succinylation prediction. The proposed predictor, abbreviated as SSEvol-Suc, made use of the above features for training an AdaBoost classifier and consequently predicting succinylated lysine residues. When SSEvol-Suc was compared with four benchmark predictors, it outperformed them in metrics such as sensitivity (0.909), accuracy (0.875) and Matthews correlation coefficient (0.75).
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, Maryland, United States of America
| | - Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail:
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, Palmerston North, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Queensland, Australia
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Queensland, Australia
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- CREST, JST, Tokyo, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji
| |
Collapse
|
29
|
López Y, Sharma A, Dehzangi A, Lal SP, Taherzadeh G, Sattar A, Tsunoda T. Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics 2018; 19:923. [PMID: 29363424 PMCID: PMC5781056 DOI: 10.1186/s12864-017-4336-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Post-translational modification is considered an important biological mechanism with critical impact on the diversification of the proteome. Although a long list of such modifications has been studied, succinylation of lysine residues has recently attracted the interest of the scientific community. The experimental detection of succinylation sites is an expensive process, which consumes a lot of time and resources. Therefore, computational predictors of this covalent modification have emerged as a last resort to tackling lysine succinylation. RESULTS In this paper, we propose a novel computational predictor called 'Success', which efficiently uses the structural and evolutionary information of amino acids for predicting succinylation sites. To do this, each lysine was described as a vector that combined the above information of surrounding amino acids. We then designed a support vector machine with a radial basis function kernel for discriminating between succinylated and non-succinylated residues. We finally compared the Success predictor with three state-of-the-art predictors in the literature. As a result, our proposed predictor showed a significant improvement over the compared predictors in statistical metrics, such as sensitivity (0.866), accuracy (0.838) and Matthews correlation coefficient (0.677) on a benchmark dataset. CONCLUSIONS The proposed predictor effectively uses the structural and evolutionary information of the amino acids surrounding a lysine. The bigram feature extraction approach, while retaining the same number of features, facilitates a better description of lysines. A support vector machine with a radial basis function kernel was used to discriminate between modified and unmodified lysines. The aforementioned aspects make the Success predictor outperform three state-of-the-art predictors in succinylation detection.
Collapse
Affiliation(s)
- Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia. .,School of Engineering & Physics, University of the South Pacific, Suva, Fiji.
| | - Abdollah Dehzangi
- Department of Computer Science, School of Computer, Mathematical, and Natural Sciences, Morgan State University, Baltimore, Maryland, USA
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, Palmerston North, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Brisbane, Australia
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.,School of Information and Communication Technology, Griffith University, Brisbane, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.,CREST, JST, Tokyo, 113-8510, Japan
| |
Collapse
|
30
|
Ibrahim W, Abadeh MS. Protein fold recognition using Deep Kernelized Extreme Learning Machine and linear discriminant analysis. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3346-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
31
|
Zhan ZH, You ZH, Zhou Y, Li LP, Li ZW. Efficient Framework for Predicting ncRNA-Protein Interactions Based on Sequence Information by Deep Learning. INTELLIGENT COMPUTING THEORIES AND APPLICATION 2018:337-344. [DOI: 10.1007/978-3-319-95933-7_41] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/30/2023]
|
32
|
iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting. Sci Rep 2017; 7:17731. [PMID: 29255285 PMCID: PMC5735173 DOI: 10.1038/s41598-017-18025-2] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Accepted: 12/05/2017] [Indexed: 02/07/2023] Open
Abstract
Prediction of new drug-target interactions is critically important as it can lead the researchers to find new uses for old drugs and to disclose their therapeutic profiles or side effects. However, experimental prediction of drug-target interactions is expensive and time-consuming. As a result, computational methods for predictioning new drug-target interactions have gained a tremendous interest in recent times. Here we present iDTI-ESBoost, a prediction model for identification of drug-target interactions using evolutionary and structural features. Our proposed method uses a novel data balancing and boosting technique to predict drug-target interaction. On four benchmark datasets taken from a gold standard data, iDTI-ESBoost outperforms the state-of-the-art methods in terms of area under receiver operating characteristic (auROC) curve. iDTI-ESBoost also outperforms the latest and the best-performing method found in the literature in terms of area under precision recall (auPR) curve. This is significant as auPR curves are argued as suitable metric for comparison for imbalanced datasets similar to the one studied here. Our reported results show the effectiveness of the classifier, balancing methods and the novel features incorporated in iDTI-ESBoost. iDTI-ESBoost is a novel prediction method that has for the first time exploited the structural features along with the evolutionary features to predict drug-protein interactions. We believe the excellent performance of iDTI-ESBoost both in terms of auROC and auPR would motivate the researchers and practitioners to use it to predict drug-target interactions. To facilitate that, iDTI-ESBoost is implemented and made publicly available at: http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/.
Collapse
|
33
|
Zhang J, Liu B. PSFM-DBT: Identifying DNA-Binding Proteins by Combing Position Specific Frequency Matrix and Distance-Bigram Transformation. Int J Mol Sci 2017; 18:ijms18091856. [PMID: 28841194 PMCID: PMC5618505 DOI: 10.3390/ijms18091856] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 08/19/2017] [Accepted: 08/22/2017] [Indexed: 12/30/2022] Open
Abstract
DNA-binding proteins play crucial roles in various biological processes, such as DNA replication and repair, transcriptional regulation and many other biological activities associated with DNA. Experimental recognition techniques for DNA-binding proteins identification are both time consuming and expensive. Effective methods for identifying these proteins only based on protein sequences are highly required. The key for sequence-based methods is to effectively represent protein sequences. It has been reported by various previous studies that evolutionary information is crucial for DNA-binding protein identification. In this study, we employed four methods to extract the evolutionary information from Position Specific Frequency Matrix (PSFM), including Residue Probing Transformation (RPT), Evolutionary Difference Transformation (EDT), Distance-Bigram Transformation (DBT), and Trigram Transformation (TT). The PSFMs were converted into fixed length feature vectors by these four methods, and then respectively combined with Support Vector Machines (SVMs); four predictors for identifying these proteins were constructed, including PSFM-RPT, PSFM-EDT, PSFM-DBT, and PSFM-TT. Experimental results on a widely used benchmark dataset PDB1075 and an independent dataset PDB186 showed that these four methods achieved state-of-the-art-performance, and PSFM-DBT outperformed other existing methods in this field. For practical applications, a user-friendly webserver of PSFM-DBT was established, which is available at http://bioinformatics.hitsz.edu.cn/PSFM-DBT/.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China.
| |
Collapse
|
34
|
Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Comput Biol Chem 2017; 68:231-244. [DOI: 10.1016/j.compbiolchem.2017.04.003] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Revised: 03/11/2017] [Accepted: 04/11/2017] [Indexed: 11/20/2022]
|
35
|
Yan K, Xu Y, Fang X, Zheng C, Liu B. Protein fold recognition based on sparse representation based classification. Artif Intell Med 2017; 79:1-8. [DOI: 10.1016/j.artmed.2017.03.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Revised: 03/06/2017] [Accepted: 03/07/2017] [Indexed: 12/13/2022]
|
36
|
Wang Y, You Z, Li X, Chen X, Jiang T, Zhang J. PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences. Int J Mol Sci 2017; 18:ijms18051029. [PMID: 28492483 PMCID: PMC5454941 DOI: 10.3390/ijms18051029] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Revised: 04/24/2017] [Accepted: 04/29/2017] [Indexed: 01/08/2023] Open
Abstract
Protein–protein interactions (PPIs) are essential for most living organisms’ process. Thus, detecting PPIs is extremely important to understand the molecular mechanisms of biological systems. Although many PPIs data have been generated by high-throughput technologies for a variety of organisms, the whole interatom is still far from complete. In addition, the high-throughput technologies for detecting PPIs has some unavoidable defects, including time consumption, high cost, and high error rate. In recent years, with the development of machine learning, computational methods have been broadly used to predict PPIs, and can achieve good prediction rate. In this paper, we present here PCVMZM, a computational method based on a Probabilistic Classification Vector Machines (PCVM) model and Zernike moments (ZM) descriptor for predicting the PPIs from protein amino acids sequences. Specifically, a Zernike moments (ZM) descriptor is used to extract protein evolutionary information from Position-Specific Scoring Matrix (PSSM) generated by Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then, PCVM classifier is used to infer the interactions among protein. When performed on PPIs datasets of Yeast and H. Pylori, the proposed method can achieve the average prediction accuracy of 94.48% and 91.25%, respectively. In order to further evaluate the performance of the proposed method, the state-of-the-art support vector machines (SVM) classifier is used and compares with the PCVM model. Experimental results on the Yeast dataset show that the performance of PCVM classifier is better than that of SVM classifier. The experimental results indicate that our proposed method is robust, powerful and feasible, which can be used as a helpful tool for proteomics research.
Collapse
Affiliation(s)
- Yanbin Wang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Zhuhong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xiao Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
| | - Tonghai Jiang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Jingting Zhang
- Department of Mathematics and Statistics, Henan University, Kaifeng 100190, China.
| |
Collapse
|
37
|
Dehzangi A, López Y, Lal SP, Taherzadeh G, Michaelson J, Sattar A, Tsunoda T, Sharma A. PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. J Theor Biol 2017; 425:97-102. [PMID: 28483566 DOI: 10.1016/j.jtbi.2017.05.005] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Revised: 04/28/2017] [Accepted: 05/03/2017] [Indexed: 11/25/2022]
Abstract
Post-translational modification (PTM) is a covalent and enzymatic modification of proteins, which contributes to diversify the proteome. Despite many reported PTMs with essential roles in cellular functioning, lysine succinylation has emerged as a subject of particular interest. Because its experimental identification remains a costly and time-consuming process, computational predictors have been recently proposed for tackling this important issue. However, the performance of current predictors is still very limited. In this paper, we propose a new predictor called PSSM-Suc which employs evolutionary information of amino acids for predicting succinylated lysine residues. Here we described each lysine residue in terms of profile bigrams extracted from position specific scoring matrices. We compared the performance of PSSM-Suc to that of existing predictors using a widely used benchmark dataset. PSSM-Suc showed a significant improvement in performance over state-of-the-art predictors. Its sensitivity, accuracy and Matthews correlation coefficient were 0.8159, 0.8199 and 0.6396, respectively.
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA.
| | - Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia
| | - Jacob Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; CREST, JST, Tokyo 113-8510, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Australia; School of Engineering & Physics, University of the South Pacific, Fiji
| |
Collapse
|
38
|
Ahmad J, Javed F, Hayat M. Intelligent computational model for classification of sub-Golgi protein using oversampling and fisher feature selection methods. Artif Intell Med 2017; 78:14-22. [DOI: 10.1016/j.artmed.2017.05.001] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2017] [Revised: 04/19/2017] [Accepted: 05/02/2017] [Indexed: 10/19/2022]
|
39
|
López Y, Dehzangi A, Lal SP, Taherzadeh G, Michaelson J, Sattar A, Tsunoda T, Sharma A. SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Anal Biochem 2017; 527:24-32. [PMID: 28363440 DOI: 10.1016/j.ab.2017.03.021] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Revised: 03/13/2017] [Accepted: 03/28/2017] [Indexed: 11/30/2022]
Abstract
Post-Translational Modification (PTM) is a biological reaction which contributes to diversify the proteome. Despite many modifications with important roles in cellular activity, lysine succinylation has recently emerged as an important PTM mark. It alters the chemical structure of lysines, leading to remarkable changes in the structure and function of proteins. In contrast to the huge amount of proteins being sequenced in the post-genome era, the experimental detection of succinylated residues remains expensive, inefficient and time-consuming. Therefore, the development of computational tools for accurately predicting succinylated lysines is an urgent necessity. To date, several approaches have been proposed but their sensitivity has been reportedly poor. In this paper, we propose an approach that utilizes structural features of amino acids to improve lysine succinylation prediction. Succinylated and non-succinylated lysines were first retrieved from 670 proteins and characteristics such as accessible surface area, backbone torsion angles and local structure conformations were incorporated. We used the k-nearest neighbors cleaning treatment for dealing with class imbalance and designed a pruned decision tree for classification. Our predictor, referred to as SucStruct (Succinylation using Structural features), proved to significantly improve performance when compared to previous predictors, with sensitivity, accuracy and Mathew's correlation coefficient equal to 0.7334-0.7946, 0.7444-0.7608 and 0.4884-0.5240, respectively.
Collapse
Affiliation(s)
- Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Abdollah Dehzangi
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA.
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia
| | - Jacob Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; CREST, JST, Tokyo 113-8510, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| |
Collapse
|
40
|
Tsubaki M, Shimbo M, Matsumoto Y. Protein Fold Recognition with Representation Learning and Long Short-Term Memory. ACTA ACUST UNITED AC 2017. [DOI: 10.2197/ipsjtbio.10.2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Masashi Tsubaki
- Graduate School of Information Science, Nara Institute of Science and Technology
| | - Masashi Shimbo
- Graduate School of Information Science, Nara Institute of Science and Technology
| | - Yuji Matsumoto
- Graduate School of Information Science, Nara Institute of Science and Technology
| |
Collapse
|
41
|
ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6802832. [PMID: 27660761 PMCID: PMC5021882 DOI: 10.1155/2016/6802832] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 07/15/2016] [Accepted: 08/07/2016] [Indexed: 11/17/2022]
Abstract
Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.
Collapse
|
42
|
Kong L, Kong L, Jing R. Improving the Prediction of Protein Structural Class for Low-Similarity Sequences by Incorporating Evolutionaryand Structural Information. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2016. [DOI: 10.20965/jaciii.2016.p0402] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein structural class prediction is beneficial to study protein function, regulation and interactions. However, protein structural class prediction for low-similarity sequences (i.e., below 40% in pairwise sequence similarity) remains a challenging problem at present. In this study, a novel computational method is proposed to accurately predict protein structural class for low-similarity sequences. This method is based on support vector machine in conjunction with integrated features from evolutionary information generated with position specific iterative basic local alignment search tool (PSI-BLAST) and predicted secondary structure. Various prediction accuracies evaluated by the jackknife tests are reported on two widely-used low-similarity benchmark datasets (25PDB and 1189), reaching overall accuracies 89.3% and 87.9%, which are significantly higher than those achieved by state-of-the-art in protein structural class prediction. The experimental results suggest that our method could serve as an effective alternative to existing methods in protein structural classification, especially for low-similarity sequences.
Collapse
|
43
|
Improving protein fold recognition and structural class prediction accuracies using physicochemical properties of amino acids. J Theor Biol 2016; 402:117-28. [PMID: 27164998 DOI: 10.1016/j.jtbi.2016.05.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Revised: 04/20/2016] [Accepted: 05/02/2016] [Indexed: 11/24/2022]
Abstract
Predicting the three-dimensional (3-D) structure of a protein is an important task in the field of bioinformatics and biological sciences. However, directly predicting the 3-D structure from the primary structure is hard to achieve. Therefore, predicting the fold or structural class of a protein sequence is generally used as an intermediate step in determining the protein's 3-D structure. For protein fold recognition (PFR) and structural class prediction (SCP), two steps are required - feature extraction step and classification step. Feature extraction techniques generally utilize syntactical-based information, evolutionary-based information and physicochemical-based information to extract features. In this study, we explore the importance of utilizing the physicochemical properties of amino acids for improving PFR and SCP accuracies. For this, we propose a Forward Consecutive Search (FCS) scheme which aims to strategically select physicochemical attributes that will supplement the existing feature extraction techniques for PFR and SCP. An exhaustive search is conducted on all the existing 544 physicochemical attributes using the proposed FCS scheme and a subset of physicochemical attributes is identified. Features extracted from these selected attributes are then combined with existing syntactical-based and evolutionary-based features, to show an improvement in the recognition and prediction performance on benchmark datasets.
Collapse
|
44
|
Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A. Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 2016; 393:67-74. [DOI: 10.1016/j.jtbi.2015.12.018] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2015] [Revised: 12/17/2015] [Accepted: 12/18/2015] [Indexed: 10/22/2022]
|
45
|
Yang R, Zhang C, Gao R, Zhang L. A Novel Feature Extraction Method with Feature Selection to Identify Golgi-Resident Protein Types from Imbalanced Data. Int J Mol Sci 2016; 17:218. [PMID: 26861308 PMCID: PMC4783950 DOI: 10.3390/ijms17020218] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2015] [Accepted: 01/26/2016] [Indexed: 01/08/2023] Open
Abstract
The Golgi Apparatus (GA) is a major collection and dispatch station for numerous proteins destined for secretion, plasma membranes and lysosomes. The dysfunction of GA proteins can result in neurodegenerative diseases. Therefore, accurate identification of protein subGolgi localizations may assist in drug development and understanding the mechanisms of the GA involved in various cellular processes. In this paper, a new computational method is proposed for identifying cis-Golgi proteins from trans-Golgi proteins. Based on the concept of Common Spatial Patterns (CSP), a novel feature extraction technique is developed to extract evolutionary information from protein sequences. To deal with the imbalanced benchmark dataset, the Synthetic Minority Over-sampling Technique (SMOTE) is adopted. A feature selection method called Random Forest-Recursive Feature Elimination (RF-RFE) is employed to search the optimal features from the CSP based features and g-gap dipeptide composition. Based on the optimal features, a Random Forest (RF) module is used to distinguish cis-Golgi proteins from trans-Golgi proteins. Through the jackknife cross-validation, the proposed method achieves a promising performance with a sensitivity of 0.889, a specificity of 0.880, an accuracy of 0.885, and a Matthew's Correlation Coefficient (MCC) of 0.765, which remarkably outperforms previous methods. Moreover, when tested on a common independent dataset, our method also achieves a significantly improved performance. These results highlight the promising performance of the proposed method to identify Golgi-resident protein types. Furthermore, the CSP based feature extraction method may provide guidelines for protein function predictions.
Collapse
Affiliation(s)
- Runtao Yang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Chengjin Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
- School of Mechanical, Electrical and Information Engineering, Shandong University atWeihai, Weihai 264209, China.
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| | - Lina Zhang
- School of Control Science and Engineering, Shandong University, Jinan 250061, China.
| |
Collapse
|
46
|
Saini H, Raicar G, Dehzangi A, Lal S, Sharma A. Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J Theor Biol 2015; 386:25-33. [DOI: 10.1016/j.jtbi.2015.08.020] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Revised: 07/10/2015] [Accepted: 08/14/2015] [Indexed: 10/23/2022]
|
47
|
Lyons J, Dehzangi A, Heffernan R, Yang Y, Zhou Y, Sharma A, Paliwal K. Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models. IEEE Trans Nanobioscience 2015. [DOI: 10.1109/tnb.2015.2457906] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
48
|
Saini H, Raicar G, Sharma A, Lal S, Dehzangi A, Lyons J, Paliwal KK, Imoto S, Miyano S. Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳s pseudo amino acid composition for protein fold recognition. J Theor Biol 2015; 380:291-8. [PMID: 26079221 DOI: 10.1016/j.jtbi.2015.05.030] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2015] [Revised: 04/28/2015] [Accepted: 05/21/2015] [Indexed: 11/15/2022]
Abstract
BACKGROUND Identification of the tertiary structure (3D structure) of a protein is a fundamental problem in biology which helps in identifying its functions. Predicting a protein׳s fold is considered to be an intermediate step for identifying the tertiary structure of a protein. Computational methods have been applied to determine a protein׳s fold by assembling information from its structural, physicochemical and/or evolutionary properties. METHODS In this study, we propose a scheme in which a feature extraction technique that extracts probabilistic expressions of amino acid dimers, which have varying degree of spatial separation in the primary sequences of proteins, from the Position Specific Scoring Matrix (PSSM). SVM classifier is used to create a model from extracted features for fold recognition. RESULTS The performance of the proposed scheme is evaluated against three benchmarked datasets, namely the Ding and Dubchak, Extended Ding and Dubchak, and Taguchi and Gromiha datasets. CONCLUSIONS The proposed scheme performed well in the experiments conducted, providing improvements over previously published results in literature.
Collapse
Affiliation(s)
| | | | - Alok Sharma
- University of the South Pacific, Fiji; Griffith University, Brisbane, Australia.
| | - Sunil Lal
- University of the South Pacific, Fiji.
| | | | | | | | - Seiya Imoto
- Human Genome Center, University of Tokyo, Japan.
| | | |
Collapse
|
49
|
Prediction of protein structural class using tri-gram probabilities of position-specific scoring matrix and recursive feature elimination. Amino Acids 2015; 47:461-8. [DOI: 10.1007/s00726-014-1878-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2014] [Accepted: 11/17/2014] [Indexed: 10/24/2022]
|
50
|
Paliwal KK, Sharma A, Lyons J, Dehzangi A. Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information. BMC Bioinformatics 2014; 15 Suppl 16:S12. [PMID: 25521502 PMCID: PMC4290640 DOI: 10.1186/1471-2105-15-s16-s12] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein fold recognition and protein secondary structure prediction are transitional steps in identifying the three dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On the other hand, the SPINE-X predictor has been developed and applied for protein secondary structure prediction. Several reported methods for protein fold recognition have only limited accuracy. In this paper, we have developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure using SPINE-X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of 8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction accuracies for sequence similarity values below 40% and 25%, respectively. We also obtain 90.6% and 77.0% prediction accuracies, respectively, for the Extended Ding and Dubchak and Taguchi and Gromiha benchmark protein fold recognition datasets widely used for in the literature.
Collapse
|