1
|
Prasad D, Sharma R, Khan MGM, Sharma A. ProtCB-bind: Protein-carbohydrate binding site prediction using an ensemble of classifiers. Carbohydr Res 2025; 552:109453. [PMID: 40086131 DOI: 10.1016/j.carres.2025.109453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2024] [Revised: 03/01/2025] [Accepted: 03/04/2025] [Indexed: 03/16/2025]
Abstract
Proteins and carbohydrates are fundamental biomolecules that play crucial roles in life processes. The interactions between these molecules are essential for various biological functions, including immune response, cell activation, and energy storage. Therefore, understanding and identifying protein-carbohydrate binding regions is of significant importance. In this study, we propose ProtCB-Bind, a computational model for predicting protein-carbohydrate interactions. ProtCB-Bind leverages an ensemble of machine learning classifiers and utilizes a common averaging approach to form predictions. The proposed model is trained using a combination of sequence-based and evolutionary-based features of protein sequences, as well as the physicochemical properties of amino acids. To enhance predictive performance, ProtCB-Bind incorporates features derived from recent advancements in transformer-based Natural Language Processing (NLP) for proteins. ProtCB-Bind was designed by systematically identifying the best combination of classifiers and features, and was evaluated using a state-of-the-art benchmark dataset. Its performance was compared against established predictors, including SPRINT-CBH, StackCB-Pred, and StackCB-Embed. ProtCB-Bind outperformed these state-of-the-art predictors, achieving an approximate 3 % improvement in overall performance on benchmark dataset. The sources code for ProtCB-Bind is available at https://github.com/Divnesh/ProtCB-Bind.
Collapse
Affiliation(s)
- Divnesh Prasad
- School of Information Technology, Engineering, Mathematics and Physics, The University of the South Pacific, Suva, Fiji
| | - Ronesh Sharma
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji.
| | - M G M Khan
- School of Information Technology, Engineering, Mathematics and Physics, The University of the South Pacific, Suva, Fiji
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane, QLD, 4111, Australia; College of Informatics, Korea University, Seoul, South Korea
| |
Collapse
|
2
|
Ali F, Kumar H, Patil S, Kotecha K, Banjar A, Daud A. Target-DBPPred: An intelligent model for prediction of DNA-binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting. Comput Biol Med 2022; 145:105533. [PMID: 35447463 DOI: 10.1016/j.compbiomed.2022.105533] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 04/11/2022] [Accepted: 04/13/2022] [Indexed: 11/03/2022]
Abstract
DNA-protein interaction is a critical biological process that performs influential activities, including DNA transcription and recombination. DBPs (DNA-binding proteins) are closely associated with different kinds of human diseases (asthma, cancer, and AIDS), while some of the DBPs are used in the production of antibiotics, steroids, and anti-inflammatories. Several methods have been reported for the prediction of DBPs. However, a more intelligent method is still highly desirable for the accurate prediction of DBPs. This study presents an intelligent computational method, Target-DBPPred, to improve DBPs prediction. Important features from primary protein sequences are investigated via a novel feature descriptor, called EDF-PSSM-DWT (Evolutionary difference formula position-specific scoring matrix-discrete wavelet transform) and several other multi-evolutionary methods, including F-PSSM (Filtered position-specific scoring matrix), EDF-PSSM (Evolutionary difference formula position-specific scoring matrix), PSSM-DPC (Position-specific scoring matrix-dipeptide composition), and Lead-BiPSSM (Lead-bigram-position specific scoring matrix) to encapsulate diverse multivariate features. The best feature set from the features of each descriptor is selected using sequential forward selection (SFS). Further, four models are trained using Adaboost, XGB (eXtreme gradient boosting), ERT (extremely randomized trees), and LiXGB (Light eXtreme gradient boosting) classifiers. LiXGB, with the best feature set of EDF-PSSM-DWT, has attained 6.69% and 15.07% higher performance in terms of accuracies using training and testing datasets, respectively. The obtained results verify the improved performance of our proposed predictor over the existing predictors.
Collapse
Affiliation(s)
- Farman Ali
- Department of Elementary and Secondary Education, Peshawar, Khyber Pakhtunkhwa, Pakistan; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Harish Kumar
- Department of Computer Science, College of Computer Science, King Khalid University, Abha, Saudi Arabia
| | - Shruti Patil
- Symbiosis Institute of Technology, Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International University, Pune, India
| | - Ketan Kotecha
- Symbiosis Institute of Technology, Symbiosis Centre for Applied Artificial Intelligence, Symbiosis International University, Pune, India.
| | - Ameen Banjar
- Department of Information Systems, College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Ali Daud
- Key Laboratory of Oceanographic Big Data Mining & Application of Zhejiang Province, School of Information Engineering, Zhejiang Ocean University, Zhoushan, 316022, China; Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah, Saudi Arabia.
| |
Collapse
|
3
|
Long short term memory based functional characterization model for unknown protein sequences using ensemble of shallow and deep features. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-06674-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
4
|
Agrawal S, Sisodia DS, Nagwani NK. Augmented sequence features and subcellular localization for functional characterization of unknown protein sequences. Med Biol Eng Comput 2021; 59:2297-2310. [PMID: 34545514 DOI: 10.1007/s11517-021-02436-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 08/29/2021] [Indexed: 11/24/2022]
Abstract
Advances in high-throughput techniques lead to evolving a large number of unknown protein sequences (UPS). Functional characterization of UPS is significant for the investigation of disease symptoms and drug repositioning. Protein subcellular localization is imperative for the functional characterization of protein sequences. Diverse techniques are used on protein sequences for feature extraction. However, many times a single feature extraction technique leads to poor prediction performance. In this paper, two feature augmentations are described through sequence induced, physicochemical, and evolutionary information of the amino acid residues. While augmented features preserve the sequence-order-information and protein-residue-properties. Two bacterial protein datasets Gram-Positive (G +) and Gram-Negative (G-) are utilized for the experimental work. After performing essential preprocessing on protein datasets, two sets of feature vectors are obtained. These feature vectors are used separately to train the different individual and ensembles such as decision tree (C 4.5), k-nearest neighbor (k-NN), multi-layer perceptron (MLP), Naïve Bayes (NB), support vector machine (SVM), AdaBoost, gradient boosting machine (GBM), and random forest (RF) with fivefold cross-validation. Prediction results of the model demonstrate that overall accuracy reported by C4.5 is highest 99.57% on G + and 97.47% on G- datasets with known protein sequences. Similarly, for the UPS overall accuracy of G + is 85.17% with SVM and 82.45% with G- dataset using MLP.
Collapse
Affiliation(s)
- Saurabh Agrawal
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India.
| | - Dilip Singh Sisodia
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| | - Naresh Kumar Nagwani
- Department of Computer Science & Engineering, National Institute of Technology Raipur, GE Road, Raipur, Chhattisgarh, 492010, India
| |
Collapse
|
5
|
Chen Y, Fu X, Li Z, Peng L, Zhuo L. Prediction of lncRNA-Protein Interactions via the Multiple Information Integration. Front Bioeng Biotechnol 2021; 9:647113. [PMID: 33718346 PMCID: PMC7947871 DOI: 10.3389/fbioe.2021.647113] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 01/19/2021] [Indexed: 01/09/2023] Open
Abstract
The long non-coding RNA (lncRNA)-protein interaction plays an important role in the post-transcriptional gene regulation, such as RNA splicing, translation, signaling, and the development of complex diseases. The related research on the prediction of lncRNA-protein interaction relationship is beneficial in the excavation and the discovery of the mechanism of lncRNA function and action occurrence, which are important. Traditional experimental methods for detecting lncRNA-protein interactions are expensive and time-consuming. Therefore, computational methods provide many effective strategies to deal with this problem. In recent years, most computational methods only use the information of the lncRNA-lncRNA or the protein-protein similarity and cannot fully capture all features to identify their interactions. In this paper, we propose a novel computational model for the lncRNA-protein prediction on the basis of machine learning methods. First, a feature method is proposed for representing the information of the network topological properties of lncRNA and protein interactions. The basic composition feature information and evolutionary information based on protein, the lncRNA sequence feature information, and the lncRNA expression profile information are extracted. Finally, the above feature information is fused, and the optimized feature vector is used with the recursive feature elimination algorithm. The optimized feature vectors are input to the support vector machine (SVM) model. Experimental results show that the proposed method has good effectiveness and accuracy in the lncRNA-protein interaction prediction.
Collapse
Affiliation(s)
- Yifan Chen
- College of Information Science and Engineering, Hunan University, Changsha, China
- School of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| | - Xiangzheng Fu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Zejun Li
- School of Computer and Information Science, Hunan Institute of Technology, Hengyang, China
| | - Li Peng
- College of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan, China
| | - Linlin Zhuo
- Department of Mathematics and Information Engineering, Wenzhou University Oujiang College, Wenzhou, China
| |
Collapse
|
6
|
Zhang Y, Yu S, Xie R, Li J, Leier A, Marquez-Lago TT, Akutsu T, Smith AI, Ge Z, Wang J, Lithgow T, Song J. PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins. Bioinformatics 2020; 36:704-712. [PMID: 31393553 DOI: 10.1093/bioinformatics/btz629] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 07/17/2019] [Accepted: 08/07/2019] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data. RESULTS In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors. AVAILABILITY AND IMPLEMENTATION http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanju Zhang
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Sha Yu
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Ruopeng Xie
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia
| | - Jiahui Li
- Bioinformatics Group, School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.,Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - André Leier
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatiana T Marquez-Lago
- Department of Genetics, AL, USA.,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - A Ian Smith
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| | - Zongyuan Ge
- Monash e-Research Centre and Faculty of Engineering, Monash University, Melbourne, VIC 3800, Australia
| | - Jiawei Wang
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, VIC 3800, Australia.,ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia
| |
Collapse
|
7
|
Sharma R, Kumar S, Tsunoda T, Kumarevel T, Sharma A. Single-stranded and double-stranded DNA-binding protein prediction using HMM profiles. Anal Biochem 2020; 612:113954. [PMID: 32946833 DOI: 10.1016/j.ab.2020.113954] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 08/26/2020] [Accepted: 09/10/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND DNA-binding proteins perform important roles in cellular processes and are involved in many biological activities. These proteins include crucial protein-DNA binding domains and can interact with single-stranded or double-stranded DNA, and accordingly classified as single-stranded DNA-binding proteins (SSBs) or double-stranded DNA-binding proteins (DSBs). Computational prediction of SSBs and DSBs helps in annotating protein functions and understanding of protein-binding domains. RESULTS Performance is reported using the DNA-binding protein dataset that was recently introduced by Wang et al., [1]. The proposed method achieved a sensitivity of 0.600, specificity of 0.792, AUC of 0.758, MCC of 0.369, accuracy of 0.744, and F-measure of 0.536, on the independent test set. CONCLUSION The proposed method with the hidden Markov model (HMM) profiles for feature extraction, outperformed the benchmark method in the literature and achieved an overall improvement of approximately 3%. The source code and supplementary information of the proposed method is available at https://github.com/roneshsharma/Predict-DNA-binding-proteins/wiki.
Collapse
Affiliation(s)
- Ronesh Sharma
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji.
| | - Shiu Kumar
- School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji.
| | - Tatsuhiko Tsunoda
- Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan; Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan; Laboratory of Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, University of Tokyo, Tokyo, 113-0033, Japan.
| | - Thirumananseri Kumarevel
- Laboratory for Transcription Structural Biology, RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan.
| | - Alok Sharma
- Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan; Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan; School of Engineering and Physics, The University of the South Pacific, Suva, Fiji; Institute for Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane, QLD, Australia.
| |
Collapse
|
8
|
|
9
|
Fu X, Cai L, Zeng X, Zou Q. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency. Bioinformatics 2020; 36:3028-3034. [DOI: 10.1093/bioinformatics/btaa131] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Revised: 02/19/2020] [Accepted: 02/25/2020] [Indexed: 12/19/2022] Open
Abstract
Abstract
Motivation
Cell-penetrating peptides (CPPs) are a vehicle for transporting into living cells pharmacologically active molecules, such as short interfering RNAs, nanoparticles, plasmid DNAs and small peptides, thus offering great potential as future therapeutics. Existing experimental techniques for identifying CPPs are time-consuming and expensive. Thus, the prediction of CPPs from peptide sequences by using computational methods can be useful to annotate and guide the experimental process quickly. Many machine learning-based methods have recently emerged for identifying CPPs. Although considerable progress has been made, existing methods still have low feature representation capabilities, thereby limiting further performance improvements.
Results
We propose a method called StackCPPred, which proposes three feature methods on the basis of the pairwise energy content of the residue as follows: RECM-composition, PseRECM and RECM–DWT. These features are used to train stacking-based machine learning methods to effectively predict CPPs. On the basis of the CPP924 and CPPsite3 datasets with jackknife validation, StackDPPred achieved 94.5% and 78.3% accuracy, which was 2.9% and 5.8% higher than the state-of-the-art CPP predictors, respectively. StackCPPred can be a powerful tool for predicting CPPs and their uptake efficiency, facilitating hypothesis-driven experimental design and accelerating their applications in clinical therapy.
Availability and implementation
Source code and data can be downloaded from https://github.com/Excelsior511/StackCPPred.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
10
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
11
|
Shao YT, Liu XX, Lu Z, Chou KC. pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.127042] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
12
|
Shao Y, Chou KC. pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning. ACTA ACUST UNITED AC 2020. [DOI: 10.4236/ns.2020.126034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
13
|
Chandra A, Sharma A, Dehzangi A, Shigemizu D, Tsunoda T. Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix. BMC Mol Cell Biol 2019; 20:57. [PMID: 31856704 PMCID: PMC6923822 DOI: 10.1186/s12860-019-0240-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 11/20/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. RESULTS We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. CONCLUSIONS The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK.
Collapse
Affiliation(s)
- Abel Chandra
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji.
| | - Alok Sharma
- School of Engineering and Physics, Faculty of Science Technology and Environment, University of the South Pacific, Suva, Fiji. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, 4111, Australia. .,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan. .,CREST, JST, Tokyo, 102-8666, Japan.
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, USA
| | - Daichi Shigemizu
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan.,CREST, JST, Tokyo, 102-8666, Japan.,Medical Genome Center, National Center for Geriatrics and Gerontology, Obu, Aichi, 474-8511, Japan
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo, 113-8510, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, 230-0045, Japan.,CREST, JST, Tokyo, 102-8666, Japan.,Laboratory for Medical Science Mathematics, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, 108-8639, Japan
| |
Collapse
|
14
|
pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. Genomics 2019; 111:1274-1282. [DOI: 10.1016/j.ygeno.2018.08.007] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 08/14/2018] [Accepted: 08/16/2018] [Indexed: 12/17/2022]
|
15
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
16
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
17
|
Arif M, Ali F, Ahmad S, Kabir M, Ali Z, Hayat M. Pred-BVP-Unb: Fast prediction of bacteriophage Virion proteins using un-biased multi-perspective properties with recursive feature elimination. Genomics 2019; 112:1565-1574. [PMID: 31526842 DOI: 10.1016/j.ygeno.2019.09.006] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/27/2019] [Accepted: 09/11/2019] [Indexed: 10/26/2022]
Abstract
Bacteriophage virion proteins (BVPs) are bacterial viruses that have a great impact on different biological functions of bacteria. They are significantly used in genetic engineering and phage therapy applications. Correct identification of BVP through conventional pathogen methods are slow and expensive. Thus, designing a Bioinformatics predictor is urgently desirable to accelerate correct identification of BVPs within a huge volume of proteins. However, available prediction tools performance is inadequate due to the lack of useful feature representation and severe imbalance issue. In the present study, we propose an intelligent model, called Pred-BVP-Unb for discrimination of BVPs that employed three nominal sequences-driven descriptors, i.e. Bi-PSSM evolutionary information, composition & translation, and split amino acid composition. The imbalance phenomena between classes were coped with the help of a synthetic minority oversampling technique. The essential attributes are selected by a robust algorithm called recursive feature elimination. Finally, the optimal feature space is provided to support vector machine classifier using a radial base kernel in order to train the model. Our predictor remarkably outperforms than existing approaches in the literature by achieving the highest accuracy of 92.54% and 83.06% respectively on the benchmark and independent datasets. We expect that Pred-BVP-Unb tool can provide useful hints for designing antibacterial drugs and also helpful to expedite large scale discovery of new bacteriophage virion proteins. The source code and all datasets are publicly available at https://github.com/Muhammad-Arif-NUST/BVP_Pred_Unb.
Collapse
Affiliation(s)
- Muhammad Arif
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computer Science, Abdul Wali Khan University Mardan, KP, Pakistan.
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Saeed Ahmad
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Zakir Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP, Pakistan.
| |
Collapse
|
18
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
19
|
|
20
|
Xiao X, Cheng X, Chen G, Mao Q, Chou KC. pLoc_bal-mVirus: Predict Subcellular Localization of Multi-Label Virus Proteins by Chou's General PseAAC and IHTS Treatment to Balance Training Dataset. Med Chem 2019; 15:496-509. [DOI: 10.2174/1573406415666181217114710] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/17/2022]
Abstract
Background/Objective:Knowledge of protein subcellular localization is vitally important for both basic research and drug development. Facing the avalanche of protein sequences emerging in the post-genomic age, it is urgent to develop computational tools for timely and effectively identifying their subcellular localization based on the sequence information alone. Recently, a predictor called “pLoc-mVirus” was developed for identifying the subcellular localization of virus proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, known as “multiplex proteins”, may simultaneously occur in, or move between two or more subcellular location sites. Despite the fact that it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mVirus was trained by an extremely skewed dataset in which some subset was over 10 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset.Methods:Using the Chou's general PseAAC (Pseudo Amino Acid Composition) approach and the IHTS (Inserting Hypothetical Training Samples) treatment to balance out the training dataset, we have developed a new predictor called “pLoc_bal-mVirus” for predicting the subcellular localization of multi-label virus proteins.Results:Cross-validation tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mVirus, the existing state-of-theart predictor for the same purpose.Conclusion:Its user-friendly web-server is available at http://www.jci-bioinfo.cn/pLoc_balmVirus/, by which the majority of experimental scientists can easily get their desired results without the need to go through the detailed complicated mathematics. Accordingly, pLoc_bal-mVirus will become a very useful tool for designing multi-target drugs and in-depth understanding of the biological process in a cell.
Collapse
Affiliation(s)
- Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Genqiang Chen
- College of Chemistry, Chemical Engineering and Biotechnology, Donghua University, Shanghai 201620, China
| | - Qi Mao
- College of Information Science and Technology, Donghua University, Shanghai, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
21
|
Chou KC, Cheng X, Xiao X. pLoc_bal-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by General PseAAC and Quasi-balancing Training Dataset. Med Chem 2019; 15:472-485. [DOI: 10.2174/1573406415666181218102517] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 10/23/2018] [Accepted: 12/12/2018] [Indexed: 12/24/2022]
Abstract
<P>Background/Objective: Information of protein subcellular localization is crucially important for both basic research and drug development. With the explosive growth of protein sequences discovered in the post-genomic age, it is highly demanded to develop powerful bioinformatics tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mEuk” was developed for identifying the subcellular localization of eukaryotic proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems where many proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mEuk was trained by an extremely skewed dataset where some subset was about 200 times the size of the other subsets. Accordingly, it cannot avoid the biased consequence caused by such an uneven training dataset. </P><P> Methods: To alleviate such bias, we have developed a new predictor called pLoc_bal-mEuk by quasi-balancing the training dataset. Cross-validation tests on exactly the same experimentconfirmed dataset have indicated that the proposed new predictor is remarkably superior to pLocmEuk, the existing state-of-the-art predictor in identifying the subcellular localization of eukaryotic proteins. It has not escaped our notice that the quasi-balancing treatment can also be used to deal with many other biological systems. </P><P> Results: To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mEuk/. </P><P> Conclusion: It is anticipated that the pLoc_bal-Euk predictor holds very high potential to become a useful high throughput tool in identifying the subcellular localization of eukaryotic proteins, particularly for finding multi-target drugs that is currently a very hot trend trend in drug development.</P>
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xiang Cheng
- Gordon Life Science Institute, Boston, MA 02478, United States
| | - Xuan Xiao
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
22
|
Sharma R, Raicar G, Tsunoda T, Patil A, Sharma A. OPAL: prediction of MoRF regions in intrinsically disordered protein sequences. Bioinformatics 2019; 34:1850-1858. [PMID: 29360926 DOI: 10.1093/bioinformatics/bty032] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Accepted: 01/17/2018] [Indexed: 12/15/2022] Open
Abstract
Motivation Intrinsically disordered proteins lack stable 3-dimensional structure and play a crucial role in performing various biological functions. Key to their biological function are the molecular recognition features (MoRFs) located within long disordered regions. Computationally identifying these MoRFs from disordered protein sequences is a challenging task. In this study, we present a new MoRF predictor, OPAL, to identify MoRFs in disordered protein sequences. OPAL utilizes two independent sources of information computed using different component predictors. The scores are processed and combined using common averaging method. The first score is computed using a component MoRF predictor which utilizes composition and sequence similarity of MoRF and non-MoRF regions to detect MoRFs. The second score is calculated using half-sphere exposure (HSE), solvent accessible surface area (ASA) and backbone angle information of the disordered protein sequence, using information from the amino acid properties of flanks surrounding the MoRFs to distinguish MoRF and non-MoRF residues. Results OPAL is evaluated using test sets that were previously used to evaluate MoRF predictors, MoRFpred, MoRFchibi and MoRFchibi-web. The results demonstrate that OPAL outperforms all the available MoRF predictors and is the most accurate predictor available for MoRF prediction. It is available at http://www.alok-ai-lab.com/tools/opal/. Contact ashwini@hgc.jp or alok.sharma@griffith.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ronesh Sharma
- School of Engineering and Physics, The University of the South Pacific, Suva, Fiji.,School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji
| | - Gaurav Raicar
- School of Engineering and Physics, The University of the South Pacific, Suva, Fiji
| | - Tatsuhiko Tsunoda
- Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan.,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo 113-8510, Japan.,CREST, JST, Tokyo 113-8510, Japan
| | - Ashwini Patil
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan
| | - Alok Sharma
- School of Engineering and Physics, The University of the South Pacific, Suva, Fiji.,Laboratory of Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama 230-0045, Japan.,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University (TMDU), Tokyo 113-8510, Japan.,CREST, JST, Tokyo 113-8510, Japan.,Institute for Integrated and Intelligent Systems, Griffith University, Nathan, Brisbane, QLD, Australia
| |
Collapse
|
23
|
Ali F, Ahmed S, Swati ZNK, Akbar S. DP-BINDER: machine learning model for prediction of DNA-binding proteins by fusing evolutionary and physicochemical information. J Comput Aided Mol Des 2019; 33:645-658. [PMID: 31123959 DOI: 10.1007/s10822-019-00207-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 05/18/2019] [Indexed: 12/28/2022]
Abstract
DNA-binding proteins (DBPs) participate in various biological processes including DNA replication, recombination, and repair. In the human genome, about 6-7% of these proteins are utilized for genes encoding. DBPs shape the DNA into a compact structure known chromatin while some of these proteins regulate the chromosome packaging and transcription process. In the pharmaceutical industry, DBPs are used as a key component of antibiotics, steroids, and cancer drugs. These proteins also involve in biophysical, biological, and biochemical studies of DNA. Due to the crucial role in various biological activities, identification of DBPs is a hot issue in protein science. A series of experimental and computational methods have been proposed, however, some methods didn't achieve the desired results while some are inadequate in its accuracy and authenticity. Still, it is highly desired to present more intelligent computational predictors. In this work, we introduce an innovative computational method namely DP-BINDER based on physicochemical and evolutionary information. We captured local highly decisive features from physicochemical properties of primary protein sequences via normalized Moreau-Broto autocorrelation (NMBAC) and evolutionary information by position specific scoring matrix-transition probability composition (PSSM-TPC) and pseudo-position specific scoring matrix (PsePSSM) using training and independent datasets. The optimal features were selected by the support vector machine-recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) from fused features and were fed into random forest (RF) and support vector machine (SVM). Our method attained 92.46% and 89.58% accuracy with jackknife and ten-fold cross-validation, respectively on the training dataset, while 81.17% accuracy on the independent dataset for prediction of DBPs. These results demonstrate that our method attained the highest success rate in the literature. The superiority of DP-BINDER over existing approaches due to several reasons including abstraction of local dominant features via effective feature descriptors, utilization of appropriate feature selection algorithms and effective classifier.
Collapse
Affiliation(s)
- Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | - Saeed Ahmed
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| | - Zar Nawab Khan Swati
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
- Department of Computer Science, Karakoram International University, Gilgit, Gilgit-Baltistan, 15100, Pakistan
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| |
Collapse
|
24
|
Chandra AA, Sharma A, Dehzangi A, Tsunoda T. EvolStruct-Phogly: incorporating structural properties and evolutionary information from profile bigrams for the phosphoglycerylation prediction. BMC Genomics 2019; 19:984. [PMID: 30999859 PMCID: PMC7402405 DOI: 10.1186/s12864-018-5383-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 12/17/2018] [Indexed: 01/21/2023] Open
Abstract
Background Post-translational modification (PTM), which is a biological process, tends to modify proteome that leads to changes in normal cell biology and pathogenesis. In the recent times, there has been many reported PTMs. Out of the many modifications, phosphoglycerylation has become particularly the subject of interest. The experimental procedure for identification of phosphoglycerylated residues continues to be an expensive, inefficient and time-consuming effort, even with a large number of proteins that are sequenced in the post-genomic period. Computational methods are therefore being anticipated in order to effectively predict phosphoglycerylated lysines. Even though there are predictors available, the ability to detect phosphoglycerylated lysine residues still remains inadequate. Results We have introduced a new predictor in this paper named EvolStruct-Phogly that uses structural and evolutionary information relating to amino acids to predict phosphoglycerylated lysine residues. Benchmarked data is employed containing experimentally identified phosphoglycerylated and non-phosphoglycerylated lysines. We have then extracted the three structural information which are accessible surface area of amino acids, backbone torsion angles, amino acid’s local structure conformations and profile bigrams of position-specific scoring matrices. Conclusion EvolStruct-Phogly showed a noteworthy improvement in regards to the performance when compared with the previous predictors. The performance metrics obtained are as follows: sensitivity 0.7744, specificity 0.8533, precision 0.7368, accuracy 0.8275, and Mathews correlation coefficient of 0.6242. The software package and data of this work can be obtained from https://github.com/abelavit/EvolStruct-Phogly or www.alok-ai-lab.com
Collapse
Affiliation(s)
| | - Alok Sharma
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia. .,CREST, JST, Tokyo, Japan.
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, USA
| | - Tatushiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.,CREST, JST, Tokyo, Japan.,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| |
Collapse
|
25
|
Sharma A, Lysenko A, López Y, Dehzangi A, Sharma R, Reddy H, Sattar A, Tsunoda T. HseSUMO: Sumoylation site prediction using half-sphere exposures of amino acids residues. BMC Genomics 2019; 19:982. [PMID: 30999862 PMCID: PMC7402407 DOI: 10.1186/s12864-018-5206-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2018] [Accepted: 10/28/2018] [Indexed: 02/06/2023] Open
Abstract
Background Post-translational modifications are viewed as an important mechanism for controlling protein function and are believed to be involved in multiple important diseases. However, their profiling using laboratory-based techniques remain challenging. Therefore, making the development of accurate computational methods to predict post-translational modifications is particularly important for making progress in this area of research. Results This work explores the use of four half-sphere exposure-based features for computational prediction of sumoylation sites. Unlike most of the previously proposed approaches, which focused on patterns of amino acid co-occurrence, we were able to demonstrate that protein structural based features could be sufficiently informative to achieve good predictive performance. The evaluation of our method has demonstrated high sensitivity (0.9), accuracy (0.89) and Matthew’s correlation coefficient (0.78–0.79). We have compared these results to the recently released pSumo-CD method and were able to demonstrate better performance of our method on the same evaluation dataset. Conclusions The proposed predictor HseSUMO uses half-sphere exposures of amino acids to predict sumoylation sites. It has shown promising results on a benchmark dataset when compared with the state-of-the-art method. The extracted data of this study can be accessed at https://github.com/YosvanyLopez/HseSUMO. Electronic supplementary material The online version of this article (10.1186/s12864-018-5206-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Q, Brisbane, LD-4111, Australia. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan. .,School of Engineering and Physics, Faculty of Science, Technology and Environment, University of the South Pacific, Suva, Fiji Islands.
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
| | - Yosvany López
- Genesis Institute of Genetic Research, Genesis Healthcare Co, Tokyo, Japan
| | - Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD, USA
| | - Ronesh Sharma
- School of Engineering and Physics, Faculty of Science, Technology and Environment, University of the South Pacific, Suva, Fiji Islands.,School of Electrical and Electronics Engineering, Fiji National University, Suva, Fiji
| | - Hamendra Reddy
- School of Engineering and Physics, Faculty of Science, Technology and Environment, University of the South Pacific, Suva, Fiji Islands
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Q, Brisbane, LD-4111, Australia
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan. .,Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan. .,CREST, JST, Tokyo, 113-8510, Japan.
| |
Collapse
|
26
|
Characterization of human proteins with different subcellular localizations by topological and biological properties. Genomics 2018; 111:1831-1838. [PMID: 30543849 DOI: 10.1016/j.ygeno.2018.12.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 12/02/2018] [Accepted: 12/07/2018] [Indexed: 11/20/2022]
Abstract
Knowing the protein localization can provide valuable information resource for elucidating protein function. In recent years, with the advances of human genomics and proteomics, it is possible to characterize human proteins that are located in different subcellular localizations. In this study, we used the topological properties and biological properties to characterize human proteins with six subcellular localizations. Almost all of these properties were found to be significantly different among six protein categories. Network topology analysis indicated that several significant topological properties, including the degree and k-core, were higher for the mitochondrial proteins. Biological property analysis showed that the nuclear proteins appeared to be correlated with important biological function. We hope these findings may provide some important help for comprehensive understanding the biological function of proteins, and prediction of protein subcellular localizations in human.
Collapse
|
27
|
Dehzangi A, López Y, Taherzadeh G, Sharma A, Tsunoda T. SumSec: Accurate Prediction of Sumoylation Sites Using Predicted Secondary Structure. Molecules 2018; 23:E3260. [PMID: 30544729 PMCID: PMC6320791 DOI: 10.3390/molecules23123260] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2018] [Revised: 11/30/2018] [Accepted: 12/05/2018] [Indexed: 12/13/2022] Open
Abstract
Post Translational Modification (PTM) is defined as the modification of amino acids along the protein sequences after the translation process. These modifications significantly impact on the functioning of proteins. Therefore, having a comprehensive understanding of the underlying mechanism of PTMs turns out to be critical in studying the biological roles of proteins. Among a wide range of PTMs, sumoylation is one of the most important modifications due to its known cellular functions which include transcriptional regulation, protein stability, and protein subcellular localization. Despite its importance, determining sumoylation sites via experimental methods is time-consuming and costly. This has led to a great demand for the development of fast computational methods able to accurately determine sumoylation sites in proteins. In this study, we present a new machine learning-based method for predicting sumoylation sites called SumSec. To do this, we employed the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. As a result, our proposed method is able to enhance the sumoylation site prediction task, outperforming previously proposed methods in the literature. SumSec demonstrated high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than those reported in previous studies. The script and extracted features are publicly available at: https://github.com/YosvanyLopez/SumSec.
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, MD 21251, USA.
| | - Yosvany López
- Genesis Institute of Genetic Research, Genesis Healthcare Co., Tokyo 150-6015, Japan.
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Gold Coast 4222, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane 4111, Australia.
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.
- CREST, JST, Tokyo 102-0076, Japan.
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan.
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa 230-0045, Japan.
- CREST, JST, Tokyo 102-0076, Japan.
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo 113-8510, Japan.
| |
Collapse
|
28
|
Cheng X, Xiao X, Chou KC. pLoc_bal-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by quasi-balancing training dataset and general PseAAC. J Theor Biol 2018; 458:92-102. [DOI: 10.1016/j.jtbi.2018.09.005] [Citation(s) in RCA: 65] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Revised: 09/05/2018] [Accepted: 09/07/2018] [Indexed: 01/03/2023]
|
29
|
Ju Z, Wang SY. Prediction of S-sulfenylation sites using mRMR feature selection and fuzzy support vector machine algorithm. J Theor Biol 2018; 457:6-13. [DOI: 10.1016/j.jtbi.2018.08.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 08/07/2018] [Accepted: 08/15/2018] [Indexed: 11/29/2022]
|
30
|
Zhang Q, Wang S, Pan Y, Su D, Lu Q, Zuo Y, Yang L. Characterization of proteins in different subcellular localizations for Escherichia coli K12. Genomics 2018; 111:1134-1141. [PMID: 30026105 DOI: 10.1016/j.ygeno.2018.07.008] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Revised: 07/07/2018] [Accepted: 07/11/2018] [Indexed: 10/28/2022]
Abstract
Knowing the comprehensive knowledge about the protein subcellular localization is an important step to understand the function of the proteins. Recent advances in system biology have allowed us to develop more accurate methods for characterizing the proteins at subcellular localization level. In this study, the analysis method was developed to characterize the topological properties and biological properties of the cytoplasmic proteins, inner membrane proteins, outer membrane proteins and periplasmic proteins in Escherichia coli (E. coli). Statistical significant differences were found in all topological properties and biological properties among proteins in different subcellular localizations. In addition, investigation was carried out to analyze the differences in 20 amino acid compositions for four protein categories. We also found that there were significant differences in all of the 20 amino acid compositions. These findings may be helpful for understanding the comprehensive relationship between protein subcellular localization and biological function.
Collapse
Affiliation(s)
- Qi Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yi Pan
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Dongqing Su
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Qianzi Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China
| | - Yongchun Zuo
- The State key Laboratory of Reproductive Regulation, Breeding of Grassland Livestock, College of Life Sciences, Inner Mongolia University, Hohhot 010070, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, China.
| |
Collapse
|
31
|
Cheng X, Lin WZ, Xiao X, Chou KC. pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC. Bioinformatics 2018; 35:398-406. [DOI: 10.1093/bioinformatics/bty628] [Citation(s) in RCA: 79] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Accepted: 07/11/2018] [Indexed: 12/25/2022] Open
Affiliation(s)
- Xiang Cheng
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Wei-Zhong Lin
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- Computer Science, Jingdezhen Ceramic Institute, Jingdezhen, China
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
| | - Kuo-Chen Chou
- Computational Biology, Gordon Life Science Institute, Boston, MA, USA
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
32
|
pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. Genomics 2018; 111:886-892. [PMID: 29842950 DOI: 10.1016/j.ygeno.2018.05.017] [Citation(s) in RCA: 79] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Revised: 05/14/2018] [Accepted: 05/18/2018] [Indexed: 12/12/2022]
Abstract
Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called "pLoc-mGpos" was developed for identifying the subcellular localization of Gram-positive bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called "multiplex proteins", may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGpos was trained by an extremely skewed dataset in which some subset (subcellular location) was over 11 times the size of the other subsets. Accordingly, it cannot avoid the bias consequence caused by such an uneven training dataset. To alleviate such bias consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGpos by quasi-balancing the training dataset. Rigorous target jackknife tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGpos, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-positive bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGpos/, by which users can easily get their desired results without the need to go through the detailed mathematics.
Collapse
|
33
|
Yang H, Qiu WR, Liu G, Guo FB, Chen W, Chou KC, Lin H. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. Int J Biol Sci 2018; 14:883-891. [PMID: 29989083 PMCID: PMC6036749 DOI: 10.7150/ijbs.24616] [Citation(s) in RCA: 135] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 02/04/2018] [Indexed: 02/06/2023] Open
Abstract
Meiotic recombination caused by meiotic double-strand DNA breaks. In some regions the frequency of DNA recombination is relatively higher, while in other regions the frequency is lower: the former is usually called "recombination hotspot", while the latter the "recombination coldspot". Information of the hot and cold spots may provide important clues for understanding the mechanism of genome revolution. Therefore, it is important to accurately predict these spots. In this study, we rebuilt the benchmark dataset by unifying its samples with a same length (131 bp). Based on such a foundation and using SVM (Support Vector Machine) classifier, a new predictor called "iRSpot-Pse6NC" was developed by incorporating the key hexamer features into the general PseKNC (Pseudo K-tuple Nucleotide Composition) via the binomial distribution approach. It has been observed via rigorous cross-validations that the proposed predictor is superior to its counterparts in overall accuracy, stability, sensitivity and specificity. For the convenience of most experimental scientists, the web-server for iRSpot-Pse6NC has been established at http://lin-group.cn/server/iRSpot-Pse6NC, by which users can easily obtain their desired result without the need to go through the detailed mathematical equations involved.
Collapse
Affiliation(s)
- Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wang-Ren Qiu
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403, China
| | - Guoqing Liu
- School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Feng-Biao Guo
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Kuo-Chen Chou
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.,Gordon Life Science Institute, Boston, MA 02478, USA
| |
Collapse
|
34
|
Uddin MR, Sharma A, Farid DM, Rahman MM, Dehzangi A, Shatabda S. EvoStruct-Sub: An accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. J Theor Biol 2018; 443:138-146. [DOI: 10.1016/j.jtbi.2018.02.002] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2017] [Revised: 01/18/2018] [Accepted: 02/03/2018] [Indexed: 12/21/2022]
|
35
|
Dehzangi A, López Y, Lal SP, Taherzadeh G, Sattar A, Tsunoda T, Sharma A. Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams. PLoS One 2018; 13:e0191900. [PMID: 29432431 PMCID: PMC5809022 DOI: 10.1371/journal.pone.0191900] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Accepted: 01/12/2018] [Indexed: 11/18/2022] Open
Abstract
Post-translational modification refers to the biological mechanism involved in the enzymatic modification of proteins after being translated in the ribosome. This mechanism comprises a wide range of structural modifications, which bring dramatic variations to the biological function of proteins. One of the recently discovered modifications is succinylation. Although succinylation can be detected through mass spectrometry, its current experimental detection turns out to be a timely process unable to meet the exponential growth of sequenced proteins. Therefore, the implementation of fast and accurate computational methods has emerged as a feasible solution. This paper proposes a novel classification approach, which effectively incorporates the secondary structure and evolutionary information of proteins through profile bigrams for succinylation prediction. The proposed predictor, abbreviated as SSEvol-Suc, made use of the above features for training an AdaBoost classifier and consequently predicting succinylated lysine residues. When SSEvol-Suc was compared with four benchmark predictors, it outperformed them in metrics such as sensitivity (0.909), accuracy (0.875) and Matthews correlation coefficient (0.75).
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Computer Science, Morgan State University, Baltimore, Maryland, United States of America
| | - Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail:
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, Palmerston North, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Queensland, Australia
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Queensland, Australia
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- CREST, JST, Tokyo, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia
- School of Engineering & Physics, University of the South Pacific, Suva, Fiji
| |
Collapse
|
36
|
López Y, Sharma A, Dehzangi A, Lal SP, Taherzadeh G, Sattar A, Tsunoda T. Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genomics 2018; 19:923. [PMID: 29363424 PMCID: PMC5781056 DOI: 10.1186/s12864-017-4336-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Post-translational modification is considered an important biological mechanism with critical impact on the diversification of the proteome. Although a long list of such modifications has been studied, succinylation of lysine residues has recently attracted the interest of the scientific community. The experimental detection of succinylation sites is an expensive process, which consumes a lot of time and resources. Therefore, computational predictors of this covalent modification have emerged as a last resort to tackling lysine succinylation. RESULTS In this paper, we propose a novel computational predictor called 'Success', which efficiently uses the structural and evolutionary information of amino acids for predicting succinylation sites. To do this, each lysine was described as a vector that combined the above information of surrounding amino acids. We then designed a support vector machine with a radial basis function kernel for discriminating between succinylated and non-succinylated residues. We finally compared the Success predictor with three state-of-the-art predictors in the literature. As a result, our proposed predictor showed a significant improvement over the compared predictors in statistical metrics, such as sensitivity (0.866), accuracy (0.838) and Matthews correlation coefficient (0.677) on a benchmark dataset. CONCLUSIONS The proposed predictor effectively uses the structural and evolutionary information of the amino acids surrounding a lysine. The bigram feature extraction approach, while retaining the same number of features, facilitates a better description of lysines. A support vector machine with a radial basis function kernel was used to discriminate between modified and unmodified lysines. The aforementioned aspects make the Success predictor outperform three state-of-the-art predictors in succinylation detection.
Collapse
Affiliation(s)
- Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan. .,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan. .,Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia. .,School of Engineering & Physics, University of the South Pacific, Suva, Fiji.
| | - Abdollah Dehzangi
- Department of Computer Science, School of Computer, Mathematical, and Natural Sciences, Morgan State University, Baltimore, Maryland, USA
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, Palmerston North, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Brisbane, Australia
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.,School of Information and Communication Technology, Griffith University, Brisbane, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan.,Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.,CREST, JST, Tokyo, 113-8510, Japan
| |
Collapse
|
37
|
Sharma R, Bayarjargal M, Tsunoda T, Patil A, Sharma A. MoRFPred-plus: Computational Identification of MoRFs in Protein Sequences using Physicochemical Properties and HMM profiles. J Theor Biol 2018; 437:9-16. [DOI: 10.1016/j.jtbi.2017.10.015] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Revised: 10/12/2017] [Accepted: 10/13/2017] [Indexed: 11/26/2022]
|
38
|
Wang L, Zhao Y, Chen Y, Wang D. The effect of three novel feature extraction methods on the prediction of the subcellular localization of multi-site virus proteins. Bioengineered 2018; 9:196-202. [PMID: 28886267 PMCID: PMC5972939 DOI: 10.1080/21655979.2017.1373536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 07/05/2017] [Indexed: 11/08/2022] Open
Abstract
Experimental methods play a crucial role in identifying the subcellular localization of proteins and building high-quality databases. However, more efficient, automated computational methods are required to predict the subcellular localization of proteins on a large scale. Various efficient feature extraction methods have been proposed to predict subcellular localization, but challenges remain. In this paper, three novel feature extraction methods are established to improve multi-site prediction. The first novel feature extraction method utilizes repetitive information via moving windows based on a dipeptide pseudo amino acid composition method (R-Dipeptide). The second novel feature extraction method utilizes the impact of each amino acid residue on its following residues based on pseudo amino acids (I-PseAAC). The third novel feature extraction method provides local information about protein sequences that reflects the strength of the physicochemical properties of residues (PseAAC2). The multi-label k-nearest neighbor algorithm (MLKNN) is used to predict the subcellular localization of multi-site virus proteins. The best overall accuracy values of R-Dipeptide, I-PseAAC, and PseAAC2 when applied to dataset S from Virus-mPloc are 59.92%, 59.13%, and 57.94% respectively.
Collapse
Affiliation(s)
- Lei Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yaou Zhao
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Yuehui Chen
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| | - Dong Wang
- School of Information Science and Engineering, University of Jinan, Jinan, China
- Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Jinan, China
| |
Collapse
|
39
|
pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 2018; 110:50-58. [DOI: 10.1016/j.ygeno.2017.08.005] [Citation(s) in RCA: 180] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Revised: 08/10/2017] [Accepted: 08/11/2017] [Indexed: 11/22/2022]
|
40
|
iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting. Sci Rep 2017; 7:17731. [PMID: 29255285 PMCID: PMC5735173 DOI: 10.1038/s41598-017-18025-2] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2017] [Accepted: 12/05/2017] [Indexed: 02/07/2023] Open
Abstract
Prediction of new drug-target interactions is critically important as it can lead the researchers to find new uses for old drugs and to disclose their therapeutic profiles or side effects. However, experimental prediction of drug-target interactions is expensive and time-consuming. As a result, computational methods for predictioning new drug-target interactions have gained a tremendous interest in recent times. Here we present iDTI-ESBoost, a prediction model for identification of drug-target interactions using evolutionary and structural features. Our proposed method uses a novel data balancing and boosting technique to predict drug-target interaction. On four benchmark datasets taken from a gold standard data, iDTI-ESBoost outperforms the state-of-the-art methods in terms of area under receiver operating characteristic (auROC) curve. iDTI-ESBoost also outperforms the latest and the best-performing method found in the literature in terms of area under precision recall (auPR) curve. This is significant as auPR curves are argued as suitable metric for comparison for imbalanced datasets similar to the one studied here. Our reported results show the effectiveness of the classifier, balancing methods and the novel features incorporated in iDTI-ESBoost. iDTI-ESBoost is a novel prediction method that has for the first time exploited the structural features along with the evolutionary features to predict drug-protein interactions. We believe the excellent performance of iDTI-ESBoost both in terms of auROC and auPR would motivate the researchers and practitioners to use it to predict drug-target interactions. To facilitate that, iDTI-ESBoost is implemented and made publicly available at: http://farshidrayhan.pythonanywhere.com/iDTI-ESBoost/.
Collapse
|
41
|
Shatabda S, Saha S, Sharma A, Dehzangi A. iPHLoc-ES: Identification of bacteriophage protein locations using evolutionary and structural features. J Theor Biol 2017; 435:229-237. [DOI: 10.1016/j.jtbi.2017.09.022] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Revised: 09/18/2017] [Accepted: 09/20/2017] [Indexed: 10/18/2022]
|
42
|
HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features. BIOMED RESEARCH INTERNATIONAL 2017; 2017:4590609. [PMID: 29270430 PMCID: PMC5706079 DOI: 10.1155/2017/4590609] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Accepted: 10/22/2017] [Indexed: 12/21/2022]
Abstract
DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.
Collapse
|
43
|
iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features. Sci Rep 2017; 7:14938. [PMID: 29097781 PMCID: PMC5668250 DOI: 10.1038/s41598-017-14945-1] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Accepted: 10/19/2017] [Indexed: 11/12/2022] Open
Abstract
DNA-binding proteins play a very important role in the structural composition of the DNA. In addition, they regulate and effect various cellular processes like transcription, DNA replication, DNA recombination, repair and modification. The experimental methods used to identify DNA-binding proteins are expensive and time consuming and thus attracted researchers from computational field to address the problem. In this paper, we present iDNAProt-ES, a DNA-binding protein prediction method that utilizes both sequence based evolutionary and structure based features of proteins to identify their DNA-binding functionality. We used recursive feature elimination to extract an optimal set of features and train them using Support Vector Machine (SVM) with linear kernel to select the final model. Our proposed method significantly outperforms the existing state-of-the-art predictors on standard benchmark dataset. The accuracy of the predictor is 90.18% using jack knife test and 88.87% using 10-fold cross validation on the benchmark dataset. The accuracy of the predictor on the independent dataset is 80.64% which is also significantly better than the state-of-the-art methods. iDNAProt-ES is a novel prediction method that uses evolutionary and structural based features. We believe the superior performance of iDNAProt-ES will motivate the researchers to use this method to identify DNA-binding proteins. iDNAProt-ES is publicly available as a web server at: http://brl.uiu.ac.bd/iDNAProt-ES/.
Collapse
|
44
|
Cheng X, Xiao X, Chou KC. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 2017; 110:S0888-7543(17)30102-7. [PMID: 28989035 DOI: 10.1016/j.ygeno.2017.10.002] [Citation(s) in RCA: 92] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 09/28/2017] [Accepted: 10/04/2017] [Indexed: 01/21/2023]
Abstract
Information of the proteins' subcellular localization is crucially important for revealing their biological functions in a cell, the basic unit of life. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop computational tools for timely identifying their subcellular locations based on the sequence information alone. The current study is focused on the Gram-negative bacterial proteins. Although considerable efforts have been made in protein subcellular prediction, the problem is far from being solved yet. This is because mounting evidences have indicated that many Gram-negative bacterial proteins exist in two or more location sites. Unfortunately, most existing methods can be used to deal with single-location proteins only. Actually, proteins with multi-locations may have some special biological functions important for both basic research and drug design. In this study, by using the multi-label theory, we developed a new predictor called "pLoc-mGneg" for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple locations. Rigorous cross-validation on a high quality benchmark dataset indicated that the proposed predictor is remarkably superior to "iLoc-Gneg", the state-of-the-art predictor for the same purpose. For the convenience of most experimental scientists, a user-friendly web-server for the novel predictor has been established at http://www.jci-bioinfo.cn/pLoc-mGneg/, by which users can easily get their desired results without the need to go through the complicated mathematics involved.
Collapse
Affiliation(s)
- Xiang Cheng
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, China; The Gordon Life Science Institute, Boston, MA 02478, USA.
| | - Kuo-Chen Chou
- The Gordon Life Science Institute, Boston, MA 02478, USA; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah, Saudi Arabia.
| |
Collapse
|
45
|
Huo H, Li T, Wang S, Lv Y, Zuo Y, Yang L. Prediction of presynaptic and postsynaptic neurotoxins by combining various Chou's pseudo components. Sci Rep 2017; 7:5827. [PMID: 28724993 PMCID: PMC5517432 DOI: 10.1038/s41598-017-06195-y] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/08/2017] [Indexed: 11/09/2022] Open
Abstract
Presynaptic and postsynaptic neurotoxins are two groups of neurotoxins. Identification of presynaptic and postsynaptic neurotoxins is an important work for numerous newly found toxins. It is both costly and time consuming to determine these two neurotoxins by experimental methods. As a complement, using computational methods for predicting presynaptic and postsynaptic neurotoxins could provide some useful information in a timely manner. In this study, we described four algorithms for predicting presynaptic and postsynaptic neurotoxins from sequence driven features by using Increment of Diversity (ID), Multinomial Naive Bayes Classifier (MNBC), Random Forest (RF), and K-nearest Neighbours Classifier (IBK). Each protein sequence was encoded by pseudo amino acid (PseAA) compositions and three biological motif features, including MEME, Prosite and InterPro motif features. The Maximum Relevance Minimum Redundancy (MRMR) feature selection method was used to rank the PseAA compositions and the 50 top ranked features were selected to improve the prediction accuracy. The PseAA compositions and three kinds of biological motif features were combined and 12 different parameters that defined as P1-P12 were selected as the input parameters of ID, MNBC, RF, and IBK. The prediction results obtained in this study were significantly better than those of previously developed methods.
Collapse
Affiliation(s)
- Haiyan Huo
- Department of Environmental Engineering, Hohhot University for Nationalities, Hohhot, 010051, China
| | - Tao Li
- College of Life Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Shiyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yingli Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China
| | - Yongchun Zuo
- The Key Laboratory of Mammalian Reproductive Biology and Biotechnology of the Ministry of Education, Inner Mongolia University, Hohhot, 010021, China.
| | - Lei Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, China.
| |
Collapse
|
46
|
Yang H, Li X, Cai Y, Wang Q, Li W, Liu G, Tang Y. In silico prediction of chemical subcellular localization via multi-classification methods. MEDCHEMCOMM 2017; 8:1225-1234. [PMID: 30108833 PMCID: PMC6072212 DOI: 10.1039/c7md00074j] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2017] [Accepted: 03/22/2017] [Indexed: 12/16/2022]
Abstract
Chemical subcellular localization is closely related to drug distribution in the body and hence important in drug discovery and design. Although many in vivo and in vitro methods have been developed, in silico methods play key roles in the prediction of chemical subcellular localization due to their low costs and high performance. For that purpose, machine learning-based methods were developed here. At first, 614 unique compounds localized in the lysosome, mitochondria, nucleus and plasma membrane were collected from the literature. 80% of the compounds were used to build the models and the rest as the external validation set. Both fingerprints and molecular descriptors were used to describe the molecules, and six machine learning methods were applied to build the multi-classification models. The performance of the models was measured by 5-fold cross-validation and external validation. We further detected key substructures for each localization and analyzed potential structure-localization relationships, which could be very helpful for molecular design and modification. The key substructures can also be used as features complementary to fingerprints to improve the performance of the models.
Collapse
Affiliation(s)
- Hongbin Yang
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Xiao Li
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Yingchun Cai
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Qin Wang
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Weihua Li
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Guixia Liu
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| | - Yun Tang
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China .
| |
Collapse
|
47
|
Dehzangi A, López Y, Lal SP, Taherzadeh G, Michaelson J, Sattar A, Tsunoda T, Sharma A. PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. J Theor Biol 2017; 425:97-102. [PMID: 28483566 DOI: 10.1016/j.jtbi.2017.05.005] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Revised: 04/28/2017] [Accepted: 05/03/2017] [Indexed: 11/25/2022]
Abstract
Post-translational modification (PTM) is a covalent and enzymatic modification of proteins, which contributes to diversify the proteome. Despite many reported PTMs with essential roles in cellular functioning, lysine succinylation has emerged as a subject of particular interest. Because its experimental identification remains a costly and time-consuming process, computational predictors have been recently proposed for tackling this important issue. However, the performance of current predictors is still very limited. In this paper, we propose a new predictor called PSSM-Suc which employs evolutionary information of amino acids for predicting succinylated lysine residues. Here we described each lysine residue in terms of profile bigrams extracted from position specific scoring matrices. We compared the performance of PSSM-Suc to that of existing predictors using a widely used benchmark dataset. PSSM-Suc showed a significant improvement in performance over state-of-the-art predictors. Its sensitivity, accuracy and Matthews correlation coefficient were 0.8159, 0.8199 and 0.6396, respectively.
Collapse
Affiliation(s)
- Abdollah Dehzangi
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA.
| | - Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia
| | - Jacob Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; CREST, JST, Tokyo 113-8510, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Australia; School of Engineering & Physics, University of the South Pacific, Fiji
| |
Collapse
|
48
|
López Y, Dehzangi A, Lal SP, Taherzadeh G, Michaelson J, Sattar A, Tsunoda T, Sharma A. SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Anal Biochem 2017; 527:24-32. [PMID: 28363440 DOI: 10.1016/j.ab.2017.03.021] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Revised: 03/13/2017] [Accepted: 03/28/2017] [Indexed: 11/30/2022]
Abstract
Post-Translational Modification (PTM) is a biological reaction which contributes to diversify the proteome. Despite many modifications with important roles in cellular activity, lysine succinylation has recently emerged as an important PTM mark. It alters the chemical structure of lysines, leading to remarkable changes in the structure and function of proteins. In contrast to the huge amount of proteins being sequenced in the post-genome era, the experimental detection of succinylated residues remains expensive, inefficient and time-consuming. Therefore, the development of computational tools for accurately predicting succinylated lysines is an urgent necessity. To date, several approaches have been proposed but their sensitivity has been reportedly poor. In this paper, we propose an approach that utilizes structural features of amino acids to improve lysine succinylation prediction. Succinylated and non-succinylated lysines were first retrieved from 670 proteins and characteristics such as accessible surface area, backbone torsion angles and local structure conformations were incorporated. We used the k-nearest neighbors cleaning treatment for dealing with class imbalance and designed a pruned decision tree for classification. Our predictor, referred to as SucStruct (Succinylation using Structural features), proved to significantly improve performance when compared to previous predictors, with sensitivity, accuracy and Mathew's correlation coefficient equal to 0.7334-0.7946, 0.7444-0.7608 and 0.4884-0.5240, respectively.
Collapse
Affiliation(s)
- Yosvany López
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan.
| | - Abdollah Dehzangi
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA.
| | - Sunil Pranit Lal
- School of Engineering & Advanced Technology, Massey University, New Zealand
| | - Ghazaleh Taherzadeh
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia
| | - Jacob Michaelson
- Department of Psychiatry, Carver College of Medicine, University of Iowa, Iowa, USA
| | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Parklands Drive, Southport, Queensland 4215, Australia; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| | - Tatsuhiko Tsunoda
- Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan; Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; CREST, JST, Tokyo 113-8510, Japan
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan; Institute for Integrated and Intelligent Systems, Griffith University, Australia
| |
Collapse
|
49
|
Liu B, Wu H, Chou KC. Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. ACTA ACUST UNITED AC 2017. [DOI: 10.4236/ns.2017.94007] [Citation(s) in RCA: 91] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
50
|
Tiwari AK. Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou's general PseAAC. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 134:197-213. [PMID: 27480744 DOI: 10.1016/j.cmpb.2016.07.004] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/19/2016] [Revised: 05/27/2016] [Accepted: 07/01/2016] [Indexed: 06/06/2023]
Abstract
BACKGROUND AND OBJECTIVE The G-protein coupled receptors are the largest superfamilies of membrane proteins and important targets for the drug design. G-protein coupled receptors are responsible for many physiochemical processes such as smell, taste, vision, neurotransmission, metabolism, cellular growth and immune response. So it is necessary to design a robust and efficient approach for the prediction of G-protein coupled receptors and their subfamilies. METHODS In this paper, the protein samples are represented by amino acid composition, dipeptide composition, correlation features, composition, transition, distribution, sequence order descriptors and pseudo amino acid composition with total 1497 number of sequence derived features. To address the issue of efficient classification of G-protein coupled receptors and their subfamilies, we propose to use a weighted k-nearest neighbor classifier with UNION of best 50 features, selected by Fisher score based feature selection, ReliefF, fast correlation based filter, minimum redundancy maximum relevancy, and support vector machine based recursive elimination feature selection methods to exploit the advantages of these feature selection methods. RESULTS The proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% using 10-fold cross-validation to predict the G-protein coupled receptors and non-G-protein coupled receptors, subfamilies of G-protein coupled receptors, and subfamilies of class A G-protein coupled receptors, respectively. CONCLUSIONS The high accuracies, MCC, ROC area values, and precision values indicate that the proposed method is better for the prediction of G-protein coupled receptors families and their subfamilies.
Collapse
|