1
|
Wang M, Song L, Zhang Y, Gao H, Yan L, Yu B. Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108191] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
2
|
Mahmud SMH, Chen W, Liu Y, Awal MA, Ahmed K, Rahman MH, Moni MA. PreDTIs: prediction of drug-target interactions based on multiple feature information using gradient boosting framework with data balancing and feature selection techniques. Brief Bioinform 2021; 22:6168499. [PMID: 33709119 PMCID: PMC7989622 DOI: 10.1093/bib/bbab046] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 01/25/2021] [Accepted: 01/29/2021] [Indexed: 12/13/2022] Open
Abstract
Discovering drug–target (protein) interactions (DTIs) is of great significance for researching and developing novel drugs, having a tremendous advantage to pharmaceutical industries and patients. However, the prediction of DTIs using wet-lab experimental methods is generally expensive and time-consuming. Therefore, different machine learning-based methods have been developed for this purpose, but there are still substantial unknown interactions needed to discover. Furthermore, data imbalance and feature dimensionality problems are a critical challenge in drug-target datasets, which can decrease the classifier performances that have not been significantly addressed yet. This paper proposed a novel drug–target interaction prediction method called PreDTIs. First, the feature vectors of the protein sequence are extracted by the pseudo-position-specific scoring matrix (PsePSSM), dipeptide composition (DC) and pseudo amino acid composition (PseAAC); and the drug is encoded with MACCS substructure fingerings. Besides, we propose a FastUS algorithm to handle the class imbalance problem and also develop a MoIFS algorithm to remove the irrelevant and redundant features for getting the best optimal features. Finally, balanced and optimal features are provided to the LightGBM Classifier to identify DTIs, and the 5-fold CV validation test method was applied to evaluate the prediction ability of the proposed method. Prediction results indicate that the proposed model PreDTIs is significantly superior to other existing methods in predicting DTIs, and our model could be used to discover new drugs for unknown disorders or infections, such as for the coronavirus disease 2019 using existing drugs compounds and severe acute respiratory syndrome coronavirus 2 protein sequences.
Collapse
Affiliation(s)
- S M Hasan Mahmud
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenyu Chen
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Yongsheng Liu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Md Abdul Awal
- Electronics and Communication Engineering Discipline, Khulna University, Khulna 9208, Bangladesh
| | - Kawsar Ahmed
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail-1902, Bangladesh
| | - Md Habibur Rahman
- Department of Computer Science and Engineering, Islamic University, Kushtia-7003, Bangladesh
| | - Mohammad Ali Moni
- UNSW Digital Health, WHO Center for eHealth, School of Public Health and Community Medicine, Faculty of Medicine, The University of New South Wales, Sydney, Australia
| |
Collapse
|
3
|
Wang M, Cui X, Yu B, Chen C, Ma Q, Zhou H. SulSite-GTB: identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04792-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
4
|
Malebary SJ, Rehman MSU, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou's 5-step rule. PLoS One 2019; 14:e0223993. [PMID: 31751380 PMCID: PMC6874067 DOI: 10.1371/journal.pone.0223993] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/02/2019] [Indexed: 01/22/2023] Open
Abstract
Among different post-translational modifications (PTMs), one of the most important one is the lysine crotonylation in proteins. Its importance cannot be undermined related to different diseases and essential biological practice. The key step for finding the hidden mechanisms of crotonylation along with their occurrence sites is to completely apprehend the mechanism behind this biological process. In previously reported studies, researchers have used different techniques, like position weighted matrix (PWM), support vector machine (SVM), k nearest neighbors (KNN), and many others. However, the maximum prediction accuracy achieved was not such high. To address this, herein, we propose an improved predictor for lysine crotonylation sites named iCrotoK-PseAAC, in which we have incorporated various position and composition relative features along with statistical moments into PseAAC. The results of self-consistency testing were 100% accurate, while the 10-fold cross validation gave 99.0% accuracy. Based on the validation and comparison of model, it is concluded that the iCrotoK-PseAAC is more accurate than the previously proposed models.
Collapse
Affiliation(s)
- Sharaf Jameel Malebary
- Department of Information Technology, King Abdul Aziz University, Rabigh, Kingdom of Saudi Arabia
| | - Muhammad Safi ur Rehman
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
5
|
Lin J, Chen H, Li S, Liu Y, Li X, Yu B. Accurate prediction of potential druggable proteins based on genetic algorithm and Bagging-SVM ensemble classifier. Artif Intell Med 2019; 98:35-47. [DOI: 10.1016/j.artmed.2019.07.005] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2018] [Revised: 03/03/2019] [Accepted: 07/18/2019] [Indexed: 12/14/2022]
|
6
|
Hoseini ASH, Mirzarezaee M. Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks. IRANIAN JOURNAL OF BIOTECHNOLOGY 2018; 16:e1933. [PMID: 31457027 PMCID: PMC6697825 DOI: 10.15171/ijb.1933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 01/11/2018] [Accepted: 01/13/2018] [Indexed: 01/09/2023]
Abstract
Background Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from protein sequences. In contrast, protein interactions have been less investigated. Objectives As protein interactions usually occur in the same or adjacent places, using this feature to find the location would be efficient and impressive. This study did not aim at increasing the total accuracy of the conducted research. The study has focused on the features of the proteins’ interaction and their employment which lead to a higher accuracy. Materials and Methods In this study, we have examined the protein interaction network as one of the features for prediction of the protein localization and its effects on the prediction results. In this regards, we have gathered some of the most common features including Amino Acid Composition, Dipeptide Compositions, Pseudo Amino Acid Compositions (PseAAC), Position Specific Scoring Matrix (PSSM), Functional Domain, Gene Ontology information, and the Pair-wise sequence alignment. The results of the classification are compared to the ones using protein interactions. For achieving this goal different machine learning algorithms were tested. Results The best-obtained results of using single feature set obtained using SVM classifier for PseAAC feature. The accuracy of combining all features with PPI data, using the Decision Tree and Random Forest classifiers, was 82.49% and 83.35%, respectively. In another experiment, using just protein interaction data with the different cutting points resulted in obtaining an accuracy of 93.035% for the protein location prediction. Conclusion In total, it was shown that protein(s) interaction has a significant impact on the prediction of the mitochondrial proteins’ location. This feature can separately distinguish the locations well. Using this feature the accuracy of the results is raised up to 5%.
Collapse
Affiliation(s)
| | - Mitra Mirzarezaee
- Department of Computer Engineering, Science and Research branch, Islamic Azad University, Tehran, Iran.,School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| |
Collapse
|
7
|
Yu B, Li S, Qiu W, Wang M, Du J, Zhang Y, Chen X. Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction. BMC Genomics 2018; 19:478. [PMID: 29914358 PMCID: PMC6006758 DOI: 10.1186/s12864-018-4849-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 06/01/2018] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Apoptosis is associated with some human diseases, including cancer, autoimmune disease, neurodegenerative disease and ischemic damage, etc. Apoptosis proteins subcellular localization information is very important for understanding the mechanism of programmed cell death and the development of drugs. Therefore, the prediction of subcellular localization of apoptosis protein is still a challenging task. RESULTS In this paper, we propose a novel method for predicting apoptosis protein subcellular localization, called PsePSSM-DCCA-LFDA. Firstly, the protein sequences are extracted by combining pseudo-position specific scoring matrix (PsePSSM) and detrended cross-correlation analysis coefficient (DCCA coefficient), then the extracted feature information is reduced dimensionality by LFDA (local Fisher discriminant analysis). Finally, the optimal feature vectors are input to the SVM classifier to predict subcellular location of the apoptosis proteins. The overall prediction accuracy of 99.7, 99.6 and 100% are achieved respectively on the three benchmark datasets by the most rigorous jackknife test, which is better than other state-of-the-art methods. CONCLUSION The experimental results indicate that our method can significantly improve the prediction accuracy of subcellular localization of apoptosis proteins, which is quite high to be able to become a promising tool for further proteomics studies. The source code and all datasets are available at https://github.com/QUST-BSBRC/PsePSSM-DCCA-LFDA/ .
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,School of Life Sciences, University of Science and Technology of China, Hefei, 230027, China.
| | - Shan Li
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Wenying Qiu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Minghui Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Junwei Du
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University at Weihai, Weihai, 264209, China
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 21116, China
| |
Collapse
|
8
|
Kabir M, Hayat M. iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples. Mol Genet Genomics 2015; 291:285-96. [DOI: 10.1007/s00438-015-1108-5] [Citation(s) in RCA: 95] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 08/19/2015] [Indexed: 10/23/2022]
|
9
|
Large-scale identification of potential drug targets based on the topological features of human protein–protein interaction network. Anal Chim Acta 2015; 871:18-27. [DOI: 10.1016/j.aca.2015.02.032] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2014] [Revised: 01/29/2015] [Accepted: 02/10/2015] [Indexed: 01/17/2023]
|
10
|
HBC-Evo: predicting human breast cancer by exploiting amino acid sequence-based feature spaces and evolutionary ensemble system. Amino Acids 2014; 47:217-21. [DOI: 10.1007/s00726-014-1871-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 11/04/2014] [Indexed: 10/24/2022]
|
11
|
Zuo YC, Peng Y, Liu L, Chen W, Yang L, Fan GL. Predicting peroxidase subcellular location by hybridizing different descriptors of Chou’ pseudo amino acid patterns. Anal Biochem 2014; 458:14-9. [DOI: 10.1016/j.ab.2014.04.032] [Citation(s) in RCA: 81] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2014] [Revised: 04/22/2014] [Accepted: 04/25/2014] [Indexed: 11/28/2022]
|
12
|
Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2014; 113:792-808. [PMID: 24472367 DOI: 10.1016/j.cmpb.2014.01.001] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/06/2013] [Revised: 12/29/2013] [Accepted: 01/03/2014] [Indexed: 06/03/2023]
Abstract
This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.
Collapse
Affiliation(s)
- Abdul Majid
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Safdar Ali
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Mubashar Iqbal
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Nabeela Kausar
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| |
Collapse
|
13
|
Ali S, Majid A, Khan A. IDM-PhyChm-Ens: intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids. Amino Acids 2014; 46:977-93. [PMID: 24390396 DOI: 10.1007/s00726-013-1659-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 12/20/2013] [Indexed: 12/21/2022]
Abstract
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
Collapse
Affiliation(s)
- Safdar Ali
- Department of Computer and Information Sciences, Pakistan Institute of Engineering, and Applied Sciences, Nilore, Islamabad, 45650, Pakistan,
| | | | | |
Collapse
|
14
|
Protein subcellular localization in human and hamster cell lines: Employing local ternary patterns of fluorescence microscopy images. J Theor Biol 2014; 340:85-95. [DOI: 10.1016/j.jtbi.2013.08.017] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Revised: 07/09/2013] [Accepted: 08/15/2013] [Indexed: 11/21/2022]
|
15
|
Wang X, Li GZ. Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:436-446. [PMID: 23929867 DOI: 10.1109/tcbb.2013.21] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Prediction of protein subcellular localization is an important but challenging problem, particularly when proteins may simultaneously exist at, or move between, two or more different subcellular location sites. Most of the existing protein subcellular localization methods are only used to deal with the single-location proteins. In the past few years, only a few methods have been proposed to tackle proteins with multiple locations. However, they only adopt a simple strategy, that is, transforming the multilocation proteins to multiple proteins with single location, which does not take correlations among different subcellular locations into account. In this paper, a novel method named random label selection (RALS) (multilabel learning via RALS), which extends the simple binary relevance (BR) method, is proposed to learn from multilocation proteins in an effective and efficient way. RALS does not explicitly find the correlations among labels, but rather implicitly attempts to learn the label correlations from data by augmenting original feature space with randomly selected labels as its additional input features. Through the fivefold cross-validation test on a benchmark data set, we demonstrate our proposed method with consideration of label correlations obviously outperforms the baseline BR method without consideration of label correlations, indicating correlations among different subcellular locations really exist and contribute to improvement of prediction performance. Experimental results on two benchmark data sets also show that our proposed methods achieve significantly higher performance than some other state-of-the-art methods in predicting subcellular multilocations of proteins. The prediction web server is available at >http://levis.tongji.edu.cn:8080/bioinfo/MLPred-Euk/ for the public usage.
Collapse
Affiliation(s)
- Xiao Wang
- Key Laboratory of Embedded System and Service Computing, Ministry of Education, Department of Control Science and Engineering, Tongji University, Shanghai 201804, China
| | | |
Collapse
|
16
|
Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y, Yang J. Enhancing Membrane Protein Subcellular Localization Prediction by Parallel Fusion of Multi-View Features. IEEE Trans Nanobioscience 2012; 11:375-85. [PMID: 22875262 DOI: 10.1109/tnb.2012.2208473] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Dongjun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| | | | | | | | | | | | | |
Collapse
|
17
|
Wang X, Li GZ. A multi-label predictor for identifying the subcellular locations of singleplex and multiplex eukaryotic proteins. PLoS One 2012; 7:e36317. [PMID: 22629314 PMCID: PMC3358325 DOI: 10.1371/journal.pone.0036317] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/01/2012] [Indexed: 01/30/2023] Open
Abstract
Subcellular locations of proteins are important functional attributes. An effective and efficient subcellular localization predictor is necessary for rapidly and reliably annotating subcellular locations of proteins. Most of existing subcellular localization methods are only used to deal with single-location proteins. Actually, proteins may simultaneously exist at, or move between, two or more different subcellular locations. To better reflect characteristics of multiplex proteins, it is highly desired to develop new methods for dealing with them. In this paper, a new predictor, called Euk-ECC-mPLoc, by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and hybridizing gene ontology with dipeptide composition information, has been developed that can be used to deal with systems containing both singleplex and multiplex eukaryotic proteins. It can be utilized to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centrosome, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome, (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole. Experimental results on a stringent benchmark dataset of eukaryotic proteins by jackknife cross validation test show that the average success rate and overall success rate obtained by Euk-ECC-mPLoc were 69.70% and 81.54%, respectively, indicating that our approach is quite promising. Particularly, the success rates achieved by Euk-ECC-mPLoc for small subsets were remarkably improved, indicating that it holds a high potential for simulating the development of the area. As a user-friendly web-server, Euk-ECC-mPLoc is freely accessible to the public at the website http://levis.tongji.edu.cn:8080/bioinfo/Euk-ECC-mPLoc/. We believe that Euk-ECC-mPLoc may become a useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying subcellular locations of eukaryotic proteins.
Collapse
Affiliation(s)
| | - Guo-Zheng Li
- The MOE Key Laboratory of Embedded System and Service Computing, Department of Control Science and Engineering, Tongji University, Shanghai, China
| |
Collapse
|
18
|
Li L, Zhang Y, Zou L, Li C, Yu B, Zheng X, Zhou Y. An ensemble classifier for eukaryotic protein subcellular location prediction using gene ontology categories and amino acid hydrophobicity. PLoS One 2012; 7:e31057. [PMID: 22303481 PMCID: PMC3268814 DOI: 10.1371/journal.pone.0031057] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2011] [Accepted: 12/31/2011] [Indexed: 02/05/2023] Open
Abstract
With the rapid increase of protein sequences in the post-genomic age, it is challenging to develop accurate and automated methods for reliably and quickly predicting their subcellular localizations. Till now, many efforts have been tried, but most of which used only a single algorithm. In this paper, we proposed an ensemble classifier of KNN (k-nearest neighbor) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic proteins based on a voting system. The overall prediction accuracies by the one-versus-one strategy are 78.17%, 89.94% and 75.55% for three benchmark datasets of eukaryotic proteins. The improved prediction accuracies reveal that GO annotations and hydrophobicity of amino acids help to predict subcellular locations of eukaryotic proteins.
Collapse
Affiliation(s)
- Liqi Li
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Yuan Zhang
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Lingyun Zou
- Department of Microbiology, College of Basic Medical Sciences, Third Military Medical University, Chongqing, China
| | - Changqing Li
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Bo Yu
- Department of Orthopedics, Yichun People's Hospital, Yichun, China
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
- Scientific Computing Key Laboratory of Shanghai Universities, Shanghai, China
| | - Yue Zhou
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| |
Collapse
|
19
|
Hayat M, Khan A. MemHyb: Predicting membrane protein types by hybridizing SAAC and PSSM. J Theor Biol 2012; 292:93-102. [DOI: 10.1016/j.jtbi.2011.09.026] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2011] [Revised: 09/21/2011] [Accepted: 09/22/2011] [Indexed: 01/08/2023]
|
20
|
Tahir M, Khan A, Majid A. Protein subcellular localization of fluorescence imagery using spatial and transform domain features. Bioinformatics 2011; 28:91-7. [DOI: 10.1093/bioinformatics/btr624] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
|