51
|
Hayat M, Iqbal N. Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou's general PseAAC and Support Vector Machine. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2014; 116:184-192. [PMID: 24997484 DOI: 10.1016/j.cmpb.2014.06.007] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2014] [Revised: 06/09/2014] [Accepted: 06/13/2014] [Indexed: 06/03/2023]
Abstract
Proteins control all biological functions in living species. Protein structure is comprised of four major classes including all-α class, all-β class, α+β, and α/β. Each class performs different function according to their nature. Owing to the large exploration of protein sequences in the databanks, the identification of protein structure classes is difficult through conventional methods with respect to cost and time. Looking at the importance of protein structure classes, it is thus highly desirable to develop a computational model for discriminating protein structure classes with high accuracy. For this purpose, we propose a silco method by incorporating Pseudo Average Chemical Shift and Support Vector Machine. Two feature extraction schemes namely Pseudo Amino Acid Composition and Pseudo Average Chemical Shift are used to explore valuable information from protein sequences. The performance of the proposed model is assessed using four benchmark datasets 25PDB, 1189, 640 and 399 employing jackknife test. The success rates of the proposed model are 84.2%, 85.0%, 86.4%, and 89.2%, respectively on the four datasets. The empirical results reveal that the performance of our proposed model compared to existing models is promising in the literature so far and might be useful for future research.
Collapse
Affiliation(s)
- Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| | - Nadeem Iqbal
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| |
Collapse
|
52
|
Prediction of DNase I hypersensitive sites by using pseudo nucleotide compositions. ScientificWorldJournal 2014; 2014:740506. [PMID: 25215331 PMCID: PMC4152949 DOI: 10.1155/2014/740506] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 08/03/2014] [Indexed: 11/19/2022] Open
Abstract
DNase I hypersensitive sites (DHS) associated with a wide variety of regulatory DNA elements. Knowledge about the locations of DHS is helpful for deciphering the function of noncoding genomic regions. With the acceleration of genome sequences in the postgenomic age, it is highly desired to develop cost-effective computational methods to identify DHS. In the present work, a support vector machine based model was proposed to identify DHS by using the pseudo dinucleotide composition. In the jackknife test, the proposed model obtained an accuracy of 83%, which is competitive with that of the existing method. This result suggests that the proposed model may become a useful tool for DHS identifications.
Collapse
|
53
|
Predicting the types of J-proteins using clustered amino acids. BIOMED RESEARCH INTERNATIONAL 2014; 2014:935719. [PMID: 24804260 PMCID: PMC3996952 DOI: 10.1155/2014/935719] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Revised: 03/04/2014] [Accepted: 03/13/2014] [Indexed: 01/24/2023]
Abstract
J-proteins are molecular chaperones and present in a wide variety of organisms from prokaryote to eukaryote. Based on their domain organizations, J-proteins can be classified into 4 types, that is, Type I, Type II, Type III, and Type IV. Different types of J-proteins play distinct roles in influencing cancer properties and cell death. Thus, reliably annotating the types of J-proteins is essential to better understand their molecular functions. In the present work, a support vector machine based method was developed to identify the types of J-proteins using the tripeptide composition of reduced amino acid alphabet. In the jackknife cross-validation, the maximum overall accuracy of 94% was achieved on a stringent benchmark dataset. We also analyzed the amino acid compositions by using analysis of variance and found the distinct distributions of amino acids in each family of the J-proteins. To enhance the value of the practical applications of the proposed model, an online web server was developed and can be freely accessed.
Collapse
|
54
|
Prediction of multi-type membrane proteins in human by an integrated approach. PLoS One 2014; 9:e93553. [PMID: 24676214 PMCID: PMC3968155 DOI: 10.1371/journal.pone.0093553] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2013] [Accepted: 03/05/2014] [Indexed: 11/29/2022] Open
Abstract
Membrane proteins were found to be involved in various cellular processes performing various important functions, which are mainly associated to their types. However, it is very time-consuming and expensive for traditional biophysical methods to identify membrane protein types. Although some computational tools predicting membrane protein types have been developed, most of them can only recognize one kind of type. Therefore, they are not as effective as one membrane protein can have several types at the same time. To our knowledge, few methods handling multiple types of membrane proteins were reported. In this study, we proposed an integrated approach to predict multiple types of membrane proteins by employing sequence homology and protein-protein interaction network. As a result, the prediction accuracies reached 87.65%, 81.39% and 70.79%, respectively, by the leave-one-out test on three datasets. It outperformed the nearest neighbor algorithm adopting pseudo amino acid composition. The method is anticipated to be an alternative tool for identifying membrane protein types. New metrics for evaluating performances of methods dealing with multi-label problems were also presented. The program of the method is available upon request.
Collapse
|
55
|
Protein subcellular localization in human and hamster cell lines: Employing local ternary patterns of fluorescence microscopy images. J Theor Biol 2014; 340:85-95. [DOI: 10.1016/j.jtbi.2013.08.017] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Revised: 07/09/2013] [Accepted: 08/15/2013] [Indexed: 11/21/2022]
|
56
|
Hayat M, Tahir M, Khan SA. Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces. J Theor Biol 2013; 346:8-15. [PMID: 24384128 DOI: 10.1016/j.jtbi.2013.12.015] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2013] [Revised: 10/30/2013] [Accepted: 12/12/2013] [Indexed: 11/28/2022]
Abstract
Proteins are the executants of biological functions in living organisms. Comprehension of protein structure is a challenging problem in the era of proteomics, computational biology, and bioinformatics because of its pivotal role in protein folding patterns. Owing to the large exploration of protein sequences in protein databanks and intricacy of protein structures, experimental and theoretical methods are insufficient for prediction of protein structure classes. Therefore, it is highly desirable to develop an accurate, reliable, and high throughput computational model to predict protein structure classes correctly from polygenetic sequences. In this regard, we propose a promising model employing hybrid descriptor space in conjunction with optimized evidence-theoretic K-nearest neighbor algorithm. Hybrid space is the composition of two descriptor spaces including Multi-profile Bayes and bi-gram probability. In order to enhance the generalization power of the classifier, we have selected high discriminative descriptors from the hybrid space using particle swarm optimization, a well-known evolutionary feature selection technique. Performance evaluation of the proposed model is performed using the jackknife test on three low similarity benchmark datasets including 25PDB, 1189, and 640. The success rates of the proposed model are 87.0%, 86.6%, and 88.4%, respectively on the three benchmark datasets. The comparative analysis exhibits that our proposed model has yielded promising results compared to the existing methods in the literature. In addition, our proposed prediction system might be helpful in future research particularly in cases where the major focus of research is on low similarity datasets.
Collapse
Affiliation(s)
- Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan.
| | - Muhammad Tahir
- Department of Computer Science, National University of Computer and Emerging Science, Peshawar, Pakistan
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| |
Collapse
|
57
|
Han GS, Yu ZG, Anh V. A two-stage SVM method to predict membrane protein types by incorporating amino acid classifications and physicochemical properties into a general form of Chou's PseAAC. J Theor Biol 2013; 344:31-9. [PMID: 24316387 DOI: 10.1016/j.jtbi.2013.11.017] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Revised: 10/16/2013] [Accepted: 11/24/2013] [Indexed: 01/12/2023]
Abstract
Membrane proteins play important roles in many biochemical processes and are also attractive targets of drug discovery for various diseases. The elucidation of membrane protein types provides clues for understanding the structure and function of proteins. Recently we developed a novel system for predicting protein subnuclear localizations. In this paper, we propose a simplified version of our system for predicting membrane protein types directly from primary protein structures, which incorporates amino acid classifications and physicochemical properties into a general form of pseudo-amino acid composition. In this simplified system, we will design a two-stage multi-class support vector machine combined with a two-step optimal feature selection process, which proves very effective in our experiments. The performance of the present method is evaluated on two benchmark datasets consisting of five types of membrane proteins. The overall accuracies of prediction for five types are 93.25% and 96.61% via the jackknife test and independent dataset test, respectively. These results indicate that our method is effective and valuable for predicting membrane protein types. A web server for the proposed method is available at http://www.juemengt.com/jcc/memty_page.php.
Collapse
Affiliation(s)
- Guo-Sheng Han
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia.
| | - Vo Anh
- School of Mathematical Science, Queensland University of Technology, GPO Box 2434, Brisbane Q 4001, Australia
| |
Collapse
|
58
|
Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou's pseudo amino acid compositions. J Theor Biol 2013; 335:205-12. [DOI: 10.1016/j.jtbi.2013.06.034] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2013] [Revised: 05/26/2013] [Accepted: 06/29/2013] [Indexed: 12/19/2022]
|
59
|
Feng PM, Lin H, Chen W. Identification of antioxidants from sequence information using naïve Bayes. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:567529. [PMID: 24062796 PMCID: PMC3766563 DOI: 10.1155/2013/567529] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/03/2013] [Revised: 07/20/2013] [Accepted: 07/22/2013] [Indexed: 12/22/2022]
Abstract
Antioxidant proteins are substances that protect cells from the damage caused by free radicals. Accurate identification of new antioxidant proteins is important in understanding their roles in delaying aging. Therefore, it is highly desirable to develop computational methods to identify antioxidant proteins. In this study, a Naïve Bayes-based method was proposed to predict antioxidant proteins using amino acid compositions and dipeptide compositions. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 66.88% for the discrimination between antioxidant and nonantioxidant proteins, which is superior to that of other state-of-the-art classifiers. These results suggest that the proposed method could be an effective and promising high-throughput method for antioxidant protein identification.
Collapse
Affiliation(s)
- Peng-Mian Feng
- School of Public Health, Hebei United University, Tangshan 063000, China
| | - Hao Lin
- Key Laboratory for NeuroInformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China
| |
Collapse
|
60
|
Feng PM, Ding H, Chen W, Lin H. Naïve Bayes classifier with feature selection to identify phage virion proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:530696. [PMID: 23762187 PMCID: PMC3671239 DOI: 10.1155/2013/530696] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 03/10/2013] [Revised: 04/16/2013] [Accepted: 04/28/2013] [Indexed: 12/31/2022]
Abstract
Knowledge about the protein composition of phage virions is a key step to understand the functions of phage virion proteins. However, the experimental method to identify virion proteins is time consuming and expensive. Thus, it is highly desirable to develop novel computational methods for phage virion protein identification. In this study, a Naïve Bayes based method was proposed to predict phage virion proteins using amino acid composition and dipeptide composition. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 79.15% for phage virion and nonvirion proteins classification, which are superior to that of other state-of-the-art classifiers. These results indicate that the proposed method could be as an effective and promising high-throughput method in phage proteomics research.
Collapse
Affiliation(s)
- Peng-Mian Feng
- School of Public Health, Hebei United University, Tangshan 063000, China
| | - Hui Ding
- Key Laboratory for Neuroinformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan 063000, China
| | - Hao Lin
- Key Laboratory for Neuroinformation of Ministry of Education, Center of Bioinformatics, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
61
|
Sharma A, Lyons J, Dehzangi A, Paliwal KK. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J Theor Biol 2013; 320:41-6. [DOI: 10.1016/j.jtbi.2012.12.008] [Citation(s) in RCA: 118] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Revised: 12/04/2012] [Accepted: 12/05/2012] [Indexed: 11/26/2022]
|
62
|
Xu Y, Ding J, Wu LY, Chou KC. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One 2013; 8:e55844. [PMID: 23409062 PMCID: PMC3567014 DOI: 10.1371/journal.pone.0055844] [Citation(s) in RCA: 282] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Accepted: 01/02/2013] [Indexed: 11/18/2022] Open
Abstract
Posttranslational modifications (PTMs) of proteins are responsible for sensing and transducing signals to regulate various cellular functions and signaling events. S-nitrosylation (SNO) is one of the most important and universal PTMs. With the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop computational methods for timely identifying the exact SNO sites in proteins because this kind of information is very useful for both basic research and drug development. Here, a new predictor, called iSNO-PseAAC, was developed for identifying the SNO sites in proteins by incorporating the position-specific amino acid propensity (PSAAP) into the general form of pseudo amino acid composition (PseAAC). The predictor was implemented using the conditional random field (CRF) algorithm. As a demonstration, a benchmark dataset was constructed that contains 731 SNO sites and 810 non-SNO sites. To reduce the homology bias, none of these sites were derived from the proteins that had [Formula: see text] pairwise sequence identity to any other. It was observed that the overall cross-validation success rate achieved by iSNO-PseAAC in identifying nitrosylated proteins on an independent dataset was over 90%, indicating that the new predictor is quite promising. Furthermore, a user-friendly web-server for iSNO-PseAAC was established at http://app.aporc.org/iSNO-PseAAC/, by which users can easily obtain the desired results without the need to follow the mathematical equations involved during the process of developing the prediction method. It is anticipated that iSNO-PseAAC may become a useful high throughput tool for identifying the SNO sites, or at the very least play a complementary role to the existing methods in this area.
Collapse
Affiliation(s)
- Yan Xu
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
- Gordon Life Science Institute, San Diego, California, United States of America
| | - Jun Ding
- Department of Information and Computer Science, University of Science and Technology Beijing, Beijing, China
| | - Ling-Yun Wu
- Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
| |
Collapse
|
63
|
Ou YY, Chen SA, Wu SC. ETMB-RBF: discrimination of metal-binding sites in electron transporters based on RBF networks with PSSM profiles and significant amino acid pairs. PLoS One 2013; 8:e46572. [PMID: 23405059 PMCID: PMC3566168 DOI: 10.1371/journal.pone.0046572] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2012] [Accepted: 08/31/2012] [Indexed: 11/18/2022] Open
Abstract
Background Cellular respiration is the process by which cells obtain energy from glucose and is a very important biological process in living cell. As cells do cellular respiration, they need a pathway to store and transport electrons, the electron transport chain. The function of the electron transport chain is to produce a trans-membrane proton electrochemical gradient as a result of oxidation–reduction reactions. In these oxidation–reduction reactions in electron transport chains, metal ions play very important role as electron donor and acceptor. For example, Fe ions are in complex I and complex II, and Cu ions are in complex IV. Therefore, to identify metal-binding sites in electron transporters is an important issue in helping biologists better understand the workings of the electron transport chain. Methods We propose a method based on Position Specific Scoring Matrix (PSSM) profiles and significant amino acid pairs to identify metal-binding residues in electron transport proteins. Results We have selected a non-redundant set of 55 metal-binding electron transport proteins as our dataset. The proposed method can predict metal-binding sites in electron transport proteins with an average 10-fold cross-validation accuracy of 93.2% and 93.1% for metal-binding cysteine and histidine, respectively. Compared with the general metal-binding predictor from A. Passerini et al., the proposed method can improve over 9% of sensitivity, and 14% specificity on the independent dataset in identifying metal-binding cysteines. The proposed method can also improve almost 76% sensitivity with same specificity in metal-binding histidine, and MCC is also improved from 0.28 to 0.88. Conclusions We have developed a novel approach based on PSSM profiles and significant amino acid pairs for identifying metal-binding sites from electron transport proteins. The proposed approach achieved a significant improvement with independent test set of metal-binding electron transport proteins.
Collapse
Affiliation(s)
- Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan.
| | | | | |
Collapse
|
64
|
Chen W, Feng PM, Lin H, Chou KC. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res 2013; 41:e68. [PMID: 23303794 PMCID: PMC3616736 DOI: 10.1093/nar/gks1450] [Citation(s) in RCA: 478] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Meiotic recombination is an important biological process. As a main driving force of evolution, recombination provides natural new combinations of genetic variations. Rather than randomly occurring across a genome, meiotic recombination takes place in some genomic regions (the so-called ‘hotspots’) with higher frequencies, and in the other regions (the so-called ‘coldspots’) with lower frequencies. Therefore, the information of the hotspots and coldspots would provide useful insights for in-depth studying of the mechanism of recombination and the genome evolution process as well. So far, the recombination regions have been mainly determined by experiments, which are both expensive and time-consuming. With the avalanche of genome sequences generated in the postgenomic age, it is highly desired to develop automated methods for rapidly and effectively identifying the recombination regions. In this study, a predictor, called ‘iRSpot-PseDNC’, was developed for identifying the recombination hotspots and coldspots. In the new predictor, the samples of DNA sequences are formulated by a novel feature vector, the so-called ‘pseudo dinucleotide composition’ (PseDNC), into which six local DNA structural properties, i.e. three angular parameters (twist, tilt and roll) and three translational parameters (shift, slide and rise), are incorporated. It was observed by the rigorous jackknife test that the overall success rate achieved by iRSpot-PseDNC was >82% in identifying recombination spots in Saccharomyces cerevisiae, indicating the new predictor is promising or at least may become a complementary tool to the existing methods in this area. Although the benchmark data set used to train and test the current method was from S. cerevisiae, the basic approaches can also be extended to deal with all the other genomes. Particularly, it has not escaped our notice that the PseDNC approach can be also used to study many other DNA-related problems. As a user-friendly web-server, iRSpot-PseDNC is freely accessible at http://lin.uestc.edu.cn/server/iRSpot-PseDNC.
Collapse
Affiliation(s)
- Wei Chen
- Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan, China.
| | | | | | | |
Collapse
|
65
|
Lin WZ, Fang JA, Xiao X, Chou KC. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. MOLECULAR BIOSYSTEMS 2013; 9:634-44. [DOI: 10.1039/c3mb25466f] [Citation(s) in RCA: 218] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
66
|
Ding C, Yuan LF, Guo SH, Lin H, Chen W. Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J Proteomics 2012; 77:321-8. [DOI: 10.1016/j.jprot.2012.09.006] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Revised: 08/18/2012] [Accepted: 09/08/2012] [Indexed: 11/25/2022]
|
67
|
Predicting secretory proteins of malaria parasite by incorporating sequence evolution information into pseudo amino acid composition via grey system model. PLoS One 2012. [PMID: 23189138 PMCID: PMC3506597 DOI: 10.1371/journal.pone.0049040] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The malaria disease has become a cause of poverty and a major hindrance to economic development. The culprit of the disease is the parasite, which secretes an array of proteins within the host erythrocyte to facilitate its own survival. Accordingly, the secretory proteins of malaria parasite have become a logical target for drug design against malaria. Unfortunately, with the increasing resistance to the drugs thus developed, the situation has become more complicated. To cope with the drug resistance problem, one strategy is to timely identify the secreted proteins by malaria parasite, which can serve as potential drug targets. However, it is both expensive and time-consuming to identify the secretory proteins of malaria parasite by experiments alone. To expedite the process for developing effective drugs against malaria, a computational predictor called "iSMP-Grey" was developed that can be used to identify the secretory proteins of malaria parasite based on the protein sequence information alone. During the prediction process a protein sample was formulated with a 60D (dimensional) feature vector formed by incorporating the sequence evolution information into the general form of PseAAC (pseudo amino acid composition) via a grey system model, which is particularly useful for solving complicated problems that are lack of sufficient information or need to process uncertain information. It was observed by the jackknife test that iSMP-Grey achieved an overall success rate of 94.8%, remarkably higher than those by the existing predictors in this area. As a user-friendly web-server, iSMP-Grey is freely accessible to the public at http://www.jci-bioinfo.cn/iSMP-Grey. Moreover, for the convenience of most experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematical equations involved in this paper.
Collapse
|
68
|
Kandaswamy KK, Pugalenthi G, Kalies KU, Hartmann E, Martinetz T. EcmPred: prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection. J Theor Biol 2012; 317:377-83. [PMID: 23123454 DOI: 10.1016/j.jtbi.2012.10.015] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2012] [Revised: 10/08/2012] [Accepted: 10/09/2012] [Indexed: 12/11/2022]
Abstract
The extracellular matrix (ECM) is a major component of tissues of multicellular organisms. It consists of secreted macromolecules, mainly polysaccharides and glycoproteins. Malfunctions of ECM proteins lead to severe disorders such as marfan syndrome, osteogenesis imperfecta, numerous chondrodysplasias, and skin diseases. In this work, we report a random forest approach, EcmPred, for the prediction of ECM proteins from protein sequences. EcmPred was trained on a dataset containing 300 ECM and 300 non-ECM and tested on a dataset containing 145 ECM and 4187 non-ECM proteins. EcmPred achieved 83% accuracy on the training and 77% on the test dataset. EcmPred predicted 15 out of 20 experimentally verified ECM proteins. By scanning the entire human proteome, we predicted novel ECM proteins validated with gene ontology and InterPro. The dataset and standalone version of the EcmPred software is available at http://www.inb.uni-luebeck.de/tools-demos/Extracellular_matrix_proteins/EcmPred.
Collapse
|
69
|
Li T, Li QZ. Annotating the protein-RNA interaction sites in proteins using evolutionary information and protein backbone structure. J Theor Biol 2012; 312:55-64. [PMID: 22874580 DOI: 10.1016/j.jtbi.2012.07.020] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2012] [Revised: 07/19/2012] [Accepted: 07/21/2012] [Indexed: 12/11/2022]
Abstract
RNA-protein interactions play important roles in various biological processes. The precise detection of RNA-protein interaction sites is very important for understanding essential biological processes and annotating the function of the proteins. In this study, based on various features from amino acid sequence and structure, including evolutionary information, solvent accessible surface area and torsion angles (φ, ψ) in the backbone structure of the polypeptide chain, a computational method for predicting RNA-binding sites in proteins is proposed. When the method is applied to predict RNA-binding sites in three datasets: RBP86 containing 86 protein chains, RBP107 containing 107 proteins chains and RBP109 containing 109 proteins chains, better sensitivities and specificities are obtained compared to previously published methods in five-fold cross-validation tests. In order to make further examination for the efficiency of our method, the RBP107 dataset is used as training set, RBP86 and RBP109 datasets are used as the independent test sets. In addition, as examples of our prediction, RNA-binding sites in a few proteins are presented. The annotated results are consistent with the PDB annotation. These results show that our method is useful for annotating RNA binding sites of novel proteins.
Collapse
Affiliation(s)
- Tao Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China.
| |
Collapse
|
70
|
Li BQ, Hu LL, Chen L, Feng KY, Cai YD, Chou KC. Prediction of protein domain with mRMR feature selection and analysis. PLoS One 2012; 7:e39308. [PMID: 22720092 PMCID: PMC3376124 DOI: 10.1371/journal.pone.0039308] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2011] [Accepted: 05/17/2012] [Indexed: 11/30/2022] Open
Abstract
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.
Collapse
Affiliation(s)
- Bi-Qing Li
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Le-Le Hu
- Institute of Systems Biology, Shanghai University, Shanghai, China
| | - Lei Chen
- College of Information Engineering, Shanghai Maritime University, Shanghai, China
| | - Kai-Yan Feng
- Shanghai Center for Bioinformation Technology, Shanghai, China
| | - Yu-Dong Cai
- Institute of Systems Biology, Shanghai University, Shanghai, China
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail: (YDC) (YC); (KCC) (KC)
| | - Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, California, United States of America
- * E-mail: (YDC) (YC); (KCC) (KC)
| |
Collapse
|
71
|
Chen YZ, Chen Z, Gong YA, Ying G. SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PLoS One 2012; 7:e39195. [PMID: 22720073 PMCID: PMC3375222 DOI: 10.1371/journal.pone.0039195] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2011] [Accepted: 05/21/2012] [Indexed: 01/19/2023] Open
Abstract
Sumoylation is one of the most essential mechanisms of reversible protein post-translational modifications and is a crucial biochemical process in the regulation of a variety of important biological functions. Sumoylation is also closely involved in various human diseases. The accurate computational identification of sumoylation sites in protein sequences aids in experimental design and mechanistic research in cellular biology. In this study, we introduced amino acid hydrophobicity as a parameter into a traditional binary encoding scheme and developed a novel sumoylation site prediction tool termed SUMOhydro. With the assistance of a support vector machine, the proposed method was trained and tested using a stringent non-redundant sumoylation dataset. In a leave-one-out cross-validation, the proposed method yielded an excellent performance with a correlation coefficient, specificity, sensitivity and accuracy equal to 0.690, 98.6%, 71.1% and 97.5%, respectively. In addition, SUMOhydro has been benchmarked against previously described predictors based on an independent dataset, thereby suggesting that the introduction of hydrophobicity as an additional parameter could assist in the prediction of sumoylation sites. Currently, SUMOhydro is freely accessible at http://protein.cau.edu.cn/others/SUMOhydro/.
Collapse
Affiliation(s)
- Yong-Zi Chen
- Laboratory of Cancer Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China
- Tianjin Key Laboratory of Cancer Prevention and Therapy, Tianjin Municipal Science and Technology Commission, Tianjin, China
| | - Zhen Chen
- Bioinformatics Center, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Yu-Ai Gong
- Laboratory of Cancer Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China
- Tianjin Key Laboratory of Cancer Prevention and Therapy, Tianjin Municipal Science and Technology Commission, Tianjin, China
| | - Guoguang Ying
- Laboratory of Cancer Cell Biology, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China
- Tianjin Key Laboratory of Cancer Prevention and Therapy, Tianjin Municipal Science and Technology Commission, Tianjin, China
| |
Collapse
|