1
|
MPPIF-Net: Identification of Plasmodium Falciparum Parasite Mitochondrial Proteins Using Deep Features with Multilayer Bi-directional LSTM. Processes (Basel) 2020. [DOI: 10.3390/pr8060725] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Mitochondrial proteins of Plasmodium falciparum (MPPF) are an important target for anti-malarial drugs, but their identification through manual experimentation is costly, and in turn, their related drugs production by pharmaceutical institutions involves a prolonged time duration. Therefore, it is highly desirable for pharmaceutical companies to develop computationally automated and reliable approach to identify proteins precisely, resulting in appropriate drug production in a timely manner. In this direction, several computationally intelligent techniques are developed to extract local features from biological sequences using machine learning methods followed by various classifiers to discriminate the nature of proteins. Unfortunately, these techniques demonstrate poor performance while capturing contextual features from sequence patterns, yielding non-representative classifiers. In this paper, we proposed a sequence-based framework to extract deep and representative features that are trust-worthy for Plasmodium mitochondrial proteins identification. The backbone of the proposed framework is MPPF identification-net (MPPFI-Net), that is based on a convolutional neural network (CNN) with multilayer bi-directional long short-term memory (MBD-LSTM). MPPIF-Net inputs protein sequences, passes through various convolution and pooling layers to optimally extract learned features. We pass these features into our sequence learning mechanism, MBD-LSTM, that is particularly trained to classify them into their relevant classes. Our proposed model is experimentally evaluated on newly prepared dataset PF2095 and two existing benchmark datasets i.e., PF175 and MPD using the holdout method. The proposed method achieved 97.6%, 97.1%, and 99.5% testing accuracy on PF2095, PF175, and MPD datasets, respectively, which outperformed the state-of-the-art approaches.
Collapse
|
2
|
Nithya V. SubmitoLoc: Identification of mitochondrial sub cellular locations of proteins using support vector machine. Bioinformation 2019; 15:863-868. [PMID: 32256006 PMCID: PMC7088428 DOI: 10.6026/97320630015863] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2019] [Revised: 12/31/2019] [Accepted: 12/31/2019] [Indexed: 11/23/2022] Open
Abstract
Mitochondria are important sub-cellular organelles in eukaryotes. Defects in mitochondrial system lead to a variety of disease. Therefore, detailed knowledge of mitochondrial proteome
is vital to understand mitochondrial system and their function. Sequence databases contain large number of mitochondrial proteins but they are mostly not annotated. In this study, we
developed a support vector machine approach, SubmitoLoc, to predict mitochondrial sub cellular locations of proteins based on various sequence derived properties. We evaluated the predictor
using 10-fold cross validation. Our method achieved 88.56 % accuracy using all features. Average sensitivity and specificity for four-subclass prediction is 85.37% and 87.25% respectively.
High prediction accuracy suggests that SubmitoLoc will be useful for researchers studying mitochondrial biology and drug discovery.
Collapse
Affiliation(s)
- Varadharaju Nithya
- Department of Animal Health Management, Alagappa University, Karaikudi-630003, India
| |
Collapse
|
3
|
Hybrid model based on Genetic Algorithms and SVM applied to variable selection within fruit juice classification. ScientificWorldJournal 2014; 2013:982438. [PMID: 24453933 PMCID: PMC3874306 DOI: 10.1155/2013/982438] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2013] [Accepted: 10/21/2013] [Indexed: 11/17/2022] Open
Abstract
Given the background of the use of Neural Networks in problems of apple juice classification, this paper aim at implementing a newly developed method in the field of machine learning: the Support Vector Machines (SVM). Therefore, a hybrid model that combines genetic algorithms and support vector machines is suggested in such a way that, when using SVM as a fitness function of the Genetic Algorithm (GA), the most representative variables for a specific classification problem can be selected.
Collapse
|
4
|
Khan A, Majid A, Hayat M. CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 2011; 35:218-29. [PMID: 21864791 DOI: 10.1016/j.compbiolchem.2011.05.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 12/18/2022]
Abstract
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Collapse
Affiliation(s)
- Asifullah Khan
- Department of Information and Computer Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan.
| | | | | |
Collapse
|
5
|
Naveed M, Khan AU. GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids 2011; 42:1809-23. [DOI: 10.1007/s00726-011-0902-6] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2010] [Accepted: 03/26/2011] [Indexed: 11/27/2022]
|
6
|
Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition. Amino Acids 2011; 42:1443-54. [DOI: 10.1007/s00726-011-0888-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2010] [Accepted: 03/09/2011] [Indexed: 12/15/2022]
|
7
|
Identification of RNA-binding sites in proteins by integrating various sequence information. Amino Acids 2010; 40:239-48. [DOI: 10.1007/s00726-010-0639-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2009] [Accepted: 05/22/2010] [Indexed: 12/12/2022]
|
8
|
Using auto covariance method for functional discrimination of membrane proteins based on evolution information. Amino Acids 2009; 38:1497-503. [DOI: 10.1007/s00726-009-0362-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2009] [Accepted: 09/24/2009] [Indexed: 11/29/2022]
|
9
|
Abstract
Beta-turn is a secondary protein structure type that plays an important role in protein configuration and function. Here, we introduced an approach of beta-turn prediction that used the support vector machine (SVM) algorithm combined with predicted secondary structure information. The secondary structure information was obtained by using E-SSpred, a new secondary protein structure prediction method. A 7-fold cross validation based on the benchmark dataset of 426 non-homologous protein chains was used to evaluate the performance of our method. The prediction results broke the 80% Q (total) barrier and achieved Q (total) = 80.9%, MCC = 0.44, and Q (predicted) higher 0.9% when compared with the best method. The results in our research are coincident with the conclusion that beta-turn prediction accuracy can be improved by inclusion of secondary structure information.
Collapse
|
10
|
Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML. Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol 2009; 259:366-72. [PMID: 19341746 DOI: 10.1016/j.jtbi.2009.03.028] [Citation(s) in RCA: 139] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2008] [Revised: 02/25/2009] [Accepted: 03/13/2009] [Indexed: 12/20/2022]
Abstract
The submitochondria location of a mitochondrial protein is very important for further understanding the structure and function of this protein. Hence, it is of great practical significance to develop an automated and reliable method for timely identifying the submitochondria locations of novel mitochondrial proteins. In this study, a sequence-based algorithm combining the augmented Chou's pseudo amino acid composition (Chou's PseAA) based on auto covariance (AC) is developed to predict protein submitochondria locations and membrane protein types in mitochondria inner membrane. The model fully considers the sequence-order effects between residues a certain distance apart in the sequence by AC combined with eight representative descriptors for both common proteins and membrane proteins. As a result of jackknife cross-validation tests, the method for submitochondria location prediction yields the accuracies of 91.8%, 96.4% and 66.1% for inner membrane, matrix, and outer membrane, respectively. The total accuracy is 89.7%. When predicting membrane protein types in mitochondria inner membrane, the method achieves the prediction performance with the accuracies of 98.4%, 64.3% and 86.7% for multi-pass inner membrane, single-pass inner membrane, and matrix side inner membrane, where the total accuracy is 93.6%. The overall performance of our method is better than the achievements of the previous studies. So our method can be an effective supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://chemlab.scu.edu.cn/Predict_subMITO/index.htm.
Collapse
Affiliation(s)
- Yu-hong Zeng
- College of Chemistry, Sichuan University, Chengdu 610064, PR China.
| | | | | | | | | | | |
Collapse
|
11
|
|
12
|
Xiao X, Lin WZ, Chou KC. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008; 29:2018-24. [PMID: 18381630 DOI: 10.1002/jcc.20955] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Using the pseudo amino acid (PseAA) composition to represent the sample of a protein can incorporate a considerable amount of sequence pattern information so as to improve the prediction quality for its structural or functional classification. However, how to optimally formulate the PseAA composition is an important problem yet to be solved. In this article the grey modeling approach is introduced that is particularly efficient in coping with complicated systems such as the one consisting of many proteins with different sequence orders and lengths. On the basis of the grey model, four coefficients derived from each of the protein sequences concerned are adopted for its PseAA components. The PseAA composition thus formulated is called the "grey-PseAA" composition that can catch the essence of a protein sequence and better reflect its overall pattern. In our study we have demonstrated that introduction of the grey-PseAA composition can remarkably enhance the success rates in predicting the protein structural class. It is anticipated that the concept of grey-PseAA composition can be also used to predict many other protein attributes, such as subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, among many others.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333000, China.
| | | | | |
Collapse
|
13
|
Xiao X, Wang P, Chou KC. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image. J Theor Biol 2008; 254:691-6. [PMID: 18634802 DOI: 10.1016/j.jtbi.2008.06.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2008] [Revised: 06/18/2008] [Accepted: 06/18/2008] [Indexed: 11/28/2022]
Abstract
A novel approach was developed for predicting the structural classes of proteins based on their sequences. It was assumed that proteins belonging to the same structural class must bear some sort of similar texture on the images generated by the cellular automaton evolving rule [Wolfram, S., 1984. Cellular automation as models of complexity. Nature 311, 419-424]. Based on this, two geometric invariant moment factors derived from the image functions were used as the pseudo amino acid components [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct., Funct., Genet. (Erratum: ibid., 2001, vol. 44, 60) 43, 246-255] to formulate the protein samples for statistical prediction. The success rates thus obtained on a previously constructed benchmark dataset are quite promising, implying that the cellular automaton image can help to reveal some inherent and subtle features deeply hidden in a pile of long and complicated amino acid sequences.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
14
|
Feng Y, Luo L. Use of tetrapeptide signals for protein secondary-structure prediction. Amino Acids 2008; 35:607-14. [PMID: 18431531 DOI: 10.1007/s00726-008-0089-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2007] [Accepted: 03/04/2008] [Indexed: 10/22/2022]
Abstract
This paper develops a novel sequence-based method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. The proposed TPIDQD method is based on tetra-peptide signals and is used to predict the structure of the central residue of a sequence fragment. The three-state overall per-residue accuracy (Q (3)) is about 80% in the threefold cross-validated test for 21-residue fragments in the CB513 dataset. The accuracy can be further improved by taking long-range sequence information (fragments of more than 21 residues) into account in prediction. The results show the tetra-peptide signals can indeed reflect some relationship between an amino acid's sequence and its secondary structure, indicating the importance of tetra-peptide signals as the protein folding code in the protein structure prediction.
Collapse
Affiliation(s)
- Yonge Feng
- Laboratory of Theoretical Biophysics, Faculty of Science and Technology, Inner Mongolia University, Hohhot, 010021, China.
| | | |
Collapse
|
15
|
Zhang SW, Chen W, Yang F, Pan Q. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35:591-8. [PMID: 18427713 DOI: 10.1007/s00726-008-0086-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 02/28/2008] [Indexed: 12/11/2022]
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
16
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|
17
|
Prediction of protein structure class by coupling improved genetic algorithm and support vector machine. Amino Acids 2008; 35:581-90. [DOI: 10.1007/s00726-008-0084-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2007] [Accepted: 01/31/2008] [Indexed: 10/22/2022]
|
18
|
An ensemble of support vector machines for predicting the membrane protein type directly from the amino acid sequence. Amino Acids 2008; 35:573-80. [DOI: 10.1007/s00726-008-0083-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2008] [Accepted: 02/26/2008] [Indexed: 11/26/2022]
|
19
|
Guo Y, Yu L, Wen Z, Li M. Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res 2008; 36:3025-30. [PMID: 18390576 PMCID: PMC2396404 DOI: 10.1093/nar/gkn159] [Citation(s) in RCA: 391] [Impact Index Per Article: 24.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Compared to the available protein sequences of different organisms, the number of revealed protein–protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11 474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.
Collapse
Affiliation(s)
- Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu 610064 and State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610041, P.R. China
| | | | | | | |
Collapse
|
20
|
An ensemble of reduced alphabets with protein encoding based on grouped weight for predicting DNA-binding proteins. Amino Acids 2008; 36:167-75. [DOI: 10.1007/s00726-008-0044-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2007] [Accepted: 02/07/2008] [Indexed: 10/22/2022]
|
21
|
Lin H. The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. J Theor Biol 2008; 252:350-6. [PMID: 18355838 DOI: 10.1016/j.jtbi.2008.02.004] [Citation(s) in RCA: 201] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Revised: 12/02/2007] [Accepted: 02/04/2008] [Indexed: 11/15/2022]
Abstract
The outer membrane proteins (OMPs) are beta-barrel membrane proteins that performed lots of biology functions. The discriminating OMPs from other non-OMPs is a very important task for understanding some biochemical process. In this study, a method that combines increment of diversity with modified Mahalanobis Discriminant, called IDQD, is presented to predict 208 OMPs, 206 transmembrane helical proteins (TMHPs) and 673 globular proteins (GPs) by using Chou's pseudo amino acid compositions as parameters. The overall accuracy of jackknife cross-validation is 93.2% and 96.1%, respectively, for three datasets (OMPs, TMHPs and GPs) and two datasets (OMPs and non-OMPs). These predicted results suggest that the method can be effectively applied to discriminate OMPs, TMHPs and GPs. And it also indicates that the pseudo amino acid composition can better reflect the core feature of membrane proteins than the classical amino acid composition.
Collapse
Affiliation(s)
- Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
22
|
Yang Y, Tantoso E, Li KB. Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties. J Theor Biol 2008; 252:145-54. [PMID: 18342336 DOI: 10.1016/j.jtbi.2008.01.028] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2007] [Revised: 11/29/2007] [Accepted: 01/26/2008] [Indexed: 11/29/2022]
Abstract
Remote homology detection refers to the detection of structure homology in evolutionarily related proteins with low sequence similarity. Supervised learning algorithms such as support vector machine (SVM) are currently the most accurate methods. In most of these SVM-based methods, efforts have been dedicated to developing new kernels to better use the pairwise alignment scores or sequence profiles. Moreover, amino acids' physicochemical properties are not generally used in the feature representation of protein sequences. In this article, we present a remote homology detection method that incorporates two novel features: (1) a protein's primary sequence is represented using amino acid's physicochemical properties and (2) the similarity between two proteins is measured using recurrence quantification analysis (RQA). An optimization scheme was developed to select different amino acid indices (up to 10 for a protein family) that are best to characterize the given protein family. The selected amino acid indices may enable us to draw better biological explanation of the protein family classification problem than using other alignment-based methods. An SVM-based classifier will then work on the space described by the RQA metrics. The classification scheme is named as SVM-RQA. Experiments at the superfamily level of the SCOP1.53 dataset show that, without using alignment or sequence profile information, the features generated from amino acid indices are able to produce results that are comparable to those obtained by the published state-of-the-art SVM kernels. In the future, better prediction accuracies can be expected by combining the alignment-based features with our amino acids property-based features. Supplementary information including the raw dataset, the best-performing amino acid indices for each protein family and the computed RQA metrics for all protein sequences can be downloaded from http://ym151113.ym.edu.tw/svm-rqa.
Collapse
Affiliation(s)
- Yuchen Yang
- Institute of Molecular and Cell Biology, 61 Biopolis Drive, Singapore 138673, Singapore
| | | | | |
Collapse
|
23
|
Using Chou's pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location. Amino Acids 2008; 34:669-75. [PMID: 18256886 DOI: 10.1007/s00726-008-0034-9] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 01/17/2008] [Indexed: 10/22/2022]
Abstract
The knowledge of subnuclear localization in eukaryotic cells is essential for understanding the life function of nucleus. Developing prediction methods and tools for proteins subnuclear localization become important research fields in protein science for special characteristics in cell nuclear. In this study, a novel approach has been proposed to predict protein subnuclear localization. Sample of protein is represented by Pseudo Amino Acid (PseAA) composition based on approximate entropy (ApEn) concept, which reflects the complexity of time series. A novel ensemble classifier is designed incorporating three AdaBoost classifiers. The base classifier algorithms in three AdaBoost are decision stumps, fuzzy K nearest neighbors classifier, and radial basis-support vector machines, respectively. Different PseAA compositions are used as input data of different AdaBoost classifier in ensemble. Genetic algorithm is used to optimize the dimension and weight factor of PseAA composition. Two datasets often used in published works are used to validate the performance of the proposed approach. The obtained results of Jackknife cross-validation test are higher and more balance than them of other methods on same datasets. The promising results indicate that the proposed approach is effective and practical. It might become a useful tool in protein subnuclear localization. The software in Matlab and supplementary materials are available freely by contacting the corresponding author.
Collapse
|
24
|
Wang Y, Xue Z, Shen G, Xu J. PRINTR: Prediction of RNA binding sites in proteins using SVM and profiles. Amino Acids 2008; 35:295-302. [DOI: 10.1007/s00726-007-0634-9] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2007] [Accepted: 11/05/2007] [Indexed: 10/22/2022]
|
25
|
Yan C, Hu J, Wang Y. Discrimination of outer membrane proteins using a K-nearest neighbor method. Amino Acids 2008; 35:65-73. [PMID: 18219549 DOI: 10.1007/s00726-007-0628-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 10/28/2007] [Indexed: 11/29/2022]
Abstract
Identification of outer membrane proteins (OMPs) from genome is an important task. This paper presents a k-nearest neighbor (K-NN) method for discriminating outer membrane proteins (OMPs). The method makes predictions based on a weighted Euclidean distance that is computed from residue composition. The method achieves 89.1% accuracy with 0.668 MCC (Matthews correlation coefficient) in discriminating OMPs and non-OMPs. The performance of the method is improved by including homologous information into the calculation of residue composition. The final method achieves an accuracy of 96.1%, with 0.873 MCC, 87.5% sensitivity, and 98.2% specificity. Comparisons with multiple recently published methods show that the method proposed in this study outperforms the others.
Collapse
Affiliation(s)
- C Yan
- Department of Computer Science, Utah State University, Logan, UT 84322-4205, USA.
| | | | | |
Collapse
|
26
|
Nanni L, Lumini A. Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization. Amino Acids 2008; 34:653-60. [PMID: 18175047 DOI: 10.1007/s00726-007-0018-1] [Citation(s) in RCA: 124] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 12/11/2007] [Indexed: 01/25/2023]
|
27
|
Nanni L, Lumini A. Combing ontologies and dipeptide composition for predicting DNA-binding proteins. Amino Acids 2008; 34:635-41. [PMID: 18175049 DOI: 10.1007/s00726-007-0016-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2007] [Accepted: 12/06/2007] [Indexed: 12/11/2022]
Abstract
Given a novel protein it is very important to know if it is a DNA-binding protein, because DNA-binding proteins participate in the fundamental role to regulate gene expression. In this work, we propose a parallel fusion between a classifier trained using the features extracted from the gene ontology database and a classifier trained using the dipeptide composition of the protein. As classifiers the support vector machine (SVM) and the 1-nearest neighbour are used. Matthews's correlation coefficient obtained by our fusion method is approximately 0.97 when the jackknife cross-validation is used; this result outperforms the best performance obtained in the literature (0.924) using the same dataset where the SVM is trained using only the Chou's pseudo amino acid based features. In this work also the area under the ROC-curve (AUC) is reported and our results show that the fusion permits to obtain a very interesting 0.995 AUC. In particular we want to stress that our fusion obtains a 5% false negative with a 0% of false positive. Matthews's correlation coefficient obtained using the single best GO-number is only 0.7211 and hence it is not possible to use the gene ontology database as a simple lookup table. Finally, we test the complementarity of the two tested feature extraction methods using the Q-statistic. We obtain the very interesting result of 0.58, which means that the features extracted from the gene ontology database and the features extracted from the amino acid sequence are partially independent and that their parallel fusion should be studied more.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | |
Collapse
|
28
|
Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2007; 34:565-72. [PMID: 18074191 DOI: 10.1007/s00726-007-0010-9] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2007] [Accepted: 11/15/2007] [Indexed: 11/24/2022]
Abstract
The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou's pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, No. 127 Youyi West Road, Xi'an 710072, China.
| | | | | | | | | |
Collapse
|