1
|
Alphonse AS, Mary NAB, Starvin MS. Classification of membrane protein using Tetra Peptide Pattern. Anal Biochem 2020; 606:113845. [PMID: 32739352 DOI: 10.1016/j.ab.2020.113845] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2020] [Revised: 06/17/2020] [Accepted: 06/22/2020] [Indexed: 11/29/2022]
Abstract
Membrane proteins play an important role in the life activities of organisms. The mechanism of cell structures and biological activities can be identified only by knowing the functional types of membrane proteins which accelerate the process. Therefore, it is greatly necessary to build up computational approaches for timely and accurate prediction of the functional types of membrane protein. The proposed method analyzes the structure of the membrane proteins using novel Tetra Peptide Pattern (TPP)-based feature extraction technique. A frequency occurrence matrix is created from which a feature vector is formed. This feature vector captures the pattern among amino acids in a membrane protein sequence. The feature vector is reduced in the dimension using General Kernel-based Supervised Principal Component Analysis (GKSPCA). Stacked Restricted Boltzmann Machines (RBM) in Deep Belief Network (DBN) is used for classification. The RBM is the building block of Deep Belief Network. The proposed method achieves good results on two datasets. The performance of the proposed method was analyzed using Accuracy, Specificity, Sensitivity and Mathew's correlation coefficient. The proposed method achieves good results when compared to other state-of-the-art techniques.
Collapse
Affiliation(s)
| | | | - M S Starvin
- University College of Engineering, Nagercoil, 629004, India.
| |
Collapse
|
2
|
Chou KC. Advances in Predicting Subcellular Localization of Multi-label Proteins and its Implication for Developing Multi-target Drugs. Curr Med Chem 2019; 26:4918-4943. [PMID: 31060481 DOI: 10.2174/0929867326666190507082559] [Citation(s) in RCA: 78] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 01/29/2019] [Accepted: 01/31/2019] [Indexed: 12/16/2022]
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
3
|
Abstract
The smallest unit of life is a cell, which contains numerous protein molecules. Most
of the functions critical to the cell’s survival are performed by these proteins located in its different
organelles, usually called ‘‘subcellular locations”. Information of subcellular localization
for a protein can provide useful clues about its function. To reveal the intricate pathways at the
cellular level, knowledge of the subcellular localization of proteins in a cell is prerequisite.
Therefore, one of the fundamental goals in molecular cell biology and proteomics is to determine
the subcellular locations of proteins in an entire cell. It is also indispensable for prioritizing
and selecting the right targets for drug development. Unfortunately, it is both timeconsuming
and costly to determine the subcellular locations of proteins purely based on experiments.
With the avalanche of protein sequences generated in the post-genomic age, it is highly
desired to develop computational methods for rapidly and effectively identifying the subcellular
locations of uncharacterized proteins based on their sequences information alone. Actually,
considerable progresses have been achieved in this regard. This review is focused on those
methods, which have the capacity to deal with multi-label proteins that may simultaneously
exist in two or more subcellular location sites. Protein molecules with this kind of characteristic
are vitally important for finding multi-target drugs, a current hot trend in drug development.
Focused in this review are also those methods that have use-friendly web-servers established so
that the majority of experimental scientists can use them to get the desired results without the
need to go through the detailed mathematics involved.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, United States
| |
Collapse
|
4
|
Jayapriya K, Mary NAB. Employing a novel 2-gram subgroup intra pattern (2GSIP) with stacked auto encoder for membrane protein classification. Mol Biol Rep 2019; 46:2259-2272. [PMID: 30778923 DOI: 10.1007/s11033-019-04680-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 02/07/2019] [Indexed: 12/01/2022]
Abstract
Cell membrane proteins play an essentially significant function in manipulating the behaviour of cells. Examination of amino acid sequences can put forward useful insights into the tertiary structures of proteins and their biological functions. One of the important problems in amino acid analysis is the uncertainty to establish a digital coding system to better reflect the properties of amino acids and their degeneracy. In order to overcome the demerits, the proposed method is a novel representation of protein sequences that incorporates a new feature named 2-gram subgroup intra pattern. The functional types of membrane protein classification will be supportive to explain the biological functions of membrane proteins. For classification, Stacked Auto Encoder Deep learning method is applied. The performance of the proposed method is evaluated on two benchmark data sets. The results were experimented using the Self-consistency test, Accuracy, Specificity, Sensitivity, Mathew's correlation coefficient, Jackknife test and Independent data set are the tests in which the proposed method outperformed other existing techniques generally used in literatures.
Collapse
Affiliation(s)
- K Jayapriya
- Vin Solutions, Tirunelveli, Tamilnadu, India.
| | | |
Collapse
|
5
|
Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix. Sci Rep 2017; 7:46237. [PMID: 28393857 PMCID: PMC5385872 DOI: 10.1038/srep46237] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Accepted: 03/14/2017] [Indexed: 11/08/2022] Open
Abstract
We develop a novel position-feature-based model for protein sequences by employing physicochemical properties of 20 amino acids and the measure of graph energy. The method puts the emphasis on sequence order information and describes local dynamic distributions of sequences, from which one can get a characteristic B-vector. Afterwards, we apply the relative entropy to the sequences representing B-vectors to measure their similarity/dissimilarity. The numerical results obtained in this study show that the proposed methods leads to meaningful results compared with competitors such as Clustal W.
Collapse
|
6
|
Kavianpour H, Vasighi M. Structural classification of proteins using texture descriptors extracted from the cellular automata image. Amino Acids 2016; 49:261-271. [DOI: 10.1007/s00726-016-2354-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Accepted: 10/18/2016] [Indexed: 12/12/2022]
|
7
|
A new multi-label classifier in identifying the functional types of human membrane proteins. J Membr Biol 2014; 248:179-86. [PMID: 25433431 DOI: 10.1007/s00232-014-9755-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 11/11/2014] [Indexed: 10/24/2022]
Abstract
Membrane proteins were found to be involved in various cellular processes performing various important functions, which are mainly associated to their type. Given a membrane protein sequence, how can we identify its type(s)? Particularly, how can we deal with the multi-type problem since one membrane protein may simultaneously belong to two or more different types? To address these problems, which are obviously very important to both basic research and drug development, a new multi-label classifier was developed based on pseudo amino acid composition with multi-label k-nearest neighbor algorithm. The success rate achieved by the new predictor on the benchmark dataset by jackknife test is 73.94%, indicating that the method is promising and the predictor may become a very useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying functional types of membrane proteins.
Collapse
|
8
|
A Multi-label Classifier for Prediction Membrane Protein Functional Types in Animal. J Membr Biol 2014; 247:1141-8. [DOI: 10.1007/s00232-014-9708-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2014] [Accepted: 07/14/2014] [Indexed: 11/26/2022]
|
9
|
Nanni L, Lumini A, Brahnam S. An empirical study of different approaches for protein classification. ScientificWorldJournal 2014; 2014:236717. [PMID: 25028675 PMCID: PMC4084589 DOI: 10.1155/2014/236717] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Revised: 05/05/2014] [Accepted: 05/07/2014] [Indexed: 01/05/2023] Open
Abstract
Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.
Collapse
Affiliation(s)
- Loris Nanni
- Dipartimento di Ingegneria dell'Informazione, Via Gradenigo 6/A, 35131 Padova, Italy
| | | | - Sheryl Brahnam
- Computer Information Systems, Missouri State University, 901 South National, Springfield, MO 65804, USA
| |
Collapse
|
10
|
An empirical study on the matrix-based protein representations and their combination with sequence-based approaches. Amino Acids 2012; 44:887-901. [PMID: 23108592 DOI: 10.1007/s00726-012-1416-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 10/03/2012] [Indexed: 10/27/2022]
Abstract
Many domains have a stake in the development of reliable systems for automatic protein classification. Of particular interest in recent studies of automatic protein classification is the exploration of new methods for extracting features from a protein that enhance classification for specific problems. These methods have proven very useful in one or two domains, but they have failed to generalize well across several domains (i.e. classification problems). In this paper, we evaluate several feature extraction approaches for representing proteins with the aim of sequence-based protein classification. Several protein representations are evaluated, those starting from: the position specific scoring matrix (PSSM) of the proteins; the amino-acid sequence; a matrix representation of the protein, of dimension (length of the protein) ×20, obtained using the substitution matrices for representing each amino-acid as a vector. A valuable result is that a texture descriptor can be extracted from the PSSM protein representation which improves the performance of standard descriptors based on the PSSM representation. Experimentally, we develop our systems by comparing several protein descriptors on nine different datasets. Each descriptor is used to train a support vector machine (SVM) or an ensemble of SVM. Although different stand-alone descriptors work well on some datasets (but not on others), we have discovered that fusion among classifiers trained using different descriptors obtains a good performance across all the tested datasets. Matlab code/Datasets used in the proposed paper are available at http://www.bias.csr.unibo.it\nanni\PSSM.rar.
Collapse
|
11
|
Optimal atomic-resolution structures of prion AGAAAAGA amyloid fibrils. J Theor Biol 2011; 279:17-28. [DOI: 10.1016/j.jtbi.2011.02.012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2010] [Revised: 02/05/2011] [Accepted: 02/16/2011] [Indexed: 11/20/2022]
|
12
|
Khan A, Majid A, Hayat M. CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 2011; 35:218-29. [PMID: 21864791 DOI: 10.1016/j.compbiolchem.2011.05.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2011] [Revised: 05/17/2011] [Accepted: 05/18/2011] [Indexed: 12/18/2022]
Abstract
Precise information about protein locations in a cell facilitates in the understanding of the function of a protein and its interaction in the cellular environment. This information further helps in the study of the specific metabolic pathways and other biological processes. We propose an ensemble approach called "CE-PLoc" for predicting subcellular locations based on fusion of individual classifiers. The proposed approach utilizes features obtained from both dipeptide composition (DC) and amphiphilic pseudo amino acid composition (PseAAC) based feature extraction strategies. Different feature spaces are obtained by varying the dimensionality using PseAAC for a selected base learner. The performance of the individual learning mechanisms such as support vector machine, nearest neighbor, probabilistic neural network, covariant discriminant, which are trained using PseAAC based features is first analyzed. Classifiers are developed using same learning mechanism but trained on PseAAC based feature spaces of varying dimensions. These classifiers are combined through voting strategy and an improvement in prediction performance is achieved. Prediction performance is further enhanced by developing CE-PLoc through the combination of different learning mechanisms trained on both DC based feature space and PseAAC based feature spaces of varying dimensions. The predictive performance of proposed CE-PLoc is evaluated for two benchmark datasets of protein subcellular locations using accuracy, MCC, and Q-statistics. Using the jackknife test, prediction accuracies of 81.47 and 83.99% are obtained for 12 and 14 subcellular locations datasets, respectively. In case of independent dataset test, prediction accuracies are 87.04 and 87.33% for 12 and 14 class datasets, respectively.
Collapse
Affiliation(s)
- Asifullah Khan
- Department of Information and Computer Sciences, Pakistan Institute of Engineering and Applied Sciences, Nilore, Islamabad, Pakistan.
| | | | | |
Collapse
|
13
|
Mahdavi A, Jahandideh S. Application of density similarities to predict membrane protein types based on pseudo-amino acid composition. J Theor Biol 2011; 276:132-7. [PMID: 21296088 DOI: 10.1016/j.jtbi.2011.01.048] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2010] [Revised: 01/28/2011] [Accepted: 01/30/2011] [Indexed: 11/26/2022]
Abstract
Cell membranes provide integrity of living cells. Although the stability of biological membrane is maintained by the lipid bilayer, membrane proteins perform most of the specific functions such as signal transduction, transmembrane transport, etc. Then it is plausible membrane proteins being attractive drug targets. In this article, based on the concept of using the pseudo-amino acid composition to define a protein, three different density similarities are developed for predicting the membrane protein type. The predicted results showed that the proposed approach can remarkably improve the accuracy, and might become a useful tool for predicting the other attributes of proteins as well.
Collapse
Affiliation(s)
- Abbas Mahdavi
- Department of Statistics, Faculty of Science, Shiraz University, Shiraz, Iran
| | | |
Collapse
|
14
|
|
15
|
Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. MOLECULAR BIOSYSTEMS 2011; 7:3287-97. [PMID: 21984117 DOI: 10.1039/c1mb05232b] [Citation(s) in RCA: 163] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Affiliation(s)
- Zhi-Cheng Wu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333046, China
| | | | | |
Collapse
|
16
|
Xiao X, Wang P, Chou KC. GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. MOLECULAR BIOSYSTEMS 2010; 7:911-9. [PMID: 21180772 DOI: 10.1039/c0mb00170h] [Citation(s) in RCA: 102] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
G protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. With the avalanche of newly generated protein sequences in the post genomic age, to expedite the process of drug discovery, it is highly desirable to develop an automated method to rapidly identify GPCRs and their types. A new predictor was developed by hybridizing two different modes of pseudo-amino acid composition (PseAAC): the functional domain PseAAC and the low-frequency Fourier spectrum PseAAC. The new predictor is called GPCR-2L, where "2L" means that it is a two-layer predictor: the 1st layer prediction engine is to identify a query protein as GPCR or not; if it is, the prediction will be automatically continued to further identify it as belonging to one of the following six types: (1) rhodopsin-like (Class A), (2) secretin-like (Class B), (3) metabotropic glutamate/pheromone (Class C), (4) fungal pheromone (Class D), (5) cAMP receptor (Class E), or (6) frizzled/smoothened family (Class F). The overall success rate of GPCR-2L in identifying proteins as GPCRs or non-GPCRs is over 97.2%, while identifying GPCRs among their six types is over 97.8%. Such high success rates were derived by the rigorous jackknife cross-validation on a stringent benchmark dataset, in which none of the included proteins had ≥40% pairwise sequence identity to any other protein in a same subset. As a user-friendly web-server, GPCR-2L is freely accessible to the public at http://icpr.jci.edu.cn/, by which one can obtain the 2-level results in about 20 s for a query protein sequence of 500 amino acids. The longer the sequence is, the more time it may usually need. The high success rates reported here indicate that it is a quite effective approach to identify GPCRs and their types with the functional domain information and the low-frequency Fourier spectrum analysis. It is anticipated that GPCR-2L may become a useful tool for both basic research and drug development in the areas related to GPCRs.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333403, China.
| | | | | |
Collapse
|
17
|
Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 2010; 273:236-47. [PMID: 21168420 PMCID: PMC7125570 DOI: 10.1016/j.jtbi.2010.12.024] [Citation(s) in RCA: 971] [Impact Index Per Article: 64.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2010] [Revised: 12/08/2010] [Accepted: 12/13/2010] [Indexed: 11/29/2022]
Abstract
With the accomplishment of human genome sequencing, the number of sequence-known proteins has increased explosively. In contrast, the pace is much slower in determining their biological attributes. As a consequence, the gap between sequence-known proteins and attribute-known proteins has become increasingly large. The unbalanced situation, which has critically limited our ability to timely utilize the newly discovered proteins for basic research and drug development, has called for developing computational methods or high-throughput automated tools for fast and reliably identifying various attributes of uncharacterized proteins based on their sequence information alone. Actually, during the last two decades or so, many methods in this regard have been established in hope to bridge such a gap. In the course of developing these methods, the following things were often needed to consider: (1) benchmark dataset construction, (2) protein sample formulation, (3) operating algorithm (or engine), (4) anticipated accuracy, and (5) web-server establishment. In this review, we are to discuss each of the five procedures, with a special focus on the introduction of pseudo amino acid composition (PseAAC), its different modes and applications as well as its recent development, particularly in how to use the general formulation of PseAAC to reflect the core and essential features that are deeply hidden in complicated protein sequences.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar Drive, San Diego, CA 92130, USA.
| |
Collapse
|
18
|
A study of entropy/clarity of genetic sequences using metric spaces and fuzzy sets. J Theor Biol 2010; 267:95-105. [PMID: 20708019 DOI: 10.1016/j.jtbi.2010.08.010] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2010] [Revised: 07/22/2010] [Accepted: 08/06/2010] [Indexed: 11/22/2022]
Abstract
The study of genetic sequences is of great importance in biology and medicine. Sequence analysis and taxonomy are two major fields of application of bioinformatics. In the present paper we extend the notion of entropy and clarity to the use of different metrics and apply them in the case of the Fuzzy Polynuclotide Space (FPS). Applications of these notions on selected polynucleotides and complete genomes both in the I(12×k) space, but also using their representation in FPS are presented. Our results show that the values of fuzzy entropy/clarity are indicative of the degree of complexity necessary for the description of the polynucleotides in the FPS, although in the latter case the interpretation is slightly different than in the case of the I(12×k) hypercube. Fuzzy entropy/clarity along with the use of appropriate metrics can contribute to sequence analysis and taxonomy.
Collapse
|
19
|
Yu L, Guo Y, Li Y, Li G, Li M, Luo J, Xiong W, Qin W. SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. J Theor Biol 2010; 267:1-6. [PMID: 20691704 DOI: 10.1016/j.jtbi.2010.08.001] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Revised: 07/30/2010] [Accepted: 08/01/2010] [Indexed: 11/17/2022]
Abstract
Protein secretion plays an important role in bacterial lifestyles. Secreted proteins are crucial for bacterial pathogenesis by making bacteria interact with their environments, particularly delivering pathogenic and symbiotic bacteria into their eukaryotic hosts. Therefore, identification of bacterial secreted proteins becomes an important process for the study of various diseases and the corresponding drugs. In this paper, fusing several new features into Chou's pseudo-amino acid composition (PseAAC), two support vector machine (SVM)-based ternary classifiers are developed to predict secreted proteins of Gram-negative and Gram-positive bacteria. For the two types of bacteria, the high accuracy of 94.03% and 94.36% are obtained in distinguishing classically secreted, non-classically secreted and non-secreted proteins by our method. In order to compare the practical ability of our method in identifying bacterial secreted proteins with those of six published methods, proteins in Escherichia coli and Bacillus subtilis are collected to construct the test sets of Gram-negative and Gram-positive bacteria, and the prediction results of our method are comparable to those of existing methods. When performed on two public independent data sets for predicting NCSPs, it also yields satisfactory results for Gram-negative bacterial proteins. The prediction server SecretP can be accessed at http://cic.scu.edu.cn/bioinformatics/secretPV2/index.htm.
Collapse
Affiliation(s)
- Lezheng Yu
- College of Chemistry, Sichuan University, Chengdu 610064, PR China
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Ji G, Wu X, Shen Y, Huang J, Quinn Li Q. A classification-based prediction model of messenger RNA polyadenylation sites. J Theor Biol 2010; 265:287-96. [DOI: 10.1016/j.jtbi.2010.05.015] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2009] [Revised: 03/21/2010] [Accepted: 05/13/2010] [Indexed: 12/30/2022]
|
21
|
Huang W, Zhang J, Wang Y, Huang D. A simple method to analyze the similarity of biological sequences based on the fuzzy theory. J Theor Biol 2010; 265:323-8. [DOI: 10.1016/j.jtbi.2010.05.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2009] [Revised: 04/01/2010] [Accepted: 05/07/2010] [Indexed: 11/28/2022]
|
22
|
Abstract
The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.
Collapse
Affiliation(s)
- Alberto Apostolico
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30318, USA.
| | | |
Collapse
|
23
|
Kurić L. Molecular biocoding of insulin. Adv Appl Bioinform Chem 2010; 3:45-58. [PMID: 21918626 PMCID: PMC3170004 DOI: 10.2147/aabc.s9994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
This paper discusses cyberinformation studies of the amino acid composition of insulin, in particular the identification of scientific terminology that could describe this phenomenon, ie, the study of genetic information, as well as the relationship between the genetic language of proteins and theoretical aspects of this system and cybernetics. The results of this research show that there is a matrix code for insulin. It also shows that the coding system within the amino acid language gives detailed information, not only on the amino acid “record”, but also on its structure, configuration, and various shapes. The issue of the existence of an insulin code and coding of the individual structural elements of this protein are discussed. Answers to the following questions are sought. Does the matrix mechanism for biosynthesis of this protein function within the law of the general theory of information systems, and what is the significance of this for understanding the genetic language of insulin? What is the essence of existence and functioning of this language? Is the genetic information characterized only by biochemical principles or it is also characterized by cyberinformation principles? The potential effects of physical and chemical, as well as cybernetic and information principles, on the biochemical basis of insulin are also investigated. This paper discusses new methods for developing genetic technologies, in particular more advanced digital technology based on programming, cybernetics, and informational laws and systems, and how this new technology could be useful in medicine, bioinformatics, genetics, biochemistry, and other natural sciences.
Collapse
Affiliation(s)
- Lutvo Kurić
- Novi Travnik, Kalinska, Bosnia and Herzegovina
| |
Collapse
|
24
|
Zheng X, Li C, Wang J. An information-theoretic approach to the prediction of protein structural class. J Comput Chem 2010; 31:1201-6. [PMID: 19777491 DOI: 10.1002/jcc.21406] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
An information-theoretical approach, which combines a sequence decomposition technique and a fuzzy clustering algorithm, is proposed for prediction of protein structural class. This approach could bypass the process of selecting and comparing sequence features as done previously. First, distances between each pair of protein sequences are estimated using a conditional decomposition technique in information theory. Then, the fuzzy k-nearest neighbor algorithm is used to identify the structural class of a protein given as set of sample sequences. To verify the strength of our method, we choose three widely used datasets constructed by Chou and Zhou. It is shown by the Jackknife test that our approach represents an improvement in the prediction of accuracy over existing methods.
Collapse
Affiliation(s)
- Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai 200234, China
| | | | | |
Collapse
|
25
|
High performance set of PseAAC and sequence based descriptors for protein classification. J Theor Biol 2010; 266:1-10. [PMID: 20558184 DOI: 10.1016/j.jtbi.2010.06.006] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2010] [Revised: 05/31/2010] [Accepted: 06/02/2010] [Indexed: 11/21/2022]
Abstract
The study of reliable automatic systems for protein classification is important for several domains, including finding novel drugs and vaccines. The last decade has seen a number of advances in the development of reliable systems for classifying proteins. Of particular interest has been the exploration of new methods for extracting features from a protein that enhance classification for a given problem. Most methods developed to date, however, have been evaluated in only one or two application areas. Methods have not been explored that generalize well across a number of application areas and datasets. The aim of this study is to find a general method, or an ensemble of methods, that works well on different protein classification datasets and problems. Towards this end, we evaluate several feature extraction approaches for representing proteins starting from their amino acid sequence as well as different feature descriptor combinations using an ensemble of classifiers (support vector machines). In our experiments, more than ten different protein descriptors are compared using nine different datasets. We develop our system using a blind testing protocol, where the parameters of the system are optimized using one dataset and then validated using the other datasets (and so on for each dataset). Although different stand-alone classifiers work well on some datasets and not on others, we have discovered that fusion among different methods obtains a good performance across all the tested datasets, especially when using the weighted sum rule. Included in our feature descriptor combinations is the introduction of two new descriptors, one based on wavelets and the other based on amino acid groups. Using our system, both outperform their standard implementations. We also consider as a baseline the simple amino acid composition (AC) and dipeptide composition (2G), since they have been widely used for protein classification. Our proposed method outperforms AC and 2G.
Collapse
|
26
|
Yan S, Wu G. Linking mutated primary structure of adrenoleukodystrophy protein with X-linked adrenoleukodystrophy. Comput Methods Biomech Biomed Engin 2010; 13:403-11. [DOI: 10.1080/10255840903279974] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
27
|
Wang S, Tian F, Qiu Y, Liu X. Bilateral similarity function: a novel and universal method for similarity analysis of biological sequences. J Theor Biol 2010; 265:194-201. [PMID: 20399215 DOI: 10.1016/j.jtbi.2010.04.013] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2009] [Revised: 04/11/2010] [Accepted: 04/12/2010] [Indexed: 11/26/2022]
Abstract
Bilateral similarity function is designed for analyzing the similarities of biological sequences such as DNA, RNA secondary structure or protein in this paper. The defined function can perform comprehensive comparison between sequences remarkably well, both in terms of the Hamming distance of two compared sequences and the corresponding location difference. Compared with the existing methods for similarity analysis, the examination of similarities/dissimilarities illustrates that the proposed method with the computational complexity of O(N) is effective for these three kinds of biological sequences, and bears the universality for them.
Collapse
Affiliation(s)
- Shiyuan Wang
- College of Communication Engineering, Chongqing University, Chongqing 400044, China.
| | | | | | | |
Collapse
|
28
|
Nanni L, Shi JY, Brahnam S, Lumini A. Protein classification using texture descriptors extracted from the protein backbone image. J Theor Biol 2010; 264:1024-32. [PMID: 20307550 DOI: 10.1016/j.jtbi.2010.03.020] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2009] [Revised: 01/28/2010] [Accepted: 03/11/2010] [Indexed: 10/19/2022]
Abstract
In this work, we propose a method for protein classification that combines different texture descriptors extracted from the 2-D distance matrix obtained from the 3-D tertiary structure of a given protein. Instead of considering all atoms in the protein, the distance matrix is calculated by considering only those atoms that belong to the protein backbone. The positive results reported in this paper offer further experimental confirmation that the distance matrix contains sufficient information for describing a protein. Moreover, we show that combining features extracted from the primary structure with features extracted from the distance matrix increases the performance of our classification system. We demonstrate this finding by comparing the performance of an ensemble of classifiers that uses the combined features. The classifiers used in our experiments are support vector machines and random subspace of support vector machines. The experimental results, validated using three different datasets (protein fold recognition, DNA-binding proteins recognition, biological processes, and molecular functions recognition) along with different texture feature extraction methods (variants of local binary patterns, Radon feature transform based approaches, and Haralick descriptors) demonstrate the effectiveness of the proposed approach. Particularly interesting are the results in the classification of 27 types of structural properties: our proposed approach achieves significant improvement compared with other reported methods.
Collapse
Affiliation(s)
- Loris Nanni
- DEIS, IEIIT-CNR, Università di Bologna, Viale Risorgimento 2, 40136 Bologna, Italy.
| | | | | | | |
Collapse
|
29
|
Xiao X, Wang P, Chou KC. Quat-2L: a web-server for predicting protein quaternary structural attributes. Mol Divers 2010; 15:149-55. [PMID: 20148364 DOI: 10.1007/s11030-010-9227-8] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2009] [Accepted: 01/21/2010] [Indexed: 11/24/2022]
Abstract
By hybridizing the functional-domain and sequence-correlated pseudo amino acid composition approaches, a 2-layer predictor called "Quat-2L" was developed for predicting the quaternary structural attribute of a protein according to its sequence information alone. The 1st layer is to identify the query protein as monomer, homo-oligomer, or hetero-oligomer. If the result thus obtained turns out to be homo-oligomer or hetero-oligomer, then the prediction will be automatically continued to further identify it belonging to one of the following six subtypes: (1) dimer, (2) trimer, (3) tetramer, (4) pentamer, (5) hexamer, and (6) octamer. The overall success rate of Quat-2L for the 1st layer identification was 71.14%; while the overall success rates of the 2nd layer for homo-oligomers and hetero-oligomers were 76.91 and 82.52%, respectively. These rates were derived by the jackknife cross-validation tests on the stringent benchmark data set in which none of proteins has ≥ 60% pairwise sequence identity to any other in the same subset. As a web-server, Quat-2L is freely accessible to the public via http://icpr.jci.jx.cn/bioinfo/Quat-2L, where one can get 2-level results in about 15 s.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen, 333001, China.
| | | | | |
Collapse
|
30
|
Nanni L, Lumini A. Coding of amino acids by texture descriptors. Artif Intell Med 2010; 48:43-50. [PMID: 19892537 DOI: 10.1016/j.artmed.2009.10.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2008] [Revised: 09/24/2009] [Accepted: 10/03/2009] [Indexed: 11/26/2022]
|
31
|
Protein location prediction using atomic composition and global features of the amino acid sequence. Biochem Biophys Res Commun 2010; 391:1670-4. [DOI: 10.1016/j.bbrc.2009.12.118] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2009] [Accepted: 12/21/2009] [Indexed: 11/17/2022]
|
32
|
Yan SM, Wu G. Trends in global warming and evolution of matrix protein 2 family from influenza A virus. Interdiscip Sci 2009; 1:272-9. [PMID: 20640805 PMCID: PMC7091293 DOI: 10.1007/s12539-009-0053-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2009] [Revised: 05/22/2009] [Accepted: 05/25/2009] [Indexed: 05/29/2023]
Abstract
The global warming is an important factor affecting the biological evolution, and the influenza is an important disease that threatens humans with possible epidemics or pandemics. In this study, we attempted to analyze the trends in global warming and evolution of matrix protein 2 family from influenza A virus, because this protein is a target of anti-flu drug, and its mutation would have significant effect on the resistance to anti-flu drugs. The evolution of matrix protein 2 of influenza A virus from 1959 to 2008 was defined using the unpredictable portion of amino-acid pair predictability. Then the trend in this evolution was compared with the trend in the global temperature, the temperature in north and south hemispheres, and the temperature in influenza A virus sampling site, and species carrying influenza A virus. The results showed the similar trends in global warming and in evolution of M2 proteins although we could not correlate them at this stage of study. The study suggested the potential impact of global warming on the evolution of proteins from influenza A virus.
Collapse
Affiliation(s)
- Shao-Min Yan
- National Engineering Research Center for Non-food Biorefinery, Guangxi Academy of Sciences, Nanning, Guangxi 530007 P.R. China
| | - Guang Wu
- Computational Mutation Project, DreamSciTech Consulting, Shenzhen, Guangdong, 518054 P.R. China
| |
Collapse
|
33
|
Shine Y, Kikuchi T. Estimation of relative binding free energy based on a free energy variational principle for quantitative structure activity relationship analyses. Chem Phys 2009. [DOI: 10.1016/j.chemphys.2009.09.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
34
|
Wang T, Xia T, Hu XM. Geometry preserving projections algorithm for predicting membrane protein types. J Theor Biol 2009; 262:208-13. [PMID: 19800352 DOI: 10.1016/j.jtbi.2009.09.027] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2008] [Revised: 09/22/2009] [Accepted: 09/22/2009] [Indexed: 10/20/2022]
Abstract
Given a new uncharacterized protein sequence, a biologist may want to know whether it is a membrane protein or not? If it is, which membrane protein type it belongs to? Knowing the type of an uncharacterized membrane protein often provides useful clues for finding the biological function of the query protein, developing the computational methods to address these questions can be really helpful. In this study, a sequence encoding scheme based on combing pseudo position-specific score matrix (PsePSSM) and dipeptide composition (DC) is introduced to represent protein samples. However, this sequence encoding scheme would correspond to a very high dimensional feature vector. A dimensionality reduction algorithm, the so-called geometry preserving projections (GPP) is introduced to extract the key features from the high-dimensional space and reduce the original high-dimensional vector to a lower-dimensional one. Finally, the K-nearest neighbor (K-NN) and support vector machine (SVM) classifiers are employed to identify the types of membrane proteins based on their reduced low-dimensional features. Our jackknife and independent dataset test results thus obtained are quite encouraging, which indicate that the above methods are used effectively to deal with this complicated problem of predicting the membrane protein type.
Collapse
Affiliation(s)
- Tong Wang
- Institute of Computer and Information, Shanghai Second Polytechnic University, Shanghai 201209, China.
| | | | | |
Collapse
|
35
|
Lin WZ, Xiao X, Chou KC. GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. Protein Eng Des Sel 2009; 22:699-705. [PMID: 19776029 DOI: 10.1093/protein/gzp057] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
G-protein-coupled receptors (GPCRs) play fundamental roles in regulating various physiological processes as well as the activity of virtually all cells. Different GPCR families are responsible for different functions. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop an automated method to address the two problems: given the sequence of a query protein, can we identify whether it is a GPCR? If it is, what family class does it belong to? Here, a two-layer ensemble classifier called GPCR-GIA was proposed by introducing a novel scale called 'grey incident degree'. The overall success rate by GPCR-GIA in identifying GPCR and non-GPCR was about 95%, and that in identifying the GPCRs among their nine family classes was about 80%. These rates were obtained by the jackknife cross-validation tests on the stringent benchmark data sets where none of the proteins has > or = 50% pairwise sequence identity to any other in a same class. Moreover, a user-friendly web-server was established at http://218.65.61.89:8080/bioinfo/GPCR-GIA. For user's convenience, a step-by-step guide on how to use the GPCR-GIA web server is provided. Generally speaking, one can get the desired two-level results in around 10 s for a query protein sequence of 300-400 amino acids; the longer the sequence is, the more time that is needed.
Collapse
Affiliation(s)
- Wei-Zhong Lin
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333001, China
| | | | | |
Collapse
|
36
|
Xiao X, Wang P, Chou KC. GPCR-CA: A cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 2009; 30:1414-23. [PMID: 19037861 DOI: 10.1002/jcc.21163] [Citation(s) in RCA: 126] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Given an uncharacterized protein sequence, how can we identify whether it is a G-protein-coupled receptor (GPCR) or not? If it is, which functional family class does it belong to? It is important to address these questions because GPCRs are among the most frequent targets of therapeutic drugs and the information thus obtained is very useful for "comparative and evolutionary pharmacology," a technique often used for drug development. Here, we present a web-server predictor called "GPCR-CA," where "CA" stands for "Cellular Automaton" (Wolfram, S. Nature 1984, 311, 419), meaning that the CA images have been utilized to reveal the pattern features hidden in piles of long and complicated protein sequences. Meanwhile, the gray-level co-occurrence matrix factors extracted from the CA images are used to represent the samples of proteins through their pseudo amino acid composition (Chou, K.C. Proteins 2001, 43, 246). GPCR-CA is a two-layer predictor: the first layer prediction engine is for identifying a query protein as GPCR on non-GPCR; if it is a GPCR protein, the process will be automatically continued with the second-layer prediction engine to further identify its type among the following six functional classes: (a) rhodopsin-like, (b) secretin-like, (c) metabotrophic/glutamate/pheromone; (d) fungal pheromone, (e) cAMP receptor, and (f) frizzled/smoothened family. The overall success rates by the predictor for the first and second layers are over 91% and 83%, respectively, that were obtained through rigorous jackknife cross-validation tests on a new-constructed stringent benchmark dataset in which none of proteins has >or=40% pairwise sequence identity to any other in a same subset. GPCR-CA is freely accessible at http://218.65.61.89:8080/bioinfo/GPCR-CA, by which one can get the desired two-layer results for a query protein sequence within about 20 seconds.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 33300, China.
| | | | | |
Collapse
|
37
|
Cao H, Xie HZ, Zhang W, Wang K, Li W, Liu CQ. Dynamic extended folding: modeling the RNA secondary structures during co-transcriptional folding. J Theor Biol 2009; 261:93-9. [PMID: 19643109 DOI: 10.1016/j.jtbi.2009.07.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Revised: 07/08/2009] [Accepted: 07/15/2009] [Indexed: 02/02/2023]
Abstract
For RNA secondary structure prediction, it is an important issue that how to deal with co-transcriptional folding during the RNA synthesis in the cell. On one hand, co-transcriptional folding, leads to the correct final structure of the whole RNA molecule. On the other hand, it may form the recognition sites for the progress of the transcription. Considering the hurdles in the experimental determination of RNA folding structures, we proposed a so-called "dynamic extended folding simulation" approach. We used two human pre-mRNA samples, the first functional alpha-gene HBZ and the fifth beta-gene HBB, to "display" the co-transcriptional folding images in detail. The modeling process starts from the prediction of a 30-nucleotide (nt) sequence, then in each update 30 nts was extended, say, 1-30, 1-60, 1-90, 1-120,..., 1-1651 nts (for HBB, 1-1606 nts). We selected the RNAstructure program to predict the folding secondary structures of all the segments. We defined "hairpin" as the unit of the secondary structure and analyzed the states of such unit during the sequential dynamic extended folding processes. We found that some hairpins are "conserved", i.e., after its appearance, it always is there in the followed foldings. Some hairpins present partially in the folding segments, and some hairpins appear for only once or twice. This phenomenon vividly depicts the generation and adjusting of the temporal structural units during the co-transcriptional folding process. It is these "hairpins" that support the thermodynamically stable structure at the end of the RNA synthesis. They may also play a role in RNA splicing process and even in the folding structure of the synthesized protein.
Collapse
Affiliation(s)
- Huai Cao
- Modern Biological Research Center, Yunnan University, Kunming 650091, China.
| | | | | | | | | | | |
Collapse
|
38
|
Liu T, Zheng X, Wang J. Prediction of protein structural class using a complexity-based distance measure. Amino Acids 2009; 38:721-8. [PMID: 19330425 DOI: 10.1007/s00726-009-0276-1] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2008] [Accepted: 03/11/2009] [Indexed: 11/30/2022]
Abstract
Knowledge of structural class plays an important role in understanding protein folding patterns. So it is necessary to develop effective and reliable computational methods for prediction of protein structural class. To this end, we present a new method called NN-CDM, a nearest neighbor classifier with a complexity-based distance measure. Instead of extracting features from protein sequences as done previously, distance between each pair of protein sequences is directly evaluated by a complexity measure of symbol sequences. Then the nearest neighbor classifier is adopted as the predictive engine. To verify the performance of this method, jackknife cross-validation tests are performed on several benchmark datasets. Results show that our approach achieves a high prediction accuracy over some classical methods.
Collapse
Affiliation(s)
- Taigang Liu
- Department of Applied Mathematics, Dalian University of Technology, 116024 Dalian, China.
| | | | | |
Collapse
|
39
|
Using the nonlinear dimensionality reduction method for the prediction of subcellular localization of Gram-negative bacterial proteins. Mol Divers 2009; 13:475-81. [PMID: 19330461 DOI: 10.1007/s11030-009-9134-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2008] [Accepted: 02/25/2009] [Indexed: 10/21/2022]
|
40
|
Georgiou D, Karakasidis T, Nieto J, Torres A. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition. J Theor Biol 2009; 257:17-26. [DOI: 10.1016/j.jtbi.2008.11.003] [Citation(s) in RCA: 132] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2008] [Revised: 10/14/2008] [Accepted: 11/01/2008] [Indexed: 11/25/2022]
|
41
|
Gao QB, Jin ZC, Ye XF, Wu C, He J. Prediction of nuclear receptors with optimal pseudo amino acid composition. Anal Biochem 2009; 387:54-9. [PMID: 19454254 DOI: 10.1016/j.ab.2009.01.018] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2008] [Revised: 12/04/2008] [Accepted: 01/09/2009] [Indexed: 10/21/2022]
Abstract
Nuclear receptors are involved in multiple cellular signaling pathways that affect and regulate processes such as organ development and maintenance, ion transport, homeostasis, and apoptosis. In this article, an optimal pseudo amino acid composition based on physicochemical characters of amino acids is suggested to represent proteins for predicting the subfamilies of nuclear receptors. Six physicochemical characters of amino acids were adopted to generate the protein sequence features via web server PseAAC. The optimal values of the rank of correlation factor and the weighting factor about PseAAC were determined to get the appropriate descriptor of proteins that leads to the best performance. A nonredundant dataset of nuclear receptors in four subfamilies is constructed to evaluate the method using support vector machines. An overall accuracy of 99.6% was achieved in the fivefold cross-validation test as well as the jackknife test, and an overall accuracy of 98.4% was reached in a blind dataset test. The performance is very competitive with that of some previous methods.
Collapse
Affiliation(s)
- Qing-Bin Gao
- Department of Health Statistics, Second Military Medical University, Shanghai 200433, China
| | | | | | | | | |
Collapse
|
42
|
Nanni L, Mazzara S, Pattini L, Lumini A. Protein classification combining surface analysis and primary structure. Protein Eng Des Sel 2009; 22:267-72. [DOI: 10.1093/protein/gzn084] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
43
|
Prediction of subcellular location apoptosis proteins with ensemble classifier and feature selection. Amino Acids 2008; 38:975-83. [PMID: 19048186 DOI: 10.1007/s00726-008-0209-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2008] [Accepted: 11/03/2008] [Indexed: 10/21/2022]
Abstract
Apoptosis proteins have a central role in the development and the homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. The function of an apoptosis protein is closely related to its subcellular location. It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, amino acids pair compositions with different spaces are used to construct feature sets for representing sample of protein feature selection approach based on binary particle swarm optimization, which is applied to extract effective feature. Ensemble classifier is used as prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor. Each basic classifier is trained with different feature sets. Two datasets often used in prior works are selected to validate the performance of proposed approach. The results obtained by jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for subcellular location of apoptosis protein, or at least can play a complimentary role to the existing methods in the relevant areas. The supplement information and software written in Matlab are available by contacting the corresponding author.
Collapse
|
44
|
Xiao X, Lin WZ. Application of protein grey incidence degree measure to predict protein quaternary structural types. Amino Acids 2008; 37:741-9. [DOI: 10.1007/s00726-008-0212-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2008] [Accepted: 11/10/2008] [Indexed: 10/21/2022]
|
45
|
Shen HB, Chou KC. Identification of proteases and their types. Anal Biochem 2008; 385:153-60. [PMID: 19007742 DOI: 10.1016/j.ab.2008.10.020] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2009] [Revised: 10/13/2008] [Accepted: 10/14/2008] [Indexed: 10/21/2022]
Abstract
Called by many as biology's version of Swiss army knives, proteases cut long sequences of amino acids into fragments and regulate most physiological processes. They are vitally important in the life cycle. Different types of proteases have different action mechanisms and biological processes. With the avalanche of protein sequences generated during the postgenomic age, it is highly desirable for both basic research and drug design to develop a fast and reliable method for identifying the types of proteases according to their sequences or even just for whether they are proteases or not. In this article, three recently developed identification methods in this regard are discussed: (i) FunD-PseAAC, (ii) GO-PseAAC, and (iii) FunD-PsePSSM. The first two were established by hybridizing the FunD (functional domain) approach and the GO (gene ontology) approach, respectively, with the PseAAC (pseudo amino acid composition) approach. The third method was established by fusing the FunD approach with the PsePSSM (pseudo position-specific scoring matrix) approach. Of these three methods, only FunD-PsePSSM has provided a server called ProtIdent (protease identifier), which is freely accessible to the public via the website at http://www.csbio.sjtu.edu.cn/bioinf/Protease. For the convenience of users, a step-by-step guide on how to use ProtIdent is illustrated. Meanwhile, the caveat in using ProtIdent and how to understand the success expectancy rate of a statistical predictor are discussed. Finally, the essence of why ProtIdent can yield a high success rate in identifying proteases and their types is elucidated.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200240, China.
| | | |
Collapse
|
46
|
Using Chou’s pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier. Pattern Recognit Lett 2008. [DOI: 10.1016/j.patrec.2008.06.007] [Citation(s) in RCA: 135] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
47
|
Xiao X, Lin WZ, Chou KC. Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes. J Comput Chem 2008; 29:2018-24. [PMID: 18381630 DOI: 10.1002/jcc.20955] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Using the pseudo amino acid (PseAA) composition to represent the sample of a protein can incorporate a considerable amount of sequence pattern information so as to improve the prediction quality for its structural or functional classification. However, how to optimally formulate the PseAA composition is an important problem yet to be solved. In this article the grey modeling approach is introduced that is particularly efficient in coping with complicated systems such as the one consisting of many proteins with different sequence orders and lengths. On the basis of the grey model, four coefficients derived from each of the protein sequences concerned are adopted for its PseAA components. The PseAA composition thus formulated is called the "grey-PseAA" composition that can catch the essence of a protein sequence and better reflect its overall pattern. In our study we have demonstrated that introduction of the grey-PseAA composition can remarkably enhance the success rates in predicting the protein structural class. It is anticipated that the concept of grey-PseAA composition can be also used to predict many other protein attributes, such as subcellular localization, membrane protein type, enzyme functional class, GPCR type, protease type, among many others.
Collapse
Affiliation(s)
- Xuan Xiao
- Computer Department, Jing-De-Zhen Ceramic Institute, Jing-De-Zhen 333000, China.
| | | | | |
Collapse
|
48
|
A complexity-based method for predicting protein subcellular location. Amino Acids 2008; 37:427-33. [DOI: 10.1007/s00726-008-0172-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2008] [Accepted: 08/04/2008] [Indexed: 11/30/2022]
|
49
|
Zhang SW, Chen W, Yang F, Pan Q. Using Chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 2008; 35:591-8. [PMID: 18427713 DOI: 10.1007/s00726-008-0086-x] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 02/28/2008] [Indexed: 12/11/2022]
Abstract
In the protein universe, many proteins are composed of two or more polypeptide chains, generally referred to as subunits, which associate through noncovalent interactions and, occasionally, disulfide bonds to form protein quaternary structures. It has long been known that the functions of proteins are closely related to their quaternary structures; some examples include enzymes, hemoglobin, DNA polymerase, and ion channels. However, it is extremely labor-expensive and even impossible to quickly determine the structures of hundreds of thousands of protein sequences solely from experiments. Since the number of protein sequences entering databanks is increasing rapidly, it is highly desirable to develop computational methods for classifying the quaternary structures of proteins from their primary sequences. Since the concept of Chou's pseudo amino acid composition (PseAAC) was introduced, a variety of approaches, such as residue conservation scores, von Neumann entropy, multiscale energy, autocorrelation function, moment descriptors, and cellular automata, have been utilized to formulate the PseAAC for predicting different attributes of proteins. Here, in a different approach, a sequence-segmented PseAAC is introduced to represent protein samples. Meanwhile, multiclass SVM classifier modules were adopted to classify protein quaternary structures. As a demonstration, the dataset constructed by Chou and Cai [(2003) Proteins 53:282-289] was adopted as a benchmark dataset. The overall jackknife success rates thus obtained were 88.2-89.1%, indicating that the new approach is quite promising for predicting protein quaternary structure.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- College of Automation, Northwestern Polytechnical University, 710072, Xi'an, China.
| | | | | | | |
Collapse
|
50
|
Zhao XM, Chen L, Aihara K. Protein function prediction with high-throughput data. Amino Acids 2008; 35:517-30. [PMID: 18427717 DOI: 10.1007/s00726-008-0077-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Accepted: 03/13/2008] [Indexed: 12/12/2022]
Abstract
Protein function prediction is one of the main challenges in post-genomic era. The availability of large amounts of high-throughput data provides an alternative approach to handling this problem from the computational viewpoint. In this review, we provide a comprehensive description of the computational methods that are currently applicable to protein function prediction, especially from the perspective of machine learning. Machine learning techniques can generally be classified as supervised learning, semi-supervised learning and unsupervised learning. By classifying the existing computational methods for protein annotation into these three groups, we are able to present a comprehensive framework on protein annotation based on machine learning techniques. In addition to describing recently developed theoretical methodologies, we also cover representative databases and software tools that are widely utilized in the prediction of protein function.
Collapse
Affiliation(s)
- Xing-Ming Zhao
- ERATO Aihara Complexity Modelling Project, JST, Tokyo, 151-0064, Japan
| | | | | |
Collapse
|