301
|
Hillson NJ, Hu P, Andersen GL, Shapiro L. Caulobacter crescentus as a whole-cell uranium biosensor. Appl Environ Microbiol 2007; 73:7615-21. [PMID: 17905881 PMCID: PMC2168040 DOI: 10.1128/aem.01566-07] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
We engineered a strain of the bacterium Caulobacter crescentus to fluoresce in the presence of micromolar levels of uranium at ambient temperatures when it is exposed to a hand-held UV lamp. Previous microarray experiments revealed that several Caulobacter genes are significantly upregulated in response to uranium but not in response to other heavy metals. We designated one of these genes urcA (for uranium response in caulobacter). We constructed a reporter that utilizes the urcA promoter to produce a UV-excitable green fluorescent protein in the presence of the uranyl cation, a soluble form of uranium. This reporter is specific for uranium and has little cross specificity for nitrate (<400 microM), lead (<150 microM), cadmium (<48 microM), or chromium (<41.6 microM). The uranium reporter construct was effective for discriminating contaminated groundwater samples (4.2 microM uranium) from uncontaminated groundwater samples (<0.1 microM uranium) collected at the Oak Ridge Field Research Center. In contrast to other uranium detection methodologies, the Caulobacter reporter strain can provide on-demand usability in the field; it requires minimal sample processing and no equipment other than a hand-held UV lamp, and it may be sprayed directly on soil, groundwater, or industrial surfaces.
Collapse
Affiliation(s)
- Nathan J Hillson
- Department of Developmental Biology, Beckman Center, Stanford University School of Medicine, Stanford, California 94305, USA
| | | | | | | |
Collapse
|
302
|
Bhasin M, Raghava GPS. A hybrid approach for predicting promiscuous MHC class I restricted T cell epitopes. J Biosci 2007; 32:31-42. [PMID: 17426378 DOI: 10.1007/s12038-007-0004-5] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
In the present study, a systematic attempt has been made to develop an accurate method for predicting MHC class I restricted T cell epitopes for a large number of MHC class I alleles. Initially, a quantitative matrix (QM)-based method was developed for 47 MHC class I alleles having at least 15 binders. A secondary artificial neural network (ANN)-based method was developed for 30 out of 47 MHC alleles having a minimum of 40 binders. Combination of these ANN-and QM-based prediction methods for 30 alleles improved the accuracy of prediction by 6% compared to each individual method. Average accuracy of hybrid method for 30 MHC alleles is 92.8%. This method also allows prediction of binders for 20 additional alleles using QM that has been reported in the literature, thus allowing prediction for 67 MHC class I alleles. The performance of the method was evaluated using jack-knife validation test. The performance of the methods was also evaluated on blind or independent data. Comparison of our method with existing MHC binder prediction methods for alleles studied by both methods shows that our method is superior to other existing methods. This method also identifies proteasomal cleavage sites in antigen sequences by implementing the matrices described earlier. Thus, the method that we discover allows the identification of MHC class I binders (peptides binding with many MHC alleles) having proteasomal cleavage site at C-terminus. The user-friendly result display format (HTML-II) can assist in locating the promiscuous MHC binding regions from antigen sequence. The method is available on the web at www.imtech.res.in/raghava/nhlapred and its mirror site is available at http://bioinformatics.uams.edu/mirror/nhlapred/.
Collapse
Affiliation(s)
- Manoj Bhasin
- Institute of Microbial Technology, Sector 39A, Chandigarh 160 036, India
| | | |
Collapse
|
303
|
Rashid M, Saha S, Raghava GPS. Support Vector Machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics 2007; 8:337. [PMID: 17854501 PMCID: PMC2147037 DOI: 10.1186/1471-2105-8-337] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2007] [Accepted: 09/13/2007] [Indexed: 11/17/2022] Open
Abstract
Background In past number of methods have been developed for predicting subcellular location of eukaryotic, prokaryotic (Gram-negative and Gram-positive bacteria) and human proteins but no method has been developed for mycobacterial proteins which may represent repertoire of potent immunogens of this dreaded pathogen. In this study, attempt has been made to develop method for predicting subcellular location of mycobacterial proteins. Results The models were trained and tested on 852 mycobacterial proteins and evaluated using five-fold cross-validation technique. First SVM (Support Vector Machine) model was developed using amino acid composition and overall accuracy of 82.51% was achieved with average accuracy (mean of class-wise accuracy) of 68.47%. In order to utilize evolutionary information, a SVM model was developed using PSSM (Position-Specific Scoring Matrix) profiles obtained from PSI-BLAST (Position-Specific Iterated BLAST) and overall accuracy achieved was of 86.62% with average accuracy of 73.71%. In addition, HMM (Hidden Markov Model), MEME/MAST (Multiple Em for Motif Elicitation/Motif Alignment and Search Tool) and hybrid model that combined two or more models were also developed. We achieved maximum overall accuracy of 86.8% with average accuracy of 89.00% using combination of PSSM based SVM model and MEME/MAST. Performance of our method was compared with that of the existing methods developed for predicting subcellular locations of Gram-positive bacterial proteins. Conclusion A highly accurate method has been developed for predicting subcellular location of mycobacterial proteins. This method also predicts very important class of proteins that is membrane-attached proteins. This method will be useful in annotating newly sequenced or hypothetical mycobacterial proteins. Based on above study, a freely accessible web server TBpred http://www.imtech.res.in/raghava/tbpred/ has been developed.
Collapse
Affiliation(s)
- Mamoon Rashid
- Bioinformatics Centre, Institute of Microbial Technology, Sector-39A, Chandigarh, India
| | - Sudipto Saha
- Bioinformatics Centre, Institute of Microbial Technology, Sector-39A, Chandigarh, India
| | - Gajendra PS Raghava
- Bioinformatics Centre, Institute of Microbial Technology, Sector-39A, Chandigarh, India
| |
Collapse
|
304
|
Su ECY, Chiu HS, Lo A, Hwang JK, Sung TY, Hsu WL. Protein subcellular localization prediction based on compartment-specific features and structure conservation. BMC Bioinformatics 2007; 8:330. [PMID: 17825110 PMCID: PMC2040162 DOI: 10.1186/1471-2105-8-330] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Accepted: 09/08/2007] [Indexed: 01/17/2023] Open
Abstract
Background Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. Results We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. Conclusion Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes.
Collapse
Affiliation(s)
- Emily Chia-Yu Su
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan
| | - Hua-Sheng Chiu
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Allan Lo
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Department of Life Sciences, National Tsing Hua University, Hsinchu, Taiwan
| | - Jenn-Kang Hwang
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu, Taiwan
| | - Ting-Yi Sung
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Bioinformatics Lab., Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
305
|
Wang Z, Jiang L, Li M, Sun L, Lin R. Fast Fourier transform-based support vector machine for subcellular localization prediction using different substitution models. Acta Biochim Biophys Sin (Shanghai) 2007; 39:715-21. [PMID: 17805467 DOI: 10.1111/j.1745-7270.2007.00326.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
There are approximately 10(9) proteins in a cell. A hotspot in bioinformatics is how to identify a protein subcellular localization, if its sequence is known. In this paper, a method using fast Fourier transform-based support vector machine is developed to predict the subcellular localization of proteins from their physicochemical properties and structural parameters. The prediction accuracies reached 83% in prokaryotic organisms and 84% in eukaryotic organisms with the substitution model of the c-p-v matrix (c, composition; p, polarity; and v, molecular volume). The overall prediction accuracy was also evaluated using the "leave-one-out" jackknife procedure. The influence of the substitution model on prediction accuracy has also been discussed in the work. The source code of the new program is available on request from the authors.
Collapse
Affiliation(s)
- Zhimeng Wang
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | | | | | | | | |
Collapse
|
306
|
Huang WL, Tung CW, Huang HL, Hwang SF, Ho SY. ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features. Biosystems 2007; 90:573-81. [PMID: 17291684 DOI: 10.1016/j.biosystems.2007.01.001] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2006] [Revised: 12/09/2006] [Accepted: 01/01/2007] [Indexed: 11/24/2022]
Abstract
Accurate prediction methods of protein subnuclear localizations rely on the cooperation between informative features and classifier design. Support vector machine (SVM) based learning methods are shown effective for predictions of protein subcellular and subnuclear localizations. This study proposes an evolutionary support vector machine (ESVM) based classifier with automatic selection from a large set of physicochemical composition (PCC) features to design an accurate system for predicting protein subnuclear localization, named ProLoc. ESVM using an inheritable genetic algorithm combined with SVM can automatically determine the best number m of PCC features and identify m out of 526 PCC features simultaneously. To evaluate ESVM, this study uses two datasets SNL6 and SNL9, which have 504 proteins localized in 6 subnuclear compartments and 370 proteins localized in 9 subnuclear compartments. Using a leave-one-out cross-validation, ProLoc utilizing the selected m=33 and 28 PCC features has accuracies of 56.37% for SNL6 and 72.82% for SNL9, which are better than 51.4% for the SVM-based system using k-peptide composition features applied on SNL6, and 64.32% for an optimized evidence-theoretic k-nearest neighbor classifier utilizing pseudo amino acid composition applied on SNL9, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Institute of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | | | | | | | | |
Collapse
|
307
|
Chen YL, Li QZ. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol 2007; 248:377-81. [PMID: 17572445 DOI: 10.1016/j.jtbi.2007.05.019] [Citation(s) in RCA: 113] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2007] [Revised: 04/18/2007] [Accepted: 05/10/2007] [Indexed: 10/23/2022]
Abstract
Apoptosis proteins are very important for understanding the mechanism of programmed cell death. The apoptosis protein localization can provide valuable information about its molecular function. The prediction of localization of an apoptosis protein is a challenging task. In our previous work we proposed an increment of diversity (ID) method using protein sequence information for this prediction task. In this work, based on the concept of Chou's pseudo-amino acid composition [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Genet. (Erratum: Chou, K.C., 2001, vol. 44, 60) 43, 246-255, Chou, K.C., 2005. Using amphiphilic pseudo-amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19], a different pseudo-amino acid composition by using the hydropathy distribution information is introduced. A novel ID_SVM algorithm combined ID with support vector machine (SVM) is proposed. This method is applied to three data sets (317 apoptosis proteins, 225 apoptosis proteins and 98 apoptosis proteins). The higher predictive success rates than the previous algorithms are obtained by the jackknife tests.
Collapse
Affiliation(s)
- Ying-Li Chen
- Laboratory of Theoretical Biophysics, Department of Physics, College of Sciences and Technology, Inner Mongolia University, Hohhot 010021, China
| | | |
Collapse
|
308
|
Tan F, Feng X, Fang Z, Li M, Guo Y, Jiang L. Prediction of mitochondrial proteins based on genetic algorithm - partial least squares and support vector machine. Amino Acids 2007; 33:669-75. [PMID: 17701100 DOI: 10.1007/s00726-006-0465-0] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2006] [Accepted: 10/15/2006] [Indexed: 11/25/2022]
Abstract
Mitochondria are essential cell organelles of eukaryotes. Hence, it is vitally important to develop an automated and reliable method for timely identification of novel mitochondrial proteins. In this study, mitochondrial proteins were encoded by dipeptide composition technology; then, the genetic algorithm-partial least square (GA-PLS) method was used to evaluate the dipeptide composition elements which are more important in recognizing mitochondrial proteins; further, these selected dipeptide composition elements were applied to support vector machine (SVM)-based classifiers to predict the mitochondrial proteins. All the models were trained and validated by the jackknife cross-validation test. The prediction accuracy is 85%, suggesting that it performs reasonably well in predicting the mitochondrial proteins. Our results strongly imply that not all the dipeptide compositions are informative and indispensable for predicting proteins. The source code of MATLAB and the dataset are available on request under liml@scu.edu.cn.
Collapse
Affiliation(s)
- F Tan
- College of Chemistry, Sichuan University, Chengdu, China
| | | | | | | | | | | |
Collapse
|
309
|
GO molecular function coding based protein subcellular localization prediction. CHINESE SCIENCE BULLETIN-CHINESE 2007. [DOI: 10.1007/s11434-007-0336-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
310
|
Liu J, Kang S, Tang C, Ellis LB, Li T. Meta-prediction of protein subcellular localization with reduced voting. Nucleic Acids Res 2007; 35:e96. [PMID: 17670799 PMCID: PMC1976432 DOI: 10.1093/nar/gkm562] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Meta-prediction seeks to harness the combined strengths of multiple predicting programs with the hope of achieving predicting performance surpassing that of all existing predictors in a defined problem domain. We investigated meta-prediction for the four-compartment eukaryotic subcellular localization problem. We compiled an unbiased subcellular localization dataset of 1693 nuclear, cytoplasmic, mitochondrial and extracellular animal proteins from Swiss-Prot 50.2. Using this dataset, we assessed the predicting performance of 12 predictors from eight independent subcellular localization predicting programs: ELSPred, LOCtree, PLOC, Proteome Analyst, PSORT, PSORT II, SubLoc and WoLF PSORT. Gorodkin correlation coefficient (GCC) was one of the performance measures. Proteome Analyst is the best individual subcellular localization predictor tested in this four-compartment prediction problem, with GCC = 0.811. A reduced voting strategy eliminating six of the 12 predictors yields a meta-predictor (RAW-RAG-6) with GCC = 0.856, substantially better than all tested individual subcellular localization predictors (P = 8.2 × 10−6, Fisher's Z-transformation test). The improvement in performance persists when the meta-predictor is tested with data not used in its development. This and similar voting strategies, when properly applied, are expected to produce meta-predictors with outstanding performance in other life sciences problem domains.
Collapse
Affiliation(s)
- Jie Liu
- Department of Neuroscience and Department of Laboratory Medicine and Pathology, University of Minneapolis, MN 55455, USA
| | - Shuli Kang
- Department of Neuroscience and Department of Laboratory Medicine and Pathology, University of Minneapolis, MN 55455, USA
| | - Chuanning Tang
- Department of Neuroscience and Department of Laboratory Medicine and Pathology, University of Minneapolis, MN 55455, USA
| | - Lynda B.M. Ellis
- Department of Neuroscience and Department of Laboratory Medicine and Pathology, University of Minneapolis, MN 55455, USA
| | - Tongbin Li
- Department of Neuroscience and Department of Laboratory Medicine and Pathology, University of Minneapolis, MN 55455, USA
- *To whom correspondence should be addressed.+1 612 626 3481+1 612 626 5009
| |
Collapse
|
311
|
Abstract
Background The CSL (CBF1/RBP-Jκ/Suppressor of Hairless/LAG-1) transcription factor family members are well-known components of the transmembrane receptor Notch signaling pathway, which plays a critical role in metazoan development. They function as context-dependent activators or repressors of transcription of their responsive genes, the promoters of which harbor the GTG(G/A)GAA consensus elements. Recently, several studies described Notch-independent activities of the CSL proteins. Conclusion Our findings support the evolutionary origin of the CSL transcription factor family in the last common ancestor of fungi and metazoans. We hypothesize that the ancestral CSL function involved DNA binding and Notch-independent regulation of transcription and that this function may still be shared, to a certain degree, by the present CSL family members from both fungi and metazoans.
Collapse
|
312
|
Quantitative assessment of relationship between sequence similarity and function similarity. BMC Genomics 2007; 8:222. [PMID: 17620139 PMCID: PMC1949826 DOI: 10.1186/1471-2164-8-222] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2006] [Accepted: 07/09/2007] [Indexed: 11/16/2022] Open
Abstract
Background Comparative sequence analysis is considered as the first step towards annotating new proteins in genome annotation. However, sequence comparison may lead to creation and propagation of function assignment errors. Thus, it is important to perform a thorough analysis for the quality of sequence-based function assignment using large-scale data in a systematic way. Results We present an analysis of the relationship between sequence similarity and function similarity for the proteins in four model organisms, i.e., Arabidopsis thaliana, Saccharomyces cerevisiae, Caenorrhabditis elegans, and Drosophila melanogaster. Using a measure of functional similarity based on the three categories of Gene Ontology (GO) classifications (biological process, molecular function, and cellular component), we quantified the correlation between functional similarity and sequence similarity measured by sequence identity or statistical significance of the alignment and compared such a correlation against randomly chosen protein pairs. Conclusion Various sequence-function relationships were identified from BLAST versus PSI-BLAST, sequence identity versus Expectation Value, GO indices versus semantic similarity approaches, and within genome versus between genome comparisons, for the three GO categories. Our study provides a benchmark to estimate the confidence in assignment of functions purely based on sequence similarity.
Collapse
|
313
|
Thireou T, Reczko M. Bidirectional Long Short-Term Memory Networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:441-446. [PMID: 17666763 DOI: 10.1109/tcbb.2007.1015] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
An algorithm called Bidirectional Long Short-Term Memory Networks (BLSTM) for processing sequential data is introduced. This supervised learning method trains a special recurrent neural network to use very long ranged symmetric sequence context using a combination of nonlinear processing elements and linear feedback loops for storing long-range context. The algorithm is applied to the sequence-based prediction of protein localization and predicts 93.3 percent novel non-plant proteins and 88.4 percent novel plant proteins correctly, which is an improvement over feedforward and standard recurrent networks solving the same problem. The BLSTM system is available as a web-service (http://www.stepc.gr/~synaptic/blstm.html).
Collapse
|
314
|
Emanuelsson O, Brunak S, von Heijne G, Nielsen H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc 2007; 2:953-71. [PMID: 17446895 DOI: 10.1038/nprot.2007.131] [Citation(s) in RCA: 2496] [Impact Index Per Article: 138.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | | | | | | |
Collapse
|
315
|
Zhou XB, Chen C, Li ZC, Zou XY. Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol 2007; 248:546-51. [PMID: 17628605 DOI: 10.1016/j.jtbi.2007.06.001] [Citation(s) in RCA: 233] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2007] [Revised: 05/16/2007] [Accepted: 06/02/2007] [Indexed: 10/23/2022]
Abstract
With the rapid increment of protein sequence data, it is indispensable to develop automated and reliable predictive methods for protein function annotation. One approach for facilitating protein function prediction is to classify proteins into functional families from primary sequence. Being the most important group of all proteins, the accurate prediction for enzyme family classes and subfamily classes is closely related to their biological functions. In this paper, for the prediction of enzyme subfamily classes, the Chou's amphiphilic pseudo-amino acid composition [Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19] has been adopted to represent the protein samples for training the 'one-versus-rest' support vector machine. As a demonstration, the jackknife test was performed on the dataset that contains 2640 oxidoreductase sequences classified into 16 subfamily classes [Chou, K.C., Elrod, D.W., 2003. Prediction of enzyme family classes. J. Proteome Res. 2, 183-190]. The overall accuracy thus obtained was 80.87%. The significant enhancement in the accuracy indicates that the current method might play a complementary role to the exiting methods.
Collapse
Affiliation(s)
- Xi-Bin Zhou
- School of Chemistry and Chemical Engineering, Sun Yat-Sen University, Guangzhou 510275, People's Republic of China
| | | | | | | |
Collapse
|
316
|
Abstract
Disulfide bonds play an important role in stabilizing protein structure and regulating protein function. Therefore, the ability to infer disulfide connectivity from protein sequences will be valuable in structural modeling and functional analysis. However, to predict disulfide connectivity directly from sequences presents a challenge to computational biologists due to the nonlocal nature of disulfide bonds, i.e., the close spatial proximity of the cysteine pair that forms the disulfide bond does not necessarily imply the short sequence separation of the cysteine residues. Recently, Chen and Hwang (Proteins 2005;61:507-512) treated this problem as a multiple class classification by defining each distinct disulfide pattern as a class. They used multiple support vector machines based on a variety of sequence features to predict the disulfide patterns. Their results compare favorably with those in the literature for a benchmark dataset sharing less than 30% sequence identity. However, since the number of disulfide patterns grows rapidly when the number of disulfide bonds increases, their method performs unsatisfactorily for the cases of large number of disulfide bonds. In this work, we propose a novel method to represent disulfide connectivity in terms of cysteine pairs, instead of disulfide patterns. Since the number of bonding states of the cysteine pairs is independent of that of disulfide bonds, the problem of class explosion is avoided. The bonding states of the cysteine pairs are predicted using the support vector machines together with the genetic algorithm optimization for feature selection. The complete disulfide patterns are then determined from the connectivity matrices that are constructed from the predicted bonding states of the cysteine pairs. Our approach outperforms the current approaches in the literature.
Collapse
Affiliation(s)
- Chih-Hao Lu
- Institute of Bioinformatics, National Chiao Tung University, Hsinchu 30050, Taiwan
| | | | | | | |
Collapse
|
317
|
Jiang P, Wu H, Wei J, Sang F, Sun X, Lu Z. RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features. Nucleic Acids Res 2007; 35:W47-51. [PMID: 17478517 PMCID: PMC1933199 DOI: 10.1093/nar/gkm217] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
In the yeast, meiotic recombination is initiated by double-strand DNA breaks (DSBs) which occur at relatively high frequencies in some genomic regions (hotspots) and relatively low frequencies in others (coldspots). Although observations concerning individual hot/cold spots have given clues as to the mechanism of recombination initiation, the prediction of hot/cold spots from DNA sequence information is a challenging task. In this article, we introduce a random forest (RF) prediction model to detect recombination hot/cold spots from yeast genome. The out-of-bag (OOB) estimation of the model indicated that the RF classifier achieved high prediction performance with 82.05% total accuracy and 0.638 Mattew's correlation coefficient (MCC) value. Compared with an alternative machine-learning algorithm, support vector machine (SVM), the RF method outperforms it in both sensitivity and specificity. The prediction model is implemented as a web server (RF-DYMHC) and it is freely available at http://www.bioinf.seu.edu.cn/Recombination/rf_dymhc.htm. Given a yeast genome and prediction parameters (RI-value and non-overlapping window scan size), the program reports the predicted hot/cold spots and marks them in color.
Collapse
Affiliation(s)
| | | | | | | | | | - Zuhong Lu
- *To whom correspondence should be addressed: +86 25 83793779+86 25 83793779
| |
Collapse
|
318
|
Guo J, Pu X, Lin Y, Leung H. Protein subcellular localization based on PSI-BLAST and machine learning. J Bioinform Comput Biol 2007; 4:1181-95. [PMID: 17245809 DOI: 10.1142/s0219720006002405] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2006] [Revised: 08/02/2006] [Accepted: 08/02/2006] [Indexed: 11/18/2022]
Abstract
Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.
Collapse
Affiliation(s)
- Jian Guo
- Laboratory of Statistical Computation, Department of Mathematical Sciences, Tsinghua University, China
| | | | | | | |
Collapse
|
319
|
Jia P, Qian Z, Zeng Z, Cai Y, Li Y. Prediction of subcellular protein localization based on functional domain composition. Biochem Biophys Res Commun 2007; 357:366-70. [PMID: 17428441 DOI: 10.1016/j.bbrc.2007.03.139] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 03/21/2007] [Indexed: 11/19/2022]
Abstract
Assigning subcellular localization (SL) to proteins is one of the major tasks of functional proteomics. Despite the impressive technical advances of the past decades, it is still time-consuming and laborious to experimentally determine SL on a high throughput scale. Thus, computational predictions are the preferred method for large-scale assignment of protein SL, and if appropriate, followed up by experimental studies. In this report, using a machine learning approach, the Nearest Neighbor Algorithm (NNA), we developed a prediction system for protein SL in which we incorporated a protein functional domain profile. The overall accuracy achieved by this system is 93.96%. Furthermore, comparisons with other methods have been conducted to demonstrate the validity and efficiency of our prediction system. We also provide an implementation of our Subcellular Location Prediction System (SLPS), which is available at http://pcal.biosino.org.
Collapse
Affiliation(s)
- Peilin Jia
- Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China.
| | | | | | | | | |
Collapse
|
320
|
Oğul H, Mumcuoğu EU. Subcellular localization prediction with new protein encoding schemes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:227-32. [PMID: 17473316 DOI: 10.1109/tcbb.2007.070209] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Subcellular localization is one of the key properties in functional annotation of proteins. Support vector machines (SVMs) have been widely used for automated prediction of subcellular localizations. Existing methods differ in the protein encoding schemes used. In this study, we present two methods for protein encoding to be used for SVM-based subcellular localization prediction: n-peptide compositions with reduced amino acid alphabets for larger values of n and pairwise sequence similarity scores based on whole sequence and N-terminal sequence. We tested the methods on a common benchmarking data set that consists of 2,427 eukaryotic proteins with four localization sites. As a result of 5-fold cross-validation tests, the encoding with n-peptide compositions provided the accuracies of 84.5, 88.9, 66.3, and 94.3 percent for cytoplasmic, extracellular, mitochondrial, and nuclear proteins, where the overall accuracy was 87.1 percent. The second method provided 83.6, 87.7, 87.9, and 90.5 percent accuracies for individual locations and 87.8 percent overall accuracy. A hybrid system, which we called PredLOC, makes a final decision based on the results of the two presented methods which achieved an overall accuracy of 91.3 percent, which is better than the achievements of many of the existing methods. The new system also outperformed the recent methods in the experiments conducted on a new-unique SWISSPROT test set.
Collapse
Affiliation(s)
- Hasan Oğul
- Department of Computer Engineering, Baskent University, Ankara, Turkey.
| | | |
Collapse
|
321
|
Ensemblator: An ensemble of classifiers for reliable classification of biological data. Pattern Recognit Lett 2007. [DOI: 10.1016/j.patrec.2006.10.012] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
322
|
Holloway DT, Kon M, DeLisi C. Machine learning for regulatory analysis and transcription factor target prediction in yeast. SYSTEMS AND SYNTHETIC BIOLOGY 2007; 1:25-46. [PMID: 19003435 PMCID: PMC2533145 DOI: 10.1007/s11693-006-9003-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
High throughput technologies, including array-based chromatin immunoprecipitation, have rapidly increased our knowledge of transcriptional maps-the identity and location of regulatory binding sites within genomes. Still, the full identification of sites, even in lower eukaryotes, remains largely incomplete. In this paper we develop a supervised learning approach to site identification using support vector machines (SVMs) to combine 26 different data types. A comparison with the standard approach to site identification using position specific scoring matrices (PSSMs) for a set of 104 Saccharomyces cerevisiae regulators indicates that our SVM-based target classification is more sensitive (73 vs. 20%) when specificity and positive predictive value are the same. We have applied our SVM classifier for each transcriptional regulator to all promoters in the yeast genome to obtain thousands of new targets, which are currently being analyzed and refined to limit the risk of classifier over-fitting. For the purpose of illustration we discuss several results, including biochemical pathway predictions for Gcn4 and Rap1. For both transcription factors SVM predictions match well with the known biology of control mechanisms, and possible new roles for these factors are suggested, such as a function for Rap1 in regulating fermentative growth. We also examine the promoter melting temperature curves for the targets of YJR060W, and show that targets of this TF have potentially unique physical properties which distinguish them from other genes. The SVM output automatically provides the means to rank dataset features to identify important biological elements. We use this property to rank classifying k-mers, thereby reconstructing known binding sites for several TFs, and to rank expression experiments, determining the conditions under which Fhl1, the factor responsible for expression of ribosomal protein genes, is active. We can see that targets of Fhl1 are differentially expressed in the chosen conditions as compared to the expression of average and negative set genes. SVM-based classifiers provide a robust framework for analysis of regulatory networks. Processing of classifier outputs can provide high quality predictions and biological insight into functions of particular transcription factors. Future work on this method will focus on increasing the accuracy and quality of predictions using feature reduction and clustering strategies. Since predictions have been made on only 104 TFs in yeast, new classifiers will be built for the remaining 100 factors which have available binding data.
Collapse
Affiliation(s)
- Dustin T. Holloway
- Molecular Biology Cell Biology and Biochemistry, Boston University, Boston, MA 02215 USA
| | - Mark Kon
- Department of Mathematics and Statistics, Boston University, Boston, MA 02215 USA
- Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA
| | - Charles DeLisi
- Bioinformatics and Systems Biology, Boston University, Boston, MA 02215 USA
| |
Collapse
|
323
|
Klee EW, Sosa CP. Computational classification of classically secreted proteins. Drug Discov Today 2007; 12:234-40. [PMID: 17331888 DOI: 10.1016/j.drudis.2007.01.008] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2006] [Revised: 01/05/2007] [Accepted: 01/23/2007] [Indexed: 11/21/2022]
Abstract
The ability to identify classically secreted proteins is an important component of targeted therapeutic studies and the discovery of circulating biomarkers. Here, we review some of the most recent programs available for the in silico prediction of secretory proteins, the performance of which is benchmarked with an independent set of annotated human proteins. The description of these programs and the results of this benchmarking provide insights into the most recently developed prediction programs, which will enable investigators to make more informed decisions about which program best addresses their research needs.
Collapse
Affiliation(s)
- Eric W Klee
- Stabile 3-15, Department of Laboratory Medicine, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA
| | | |
Collapse
|
324
|
Raghuraj R, Lakshminarayanan S. VPMCD: Variable interaction modeling approach for class discrimination in biological systems. FEBS Lett 2007; 581:826-30. [PMID: 17289035 DOI: 10.1016/j.febslet.2007.01.052] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2006] [Revised: 01/19/2007] [Accepted: 01/25/2007] [Indexed: 11/20/2022]
Abstract
Data classification algorithms applied for class prediction in computational biology literature are data specific and have shown varying degrees of performance. Different classes cannot be distinguished solely based on interclass distances or decision boundaries. We propose that inter-relations among the features be exploited for separating observations into specific classes. A new variable predictive model based class discrimination (VPMCD) method is described here. Three well established and proven data sets of varying statistical and biological significance are utilized as benchmark. The performance of the new method is compared with advanced classification algorithms. The new method performs better during different tests and shows higher stability and robustness. The VPMCD is observed to be a potentially strong classification approach and can be effectively extended to other data mining applications involving biological systems.
Collapse
Affiliation(s)
- Rao Raghuraj
- Department of Chemical and Biomolecular Engineering, 4 Engineering Drive 4, National University of Singapore, Singapore
| | | |
Collapse
|
325
|
Carrie C, Murcha MW, Millar AH, Smith SM, Whelan J. Nine 3-ketoacyl-CoA thiolases (KATs) and acetoacetyl-CoA thiolases (ACATs) encoded by five genes in Arabidopsis thaliana are targeted either to peroxisomes or cytosol but not to mitochondria. PLANT MOLECULAR BIOLOGY 2007; 63:97-108. [PMID: 17120136 DOI: 10.1007/s11103-006-9075-1] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2006] [Accepted: 08/10/2006] [Indexed: 05/12/2023]
Abstract
The sub-cellular location of enzymes of fatty acid beta-oxidation in plants is controversial. In the current debate the role and location of particular thiolases in fatty acid degradation, fatty acid synthesis and isoleucine degradation are important. The aim of this research was to determine the sub-cellular location and hence provide information about possible functions of all the putative 3-ketoacyl-CoA thiolases (KAT) and acetoacetyl-CoA thiolases (ACAT) in Arabidopsis. Arabidopsis has three genes predicted to encode KATs, one of which encodes two polypeptides that differ at the N-terminal end. Expression in Arabidopsis cells of cDNAs encoding each of these KATs fused to green fluorescent protein (GFP) at their C-termini showed that three are targeted to peroxisomes while the fourth is apparently cytosolic. The four KATs are also predicted to have mitochondrial targeting sequences, but purified mitochondria were unable to import any of the proteins in vitro. Arabidopsis also has two genes encoding a total of five different putative ACATs. One isoform is targeted to peroxisomes as a fusion with GFP, while the others display no targeting in vivo as GFP fusions, or import into isolated mitochondria. Analysis of gene co-expression clusters in Arabidopsis suggests a role for peroxisomal KAT2 in beta-oxidation, while KAT5 co-expresses with genes of the flavonoid biosynthesis pathway and cytosolic ACAT2 clearly co-expresses with genes of the cytosolic mevalonate biosynthesis pathway. We conclude that KATs and ACATs are present in the cytosol and peroxisome, but are not found in mitochondria. The implications for fatty acid beta-oxidation and for isoleucine degradation in mitochondria are discussed.
Collapse
Affiliation(s)
- Chris Carrie
- ARC Centre of Excellence in Plant Energy Biology, University of Western Australia, MCS building M316, 35 Stirling Highway, Crawley, 6009, WA, Australia
| | | | | | | | | |
Collapse
|
326
|
Mukai Y, Hirokawa T, Tomii K, Asai K, Akiyama Y, Suwa M. Identification of Glycosyltransferases Focusing on Golgi Transmembrane Region. TRENDS GLYCOSCI GLYC 2007. [DOI: 10.4052/tigg.19.41] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
327
|
González-Díaz H, Pérez-Castillo Y, Podda G, Uriarte E. Computational chemistry comparison of stable/nonstable protein mutants classification models based on 3D and topological indices. J Comput Chem 2007; 28:1990-5. [PMID: 17450569 DOI: 10.1002/jcc.20700] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
In principle, there are different protein structural parameters that can be used in computational chemistry studies to classify protein mutants according to thermal stability including: sequence, connectivity, and 3D descriptors. Connectivity parameters (called topological indices, TIs) are simpler than 3D parameters being then less computationally expensive. However, TIs ignore important aspects of protein structure and hence are expected to be inaccurate. In any case, a comparison of 3D and TIs has not been reported with respect to the power of discrimination of proteins according to stability. In this study, we compare both classes of indices in this sense by the first time. The best model found, based on 3D spectral moments correctly classified 507 out of 525 (96.6%) proteins while TIs model correctly classified 404 out of 525 (77.0%) proteins. We have shown that, in fact, 3D descriptor models gave more accurate results than TIs but interestingly, TIs give acceptable results in a timely way in spite of their simplicity.
Collapse
Affiliation(s)
- Humberto González-Díaz
- Faculty of Pharmacy, University of Santiago de Compostela, Santiago de Compostela 15782, Spain.
| | | | | | | |
Collapse
|
328
|
Abstract
It is widely recognized that much of the information for determining the final subcellular localization of proteins is found in their amino acid sequences. Thus the prediction of protein localization sites is of both theoretical and practical interest. In most cases, the prediction has been attempted in two ways: one is based on the knowledge of experimentally characterized targeting signals, while the other utilizes the statistical differences of general sequence characteristics, such as amino acid composition, between localization sites. Both approaches have limitations, and it is recommended to check the results of various prediction methods based on different principles as well as training data. Recently, increased proteomic analyses of localization sites have provided new data to assess the current status of predictive methods. In this chapter we discuss these issues and close with an example illustrating the use of the WoLF PSORT web server for localization prediction.
Collapse
Affiliation(s)
- Kenta Nakai
- Laboratory of Functional Analysis in silico, Human Genome Center, Institute of Medical Science, University of Tokyo, Tokyo, Japan
| | | |
Collapse
|
329
|
González-Díaz H, Uriarte E. Biopolymer stochastic moments. I. Modeling human rhinovirus cellular recognition with protein surface electrostatic moments. Biopolymers 2006; 77:296-303. [PMID: 15648087 DOI: 10.1002/bip.20234] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Stochastic moments may be applied as molecular descriptors in quantitative structure-activity relationship (QSAR) studies for small molecules (H. González-Dìaz et al., Journal of Molecular Modeling, 2002, Vol. 8, pp. 237-245; 2003, Vol. 9, pp. 395-407). However, applications in the field of biopolymers are less known. Recently, the MARCH-INSIDE approach has been generalized to encode structural features of proteins and other biopolymers (H. González-Dáaz et al., Bioinformatics, 2003, Vol. 19, pp. 2079-2087; Bioorganic & Medicinal Chemistry Letters, 2004, Vol. 14, pp. 4691-4695; Polymers, 2004, Vol. 45, pp. 3845-3853; Bioorganic & Medicinal Chemistry, 2005, Vol. 13, pp. 323-331). The present article attempts to extend this research by introducing for the first time stochastic moments for a surface road map of viral proteins. These moments are afterward used to seek a model that predicts the cellular receptor for human rhinoviruses. The model correctly classified 100% of 10 viruses binding to low-density lipoprotein receptor (LDLR) and 88.9% of 9 viruses binding to the intracellular adhesion molecule (ICAM) receptors in training. The same results have been obtained in four cross-validation experiments using a resubstitution technique. The present model favorably compares, in terms of complexity, with other previously reported based on entropy considerations, and offers a quantitative basis for the visual rule previously reported by Vlasak et al.
Collapse
|
330
|
Bodył A, Mackiewicz P. Analysis of the targeting sequences of an iron-containing superoxide dismutase (SOD) of the dinoflagellate Lingulodinium polyedrum suggests function in multiple cellular compartments. Arch Microbiol 2006; 187:281-96. [PMID: 17143625 DOI: 10.1007/s00203-006-0194-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2006] [Accepted: 11/06/2006] [Indexed: 01/19/2023]
Abstract
One of the proteins targeted to the peridinin plastid of the dinoflagellate Lingulodinium polyedrum is the iron-containing superoxide dismutase (LpSOD). Like dinoflagellate plastid proteins of class II, LpSOD carries a bipartite presequence comprising a signal peptide followed by a transit peptide. Our bioinformatic studies suggest that its signal peptide is atypical, however, and that the entire presequence may function as a mitochondrial targeting signal. It is possible that LpSOD represents a new class of proteins in algae with complex plastids, which are co-targeted to the plastid and mitochondrion. In addition to the ambiguous N-terminal targeting signal, LpSOD contains a potential type-1 peroxisome-targeting signal (PTS1) located at its C-terminus. In accordance with a peroxisome localization of this dismutase, its mRNA has two in-frame AUG codons. Our bioinformatic analyses indicate that the first start codon resides in a much weaker oligonucleotide context than the second one. This suggests that synthesis of the plastid/mitochondrion-targeted and peroxisome-targeted isoforms could proceed through so-called leaky scanning. Moreover, our results show that expression of the two isoforms could be regulated by a 'hairpin' structure located between the first and second start codons.
Collapse
Affiliation(s)
- Andrzej Bodył
- Department of Biodiversity and Evolutionary Taxonomy, Zoological Institute, University of Wrocław, ul. Przybyszewskiego 63/77, 51-148 Wrocław, Poland.
| | | |
Collapse
|
331
|
Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence. BMC Bioinformatics 2006; 7:518. [PMID: 17134515 PMCID: PMC1716183 DOI: 10.1186/1471-2105-7-518] [Citation(s) in RCA: 137] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2006] [Accepted: 11/30/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Knowing the submitochondria localization of a mitochondria protein is an important step to understand its function. We develop a method which is based on an extended version of pseudo-amino acid composition to predict the protein localization within mitochondria. This work goes one step further than predicting protein subcellular location. We also try to predict the membrane protein type for mitochondrial inner membrane proteins. RESULTS By using leave-one-out cross validation, the prediction accuracy is 85.5% for inner membrane, 94.5% for matrix and 51.2% for outer membrane. The overall prediction accuracy for submitochondria location prediction is 85.2%. For proteins predicted to localize at inner membrane, the accuracy is 94.6% for membrane protein type prediction. CONCLUSION Our method is an effective method for predicting protein submitochondria location. But even with our method or the methods at subcellular level, the prediction of protein submitochondria location is still a challenging problem. The online service SubMito is now available at: http://bioinfo.au.tsinghua.edu.cn/subMito.
Collapse
|
332
|
Machine learning techniques in disease forecasting: a case study on rice blast prediction. BMC Bioinformatics 2006; 7:485. [PMID: 17083731 PMCID: PMC1647291 DOI: 10.1186/1471-2105-7-485] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2006] [Accepted: 11/03/2006] [Indexed: 11/22/2022] Open
Abstract
Background Diverse modeling approaches viz. neural networks and multiple regression have been followed to date for disease prediction in plant populations. However, due to their inability to predict value of unknown data points and longer training times, there is need for exploiting new prediction softwares for better understanding of plant-pathogen-environment relationships. Further, there is no online tool available which can help the plant researchers or farmers in timely application of control measures. This paper introduces a new prediction approach based on support vector machines for developing weather-based prediction models of plant diseases. Results Six significant weather variables were selected as predictor variables. Two series of models (cross-location and cross-year) were developed and validated using a five-fold cross validation procedure. For cross-year models, the conventional multiple regression (REG) approach achieved an average correlation coefficient (r) of 0.50, which increased to 0.60 and percent mean absolute error (%MAE) decreased from 65.42 to 52.24 when back-propagation neural network (BPNN) was used. With generalized regression neural network (GRNN), the r increased to 0.70 and %MAE also improved to 46.30, which further increased to r = 0.77 and %MAE = 36.66 when support vector machine (SVM) based method was used. Similarly, cross-location validation achieved r = 0.48, 0.56 and 0.66 using REG, BPNN and GRNN respectively, with their corresponding %MAE as 77.54, 66.11 and 58.26. The SVM-based method outperformed all the three approaches by further increasing r to 0.74 with improvement in %MAE to 44.12. Overall, this SVM-based prediction approach will open new vistas in the area of forecasting plant diseases of various crops. Conclusion Our case study demonstrated that SVM is better than existing machine learning techniques and conventional REG approaches in forecasting plant diseases. In this direction, we have also developed a SVM-based web server for rice blast prediction, a first of its kind worldwide, which can help the plant science community and farmers in their decision making process. The server is freely available at .
Collapse
|
333
|
Huang WL, Chen HM, Hwang SF, Ho SY. Accurate prediction of enzyme subfamily class using an adaptive fuzzy k-nearest neighbor method. Biosystems 2006; 90:405-13. [PMID: 17140725 DOI: 10.1016/j.biosystems.2006.10.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2006] [Revised: 10/15/2006] [Accepted: 10/22/2006] [Indexed: 10/24/2022]
Abstract
Amphiphilic pseudo-amino acid composition (Am-Pse-AAC) with extra sequence-order information is a useful feature for representing enzymes. This study first utilizes the k-nearest neighbor (k-NN) rule to analyze the distribution of enzymes in the Am-Pse-AAC feature space. This analysis indicates the distributions of multiple classes of enzymes are highly overlapped. To cope with the overlap problem, this study proposes an efficient non-parametric classifier for predicting enzyme subfamily class using an adaptive fuzzy r-nearest neighbor (AFK-NN) method, where k and a fuzzy strength parameter m are adaptively specified. The fuzzy membership values of a query sample Q are dynamically determined according to the position of Q and its weighted distances to the k nearest neighbors. Using the same enzymes of the oxidoreductases family for comparisons, the prediction accuracy of AFK-NN is 76.6%, which is better than those of Support Vector Machine (73.6%), the decision tree method C5.0 (75.4%) and the existing covariant-discriminate algorithm (70.6%) using a jackknife test. To evaluate the generalization ability of AFK-NN, the datasets for all six families of entirely sequenced enzymes are established from the newly updated SWISS-PROT and ENZYME database. The accuracy of AFK-NN on the new large-scale dataset of oxidoreductases family is 83.3%, and the mean accuracy of the six families is 92.1%.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Institute of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan
| | | | | | | |
Collapse
|
334
|
Zhang ZH, Wang ZH, Zhang ZR, Wang YX. A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine. FEBS Lett 2006; 580:6169-74. [PMID: 17069811 DOI: 10.1016/j.febslet.2006.10.017] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2006] [Revised: 09/25/2006] [Accepted: 10/06/2006] [Indexed: 11/25/2022]
Abstract
Apoptosis proteins have a central role in the development and homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. Based on the idea of coarse-grained description and grouping in physics, a new feature extraction method with grouped weight for protein sequence is presented, and applied to apoptosis protein subcellular localization prediction associated with support vector machine. For the same training dataset and the same predictive algorithm, the overall prediction accuracy of our method in Jackknife test is 13.2% and 15.3% higher than the accuracy based on the amino acid composition and instability index. Especially for the else class apoptosis proteins, the increment of prediction accuracy is 41.7 and 33.3 percentile, respectively. The experiment results show that the new feature extraction method is efficient to extract the structure information implicated in protein sequence and the method has reached a satisfied performance despite its simplicity. The overall prediction accuracy of EBGW_SVM model on dataset ZD98 reach 92.9% in Jackknife test, which is 8.2-20.4 percentile higher than other existing models. For a new dataset ZW225, the overall prediction accuracy of EBGW_SVM achieves 83.1%. Those implied that EBGW_SVM model is a simple but efficient prediction model for apoptosis protein subcellular location prediction.
Collapse
Affiliation(s)
- Zhen-Hui Zhang
- Department of Mathematics and System Science, School of Science, National University of Defense Technology, 410073 Changsha, China.
| | | | | | | |
Collapse
|
335
|
Abstract
Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan, Republic of China
| | | | | | | |
Collapse
|
336
|
Song J, Burrage K. Predicting residue-wise contact orders in proteins by support vector regression. BMC Bioinformatics 2006; 7:425. [PMID: 17014735 PMCID: PMC1618864 DOI: 10.1186/1471-2105-7-425] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2006] [Accepted: 10/03/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.
Collapse
Affiliation(s)
- Jiangning Song
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| | - Kevin Burrage
- Advanced Computational Modelling Centre, The University of Queensland, Brisbane Qld 4072, Australia
| |
Collapse
|
337
|
Chou KC, Shen HB. Predicting protein subcellular location by fusing multiple classifiers. J Cell Biochem 2006; 99:517-27. [PMID: 16639720 DOI: 10.1002/jcb.20879] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. Knowledge of subcellular locations of proteins can provide key hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both time-consuming and expensive to determine the localization of an uncharacterized protein in a living cell purely based on experiments. With the avalanche of newly found protein sequences emerging in the post genomic era, we are facing a critical challenge, that is, how to develop an automated method to fast and reliably identify their subcellular locations so as to be able to timely use them for basic research and drug discovery. In view of this, an ensemble classifier was developed by the approach of fusing many basic individual classifiers through a voting system. Each of these basic classifiers was trained in a different dimension of the amphiphilic pseudo amino acid composition (Chou [2005] Bioinformatics 21: 10-19). As a demonstration, predictions were performed with the fusion classifier for proteins among the following 14 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6) endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, (10) mitochondria, (11) nucleus, (12) peroxisome, (13) plasma membrane, and (14) vacuole. The overall success rates thus obtained via the resubstitution test, jackknife test, and independent dataset test were all significantly higher than those by the existing classifiers. It is anticipated that the novel ensemble classifier may also become a very useful vehicle in classifying other attributes of proteins according to their sequences, such as membrane protein type, enzyme family/sub-family, G-protein coupled receptor (GPCR) type, and structural class, among many others. The fusion ensemble classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, 13784 Torrey Del Mar, San Diego, California 92130, USA.
| | | |
Collapse
|
338
|
Zhang T, Ding Y, Chou KC. Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Comput Biol Chem 2006; 30:367-71. [PMID: 16963318 DOI: 10.1016/j.compbiolchem.2006.08.003] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Accepted: 08/03/2006] [Indexed: 11/17/2022]
Abstract
The function of eukaryotic protein is closely correlated with its subcellular location. The number of newly found protein sequences entering into data banks is rapidly increasing with the success of human genome project. It is highly desirable to predict a protein subcellular automatically from its amino acid sequence. In this paper, amino acid hydrophobic patterns and average power-spectral density (APSD) are introduced to define pseudo amino acid composition. The covariant-discriminant predictor is used to predict subcellular location. Immune-genetic algorithm (IGA) is used to find the fittest weight factors which are very important in this method. As such, high success rates are obtained by both self-consistency test (86%) and jackknife test (73%). More than 80% predictive accuracy is achieved in independent dataset test. The results demonstrate that the proposed method is practical. And, the method illuminates that the protein subcellular location can be predicted from its surface physio-chemical characteristic of protein folding.
Collapse
Affiliation(s)
- Tongliang Zhang
- Bio-Informatics Research Center, College of Information Sciences and Technology, Donghua University, Shanghai 201620, PR China
| | | | | |
Collapse
|
339
|
Wang Y, Xue Z, Xu J. Better prediction of the location of alpha-turns in proteins with support vector machine. Proteins 2006; 65:49-54. [PMID: 16894602 DOI: 10.1002/prot.21062] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We have developed a novel method named AlphaTurn to predict alpha-turns in proteins based on the support vector machine (SVM). The prediction was done on a data set of 469 nonhomologous proteins containing 967 alpha-turns. A great improvement in prediction performance was achieved by using multiple sequence alignment generated by PSI-BLAST as input instead of the single amino acid sequence. The introduction of secondary structure information predicted by PSIPRED also improved the prediction performance. Moreover, we handled the very uneven data set by combining the cost factor j with the "state-shifting" rule. This further promoted the prediction quality of our method. The final SVM model yielded a Matthews correlation coefficient (MCC) of 0.25 by a 10-fold cross-validation. To our knowledge, this MCC value is the highest obtained so far for predicting alpha-turns. An online Web server based on this method has been developed and can be freely accessed at http://bmc.hust.edu.cn/bioinformatics/ or http://210.42.106.80/.
Collapse
Affiliation(s)
- Yan Wang
- Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan City, China
| | | | | |
Collapse
|
340
|
Gardy JL, Brinkman FSL. Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol 2006; 4:741-51. [PMID: 16964270 DOI: 10.1038/nrmicro1494] [Citation(s) in RCA: 120] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
The computational prediction of the subcellular localization of bacterial proteins is an important step in genome annotation and in the search for novel vaccine or drug targets. Since the 1991 release of PSORT I--the first comprehensive algorithm to predict bacterial protein localization--many other localization prediction tools have been developed. These methods offer significant improvements in predictive performance over PSORT I and the accuracy of some methods now rivals that of certain high-throughput laboratory methods for protein localization identification.
Collapse
Affiliation(s)
- Jennifer L Gardy
- Centre for Microbial Diseases and Immunity Research, University of British Columbia, Vancouver, British Columbia, V6T 1Z4 Canada
| | | |
Collapse
|
341
|
Lee K, Kim DW, Na D, Lee KH, Lee D. PLPD: reliable protein localization prediction from imbalanced and overlapped datasets. Nucleic Acids Res 2006; 34:4655-66. [PMID: 16966337 PMCID: PMC1636404 DOI: 10.1093/nar/gkl638] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003).
Collapse
Affiliation(s)
- KiYoung Lee
- Department of BioSystems, KAISTDaejeon City, Republic of Korea
- Advanced Information Technology Research Center, KAISTDaejeon City, Republic of Korea
| | - Dae-Won Kim
- School of Computer Science and Engineering, Chung-Ang UniversitySeoul City, Republic of Korea
| | - DoKyun Na
- Department of BioSystems, KAISTDaejeon City, Republic of Korea
| | - Kwang H. Lee
- Department of BioSystems, KAISTDaejeon City, Republic of Korea
- Advanced Information Technology Research Center, KAISTDaejeon City, Republic of Korea
| | - Doheon Lee
- Department of BioSystems, KAISTDaejeon City, Republic of Korea
- To whom correspondence should be addressed. Tel: +82 42 869 4316; Fax: +82 42 869 8680;
| |
Collapse
|
342
|
Guo J, Lin Y, Liu X. GNBSL: A new integrative system to predict the subcellular location for Gram-negative bacteria proteins. Proteomics 2006; 6:5099-105. [PMID: 16955516 DOI: 10.1002/pmic.200600064] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper proposes a new integrative system (GNBSL--Gram-negative bacteria subcellular localization) for subcellular localization specifized on the Gram-negative bacteria proteins. First, the system generates a position-specific frequency matrix (PSFM) and a position-specific scoring matrix (PSSM) for each protein sequence by searching the Swiss-Prot database. Then different features are extracted by four modules from the PSFM and the PSSM. The features include whole-sequence amino acid composition, N- and C-terminus amino acid composition, dipeptide composition, and segment composition. Four probabilistic neural network (PNN) classifiers are used to classify these modules. To further improve the performance, two modules trained by support vector machine (SVM) are added in this system. One module extracts the residue-couple distribution from the amino acid sequence and the other module applies a pairwise profile alignment kernel to measure the local similarity between every two sequences. Finally, an additional SVM is used to fuse the outputs from the six modules. Test on a benchmark dataset shows that the overall success rate of GNBSL is higher than those of PSORT-B, CELLO, and PSLpred. A web server GNBSL can be visited from http://166.111.24.5/webtools/GNBSL/index.htm.
Collapse
Affiliation(s)
- Jian Guo
- Department of Mathematical Sciences, Laboratory of Statistical Computing & Bioinformatics, Tsinghua University, Beijing, P R China.
| | | | | |
Collapse
|
343
|
Abstract
Cytokine-induced apoptosis inhibitor 1 (CIAPIN1) is a newly identified anti-apoptotic molecule. Our previous studies have demonstrated that CIAPIN1 is ubiquitously expressed in normal fetal and adult human tissues and confers multidrug resistance in gastric cancer cells, possibly by upregulating the expression of multidrug resistance gene 1 and multidrug resistance-related protein 1. However, fundamental biological functions of CIAPIN1 have not been elucidated. In this study, we first predicted the subcellular localization of CIAPIN1 with bioinformatic approaches and then characterized the intracellular localization of CIAPIN1 in both human and mouse cells by a combination of techniques including (a)immunohistochemistry and immunofluorescence, (b) His-tagged CIAPIN1 expression, and (c)subcellular fractionation and analysis of CIAPIN1 in the fractions by Western blotting. All methods produced consistent results; CIAPIN1 was localized in both the cytoplasm and the nucleus and was accumulated in the nucleolus. Bioinformatic prediction disclosed a putative nuclear localization signal and a putative nuclear export signal within both human and mouse CIAPIN1. These findings suggest that CIAPIN1 may undergo a cytoplasm-nucleus-nucleolus translocation.
Collapse
Affiliation(s)
- Zhiming Hao
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| | - Xiaohua Li
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| | - Taidong Qiao
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| | - Rui Du
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| | - Guoyun Zhang
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| | - Daiming Fan
- State Key Laboratory of Cancer Biology, Institute of Digestive Diseases, Xijing Hospital, The Fourth Military Medical University, Xi'an, Shaanxi Province, China
| |
Collapse
|
344
|
Azizi AA, Gelpi E, Yang JW, Rupp B, Godwin AK, Slater C, Slavc I, Lubec G. Mass spectrometric identification of serine hydrolase OVCA2 in the medulloblastoma cell line DAOY. Cancer Lett 2006; 241:235-49. [PMID: 16368187 DOI: 10.1016/j.canlet.2005.10.023] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2005] [Revised: 10/14/2005] [Accepted: 10/17/2005] [Indexed: 11/18/2022]
Abstract
OVCA2 is a putative serine-hydrolase. Performing protein profiling in human tumour cell lines, OVCA2 was detected in DAOY medulloblastoma cells as a high abundance protein. The protein was unambiguously identified by 2D gel-electrophoresis and MALDI-MS and MS/MS, its presence was confirmed by western blotting. Immunohistochemistry revealed expression in medulloblastoma and predominantly in oligodendrocytes. Computational approaches predicted functional motifs and domains, interaction with apoptosis-related protein BAG and 3D structure. In addition to the presence of OVCA2 in medulloblastoma, it was furthermore detectable in three out of 10 human tumour cell-lines as a high abundance protein probably suggesting a role in the tumour biology.
Collapse
Affiliation(s)
- Amedeo A Azizi
- Department of Pediatrics, Medical University of Vienna, Währinger Gürtel 19-21, A-1090 Vienna, Austria
| | | | | | | | | | | | | | | |
Collapse
|
345
|
Wang Y, Xue ZD, Shi XH, Xu J. Prediction of π-turns in proteins using PSI-BLAST profiles and secondary structure information. Biochem Biophys Res Commun 2006; 347:574-80. [PMID: 16844090 DOI: 10.1016/j.bbrc.2006.06.066] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2006] [Accepted: 06/14/2006] [Indexed: 11/28/2022]
Abstract
Due to the structural and functional importance of tight turns, some methods have been proposed to predict gamma-turns, beta-turns, and alpha-turns in proteins. In the past, studies of pi-turns were made, but not a single prediction approach has been developed so far. It will be useful to develop a method for identifying pi-turns in a protein sequence. In this paper, the support vector machine (SVM) method has been introduced to predict pi-turns from the amino acid sequence. The training and testing of this approach is performed with a newly collected data set of 640 non-homologous protein chains containing 1931 pi-turns. Different sequence encoding schemes have been explored in order to investigate their effects on the prediction performance. With multiple sequence alignment and predicted secondary structure, the final SVM model yields a Matthews correlation coefficient (MCC) of 0.556 by a 7-fold cross-validation. A web server implementing the prediction method is available at the following URL: http://210.42.106.80/piturn/.
Collapse
Affiliation(s)
- Yan Wang
- Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan City, China.
| | | | | | | |
Collapse
|
346
|
Abstract
The pTARGET web server enables prediction of nine distinct protein subcellular localizations in eukaryotic non-plant species. Predictions are made using a new algorithm [C. Guda and S. Subramaniam (2005) pTARGET [corrected] a new method for predicting protein subcellular localization in eukaryotes. Bioinformatics, 21, 3963–3969], which is primarily based on the occurrence patterns of location-specific protein functional domains in different subcellular locations. We have implemented a relational database, PreCalcDB, to store pre-computed prediction results for all eukaryotic non-plant protein sequences in the public domain that includes about 770 000 entries. Queries can be made by entering protein sequences or by uploading a file containing up to 5000 protein sequences in FASTA format. Prediction results for queries with matching entries in the PreCalcDB will be retrieved instantly; while for the missing ones new predictions will be computed and sent by email. Pre-computed predictions can also be downloaded for complete proteomes of Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila, Mus musculus and Homo sapiens. The server, its documentation and the data are accessible from .
Collapse
Affiliation(s)
- Chittibabu Guda
- Gen*NY*sis Center for Excellence in Cancer Genomics and Department of Epidemiology and Biostatistics, University at Albany, State University of New York, 1 Discovery drive, Rensselaer, NY 12144-3456, USA
| |
Collapse
|
347
|
Haveman SA, Holmes DE, Ding YHR, Ward JE, Didonato RJ, Lovley DR. c-Type cytochromes in Pelobacter carbinolicus. Appl Environ Microbiol 2006; 72:6980-5. [PMID: 16936056 PMCID: PMC1636167 DOI: 10.1128/aem.01128-06] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Previous studies failed to detect c-type cytochromes in Pelobacter species despite the fact that other close relatives in the Geobacteraceae, such as Geobacter and Desulfuromonas species, have abundant c-type cytochromes. Analysis of the recently completed genome sequence of Pelobacter carbinolicus revealed 14 open reading frames that could encode c-type cytochromes. Transcripts for all but one of these open reading frames were detected in acetoin-fermenting and/or Fe(III)-reducing cells. Three putative c-type cytochrome genes were expressed specifically during Fe(III) reduction, suggesting that the encoded proteins may participate in electron transfer to Fe(III). One of these proteins was a periplasmic triheme cytochrome with a high level of similarity to PpcA, which has a role in Fe(III) reduction in Geobacter sulfurreducens. Genes for heme biosynthesis and system II cytochrome c biogenesis were identified in the genome and shown to be expressed. Sodium dodecyl sulfate-polyacrylamide gel electrophoresis gels of protein extracted from acetoin-fermenting P. carbinolicus cells contained three heme-staining bands which were confirmed by mass spectrometry to be among the 14 predicted c-type cytochromes. The number of cytochrome genes, the predicted amount of heme c per protein, and the ratio of heme-stained protein to total protein were much smaller in P. carbinolicus than in G. sulfurreducens. Furthermore, many of the c-type cytochromes that genetic studies have indicated are required for optimal Fe(III) reduction in G. sulfurreducens were not present in the P. carbinolicus genome. These results suggest that further evaluation of the functions of c-type cytochromes in the Geobacteraceae is warranted.
Collapse
Affiliation(s)
- Shelley A Haveman
- Department of Microbiology, University of Massachusetts, Amherst, MA 01003, USA.
| | | | | | | | | | | |
Collapse
|
348
|
Ofran Y, Margalit H. Proteins of the same fold and unrelated sequences have similar amino acid composition. Proteins 2006; 64:275-9. [PMID: 16565950 DOI: 10.1002/prot.20964] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
It is well established that there is a relationship between the amino acid composition of a protein and its structural class (i.e., alpha, beta, alpha + beta, or alpha/beta). Several studies have even shown the power of amino acid composition in predicting the secondary structure class of a protein. Herein, we show that significant similarity in amino acid composition exists not only between proteins of the same class, but even between proteins of the same fold. To test conjectural explanations for this phenomenon, we analyzed a set of structurally similar proteins that are dissimilar in sequence. Based on this analysis, we suggest that specific residues that are involved in intramolecular interactions may account for this surprising relationship between composition and structure.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.
| | | |
Collapse
|
349
|
|
350
|
Cruz-Monteagudo M, González-Díaz H, Uriarte E. Simple Stochastic Fingerprints Towards Mathematical Modeling in Biology and Medicine 2. Unifying Markov Model for Drugs Side Effects. Bull Math Biol 2006; 68:1527-54. [PMID: 16847720 DOI: 10.1007/s11538-005-9013-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2005] [Accepted: 05/09/2005] [Indexed: 10/24/2022]
Abstract
Most of present mathematical models for biological activity consider just the molecular structure. In the present article we pretend extending the use of Markov chain models to define novel molecular descriptors, which consider in addition other parameters like target site or biological effect. Specifically, this mathematical model takes into consideration not only the molecular structure but the specific biological system the drug affects too. Herein, a general Markov model is developed that describes 19 different drugs side effects grouped in eight affected biological systems for 178 drugs, being 270 cases finally. The data was processed by linear discriminant analysis (LDA) classifying drugs according to their specific side effects, forward stepwise was fixed as strategy for variables selection. The average percentage of good classification and number of compounds used in the training/predicting sets were 100/95.8% for endocrine manifestations, (18 out of 18)/(13 out of 14); 90.5/92.3% for gastrointestinal manifestations, (38 out of 42)/(30 out of 32); 88.5/86.5% for systemic phenomena, (23 out of 26)/(17 out of 20); 81.8/77.3% for neurological manifestations, (27 out of 33)/(19 out of 25); 81.6/86.2% for dermal manifestations, (31 out of 38)/(25 out of 29); 78.4/85.1% for cardiovascular manifestation, (29 out of 37)/(24 out of 28); 77.1/75.7% for breathing manifestations, (27 out of 35)/(20 out of 26) and 75.6/75% for psychiatric manifestations, (31 out of 41)/(23 out of 31). Additionally a back-projection analysis (BPA) was carried out for two ulcerogenic drugs to prove in structural terms the physical interpretation of the models obtained. This article develops a mathematical model that encompasses a large number of drugs side effects grouped in specifics biological systems using stochastic absolute probabilities of interaction ((A)pi(k)(j)) by the first time.
Collapse
Affiliation(s)
- Maykel Cruz-Monteagudo
- Applied Chemistry Research Center and Chemical Bioactives Center, Central University of Las Villas, Santa Clara, 54830, Cuba
| | | | | |
Collapse
|