451
|
Chou KC, Cai YD. Predicting protein structural class by functional domain composition. Biochem Biophys Res Commun 2004; 321:1007-9. [PMID: 15358128 DOI: 10.1016/j.bbrc.2004.07.059] [Citation(s) in RCA: 144] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2004] [Indexed: 11/16/2022]
Abstract
The functional domain composition is introduced to predict the structural class of a protein or domain according to the following classification: all-alpha, all-beta, alpha/beta, alpha+beta, micro (multi-domain), sigma (small protein), and rho (peptide). The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins and domains with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 98%. In contrast to this, the corresponding rates obtained by the simple geometry approaches based on the amino acid composition were only 36-39%. This indicates that using the functional domain composition to represent the sample of a protein for statistical prediction is very promising, and that the functional type of a domain is closely correlated with its structural class.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
452
|
Abstract
MOTIVATION With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. RESULTS To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2lambda discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2lambda numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| |
Collapse
|
453
|
Bendtsen JD, Nielsen H, von Heijne G, Brunak S. Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 2004; 340:783-95. [PMID: 15223320 DOI: 10.1016/j.jmb.2004.05.028] [Citation(s) in RCA: 5185] [Impact Index Per Article: 246.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2004] [Revised: 05/17/2004] [Accepted: 05/17/2004] [Indexed: 10/26/2022]
Abstract
We describe improvements of the currently most popular method for prediction of classically secreted proteins, SignalP. SignalP consists of two different predictors based on neural network and hidden Markov model algorithms, where both components have been updated. Motivated by the idea that the cleavage site position and the amino acid composition of the signal peptide are correlated, new features have been included as input to the neural network. This addition, combined with a thorough error-correction of a new data set, have improved the performance of the predictor significantly over SignalP version 2. In version 3, correctness of the cleavage site predictions has increased notably for all three organism groups, eukaryotes, Gram-negative and Gram-positive bacteria. The accuracy of cleavage site prediction has increased in the range 6-17% over the previous version, whereas the signal peptide discrimination improvement is mainly due to the elimination of false-positive predictions, as well as the introduction of a new discrimination score for the neural network. The new method has been benchmarked against other available methods. Predictions can be made at the publicly available web server
Collapse
Affiliation(s)
- Jannick Dyrløv Bendtsen
- Center for Biological Sequence Analysis, BioCentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark
| | | | | | | |
Collapse
|
454
|
Bhasin M, Raghava GPS. GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res 2004; 32:W383-9. [PMID: 15215416 PMCID: PMC441554 DOI: 10.1093/nar/gkh416] [Citation(s) in RCA: 97] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2004] [Revised: 04/02/2004] [Accepted: 04/02/2004] [Indexed: 11/13/2022] Open
Abstract
G-protein coupled receptors (GPCRs) belong to one of the largest superfamilies of membrane proteins and are important targets for drug design. In this study, a support vector machine (SVM)-based method, GPCRpred, has been developed for predicting families and subfamilies of GPCRs from the dipeptide composition of proteins. The dataset used in this study for training and testing was obtained from http://www.soe.ucsc.edu/research/compbio/gpcr/. The method classified GPCRs and non-GPCRs with an accuracy of 99.5% when evaluated using 5-fold cross-validation. The method is further able to predict five major classes or families of GPCRs with an overall Matthew's correlation coefficient (MCC) and accuracy of 0.81 and 97.5% respectively. In recognizing the subfamilies of the rhodopsin-like family, the method achieved an average MCC and accuracy of 0.97 and 97.3% respectively. The method achieved overall accuracy of 91.3% and 96.4% at family and subfamily level respectively when evaluated on an independent/blind dataset of 650 GPCRs. A server for recognition and classification of GPCRs based on multiclass SVMs has been set up at http://www.imtech.res.in/raghava/gpcrpred/. We have also suggested subfamilies for 42 sequences which were previously identified as unclassified ClassA GPCRs. The supplementary information is available at http://www.imtech.res.in/raghava/gpcrpred/info.html.
Collapse
Affiliation(s)
- Manoj Bhasin
- Institute of Microbial Technology Sector 39-A, Chandigarh, 160036, India
| | | |
Collapse
|
455
|
Nair R, Rost B. LOCnet and LOCtarget: sub-cellular localization for structural genomics targets. Nucleic Acids Res 2004; 32:W517-21. [PMID: 15215440 PMCID: PMC441579 DOI: 10.1093/nar/gkh441] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2004] [Revised: 03/26/2004] [Accepted: 04/16/2004] [Indexed: 11/14/2022] Open
Abstract
LOCtarget is a web server and database that predicts and annotates sub-cellular localization for structural genomics targets; LOCnet is one of the methods used in LOCtarget that can predict sub-cellular localization for all eukaryotic and prokaryotic proteins. Targets are taken from the central registration database for structural genomics, namely, TargetDB. LOCtarget predicts localization through a combination of four different methods: known nuclear localization signals (PredictNLS), homology-based transfer of experimental annotations (LOChom), inference through automatic text analysis of SWISS-PROT keywords (LOCkey) and de novo prediction through a system of neural networks (LOCnet). Additionally, we report predictions from SignalP. The final prediction is based on the method with the highest confidence. The web server can be used to predict sub-cellular localization of proteins from their amino acid sequence. The LOCtarget database currently contains localization predictions for all eukaryotic proteins from TargetDB and is updated every week. The server is available at http://www.rostlab.org/services/LOCtarget/.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| | | |
Collapse
|
456
|
Bhasin M, Raghava GPS. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res 2004; 32:W414-9. [PMID: 15215421 PMCID: PMC441488 DOI: 10.1093/nar/gkh350] [Citation(s) in RCA: 215] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2004] [Accepted: 02/09/2004] [Indexed: 11/13/2022] Open
Abstract
Automated prediction of subcellular localization of proteins is an important step in the functional annotation of genomes. The existing subcellular localization prediction methods are based on either amino acid composition or N-terminal characteristics of the proteins. In this paper, support vector machine (SVM) has been used to predict the subcellular location of eukaryotic proteins from their different features such as amino acid composition, dipeptide composition and physico-chemical properties. The SVM module based on dipeptide composition performed better than the SVM modules based on amino acid composition or physico-chemical properties. In addition, PSI-BLAST was also used to search the query sequence against the dataset of proteins (experimentally annotated proteins) to predict its subcellular location. In order to improve the prediction accuracy, we developed a hybrid module using all features of a protein, which consisted of an input vector of 458 dimensions (400 dipeptide compositions, 33 properties, 20 amino acid compositions of the protein and 5 from PSI-BLAST output). Using this hybrid approach, the prediction accuracies of nuclear, cytoplasmic, mitochondrial and extracellular proteins reached 95.3, 85.2, 68.2 and 88.9%, respectively. The overall prediction accuracy of SVM modules based on amino acid composition, physico-chemical properties, dipeptide composition and the hybrid approach was 78.1, 77.8, 82.9 and 88.0%, respectively. The accuracy of all the modules was evaluated using a 5-fold cross-validation technique. Assigning a reliability index (reliability index > or =3), 73.5% of prediction can be made with an accuracy of 96.4%. Based on the above approach, an online web server ESLpred was developed, which is available at http://www.imtech.res.in/raghava/eslpred/.
Collapse
Affiliation(s)
- Manoj Bhasin
- Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India
| | | |
Collapse
|
457
|
Dönnes P, Höglund A, Sturm M, Comtesse N, Backes C, Meese E, Kohlbacher O, Lenhof HP. Integrative analysis of cancer‐related data using CAP. FASEB J 2004; 18:1465-7. [PMID: 15231723 DOI: 10.1096/fj.04-1797fje] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The development of human cancer is a highly complex process and can be considered the result of several combined events, such as genetic alterations, disturbance of signal transduction, or failure of immunological surveillance. Cancer-related databases usually focus on specific fields of research, e.g., cancer genetics or cancer immunology, whereas the complexity of cancer genesis requires an integrated analysis of heterogeneous data from several sources. Here we present the cancer-associated protein database (CAP), a novel analysis system for cancer-related data. CAP integrates data from multiple external databases, augments these data with functional annotations, and offers tools for statistical analysis of these data. We have employed CAP to analyze genes that have been found to cause an autoimmune response in cancer. In particular, we explored the connection between the autoimmune response, mutations, and overexpression of these genes. Our preliminary results suggest that mutations are not significant contributors to raising an antibody response against tumor antigens, whereas overexpression seems to play a more important role. We hereby demonstrate how different types of data can be integrated and analyzed successfully, providing interesting results. As the amount of available data is growing rapidly, a combined analysis will play an important role in exploring the genetic and immunological basis of cancer. CAP is freely available at the following web site: http://www.bioinf.uni-sb.de/CAP/.
Collapse
Affiliation(s)
- Pierre Dönnes
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | | | | | | | | | | | | | | |
Collapse
|
458
|
Huang K, Murphy RF. Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. BMC Bioinformatics 2004; 5:78. [PMID: 15207009 PMCID: PMC449699 DOI: 10.1186/1471-2105-5-78] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2004] [Accepted: 06/18/2004] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Detailed knowledge of the subcellular location of each expressed protein is critical to a full understanding of its function. Fluorescence microscopy, in combination with methods for fluorescent tagging, is the most suitable current method for proteome-wide determination of subcellular location. Previous work has shown that neural network classifiers can distinguish all major protein subcellular location patterns in both 2D and 3D fluorescence microscope images. Building on these results, we evaluate here new classifiers and features to improve the recognition of protein subcellular location patterns in both 2D and 3D fluorescence microscope images. RESULTS We report here a thorough comparison of the performance on this problem of eight different state-of-the-art classification methods, including neural networks, support vector machines with linear, polynomial, radial basis, and exponential radial basis kernel functions, and ensemble methods such as AdaBoost, Bagging, and Mixtures-of-Experts. Ten-fold cross validation was used to evaluate each classifier with various parameters on different Subcellular Location Feature sets representing both 2D and 3D fluorescence microscope images, including new feature sets incorporating features derived from Gabor and Daubechies wavelet transforms. After optimal parameters were chosen for each of the eight classifiers, optimal majority-voting ensemble classifiers were formed for each feature set. Comparison of results for each image for all eight classifiers permits estimation of the lower bound classification error rate for each subcellular pattern, which we interpret to reflect the fraction of cells whose patterns are distorted by mitosis, cell death or acquisition errors. Overall, we obtained statistically significant improvements in classification accuracy over the best previously published results, with the overall error rate being reduced by one-third to one-half and with the average accuracy for single 2D images being higher than 90% for the first time. In particular, the classification accuracy for the easily confused endomembrane compartments (endoplasmic reticulum, Golgi, endosomes, lysosomes) was improved by 5-15%. We achieved further improvements when classification was conducted on image sets rather than on individual cell images. CONCLUSIONS The availability of accurate, fast, automated classification systems for protein location patterns in conjunction with high throughput fluorescence microscope imaging techniques enables a new subfield of proteomics, location proteomics. The accuracy and sensitivity of this approach represents an important alternative to low-resolution assignments by curation or sequence-based prediction.
Collapse
Affiliation(s)
- Kai Huang
- Department of Biological Sciences, Carnegie Mellon University,4400 Fifth Avenue, Pittsburgh, PA 15213 USA
- Center for Automated Learning and Discovery, Carnegie Mellon University,4400 Fifth Avenue, Pittsburgh, PA 15213 USA
| | - Robert F Murphy
- Department of Biological Sciences, Carnegie Mellon University,4400 Fifth Avenue, Pittsburgh, PA 15213 USA
- Department of Biomedical Engineering, Carnegie Mellon University,4400 Fifth Avenue, Pittsburgh, PA 15213 USA
- Center for Automated Learning and Discovery, Carnegie Mellon University,4400 Fifth Avenue, Pittsburgh, PA 15213 USA
| |
Collapse
|
459
|
Cui Q, Jiang T, Liu B, Ma S. Esub8: a novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinformatics 2004; 5:66. [PMID: 15163352 PMCID: PMC420457 DOI: 10.1186/1471-2105-5-66] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2003] [Accepted: 05/27/2004] [Indexed: 11/29/2022] Open
Abstract
Background Subcellular localization of a new protein sequence is very important and fruitful for understanding its function. As the number of new genomes has dramatically increased over recent years, a reliable and efficient system to predict protein subcellular location is urgently needed. Results Esub8 was developed to predict protein subcellular localizations for eukaryotic proteins based on amino acid composition. In this research, the proteins are classified into the following eight groups: chloroplast, cytoplasm, extracellular, Golgi apparatus, lysosome, mitochondria, nucleus and peroxisome. We know subcellular localization is a typical classification problem; consequently, a one-against-one (1-v-1) multi-class support vector machine was introduced to construct the classifier. Unlike previous methods, ours considers the order information of protein sequences by a different method. Our method is tested in three subcellular localization predictions for prokaryotic proteins and four subcellular localization predictions for eukaryotic proteins on Reinhardt's dataset. The results are then compared to several other methods. The total prediction accuracies of two tests are both 100% by a self-consistency test, and are 92.9% and 84.14% by the jackknife test, respectively. Esub8 also provides excellent results: the total prediction accuracies are 100% by a self-consistency test and 87% by the jackknife test. Conclusions Our method represents a different approach for predicting protein subcellular localization and achieved a satisfactory result; furthermore, we believe Esub8 will be a useful tool for predicting protein subcellular localizations in eukaryotic organisms.
Collapse
Affiliation(s)
- Qinghua Cui
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Tianzi Jiang
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Bing Liu
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| | - Songde Ma
- National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, P. R. China
| |
Collapse
|
460
|
Ramos de Armas R, González Díaz H, Molina R, Uriarte E. Markovian Backbone Negentropies: Molecular descriptors for protein research. I. Predicting protein stability in Arc repressor mutants. Proteins 2004; 56:715-23. [PMID: 15281125 DOI: 10.1002/prot.20159] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
As more and more protein structures are determined and applied to drug manufacture, there is increasing interest in studying their stability. In this sense, developing novel computational methods to predict and study protein stability in relation to their amino acid sequences has become a significant goal in applied Proteomics. In the study described here, Markovian Backbone Negentropies (MBN) have been introduced in order to model the effect on protein stability of a complete set of alanine substitutions in the Arc repressor. A total of 53 proteins were studied by means of Linear Discriminant Analysis using MBN as molecular descriptors. MBN are molecular descriptors based on a Markov chain model of electron delocalization throughout the protein backbone. The model correctly classified 43 out of 53 (81.13%) proteins according to their thermal stability. More specifically, the model classified 20/28 (71.4%) proteins with near wild-type stability and 23/25 (92%) proteins with reduced stability. Moreover, the model presented a good Mathew's regression coefficient of 0.643. Validation of the model was carried out by several Jackknife procedures. The method compares favorably with surface-dependent and thermodynamic parameter stability scoring functions. For instance, the D-FIRE potential classification function shows a level of good classification of 76.9%. On the other hand, surface, volume, logP, and molar refractivity show accuracies of 70.7, 62.3, 59.0, and 60.0%, respectively.
Collapse
|
461
|
González Dı́az H, Molina R, Uriarte E. Stochastic molecular descriptors for polymers. 1. Modelling the properties of icosahedral viruses with 3D-Markovian negentropies. POLYMER 2004. [DOI: 10.1016/j.polymer.2004.03.071] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
462
|
Bhasin M, Raghava GPS. Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem 2004; 279:23262-6. [PMID: 15039428 DOI: 10.1074/jbc.m401932200] [Citation(s) in RCA: 189] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Nuclear receptors are key transcription factors that regulate crucial gene networks responsible for cell growth, differentiation, and homeostasis. Nuclear receptors form a superfamily of phylogenetically related proteins and control functions associated with major diseases (e.g. diabetes, osteoporosis, and cancer). In this study, a novel method has been developed for classifying the subfamilies of nuclear receptors. The classification was achieved on the basis of amino acid and dipeptide composition from a sequence of receptors using support vector machines. The training and testing was done on a non-redundant data set of 282 proteins obtained from the NucleaRDB data base (1). The performance of all classifiers was evaluated using a 5-fold cross validation test. In the 5-fold cross-validation, the data set was randomly partitioned into five equal sets and evaluated five times on each distinct set while keeping the remaining four sets for training. It was found that different subfamilies of nuclear receptors were quite closely correlated in terms of amino acid composition as well as dipeptide composition. The overall accuracy of amino acid composition-based and dipeptide composition-based classifiers were 82.6 and 97.5%, respectively. Therefore, our results prove that different subfamilies of nuclear receptors are predictable with considerable accuracy using amino acid or dipeptide composition. Furthermore, based on above approach, an online web service, NRpred, was developed, which is available at www.imtech.res.in/raghava/nrpred.
Collapse
Affiliation(s)
- Manoj Bhasin
- Institute of Microbial Technology, Chandigarh 160036, India
| | | |
Collapse
|
463
|
Kim H, Park H. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins 2004; 54:557-62. [PMID: 14748002 DOI: 10.1002/prot.10602] [Citation(s) in RCA: 94] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The prediction of protein relative solvent accessibility gives us helpful information for the prediction of tertiary structure of a protein. The SVMpsi method, which uses support vector machines (SVMs), and the position-specific scoring matrix (PSSM) generated from PSI-BLAST have been applied to achieve better prediction accuracy of the relative solvent accessibility. We have introduced a three-dimensional local descriptor that contains information about the expected remote contacts by both the long-range interaction matrix and neighbor sequences. Moreover, we applied feature weights to kernels in SVMs in order to consider the degree of significance that depends on the distance from the specific amino acid. Relative solvent accessibility based on a two state-model, for 25%, 16%, 5%, and 0% accessibility are predicted at 78.7%, 80.7%, 82.4%, and 87.4% accuracy, respectively. Three-state prediction results provide a 64.5% accuracy with 9%; 36% threshold. The support vector machine approach has successfully been applied for solvent accessibility prediction by considering long-range interaction and handling unbalanced data.
Collapse
Affiliation(s)
- Hyunsoo Kim
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | | |
Collapse
|
464
|
Guo T, Hua S, Ji X, Sun Z. DBSubLoc: database of protein subcellular localization. Nucleic Acids Res 2004; 32:D122-4. [PMID: 14681374 PMCID: PMC308843 DOI: 10.1093/nar/gkh109] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We have built a protein subcellular localization annotation database, the DBSubLoc database, which is available at http://www.bioinfo.tsinghua. edu.cn/dbsubloc.html. Annotations were taken from primary protein databases, model organism genome projects and literature texts, and then were analyzed to dig out the subcellular localization features of the proteins. The proteins are also classified into different categories. Based on sequence alignment, non-redundant subsets of the database have been built, which may provide useful information for subcellular localization prediction. The database now contains >60,000 protein sequences including approximately 30,000 protein sequences in the non-redundant data sets. Online download, search and Blast tools are also available.
Collapse
Affiliation(s)
- Tao Guo
- Institute of Bioinformatics, Department of Biological Sciences and Biotechnology, Tsinghua University, Beijing 100084, China
| | | | | | | |
Collapse
|
465
|
Lin W, Yuan X, Yuen P, Wei WI, Sham J, Shi P, Qu J. Classification of in vivo autofluorescence spectra using support vector machines. JOURNAL OF BIOMEDICAL OPTICS 2004; 9:180-6. [PMID: 14715071 DOI: 10.1117/1.1628244] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
An algorithm based on support vector machines (SVM), the most recent advance in pattern recognition, is presented for use in classifying light-induced autofluorescence collected from cancerous and normal tissues. The in vivo autofluorescence spectra used for development and evaluation of SVM diagnostic algorithms were measured from 85 nasopharyngeal carcinoma (NPC) lesions and 131 normal tissue sites from 59 subjects during routine nasal endoscopy. Leave-one-out cross-validation was used to evaluate the performance of the algorithms. An overall diagnostic accuracy of 96%, a sensitivity of 94%, and a specificity of 97% for discriminating nasopharyngeal carcinomas from normal tissues were achieved using a linear SVM algorithm. A diagnostic accuracy of 98%, a sensitivity of 95%, and a specificity of 99% for detecting NPC were achieved with a nonlinear SVM algorithm. In a comparison with previously developed algorithms using the same dataset and the principal component analysis (PCA) technique, the SVM algorithms produced better diagnostic accuracy in all instances. In addition, we investigated a method combining PCA and SVM techniques for reducing the complexity of the SVM algorithms.
Collapse
Affiliation(s)
- WuMei Lin
- Hong Kong University of Science & Technology, Department of Electrical & Electronic Engineering, Clear Water Bay, Kowloon, Hong Kong, China
| | | | | | | | | | | | | |
Collapse
|
466
|
Heazlewood JL, Tonti-Filippini JS, Gout AM, Day DA, Whelan J, Millar AH. Experimental analysis of the Arabidopsis mitochondrial proteome highlights signaling and regulatory components, provides assessment of targeting prediction programs, and indicates plant-specific mitochondrial proteins. THE PLANT CELL 2004; 16:241-56. [PMID: 14671022 PMCID: PMC301408 DOI: 10.1105/tpc.016055] [Citation(s) in RCA: 430] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2003] [Accepted: 11/06/2003] [Indexed: 05/17/2023]
Abstract
A novel insight into Arabidopsis mitochondrial function was revealed from a large experimental proteome derived by liquid chromatography-tandem mass spectrometry. Within the experimental set of 416 identified proteins, a significant number of low-abundance proteins involved in DNA synthesis, transcriptional regulation, protein complex assembly, and cellular signaling were discovered. Nearly 20% of the experimentally identified proteins are of unknown function, suggesting a wealth of undiscovered mitochondrial functions in plants. Only approximately half of the experimental set is predicted to be mitochondrial by targeting prediction programs, allowing an assessment of the benefits and limitations of these programs in determining plant mitochondrial proteomes. Maps of putative orthology networks between yeast, human, and Arabidopsis mitochondrial proteomes and the Rickettsia prowazekii proteome provide detailed insights into the divergence of the plant mitochondrial proteome from those of other eukaryotes. These show a clear set of putative cross-species orthologs in the core metabolic functions of mitochondria, whereas considerable diversity exists in many signaling and regulatory functions.
Collapse
Affiliation(s)
- Joshua L Heazlewood
- Plant Molecular Biology Group, School of Biomedical and Chemical Sciences, University of Western Australia, Crawley 6009, Western Australia, Australia
| | | | | | | | | | | |
Collapse
|
467
|
Vinayagam A, Pugalenthi G, Rajesh R, Sowdhamini R. DSDBASE: a consortium of native and modelled disulphide bonds in proteins. Nucleic Acids Res 2004; 32:D200-2. [PMID: 14681394 PMCID: PMC308760 DOI: 10.1093/nar/gkh026] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2003] [Revised: 08/16/2003] [Accepted: 09/03/2003] [Indexed: 11/13/2022] Open
Abstract
DSDBASE is a database of disulphide bonds in proteins, which provides information on native disulphides and those that are stereochemically possible between pairs of residues for all known protein structural entries. The modelling of disulphides has been performed, using MODIP, by the identification of residue pairs that can strainlessly accommodate a covalent cross-link. We also assess the stereochemical quality of the covalent cross-link and grade them appropriately. One of the potential uses of the database is to design site-directed mutants in order to enhance the thermal stability of a protein. The proposed sites of mutations can be viewed specifically with respect to active sites of enzymes and across physiological dimers. The occurrence of native and modelled disulphides increases the dimensions of the database enormously. This database can also be employed for proposing three-dimensional models of disulphide-rich short polypeptides. The database can be accessed from http://www.ncbs.res.in/ approximately faculty/mini/dsdbase/dsdbase.html. Supplementary information can be accessed from http://www.ncbs.res.in/ approximately faculty/mini/dsdbase/nar/suppl.htm.
Collapse
Affiliation(s)
- A Vinayagam
- National Centre for Biological Sciences, UAS-GKVK Campus, Bangalore 560065, India
| | | | | | | |
Collapse
|
468
|
|
469
|
Nair R, Rost B. Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins 2003; 53:917-30. [PMID: 14635133 DOI: 10.1002/prot.10507] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.
Collapse
Affiliation(s)
- Rajesh Nair
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032 , USA.
| | | |
Collapse
|
470
|
Chou KC, Cai YD. A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochem Biophys Res Commun 2003; 311:743-7. [PMID: 14623335 DOI: 10.1016/j.bbrc.2003.10.062] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Based on the recent development in the gene ontology and functional domain databases, a new hybridization approach is developed for predicting protein subcellular location by combining the gene product, functional domain, and quasi-sequence-order effects. As a showcase, the same prokaryotic and eukaryotic datasets, which were studied by many previous investigators, are used for demonstration. The overall success rate by the jackknife test for the prokaryotic set is 94.7% and that for the eukaryotic set 92.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. The very high success rates also reflect the fact that the subcellular localization of a protein is closely correlated with: (1). the biological objective to which the gene or gene product contributes, (2). the biochemical activity of a gene product, and (3). the place in the cell where a gene product is active.
Collapse
Affiliation(s)
- Kuo-Chen Chou
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
471
|
Gonzáles-Díaz H, Gia O, Uriarte E, Hernádez I, Ramos R, Chaviano M, Seijo S, Castillo JA, Morales L, Santana L, Akpaloo D, Molina E, Cruz M, Torres LA, Cabrera MA. Markovian chemicals "in silico" design (MARCH-INSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J Mol Model 2003; 9:395-407. [PMID: 13680309 DOI: 10.1007/s00894-003-0148-7] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2003] [Accepted: 07/07/2003] [Indexed: 10/26/2022]
Abstract
A simple stochastic approach, designed to model the movement of electrons throughout chemical bonds, is introduced. This model makes use of a Markov matrix to codify useful structural information in QSAR. The self-return probabilities of this matrix throughout time ((SR)pi(k)) are then used as molecular descriptors. Firstly, a calculation of (SR)pi(k) is made for a large series of anticancer and non-anticancer chemicals. Then, k-Means Cluster Analysis allows us to split the data series into clusters and ensure a representative design of training and predicting series. Next, we develop a classification function through Linear Discriminant Analysis (LDA). This QSAR discriminates between anticancer compounds and non-active compounds with a correct global classification of 90.5% in the training series. The model also correctly classified 86.07% of the compounds in the predicting series. This classification function is then used to perform a virtual screening of a combinatorial library of coumarins. In this connection, the biological assay of some furocoumarins, selected by virtual screening using the present model, gives good results. In particular, a tetracyclic derivative of 5-methoxypsoralen (5-MOP) has an IC50 against HL-60 tumoral line around 6 to 10 times lower than those for 8-MOP and 5-MOP (reference drugs), respectively. Finally, application of Iso-contribution Zone Analysis (IZA) provides structural interpretation of the biological activity predicted with this QSAR.
Collapse
Affiliation(s)
- Humberto Gonzáles-Díaz
- Chemical Bioactives Center, Central University of Las Villas, 54830 Santa Clara, Villa Clara, Cuba.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
472
|
Peng SH, Fan LJ, Peng XN, Zhuang SL, Du W, Chen LB. Splicing-site recognition of rice (Oryza sativa L.) DNA sequences by support vector machines. JOURNAL OF ZHEJIANG UNIVERSITY. SCIENCE 2003; 4:573-577. [PMID: 12958717 DOI: 10.1631/jzus.2003.0573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
MOTIVATION It was found that high accuracy splicing-site recognition of rice (Oryza sativa L.) DNA sequence is especially difficult. We described a new method for the splicing-site recognition of rice DNA sequences. METHOD Based on the intron in eukaryotic organisms conforming to the principle of GT-AG, we used support vector machines (SVM) to predict the splicing sites. By machine learning, we built a model and used it to test the effect of the test data set of true and pseudo splicing sites. RESULTS The prediction accuracy we obtained was 87.53% at the true 5' end splicing site and 87.37% at the true 3' end splicing sites. The results suggested that the SVM approach could achieve higher accuracy than the previous approaches.
Collapse
Affiliation(s)
- Si-hua Peng
- Department of Control Science and Engineering, College of Information Science and Engineering, Zhejiang University, Hangzhou 310027, China.
| | | | | | | | | | | |
Collapse
|
473
|
Jin L, Fang W, Tang H. Prediction of protein structural classes by a new measure of information discrepancy. Comput Biol Chem 2003; 27:373-80. [PMID: 12927111 DOI: 10.1016/s1476-9271(02)00087-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Since it was observed that the structural class of a protein is related to its amino acid composition, various methods based on amino acid composition have been proposed to predict protein structural classes. Though those methods are effective to some degree, their predictive quality is confined because amino acid composition cannot sufficiently include the information of protein sequences. In this paper, a measure of information discrepancy is applied to the prediction of protein structural classes; different from the previous methods, this new approach is based on the comparisons of subsequence distributions; therefore, the effect of residue order on protein structure is taken into account. The predictive results of the new approach on the same data set are better than those of the previous methods. As to a data set of 1401 sequences with no more than 30% redundancy, the overall correctness rates of resubstitution test and Jackknife test are 99.4 and 75.02%, respectively, and to other data sets the similar results are also obtained. All tests demonstrate that the residue order along protein sequences plays an important role on recognition of protein structural classes, especially for alpha/beta proteins and alpha+beta proteins. In addition, the tests also show that the new method is simple and efficient.
Collapse
Affiliation(s)
- Lixia Jin
- Institute of Computational Biology and Bioinformatics, Dalian University of Technology, 116025, Dalian, People's Republic of China.
| | | | | |
Collapse
|
474
|
Gardy JL, Spencer C, Wang K, Ester M, Tusnády GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, Brinkman FSL. PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res 2003; 31:3613-7. [PMID: 12824378 PMCID: PMC169008 DOI: 10.1093/nar/gkg602] [Citation(s) in RCA: 312] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Automated prediction of bacterial protein subcellular localization is an important tool for genome annotation and drug discovery. PSORT has been one of the most widely used computational methods for such bacterial protein analysis; however, it has not been updated since it was introduced in 1991. In addition, neither PSORT nor any of the other computational methods available make predictions for all five of the localization sites characteristic of Gram-negative bacteria. Here we present PSORT-B, an updated version of PSORT for Gram-negative bacteria, which is available as a web-based application at http://www.psort.org. PSORT-B examines a given protein sequence for amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices and motifs corresponding to specific localizations. A probabilistic method integrates these analyses, returning a list of five possible localization sites with associated probability scores. PSORT-B, designed to favor high precision (specificity) over high recall (sensitivity), attained an overall precision of 97% and recall of 75% in 5-fold cross-validation tests, using a dataset we developed of 1443 proteins of experimentally known localization. This dataset, the largest of its kind, is freely available, along with the PSORT-B source code (under GNU General Public License).
Collapse
Affiliation(s)
- Jennifer L Gardy
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada V5A 1S6
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
475
|
Cai YD, Chou KC. Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem Biophys Res Commun 2003; 305:407-11. [PMID: 12745090 DOI: 10.1016/s0006-291x(03)00775-7] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
In this paper, based on the approach by combining the "functional domain composition" [K.C. Chou, Y. D. Cai, J. Biol. Chem. 277 (2002) 45765] and the pseudo-amino acid composition [K.C. Chou, Proteins Struct. Funct. Genet. 43 (2001) 246; Correction Proteins Struct. Funct. Genet. 2044 (2001) 2060], the Nearest Neighbour Algorithm (NNA) was developed for predicting the protein subcellular location. Very high success rates were observed, suggesting that such a hybrid approach may become a useful high-throughput tool in the area of bioinformatics and proteomics.
Collapse
Affiliation(s)
- Yu-Dong Cai
- Shanghai Research Centre of Biotechnology, Chinese Academy of Sciences, Shanghai 200233, China.
| | | |
Collapse
|
476
|
Pillai S, Good B, Richman D, Corbeil J. A new perspective on V3 phenotype prediction. AIDS Res Hum Retroviruses 2003; 19:145-9. [PMID: 12643277 DOI: 10.1089/088922203762688658] [Citation(s) in RCA: 96] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The particular coreceptor used by a strain of HIV-1 to enter a host cell is highly indicative of its pathology. HIV-1 coreceptor usage is primarily determined by the amino add sequences of the V3 loop region of the viral envelope glycoprotein. The canonical approach to sequence-based prediction of coreceptor usage was derived via statistical analysis of a less reliable and significantly smaller data set than is presently available. We aimed to produce a superior phenotypic classifier by applying modern machine learning (ML) techniques to the current database of V3 loop sequences with known phenotype. The trained classifiers along with the sequence data are available for public use at the supplementary website: http://genomiac2.ucsd.edu:8080/wetcat/v3.html and http://www.cs.waikato.ac.nz/ml/weka[corrected].
Collapse
Affiliation(s)
- Satish Pillai
- University of California, San Diego, La Jolla, California 92093, USA.
| | | | | | | |
Collapse
|
477
|
Nair R, Rost B. Sequence conserved for subcellular localization. Protein Sci 2002; 11:2836-47. [PMID: 12441382 PMCID: PMC2373743 DOI: 10.1110/ps.0207402] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2002] [Revised: 09/05/2002] [Accepted: 09/10/2002] [Indexed: 10/27/2022]
Abstract
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.
Collapse
Affiliation(s)
- Rajesh Nair
- Columbia University Bioinformatics Center (CUBIC), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
478
|
Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins 2002; 48:566-70. [PMID: 12112679 DOI: 10.1002/prot.10176] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15% that splits the dataset evenly (an equal number of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1% for single sequence input and 73.9% for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis.
Collapse
Affiliation(s)
- Zheng Yuan
- Institute for Molecular Bioscience and ARC Special Centre for Functional and Applied Genomics, The University of Queensland, Brisbane, Australia.
| | | | | |
Collapse
|
479
|
Mott R, Schultz J, Bork P, Ponting CP. Predicting protein cellular localization using a domain projection method. Genome Res 2002; 12:1168-74. [PMID: 12176924 PMCID: PMC186639 DOI: 10.1101/gr.96802] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2002] [Accepted: 05/15/2002] [Indexed: 11/24/2022]
Abstract
We investigate the co-occurrence of domain families in eukaryotic proteins to predict protein cellular localization. Approximately half (300) of SMART domains form a "small-world network", linked by no more than seven degrees of separation. Projection of the domains onto two-dimensional space reveals three clusters that correspond to cellular compartments containing secreted, cytoplasmic, and nuclear proteins. The projection method takes into account the existence of "bridging" domains, that is, instances where two domains might not occur with each other but frequently co-occur with a third domain; in such circumstances the domains are neighbors in the projection. While the majority of domains are specific to a compartment ("locale"), and hence may be used to localize any protein that contains such a domain, a small subset of domains either are present in multiple locales or occur in transmembrane proteins. Comparison with previously annotated proteins shows that SMART domain data used with this approach can predict, with 92% accuracy, the localizations of 23% of eukaryotic proteins. The coverage and accuracy will increase with improvements in domain database coverage. This method is complementary to approaches that use amino-acid composition or identify sorting sequences; these methods may be combined to further enhance prediction accuracy.
Collapse
Affiliation(s)
- Richard Mott
- Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
| | | | | | | |
Collapse
|
480
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2002. [PMCID: PMC2447253 DOI: 10.1002/cfg.117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
|
481
|
Abstract
Human chromosomal region 11p15 is known to be associated with several diseases including predispositions to develop various tumor types. In search of candidate genes, a novel human kinase gene is described, STK33, which codes for a serine/threonine protein kinase. The gene was discovered by comparative genome analysis of human chromosome 11p15.3 and its orthologous region on distal mouse chromosome 7. Human STK33 gene contains 12 exons as has been determined by the comparison to the full-length transcript amplified from human uterus RNA. Transcripts are found in a variety of tissues in at least two alternatively spliced forms as revealed by reverse transcriptase-polymerase chain reaction, cDNA sequencing and expressed sequence tag clustering. Phylogenetic analysis suggests that STK33 may belong to the calcium/calmodulin-dependent protein kinase group, even though, like several other members of the group, it lacks the calcium/calmodulin binding domain [FASEB J. 9 (1995) 576]. STK33 shows a differential expression in a variety of normal and malignant tissues.
Collapse
Affiliation(s)
- A O Mujica
- Institut für Molekulargenetik, Gentechnologische Sicherheitsforschung und Beratung, Johannes Gutenberg Universität Mainz, J.J. Becherweg 32, D-55099 Mainz, Germany
| | | | | |
Collapse
|