1
|
Kanto S, Grynberg M, Kaneko Y, Fujita J, Satake M. A variant of Runx2 that differs from the bone isoform in its splicing is expressed in spermatogenic cells. PeerJ 2016; 4:e1862. [PMID: 27069802 PMCID: PMC4824880 DOI: 10.7717/peerj.1862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2016] [Accepted: 03/09/2016] [Indexed: 11/20/2022] Open
Abstract
Background. Members of the Runx gene family encode transcription factors that bind to DNA in a sequence-specific manner. Among the three Runx proteins, Runx2 comprises 607 amino acid (aa) residues, is expressed in bone, and plays crucial roles in osteoblast differentiation and bone development. We examined whether the Runx2 gene is also expressed in testes. Methods. Murine testes from 1-, 2-, 3-, 4-, and 10-week-old male mice of the C57BL/6J strain and W∕Wv strain were used throughout the study. Northern Blot Analyses were performed using extracts form the murine testes. Sequencing of cDNA clones and 5′-rapid amplification of cDNA ends were performed to determine the full length of the transcripts, which revealed that the testicular Runx2 comprises 106 aa residues coding novel protein. Generating an antiserum using the amino-terminal 15 aa of Runx2 (Met1 to Gly15) as an antigen, immunoblot analyses were performed to detect the predicted polypeptide of 106 aa residues with the initiating Met1. With the affinity-purified anti-Runx2 antibody, immunohistochemical analyses were performed to elucidate the localization of the protein. Furthermore, bioinformatic analyses were performed to predict the function of the protein. Results. A Runx2 transcript was detected in testes and was specifically expressed in germ cells. Determination of the transcript structure indicated that the testicular Runx2 is a splice isoform. The predicted testicular Runx2 polypeptide is composed of only 106 aa residues, lacks a Runt domain, and appears to be a basic protein with a predominantly alpha-helical conformation. Immunoblot analyses with an anti-Runx2 antibody revealed that Met1 in the deduced open reading frame of Runx2 is used as the initiation codon to express an 11 kDa protein. Furthermore, immunohistochemical analyses revealed that the Runx2 polypeptide was located in the nuclei, and was detected in spermatocytes at the stages of late pachytene, diplotene and second meiotic cells as well as in round spermatids. Bioinformatic analyses suggested that the testicular Runx2 is a histone-like protein. Discussion. A variant of Runx2 that differs from the bone isoform in its splicing is expressed in pachytene spermatocytes and round spermatids in testes, and encodes a histone-like, nuclear protein of 106 aa residues. Considering its nuclear localization and differentiation stage-dependent expression, Runx2 may function as a chromatin-remodeling factor during spermatogenesis. We thus conclude that a single Runx2 gene can encode two different types of nuclear proteins, a previously defined transcription factor in bone and cartilage and a short testicular variant that lacks a Runt domain.
Collapse
Affiliation(s)
- Satoru Kanto
- Department of Molecular Immunology, Institute of Development, Aging and Cancer, Tohoku University, Sendai, Miyagi, Japan; Department of Urology, Graduate School of Medicine, Tohoku University, Sendai, Miyagi, Japan
| | - Marcin Grynberg
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland; Program in Bioinformatics and Systems Biology, Stanford Burnham Medical Research Institute, La Jolla, CA, United States of America
| | - Yoshiyuki Kaneko
- Department of Clinical Molecular Biology, Faculty of Medicine, Kyoto University , Kyoto , Japan
| | - Jun Fujita
- Department of Clinical Molecular Biology, Faculty of Medicine, Kyoto University , Kyoto , Japan
| | - Masanobu Satake
- Department of Molecular Immunology, Institute of Development, Aging and Cancer, Tohoku University , Sendai, Miyagi , Japan
| |
Collapse
|
2
|
Tong J, Sadreyev RI, Pei J, Kinch LN, Grishin NV. Using homology relations within a database markedly boosts protein sequence similarity search. Proc Natl Acad Sci U S A 2015; 112:7003-8. [PMID: 26038555 PMCID: PMC4460465 DOI: 10.1073/pnas.1424324112] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Inference of homology from protein sequences provides an essential tool for analyzing protein structure, function, and evolution. Current sequence-based homology search methods are still unable to detect many similarities evident from protein spatial structures. In computer science a search engine can be improved by considering networks of known relationships within the search database. Here, we apply this idea to protein-sequence-based homology search and show that it dramatically enhances the search accuracy. Our new method, COMPADRE (COmparison of Multiple Protein sequence Alignments using Database RElationships) assesses the relationship between the query sequence and a hit in the database by considering the similarity between the query and hit's known homologs. This approach increases detection quality, boosting the precision rate from 18% to 83% at half-coverage of all database homologs. The increased precision rate allows detection of a large fraction of protein structural relationships, thus providing structure and function predictions for previously uncharacterized proteins. Our results suggest that this general approach is applicable to a wide variety of methods for detection of biological similarities. The web server is available at prodata.swmed.edu/compadre.
Collapse
Affiliation(s)
- Jing Tong
- Department of Molecular Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Ruslan I Sadreyev
- Department of Molecular Biology, Massachusetts General Hospital, Boston, MA 02114; Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| | - Nick V Grishin
- Department of Molecular Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050; Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050
| |
Collapse
|
3
|
Yaseen A, Li Y. Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features. BMC Bioinformatics 2014; 15 Suppl 8:S3. [PMID: 25080939 PMCID: PMC4120151 DOI: 10.1186/1471-2105-15-s8-s3] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Secondary structures prediction of proteins is important to many protein structure modeling applications. Correct prediction of secondary structures can significantly reduce the degrees of freedom in protein tertiary structure modeling and therefore reduces the difficulty of obtaining high resolution 3D models. Methods In this work, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. We construct structural templates from known protein structures with certain sequence similarity. The structural templates are then incorporated as features with sequence and evolutionary information to train two-stage neural networks. In case of structural templates absence, heuristic structural information is incorporated instead. Results After applying the template-based 8-state secondary structure prediction method, the 7-fold cross-validated Q8 accuracy is 78.85%. Even templates from structures with only 20%~30% sequence similarity can help improve the 8-state prediction accuracy. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structures, such as 3-10 helices, turns, and bends, are highly improved, which are useful for practical applications. Conclusions Our computational results show that the templates containing structural information are effective features to enhance 8-state secondary structure predictions. Our prediction algorithm is implemented on a web server named "C8-SCORPION" available at: http://hpcr.cs.odu.edu/c8scorpion.
Collapse
|
4
|
Tataru P, Sand A, Hobolth A, Mailund T, Pedersen CNS. Algorithms for hidden markov models restricted to occurrences of regular expressions. BIOLOGY 2013; 2:1282-95. [PMID: 24833225 PMCID: PMC4009796 DOI: 10.3390/biology2041282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2013] [Revised: 10/08/2013] [Accepted: 11/05/2013] [Indexed: 11/24/2022]
Abstract
Hidden Markov Models (HMMs) are widely used probabilistic models, particularly for annotating sequential data with an underlying hidden structure. Patterns in the annotation are often more relevant to study than the hidden structure itself. A typical HMM analysis consists of annotating the observed data using a decoding algorithm and analyzing the annotation to study patterns of interest. For example, given an HMM modeling genes in DNA sequences, the focus is on occurrences of genes in the annotation. In this paper, we define a pattern through a regular expression and present a restriction of three classical algorithms to take the number of occurrences of the pattern in the hidden sequence into account. We present a new algorithm to compute the distribution of the number of pattern occurrences, and we extend the two most widely used existing decoding algorithms to employ information from this distribution. We show experimentally that the expectation of the distribution of the number of pattern occurrences gives a highly accurate estimate, while the typical procedure can be biased in the sense that the identified number of pattern occurrences does not correspond to the true number. We furthermore show that using this distribution in the decoding algorithms improves the predictive power of the model.
Collapse
Affiliation(s)
- Paula Tataru
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Andreas Sand
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Asger Hobolth
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Thomas Mailund
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| | - Christian N S Pedersen
- Bioinformatics Research Centre, Aarhus University, C. F. Møllers Allé 8, DK-8000 Aarhus C, Denmark.
| |
Collapse
|
5
|
Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinformatics 2011; 12:217. [PMID: 21619696 PMCID: PMC3133552 DOI: 10.1186/1471-2105-12-217] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2010] [Accepted: 05/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background Conotoxin has been proven to be effective in drug design and could be used to treat various disorders such as schizophrenia, neuromuscular disorders and chronic pain. With the rapidly growing interest in conotoxin, accurate conotoxin superfamily classification tools are desirable to systematize the increasing number of newly discovered sequences and structures. However, despite the significance and extensive experimental investigations on conotoxin, those tools have not been intensively explored. Results In this paper, we propose to consider suboptimal alignments of words with restricted length. We developed a scoring system based on local alignment partition functions, called free score. The scoring system plays the key role in the feature extraction step of support vector machine classification. In the classification of conotoxin proteins, our method, SVM-Freescore, features an improved sensitivity and specificity by approximately 5.864% and 3.76%, respectively, over previously reported methods. For the generalization purpose, SVM-Freescore was also applied to classify superfamilies from curated and high quality database such as ConoServer. The average computed sensitivity and specificity for the superfamily classification were found to be 0.9742 and 0.9917, respectively. Conclusions The SVM-Freescore method is shown to be a useful sequence-based analysis tool for functional and structural characterization of conotoxin proteins. The datasets and the software are available at http://faculty.uaeu.ac.ae/nzaki/SVM-Freescore.htm.
Collapse
|
6
|
Arenas NE, Salazar LM, Soto CY, Vizcaíno C, Patarroyo ME, Patarroyo MA, Gómez A. Molecular modeling and in silico characterization of Mycobacterium tuberculosis TlyA: possible misannotation of this tubercle bacilli-hemolysin. BMC STRUCTURAL BIOLOGY 2011; 11:16. [PMID: 21443791 PMCID: PMC3072309 DOI: 10.1186/1472-6807-11-16] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2010] [Accepted: 03/28/2011] [Indexed: 11/24/2022]
Abstract
Background The TlyA protein has a controversial function as a virulence factor in Mycobacterium tuberculosis (M. tuberculosis). At present, its dual activity as hemolysin and RNA methyltransferase in M. tuberculosis has been indirectly proposed based on in vitro results. There is no evidence however for TlyA relevance in the survival of tubercle bacilli inside host cells or whether both activities are functionally linked. A thorough analysis of structure prediction for this mycobacterial protein in this study shows the need for reevaluating TlyA's function in virulence. Results Bioinformatics analysis of TlyA identified a ribosomal protein binding domain (S4 domain), located between residues 5 and 68 as well as an FtsJ-like methyltranferase domain encompassing residues 62 and 247, all of which have been previously described in translation machinery-associated proteins. Subcellular localization prediction showed that TlyA lacks a signal peptide and its hydrophobicity profile showed no evidence of transmembrane helices. These findings suggested that it may not be attached to the membrane, which is consistent with a cytoplasmic localization. Three-dimensional modeling of TlyA showed a consensus structure, having a common core formed by a six-stranded β-sheet between two α-helix layers, which is consistent with an RNA methyltransferase structure. Phylogenetic analyses showed high conservation of the tlyA gene among Mycobacterium species. Additionally, the nucleotide substitution rates suggested purifying selection during tlyA gene evolution and the absence of a common ancestor between TlyA proteins and bacterial pore-forming proteins. Conclusion Altogether, our manual in silico curation suggested that TlyA is involved in ribosomal biogenesis and that there is a functional annotation error regarding this protein family in several microbial and plant genomes, including the M. tuberculosis genome.
Collapse
Affiliation(s)
- Nelson E Arenas
- Departamento de Química, Facultad de Ciencias, Universidad Nacional de Colombia, Carrera 45 No. 26-85 Bogotá, DC. Colombia
| | | | | | | | | | | | | |
Collapse
|
7
|
Kalkhof S, Haehn S, Paulsson M, Smyth N, Meiler J, Sinz A. Computational modeling of laminin N-terminal domains using sparse distance constraints from disulfide bonds and chemical cross-linking. Proteins 2010; 78:3409-27. [PMID: 20939100 PMCID: PMC5079110 DOI: 10.1002/prot.22848] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2010] [Revised: 07/16/2010] [Accepted: 07/25/2010] [Indexed: 11/10/2022]
Abstract
Basement membranes are thin extracellular protein layers, which separate endothelial and epithelial cells from the underlying connecting tissue. The main noncollagenous components of basement membranes are laminins, trimeric glycoproteins, which form polymeric networks by interactions of their N-terminal (LN) domains; however, no high-resolution structure of laminin LN domains exists so far. To construct models for laminin β(1) and γ(1) LN domains, 14 potentially suited template structures were determined using fold recognition methods. For each target/template-combination comparative models were created with Rosetta. Final models were selected based on their agreement with experimentally obtained distance constraints from natural cross-links, that is, disulfide bonds as well as chemical cross-links obtained from reactions with two amine-reactive cross-linkers. We predict that laminin β(1) and γ(1) LN domains share the galactose-binding domain-like fold.
Collapse
Affiliation(s)
- Stefan Kalkhof
- Department of Pharmaceutical Chemistry & Bioanalytics, Institute of Pharmacy, Martin Luther University Halle-Wittenberg, Wolfgang-Langenbeck-Strasse 4, D-06120 Halle (Saale), Germany
| | - Sebastian Haehn
- Center for Biochemistry, Faculty of Medicine, Center for Molecular Medicine Cologne (CMMC), and Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Joseph-Stelzmann-Strasse 52, Cologne D-50931, Germany
| | - Mats Paulsson
- Center for Biochemistry, Faculty of Medicine, Center for Molecular Medicine Cologne (CMMC), and Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Joseph-Stelzmann-Strasse 52, Cologne D-50931, Germany
| | - Neil Smyth
- School of Biological Sciences, University of Southampton, Bassett Crescent, East Southampton, SO16 7PX, United Kingdom
| | - Jens Meiler
- Department of Chemistry and Center for Structural Biology, Vanderbilt University Nashville, TN 37212, USA
| | - Andrea Sinz
- Department of Pharmaceutical Chemistry & Bioanalytics, Institute of Pharmacy, Martin Luther University Halle-Wittenberg, Wolfgang-Langenbeck-Strasse 4, D-06120 Halle (Saale), Germany
| |
Collapse
|
8
|
Kountouris P, Hirst JD. Predicting beta-turns and their types using predicted backbone dihedral angles and secondary structures. BMC Bioinformatics 2010; 11:407. [PMID: 20673368 PMCID: PMC2920885 DOI: 10.1186/1471-2105-11-407] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2010] [Accepted: 07/31/2010] [Indexed: 11/29/2022] Open
Abstract
Background β-turns are secondary structure elements usually classified as coil. Their prediction is important, because of their role in protein folding and their frequent occurrence in protein chains. Results We have developed a novel method that predicts β-turns and their types using information from multiple sequence alignments, predicted secondary structures and, for the first time, predicted dihedral angles. Our method uses support vector machines, a supervised classification technique, and is trained and tested on three established datasets of 426, 547 and 823 protein chains. We achieve a Matthews correlation coefficient of up to 0.49, when predicting the location of β-turns, the highest reported value to date. Moreover, the additional dihedral information improves the prediction of β-turn types I, II, IV, VIII and "non-specific", achieving correlation coefficients up to 0.39, 0.33, 0.27, 0.14 and 0.38, respectively. Our results are more accurate than other methods. Conclusions We have created an accurate predictor of β-turns and their types. Our method, called DEBT, is available online at http://comp.chem.nottingham.ac.uk/debt/.
Collapse
Affiliation(s)
- Petros Kountouris
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK
| | | |
Collapse
|
9
|
Schmidt am Busch M, Sedano A, Simonson T. Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition. PLoS One 2010; 5:e10410. [PMID: 20463972 PMCID: PMC2864755 DOI: 10.1371/journal.pone.0010410] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2009] [Accepted: 03/31/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. METHODOLOGY/PRINCIPAL FINDINGS WE EXPLORE THIS STRATEGY FOR FOUR SCOP FAMILIES: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000-300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. CONCLUSIONS/SIGNIFICANCE For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.
Collapse
Affiliation(s)
- Marcel Schmidt am Busch
- Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, Palaiseau, France
| | - Audrey Sedano
- Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, Palaiseau, France
| | - Thomas Simonson
- Laboratoire de Biochimie (CNRS UMR7654), Department of Biology, Ecole Polytechnique, Palaiseau, France
| |
Collapse
|
10
|
Kountouris P, Hirst JD. Prediction of backbone dihedral angles and protein secondary structure using support vector machines. BMC Bioinformatics 2009; 10:437. [PMID: 20025785 PMCID: PMC2811710 DOI: 10.1186/1471-2105-10-437] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2009] [Accepted: 12/22/2009] [Indexed: 11/26/2022] Open
Abstract
Background The prediction of the secondary structure of a protein is a critical step in the prediction of its tertiary structure and, potentially, its function. Moreover, the backbone dihedral angles, highly correlated with secondary structures, provide crucial information about the local three-dimensional structure. Results We predict independently both the secondary structure and the backbone dihedral angles and combine the results in a loop to enhance each prediction reciprocally. Support vector machines, a state-of-the-art supervised classification technique, achieve secondary structure predictive accuracy of 80% on a non-redundant set of 513 proteins, significantly higher than other methods on the same dataset. The dihedral angle space is divided into a number of regions using two unsupervised clustering techniques in order to predict the region in which a new residue belongs. The performance of our method is comparable to, and in some cases more accurate than, other multi-class dihedral prediction methods. Conclusions We have created an accurate predictor of backbone dihedral angles and secondary structure. Our method, called DISSPred, is available online at http://comp.chem.nottingham.ac.uk/disspred/.
Collapse
Affiliation(s)
- Petros Kountouris
- School of Chemistry, University of Nottingham, University Park, Nottingham NG7 2RD, UK.
| | | |
Collapse
|
11
|
Green JR, Korenberg MJ, Aboul-Magd MO. PCI-SS: MISO dynamic nonlinear protein secondary structure prediction. BMC Bioinformatics 2009; 10:222. [PMID: 19615046 PMCID: PMC2720391 DOI: 10.1186/1471-2105-10-222] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2008] [Accepted: 07/17/2009] [Indexed: 11/10/2022] Open
Abstract
Background Since the function of a protein is largely dictated by its three dimensional configuration, determining a protein's structure is of fundamental importance to biology. Here we report on a novel approach to determining the one dimensional secondary structure of proteins (distinguishing α-helices, β-strands, and non-regular structures) from primary sequence data which makes use of Parallel Cascade Identification (PCI), a powerful technique from the field of nonlinear system identification. Results Using PSI-BLAST divergent evolutionary profiles as input data, dynamic nonlinear systems are built through a black-box approach to model the process of protein folding. Genetic algorithms (GAs) are applied in order to optimize the architectural parameters of the PCI models. The three-state prediction problem is broken down into a combination of three binary sub-problems and protein structure classifiers are built using 2 layers of PCI classifiers. Careful construction of the optimization, training, and test datasets ensures that no homology exists between any training and testing data. A detailed comparison between PCI and 9 contemporary methods is provided over a set of 125 new protein chains guaranteed to be dissimilar to all training data. Unlike other secondary structure prediction methods, here a web service is developed to provide both human- and machine-readable interfaces to PCI-based protein secondary structure prediction. This server, called PCI-SS, is available at . In addition to a dynamic PHP-generated web interface for humans, a Simple Object Access Protocol (SOAP) interface is added to permit invocation of the PCI-SS service remotely. This machine-readable interface facilitates incorporation of PCI-SS into multi-faceted systems biology analysis pipelines requiring protein secondary structure information, and greatly simplifies high-throughput analyses. XML is used to represent the input protein sequence data and also to encode the resulting structure prediction in a machine-readable format. To our knowledge, this represents the only publicly available SOAP-interface for a protein secondary structure prediction service with published WSDL interface definition. Conclusion Relative to the 9 contemporary methods included in the comparison cascaded PCI classifiers perform well, however PCI finds greatest application as a consensus classifier. When PCI is used to combine a sequence-to-structure PCI-based classifier with the current leading ANN-based method, PSIPRED, the overall error rate (Q3) is maintained while the rate of occurrence of a particularly detrimental error is reduced by up to 25%. This improvement in BAD score, combined with the machine-readable SOAP web service interface makes PCI-SS particularly useful for inclusion in a tertiary structure prediction pipeline.
Collapse
Affiliation(s)
- James R Green
- Department of Systems and Computer Engineering, Carleton University, Ottawa, Ontario, Canada.
| | | | | |
Collapse
|
12
|
Abstract
The SAM-T08 web server is a protein structure prediction server that provides several useful intermediate results in addition to the final predicted 3D structure: three multiple sequence alignments of putative homologs using different iterated search procedures, prediction of local structure features including various backbone and burial properties, calibrated E-values for the significance of template searches of PDB and residue–residue contact predictions. The server has been validated as part of the CASP8 assessment of structure prediction as having good performance across all classes of predictions. The SAM-T08 server is available at http://compbio.soe.ucsc.edu/SAM_T08/T08-query.html
Collapse
Affiliation(s)
- Kevin Karplus
- Department of Biomolecular Engineering, Baskin School of Engineering, University of California, Santa Cruz, CA 95064, USA.
| |
Collapse
|
13
|
Wang Y, Sadreyev RI, Grishin NV. PROCAIN: protein profile comparison with assisting information. Nucleic Acids Res 2009; 37:3522-30. [PMID: 19357092 PMCID: PMC2699500 DOI: 10.1093/nar/gkp212] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Detection of remote sequence homology is essential for the accurate inference of protein structure, function and evolution. The most sensitive detection methods involve the comparison of evolutionary patterns reflected in multiple sequence alignments (MSAs) of protein families. We present PROCAIN, a new method for MSA comparison based on the combination of 'vertical' MSA context (substitution constraints at individual sequence positions) and 'horizontal' context (patterns of residue content at multiple positions). Based on a simple and tractable profile methodology and primitive measures for the similarity of horizontal MSA patterns, the method achieves the quality of homology detection comparable to a more complex advanced method employing hidden Markov models (HMMs) and secondary structure (SS) prediction. Adding SS information further improves PROCAIN performance beyond the capabilities of current state-of-the-art tools. The potential value of the method for structure/function predictions is illustrated by the detection of subtle homology between evolutionary distant yet structurally similar protein domains. ProCAIn, relevant databases and tools can be downloaded from: http://prodata.swmed.edu/procain/download. The web server can be accessed at http://prodata.swmed.edu/procain/procain.php.
Collapse
Affiliation(s)
- Yong Wang
- Biomedical Engineering Program, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | | | | |
Collapse
|
14
|
Katzman S, Barrett C, Thiltgen G, Karchin R, Karplus K. PREDICT-2ND: a tool for generalized protein local structure prediction. ACTA ACUST UNITED AC 2008; 24:2453-9. [PMID: 18757875 DOI: 10.1093/bioinformatics/btn438] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Predictions of protein local structure, derived from sequence alignment information alone, provide visualization tools for biologists to evaluate the importance of amino acid residue positions of interest in the absence of X-ray crystal/NMR structures or homology models. They are also useful as inputs to sequence analysis and modeling tools, such as hidden Markov models (HMMs), which can be used to search for homology in databases of known protein structure. In addition, local structure predictions can be used as a component of cost functions in genetic algorithms that predict protein tertiary structure. We have developed a program (predict-2nd) that trains multilayer neural networks and have applied it to numerous local structure alphabets, tuning network parameters such as the number of layers, the number of units in each layer and the window sizes of each layer. We have had the most success with four-layer networks, with gradually increasing window sizes at each layer. RESULTS Because the four-layer neural nets occasionally get trapped in poor local optima, our training protocol now uses many different random starts, with short training runs, followed by more training on the best performing networks from the short runs. One recent addition to the program is the option to add a guide sequence to the profile inputs, increasing the number of inputs per position by 20. We find that use of a guide sequence provides a small but consistent improvement in the predictions for several different local-structure alphabets. AVAILABILITY Local structure prediction with the methods described here is available for use online at http://www.soe.ucsc.edu/compbio/SAM_T08/T08-query.html. The source code and example networks for PREDICT-2ND are available at http://www.soe.ucsc.edu/~karplus/predict-2nd/ A required C++ library is available at http://www.soe.ucsc.edu/~karplus/ultimate/
Collapse
Affiliation(s)
- Sol Katzman
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA 95064, USA
| | | | | | | | | |
Collapse
|
15
|
Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection. BMC Bioinformatics 2008; 9:298. [PMID: 18590572 PMCID: PMC2459191 DOI: 10.1186/1471-2105-9-298] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2008] [Accepted: 07/01/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement. RESULTS The performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC50 scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins. CONCLUSION The present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.
Collapse
|
16
|
Singh A, Kushwaha HR, Sharma P. Molecular modelling and comparative structural account of aspartyl beta-semialdehyde dehydrogenase of Mycobacterium tuberculosis (H37Rv). J Mol Model 2008; 14:249-63. [PMID: 18236087 DOI: 10.1007/s00894-008-0267-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2007] [Accepted: 01/03/2008] [Indexed: 11/29/2022]
Abstract
Aspartyl beta-semialdehyde dehydrogenase (ASADH) is an important enzyme, occupying the first branch position of the biosynthetic pathway of the aspartate family of amino acids in bacteria, fungi and higher plants. It catalyses reversible dephosphorylation of L: -beta-aspartyl phosphate (betaAP) to L: -aspartate-beta-semialdehyde (ASA), a key intermediate in the biosynthesis of diaminopimelic acid (DAP)-an essential component of cross linkages in bacterial cell walls. Since the aspartate pathway is unique to plants and bacteria, and ASADH is the key enzyme in this pathway, it becomes an attractive target for antimicrobial agent development. Therefore, with the objective of deducing comparative structural models, we have described a molecular model emphasizing the uniqueness of ASADH from Mycobacterium tuberculosis (H37Rv) that should generate insights into the structural distinctiveness of this protein as compared to structurally resolved ASADH from other bacterial species. We find that mtASADH exhibits structural features common to bacterial ASADH, while other structural motifs are not present. Structural analysis of various domains in mtASADH reveals structural conservation among all bacterial ASADH proteins. The results suggest that the probable mechanism of action of the mtASADH enzyme might be same as that of other bacterial ASADH. Analysis of the structure of mtASADH will shed light on its mechanism of action and may help in designing suitable antagonists against this enzyme that could control the growth of Mycobacterium tuberculosis.
Collapse
Affiliation(s)
- Anupama Singh
- Centre of Computational Biology and Bioinformatics (CCBB), School of Information Technology, Jawaharlal Nehru University, New Delhi, 110067, India
| | | | | |
Collapse
|
17
|
Yao XQ, Zhu H, She ZS. A dynamic Bayesian network approach to protein secondary structure prediction. BMC Bioinformatics 2008; 9:49. [PMID: 18218144 PMCID: PMC2266706 DOI: 10.1186/1471-2105-9-49] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2007] [Accepted: 01/25/2008] [Indexed: 11/19/2022] Open
Abstract
Background Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM). Results In this paper, we report a new method of probabilistic nature for protein secondary structure prediction, based on dynamic Bayesian networks (DBN). The new method models the PSI-BLAST profile of a protein sequence using a multivariate Gaussian distribution, and simultaneously takes into account the dependency between the profile and secondary structure and the dependency between profiles of neighboring residues. In addition, a segment length distribution is introduced for each secondary structure state. Tests show that the DBN method has made a significant improvement in the accuracy compared to other pure HMM-type methods. Further improvement is achieved by combining the DBN with an NN, a method called DBNN, which shows better Q3 accuracy than many popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus. Conclusion The DBN method using a Gaussian distribution for the PSI-BLAST profile and a high-ordered dependency between profiles of neighboring residues produces significantly better prediction accuracy than other HMM-type probabilistic methods. Owing to their different nature, the DBN and NN combine to form a more accurate method DBNN. Future improvement may be achieved by combining DBNN with a method of SVM type.
Collapse
Affiliation(s)
- Xin-Qiu Yao
- State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, Peking University, Beijing 100871, China.
| | | | | |
Collapse
|
18
|
Abstract
COMPASS is a method for homology detection and local alignment construction based on the comparison of multiple sequence alignments (MSAs). The method derives numerical profiles from given MSAs, constructs local profile-profile alignments and analytically estimates E-values for the detected similarities. Until now, COMPASS was only available for download and local installation. Here, we present a new web server featuring the latest version of COMPASS, which provides (i) increased sensitivity and selectivity of homology detection; (ii) longer, more complete alignments; and (iii) faster computational speed. After submission of the query MSA or single sequence, the server performs searches versus a user-specified database. The server includes detailed and intuitive control of the search parameters. A flexible output format, structured similarly to BLAST and PSI-BLAST, provides an easy way to read and analyze the detected profile similarities. Brief help sections are available for all input parameters and output options, along with detailed documentation. To illustrate the value of this tool for protein structure-functional prediction, we present two examples of detecting distant homologs for uncharacterized protein families. Available at http://prodata.swmed.edu/compass.
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA.
| | | | | | | |
Collapse
|
19
|
Saha RP, Chakrabarti P. Molecular modeling and characterization of Vibrio cholerae transcription regulator HlyU. BMC STRUCTURAL BIOLOGY 2006; 6:24. [PMID: 17116251 PMCID: PMC1665450 DOI: 10.1186/1472-6807-6-24] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/17/2006] [Accepted: 11/20/2006] [Indexed: 11/15/2022]
Abstract
Background The SmtB/ArsR family of prokaryotic metal-regulatory transcriptional repressors represses the expression of operons linked to stress-inducing concentrations of heavy metal ions, while derepression results from direct binding of metal ions by these 'metal-sensor' proteins. The HlyU protein from Vibrio cholerae is the positive regulator of haemolysin gene, it also plays important role in the regulation of expression of the virulence genes. Despite the understanding of biochemical properties, its structure and relationship to other protein families remain unknown. Results We find that HlyU exhibits structural features common to the SmtB/ArsR family of transcriptional repressors. Analysis of the modeled structure of HlyU reveals that it does not have the key metal-sensing residues which are unique to the SmtB/ArsR family of repressors, yet the tertiary structure is very similar to the family members. HlyU is the only member that has a positive control on transcription, while all the other members in the family are repressors. An evolutionary analysis with other SmtB/ArsR family members suggests that during evolution HlyU probably occurred by gene duplication and mutational events that led to the emergence of this protein from ancestral transcriptional repressor by the loss of the metal-binding sites. Conclusion The study indicates that the same protein family can contain both the positive regulator of transcription and repressors – the exact function being controlled by the absence or the presence of metal-binding sites.
Collapse
Affiliation(s)
- Rudra P Saha
- Department of Biochemistry, Bose Institute, P-1/12 CIT Scheme VIIM, Calcutta 700 054, India
| | - Pinak Chakrabarti
- Department of Biochemistry, Bose Institute, P-1/12 CIT Scheme VIIM, Calcutta 700 054, India
| |
Collapse
|
20
|
Ku CJ, Yona G. The distance-profile representation and its application to detection of distantly related protein families. BMC Bioinformatics 2005; 6:282. [PMID: 16316461 PMCID: PMC1345692 DOI: 10.1186/1471-2105-6-282] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2005] [Accepted: 11/29/2005] [Indexed: 11/11/2022] Open
Abstract
Background Detecting homology between remotely related protein families is an important problem in computational biology since the biological properties of uncharacterized proteins can often be inferred from those of homologous proteins. Many existing approaches address this problem by measuring the similarity between proteins through sequence or structural alignment. However, these methods do not exploit collective aspects of the protein space and the computed scores are often noisy and frequently fail to recognize distantly related protein families. Results We describe an algorithm that improves over the state of the art in homology detection by utilizing global information on the proximity of entities in the protein space. Our method relies on a vectorial representation of proteins and protein families and uses structure-specific association measures between proteins and template structures to form a high-dimensional feature vector for each query protein. These vectors are then processed and transformed to sparse feature vectors that are treated as statistical fingerprints of the query proteins. The new representation induces a new metric between proteins measured by the statistical difference between their corresponding probability distributions. Conclusion Using several performance measures we show that the new tool considerably improves the performance in recognizing distant homologies compared to existing approaches such as PSIBLAST and FUGUE.
Collapse
Affiliation(s)
- Chin-Jen Ku
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Golan Yona
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| |
Collapse
|
21
|
Choo KH, Tong JC, Zhang L. Recent applications of Hidden Markov Models in computational biology. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:84-96. [PMID: 15629048 PMCID: PMC5172443 DOI: 10.1016/s1672-0229(04)02014-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
This paper examines recent developments and applications of Hidden Markov Models (HMMs) to various problems in computational biology, including multiple sequence alignment, homology detection, protein sequences classification, and genomic annotation.
Collapse
Affiliation(s)
- Khar Heng Choo
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Joo Chuan Tong
- Department of Biochemistry, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, 2 Science Drive 2, Singapore 117543
- Corresponding author.
| |
Collapse
|
22
|
Nakai S, Li-Chan ECY, Dou J. Pattern similarity study of functional sites in protein sequences: lysozymes and cystatins. BMC BIOCHEMISTRY 2005; 6:9. [PMID: 15904486 PMCID: PMC1173080 DOI: 10.1186/1471-2091-6-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2004] [Accepted: 05/18/2005] [Indexed: 11/10/2022]
Abstract
BACKGROUND Although it is generally agreed that topography is more conserved than sequences, proteins sharing the same fold can have different functions, while there are protein families with low sequence similarity. An alternative method for profile analysis of characteristic conserved positions of the motifs within the 3D structures may be needed for functional annotation of protein sequences. Using the approach of quantitative structure-activity relationships (QSAR), we have proposed a new algorithm for postulating functional mechanisms on the basis of pattern similarity and average of property values of side-chains in segments within sequences. This approach was used to search for functional sites of proteins belonging to the lysozyme and cystatin families. RESULTS Hydrophobicity and beta-turn propensity of reference segments with 3-7 residues were used for the homology similarity search (HSS) for active sites. Hydrogen bonding was used as the side-chain property for searching the binding sites of lysozymes. The profiles of similarity constants and average values of these parameters as functions of their positions in the sequences could identify both active and substrate binding sites of the lysozyme of Streptomyces coelicolor, which has been reported as a new fold enzyme (Cellosyl). The same approach was successfully applied to cystatins, especially for postulating the mechanisms of amyloidosis of human cystatin C as well as human lysozyme. CONCLUSION Pattern similarity and average index values of structure-related properties of side chains in short segments of three residues or longer were, for the first time, successfully applied for predicting functional sites in sequences. This new approach may be applicable to studying functional sites in un-annotated proteins, for which complete 3D structures are not yet available.
Collapse
Affiliation(s)
- Shuryo Nakai
- Food, Nutrition and Health, The University of British Columbia, 6650 Marine Drive, Vancouver, B.C., Canada
| | - Eunice CY Li-Chan
- Food, Nutrition and Health, The University of British Columbia, 6650 Marine Drive, Vancouver, B.C., Canada
| | - Jinglie Dou
- Food, Nutrition and Health, The University of British Columbia, 6650 Marine Drive, Vancouver, B.C., Canada
| |
Collapse
|
23
|
Ginalski K, Grishin NV, Godzik A, Rychlewski L. Practical lessons from protein structure prediction. Nucleic Acids Res 2005; 33:1874-91. [PMID: 15805122 PMCID: PMC1074308 DOI: 10.1093/nar/gki327] [Citation(s) in RCA: 99] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Despite recent efforts to develop automated protein structure determination protocols, structural genomics projects are slow in generating fold assignments for complete proteomes, and spatial structures remain unknown for many protein families. Alternative cheap and fast methods to assign folds using prediction algorithms continue to provide valuable structural information for many proteins. The development of high-quality prediction methods has been boosted in the last years by objective community-wide assessment experiments. This paper gives an overview of the currently available practical approaches to protein structure prediction capable of generating accurate fold assignment. Recent advances in assessment of the prediction quality are also discussed.
Collapse
Affiliation(s)
- Krzysztof Ginalski
- BioInfoBank Instituteul. Limanowskiego 24A, 60-744 Poznań, Poland
- Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw UniversityPawińskiego 5a, 02-106 Warsaw, Poland
- Department of Biochemistry, University of Texas, Southwestern Medical Center5323 Harry Hines Boulevard, Dallas, TX 75390-9038, USA
| | - Nick V. Grishin
- Department of Biochemistry, University of Texas, Southwestern Medical Center5323 Harry Hines Boulevard, Dallas, TX 75390-9038, USA
- Howard Hughes Medical Institute, University of Texas, Southwestern Medical Center5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA
| | - Adam Godzik
- The Burnham Institute10901 N. Torrey Pines Road, La Jolla, CA 92037, USA
| | - Leszek Rychlewski
- BioInfoBank Instituteul. Limanowskiego 24A, 60-744 Poznań, Poland
- To whom correspondence should be addressed. Tel: +48 604 628805; Fax: +48 61 8643350;
| |
Collapse
|
24
|
Wu KP, Lin HN, Chang JM, Sung TY, Hsu WL. HYPROSP: a hybrid protein secondary structure prediction algorithm--a knowledge-based approach. Nucleic Acids Res 2004; 32:5059-65. [PMID: 15448186 PMCID: PMC521652 DOI: 10.1093/nar/gkh836] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We develop a knowledge-based approach (called PROSP) for protein secondary structure prediction. The knowledge base contains small peptide fragments together with their secondary structural information. A quantitative measure M, called match rate, is defined to measure the amount of structural information that a target protein can extract from the knowledge base. Our experimental results show that proteins with a higher match rate will likely be predicted more accurately based on PROSP. That is, there is roughly a monotone correlation between the prediction accuracy and the amount of structure matching with the knowledge base. To fully utilize the strength of our knowledge base, a hybrid prediction method is proposed as follows: if the match rate of a target protein is at least 80%, we use the extracted information to make the prediction; otherwise, we adopt a popular machine-learning approach. This comprises our hybrid protein structure prediction (HYPROSP) approach. We use the DSSP and EVA data as our datasets and PSIPRED as our underlying machine-learning algorithm. For target proteins with match rate at least 80%, the average Q3 of PROSP is 3.96 and 7.2 better than that of PSIPRED on DSSP and EVA data, respectively.
Collapse
Affiliation(s)
- Kuen-Pin Wu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | | | | | | | | |
Collapse
|
25
|
Sadreyev RI, Baker D, Grishin NV. Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 2004; 12:2262-72. [PMID: 14500884 PMCID: PMC2366929 DOI: 10.1110/ps.03197403] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recently we proposed a novel method of alignment–alignment comparison, COMPASS (the tool for COmparison of Multiple Protein Alignments with Assessment of Statistical Significance). Here we present several examples of the relations between PFAM protein families that were detected by COMPASS and that lead to the predictions of presently unresolved protein structures. We discuss relatively straightforward COMPASS predictions that are new and interesting to us, and that would require a substantial time and effort to justify even for a skilled PSI-BLAST user. All of the presented COMPASS hits are independently confirmed by other methods, including the ab initio structure-prediction method ROSETTA. The tertiary structure predictions made by ROSETTA proved to be useful for improving sequence-derived alignments, because they are based on a reasonable folding of the polypeptide chain rather than on the information from sequence databases. The ability of COMPASS to predict new relations within the PFAM database indicates the high sensitivity of COMPASS searches and substantiates its potential value for the discovery of previously unknown similarities between protein families.
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas 75390-9050, USA
| | | | | |
Collapse
|
26
|
Zhang DQ, Liu B, Feng DR, He YM, Wang SQ, Wang HB, Wang JF. Significance of conservative asparagine residues in the thermal hysteresis activity of carrot antifreeze protein. Biochem J 2004; 377:589-95. [PMID: 14531728 PMCID: PMC1223888 DOI: 10.1042/bj20031249] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2003] [Revised: 10/06/2003] [Accepted: 10/08/2003] [Indexed: 11/17/2022]
Abstract
The approximately 24-amino-acid leucine-rich tandem repeat motif (PXXXXXLXXLXXLXLSXNXLXGXI) of carrot antifreeze protein comprises most of the processed protein and should contribute at least partly to the ice-binding site. Structural predictions using publicly available online sources indicated that the theoretical three-dimensional model of this plant protein includes a 10-loop beta-helix containing the approximately 24-amino-acid tandem repeat. This theoretical model indicated that conservative asparagine residues create putative ice-binding sites with surface complementarity to the 1010 prism plane of ice. We used site-specific mutagenesis to test the importance of these residues, and observed a distinct loss of thermal hysteresis activity when conservative asparagines were replaced with valine or glutamine, whereas a large increase in thermal hysteresis was observed when phenylalanine or threonine residues were replaced with asparagine, putatively resulting in the formation of an ice-binding site. These results confirmed that the ice-binding site of carrot antifreeze protein consists of conservative asparagine residues in each beta-loop. We also found that its thermal hysteresis activity is directly correlated with the length of its asparagine-rich binding site, and hence with the size of its ice-binding face.
Collapse
Affiliation(s)
- Dang-Quan Zhang
- The Key Laboratory of Gene Engineering of Ministry of Education, School of Life Sciences, Sun Yat-sen University, Guangzhou 510275, China
| | | | | | | | | | | | | |
Collapse
|
27
|
Koh IYY, Eyrich VA, Marti-Renom MA, Przybylski D, Madhusudhan MS, Eswar N, Graña O, Pazos F, Valencia A, Sali A, Rost B. EVA: Evaluation of protein structure prediction servers. Nucleic Acids Res 2003; 31:3311-5. [PMID: 12824315 PMCID: PMC169025 DOI: 10.1093/nar/gkg619] [Citation(s) in RCA: 134] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
EVA (http://cubic.bioc.columbia.edu/eva/) is a web server for evaluation of the accuracy of automated protein structure prediction methods. The evaluation is updated automatically each week, to cope with the large number of existing prediction servers and the constant changes in the prediction methods. EVA currently assesses servers for secondary structure prediction, contact prediction, comparative protein structure modelling and threading/fold recognition. Every day, sequences of newly available protein structures in the Protein Data Bank (PDB) are sent to the servers and their predictions are collected. The predictions are then compared to the experimental structures once a week; the results are published on the EVA web pages. Over time, EVA has accumulated prediction results for a large number of proteins, ranging from hundreds to thousands, depending on the prediction method. This large sample assures that methods are compared reliably. As a result, EVA provides useful information to developers as well as users of prediction methods.
Collapse
Affiliation(s)
- Ingrid Y Y Koh
- Columbia University Center for Computational Biology and Bioinformatics (C2B2), Russ Berrie Pavilion, 1150 St Nicholas Avenue, New York, NY 10032, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Mallick P, Weiss R, Eisenberg D. The directional atomic solvation energy: an atom-based potential for the assignment of protein sequences to known folds. Proc Natl Acad Sci U S A 2002; 99:16041-6. [PMID: 12461172 PMCID: PMC138561 DOI: 10.1073/pnas.252626399] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The Directional Atomic Solvation EnergY (DASEY) is an atom-based description of the environment of an amino acid position within a known 3D protein structure. The DASEY has been developed to align and score a probe amino acid sequence to a library of template protein structures for fold assignment. DASEY is computed by summing the atomic solvation parameters of atoms falling within a tetrahedral sector, or petal, extending 16 A along each of the four bond axes of each alpha-carbon atom of the protein. The DASEY discriminates between pairs of structurally equivalent positions and random pairs in protein structures sharing a fold but belonging to different superfamilies, unlike some previous descriptors of protein environments, such as buried area. Furthermore, the DASEY values have characteristic patterns of residue replacement, an essential feature of a successful fold assignment method. Benchmarking fold assignment with DASEY achieves coverage of 56% of sequences with 90% accuracy when probe sequences are matched to protein structural templates belonging to the same fold but to a different superfamily, an improvement of greater than 200% over a previous method.
Collapse
Affiliation(s)
- Parag Mallick
- Department of Chemistry and Biochemistry, and University of California, UCLA-DOE Center for Genomics and Proteomics, Molecular Biology Institute, Howard Hughes Medical Institute, University of California, Los Angeles, CA 90095-1570, USA
| | | | | |
Collapse
|
29
|
Samudrala R, Levitt M. A comprehensive analysis of 40 blind protein structure predictions. BMC STRUCTURAL BIOLOGY 2002; 2:3. [PMID: 12150712 PMCID: PMC122083 DOI: 10.1186/1472-6807-2-3] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2002] [Accepted: 08/01/2002] [Indexed: 11/21/2022]
Abstract
BACKGROUND We thoroughly analyse the results of 40 blind predictions for which an experimental answer was made available at the fourth meeting on the critical assessment of protein structure methods (CASP4). Using our comparative modelling and fold recognition methodologies, we made 29 predictions for targets that had sequence identities ranging from 50% to 10% to the nearest related protein with known structure. Using our ab initio methodologies, we made eleven predictions for targets that had no detectable sequence relationships. RESULTS For 23 of these proteins, we produced models ranging from 1.0 to 6.0 A root mean square deviation (RMSD) for the Calpha atoms between the model and the corresponding experimental structure for all or large parts of the protein, with model accuracies scaling fairly linearly with respect to sequence identity (i.e., the higher the sequence identity, the better the prediction). We produced nine models with accuracies ranging from 4.0 to 6.0 A Calpha RMSD for 60-100 residue proteins (or large fragments of a protein), with a prediction accuracy of 4.0 A Calpha RMSD for residues 1-80 for T110/rbfa. CONCLUSIONS The areas of protein structure prediction that work well, and areas that need improvement, are discernable by examining how our methods have performed over the past four CASP experiments. These results have implications for modelling the structure of all tractable proteins encoded by the genome of an organism.
Collapse
Affiliation(s)
- Ram Samudrala
- Department of Microbiology, University of Washington, School of Medicine, Seattle, WA 98195, USA
| | - Michael Levitt
- Department of Structural Biology, Stanford University, School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
30
|
Abstract
Fold recognition predicts protein three-dimensional structure by establishing relationships between a protein sequence and known protein structures. Most methods explicitly use information derived from the secondary and tertiary structure of the templates. Here we show that rigorous application of a sequence search method (PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold recognition methods. Since the method, SENSER, does not require knowledge of the three-dimensional structure, it can be used to infer relationships that are not tractable by methods dependent on structural templates.
Collapse
Affiliation(s)
- Kristin K Koretke
- Microbial Bioinformatics Group, GlaxoSmithKline, Collegeville, Pennsylvania 19426-0989, USA.
| | | | | |
Collapse
|
31
|
MacGregor EA. Possible structure and active site residues of starch, glycogen, and sucrose synthases. JOURNAL OF PROTEIN CHEMISTRY 2002; 21:297-306. [PMID: 12168700 DOI: 10.1023/a:1019701621256] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
A group of enzymes that include muscle glycogen phosphorylase and sugar transferases involved in, for example, the glucosylation of DNA and the synthesis of peptidoglycan are known to possess the same basic three-dimensional fold. Here the possibility is examined that other monosaccharide transferases, those that catalyze synthesis of starch, glycogen, and the disaccharide sucrose, resemble the phosphorylase-type enzymes in structure. In particular, a clear relationship is shown, for the first time, between mammalian glycogen synthases and the phosphorylase structural group of proteins. Domain architecture and secondary structure are discussed, and the possible role of several conserved amino acids at the active site is explored.
Collapse
Affiliation(s)
- E Ann MacGregor
- Department of Chemistry, University of Manitoba, Winnipeg, Canada.
| |
Collapse
|
32
|
Bujnicki JM, Rychlewski L. RNA:(guanine-N2) methyltransferases RsmC/RsmD and their homologs revisited--bioinformatic analysis and prediction of the active site based on the uncharacterized Mj0882 protein structure. BMC Bioinformatics 2002; 3:10. [PMID: 11929612 PMCID: PMC102759 DOI: 10.1186/1471-2105-3-10] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2001] [Accepted: 04/03/2002] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Escherichia coli guanine-N2 (m2G) methyltransferases (MTases) RsmC and RsmD modify nucleosides G1207 and G966 of 16S rRNA. They possess a common MTase domain in the C-terminus and a variable region in the N-terminus. Their C-terminal domain is related to the YbiN family of hypothetical MTases, but nothing is known about the structure or function of the N-terminal domain. RESULTS Using a combination of sequence database searches and fold recognition methods it has been demonstrated that the N-termini of RsmC and RsmD are related to each other and that they represent a "degenerated" version of the C-terminal MTase domain. Novel members of the YbiN family from Archaea and Eukaryota were also indentified. It is inferred that YbiN and both domains of RsmC and RsmD are closely related to a family of putative MTases from Gram-positive bacteria and Archaea, typified by the Mj0882 protein from M. jannaschii (1dus in PDB). Based on the results of sequence analysis and structure prediction, the residues involved in cofactor binding, target recognition and catalysis were identified, and the mechanism of the guanine-N2 methyltransfer reaction was proposed. CONCLUSIONS Using the known Mj0882 structure, a comprehensive analysis of sequence-structure-function relationships in the family of genuine and putative m2G MTases was performed. The results provide novel insight into the mechanism of m2G methylation and will serve as a platform for experimental analysis of numerous uncharacterized N-MTases.
Collapse
Affiliation(s)
- Janusz M Bujnicki
- Bioinformatics Laboratory, International Institute of Cell and Molecular Biology, ul. ks. Trojdena 4, 02-109 Warsaw, Poland
| | | |
Collapse
|
33
|
Bonneau R, Baker D. Ab initio protein structure prediction: progress and prospects. ANNUAL REVIEW OF BIOPHYSICS AND BIOMOLECULAR STRUCTURE 2001; 30:173-89. [PMID: 11340057 DOI: 10.1146/annurev.biophys.30.1.173] [Citation(s) in RCA: 226] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Considerable recent progress has been made in the field of ab initio protein structure prediction, as witnessed by the third Critical Assessment of Structure Prediction (CASP3). In spite of this progress, much work remains, for the field has yet to produce consistently reliable ab initio structure prediction protocols. In this work, we review the features of current ab initio protocols in an attempt to highlight the foundations of recent progress in the field and suggest promising directions for future work.
Collapse
Affiliation(s)
- R Bonneau
- Department of Biochemistry, University of Washington, Seattle, Washington, Box 357350, 98195, USA.
| | | |
Collapse
|
34
|
Rognes T. ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res 2001; 29:1647-52. [PMID: 11266569 PMCID: PMC31274 DOI: 10.1093/nar/29.7.1647] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
There is a need for faster and more sensitive algorithms for sequence similarity searching in view of the rapidly increasing amounts of genomic sequence data available. Parallel processing capabilities in the form of the single instruction, multiple data (SIMD) technology are now available in common microprocessors and enable a single microprocessor to perform many operations in parallel. The ParAlign algorithm has been specifically designed to take advantage of this technology. The new algorithm initially exploits parallelism to perform a very rapid computation of the exact optimal ungapped alignment score for all diagonals in the alignment matrix. Then, a novel heuristic is employed to compute an approximate score of a gapped alignment by combining the scores of several diagonals. This approximate score is used to select the most interesting database sequences for a subsequent Smith-Waterman alignment, which is also parallelised. The resulting method represents a substantial improvement compared to existing heuristics. The sensitivity and specificity of ParAlign was found to be as good as Smith-Waterman implementations when the same method for computing the statistical significance of the matches was used. In terms of speed, only the significantly less sensitive NCBI BLAST 2 program was found to outperform the new approach. Online searches are available at http://dna.uio.no/search/
Collapse
Affiliation(s)
- T Rognes
- Department of Molecular Biology, Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway.
| |
Collapse
|
35
|
Bujnicki JM, Elofsson A, Fischer D, Rychlewski L. LiveBench-1: continuous benchmarking of protein structure prediction servers. Protein Sci 2001; 10:352-61. [PMID: 11266621 PMCID: PMC2373940 DOI: 10.1110/ps.40501] [Citation(s) in RCA: 101] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
We present a novel, continuous approach aimed at the large-scale assessment of the performance of available fold-recognition servers. Six popular servers were investigated: PDB-Blast, FFAS, T98-lib, GenTHREADER, 3D-PSSM, and INBGU. The assessment was conducted using as prediction targets a large number of selected protein structures released from October 1999 to April 2000. A target was selected if its sequence showed no significant similarity to any of the proteins previously available in the structural database. Overall, the servers were able to produce structurally similar models for one-half of the targets, but significantly accurate sequence-structure alignments were produced for only one-third of the targets. We further classified the targets into two sets: easy and hard. We found that all servers were able to find the correct answer for the vast majority of the easy targets if a structurally similar fold was present in the server's fold libraries. However, among the hard targets--where standard methods such as PSI-BLAST fail--the most sensitive fold-recognition servers were able to produce similar models for only 40% of the cases, half of which had a significantly accurate sequence-structure alignment. Among the hard targets, the presence of updated libraries appeared to be less critical for the ranking. An "ideally combined consensus" prediction, where the results of all servers are considered, would increase the percentage of correct assignments by 50%. Each server had a number of cases with a correct assignment, where the assignments of all the other servers were wrong. This emphasizes the benefits of considering more than one server in difficult prediction tasks. The LiveBench program (http://BioInfo.PL/LiveBench) is being continued, and all interested developers are cordially invited to join.
Collapse
Affiliation(s)
- J M Bujnicki
- Bioinformatics Laboratory, International Institute of Molecular and Cell Biology, 02-109 Warsaw, Poland
| | | | | | | |
Collapse
|
36
|
Abstract
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.
Collapse
Affiliation(s)
- L Jaroszewski
- The Burnham Institute, La Jolla, California 92037, USA
| | | | | |
Collapse
|