101
|
Sitbon E, Pietrokovski S. Occurrence of protein structure elements in conserved sequence regions. BMC STRUCTURAL BIOLOGY 2007; 7:3. [PMID: 17210087 PMCID: PMC1781454 DOI: 10.1186/1472-6807-7-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2006] [Accepted: 01/09/2007] [Indexed: 11/19/2022]
Abstract
BACKGROUND Conserved protein sequence regions are extremely useful for identifying and studying functionally and structurally important regions. By means of an integrated analysis of large-scale protein structure and sequence data, structural features of conserved protein sequence regions were identified. RESULTS Helices and turns were found to be underrepresented in conserved regions, while strands were found to be overrepresented. Similar numbers of loops were found in conserved and random regions. CONCLUSION These results can be understood in light of the structural constraints on different secondary structure elements, and their role in protein structural stabilization and topology. Strands can tolerate fewer sequence changes and nonetheless keep their specific shape and function. They thus tend to be more conserved than helices, which can keep their shape and function with more changes. Loop behavior can be explained by the presence of both constrained and freely changing loops in proteins. Our detailed statistical analysis of diverse proteins links protein evolution to the biophysics of protein thermodynamic stability and folding. The basic structural features of conserved sequence regions are also important determinants of protein structure motifs and their function.
Collapse
Affiliation(s)
- Einat Sitbon
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| | - Shmuel Pietrokovski
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
102
|
Raghava GPS, Barton GJ. Quantification of the variation in percentage identity for protein sequence alignments. BMC Bioinformatics 2006; 7:415. [PMID: 16984632 PMCID: PMC1592310 DOI: 10.1186/1471-2105-7-415] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2005] [Accepted: 09/19/2006] [Indexed: 11/26/2022] Open
Abstract
Background Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method. Results The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% ± 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84. Conclusion Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature.
Collapse
Affiliation(s)
- GPS Raghava
- Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India
- This work was intitated when both authors were at the University of Oxford, Laboratory of Molecular Biophysics, Rex Richards Building, Oxford, OX1 3QU, UK
| | - Geoffrey J Barton
- School of Life Sciences Research, University of Dundee, Dow Street, Dundee, DD1 5EH, Scotland, UK
- This work was intitated when both authors were at the University of Oxford, Laboratory of Molecular Biophysics, Rex Richards Building, Oxford, OX1 3QU, UK
| |
Collapse
|
103
|
Sidhu A, Yang ZR. Prediction of signal peptides using bio-basis function neural networks and decision trees. ACTA ACUST UNITED AC 2006; 5:13-9. [PMID: 16539533 DOI: 10.2165/00822942-200605010-00002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Signal peptide identification is of immense importance in drug design. Accurate identification of signal peptides is the first critical step to be able to change the direction of the targeting proteins and use the designed drug to target a specific organelle to correct a defect. Because experimental identification is the most accurate method, but is expensive and time-consuming, an efficient and affordable automated system is of great interest. In this article, we propose using an adapted neural network, called a bio-basis function neural network, and decision trees for predicting signal peptides. The bio-basis function neural network model and decision trees achieved 97.16% and 97.63% accuracy respectively, demonstrating that the methods work well for the prediction of signal peptides. Moreover, decision trees revealed that position P(1'), which is important in forming signal peptides, most commonly comprises either leucine or alanine. This concurs with the (P(3)-P(1)-P(1')) coupling model.
Collapse
Affiliation(s)
- Ateesh Sidhu
- Biological Science, University of Warwick, Coventry, UK.
| | | |
Collapse
|
104
|
Abstract
Protein sequence comparison is the most powerful tool for the inference of novel protein structure and function. This type of inference is commonly based on the similar sequence-similar structure-similar function paradigm, and derived by sequence similarity searching on databases of protein sequences. As entire genomes have been being determined at a rapid rate, computational methods for comparing protein sequences will be more essential for probing the complexity of molecular machines. In this paper we introduce a pattern-comparison algorithm, which is based on the mathematical concepts of linear predictive coding (LPC) and LPC cepstral distortion measure, for computing similarities/dissimilarities between protein sequences. Experimental results on a real data set of functionally related and functionally nonrelated protein sequences have shown the effectiveness of the proposed approach on both accuracy and computational efficiency.
Collapse
Affiliation(s)
- Tuan D Pham
- Bioinformatics Applications Research Center, School of Information Technology, James Cook University, Townsville, QLD 4811, Australia.
| |
Collapse
|
105
|
Krishna SS, Sadreyev RI, Grishin NV. A tale of two ferredoxins: sequence similarity and structural differences. BMC STRUCTURAL BIOLOGY 2006; 6:8. [PMID: 16603087 PMCID: PMC1459171 DOI: 10.1186/1472-6807-6-8] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2005] [Accepted: 04/09/2006] [Indexed: 11/10/2022]
Abstract
Background Sequence similarity between proteins is usually considered a reliable indicator of homology. Pyruvate-ferredoxin oxidoreductase and quinol-fumarate reductase contain ferredoxin domains that bind [Fe-S] clusters and are involved in electron transport. Profile-based methods for sequence comparison, such as PSI-BLAST and HMMer, suggest statistically significant similarity between these domains. Results The sequence similarity between these ferredoxin domains resides in the area of the [Fe-S] cluster-binding sites. Although overall folds of these ferredoxins bear no obvious similarity, the regions of sequence similarity display a remarkable local structural similarity. These short regions with pronounced sequence motifs are incorporated in completely different structural environments. In pyruvate-ferredoxin oxidoreductase (bacterial ferredoxin), the hydrophobic core of the domain is completed by two β-hairpins, whereas in quinol-fumarate reductase (α-helical ferredoxin), the cluster-binding motifs are part of a larger all-α-helical globin-like fold core. Conclusion Functionally meaningful sequence similarity may sometimes be reflected only in local structural similarity, but not in global fold similarity. If detected and used naively, such similarities may lead to incorrect fold predictions.
Collapse
Affiliation(s)
- S Sri Krishna
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX, 75390-8816, USA
- Joint Center for Structural Genomics, University of California, San Diego, La Jolla, CA, 92093-0314, USA
| | - Ruslan I Sadreyev
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX, 75390-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX, 75390-9050, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX, 75390-8816, USA
| |
Collapse
|
106
|
Williams TJ, Zhang CL, Scott JH, Bazylinski DA. Evidence for autotrophy via the reverse tricarboxylic acid cycle in the marine magnetotactic coccus strain MC-1. Appl Environ Microbiol 2006; 72:1322-9. [PMID: 16461683 PMCID: PMC1392968 DOI: 10.1128/aem.72.2.1322-1329.2006] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2005] [Accepted: 11/30/2005] [Indexed: 11/20/2022] Open
Abstract
Strain MC-1 is a marine, microaerophilic, magnetite-producing, magnetotactic coccus phylogenetically affiliated with the alpha-Proteobacteria. Strain MC-1 grew chemolithotrophically with sulfide and thiosulfate as electron donors with HCO3-/CO2 as the sole carbon source. Experiments with cells grown microaerobically in liquid with thiosulfate and H14CO3-/14CO2 showed that all cell carbon was derived from H14CO3-/14CO2 and therefore that MC-1 is capable of chemolithoautotrophy. Cell extracts did not exhibit ribulose-1,5-bisphosphate carboxylase-oxygenase (RubisCO) activity, nor were RubisCO genes found in the draft genome of MC-1. Thus, unlike other chemolithoautotrophic, magnetotactic bacteria, strain MC-1 does not appear to utilize the Calvin-Benson-Bassham cycle for autotrophy. Cell extracts did not exhibit carbon monoxide dehydrogenase activity, indicating that the acetyl-coenzyme A pathway also does not function in strain MC-1. The 13C content of whole cells of MC-1 relative to the 13C content of the inorganic carbon source (Deltadelta13C) was -11.4 per thousand. Cellular fatty acids showed enrichment of 13C relative to whole cells. Strain MC-1 cell extracts showed activities for several key enzymes of the reverse (reductive) tricarboxylic acid (rTCA) cycle including fumarate reductase, pyruvate:acceptor oxidoreductase and 2-oxoglutarate:acceptor oxidoreductase. Although ATP citrate lyase (another key enzyme of the rTCA cycle) activity was not detected in strain MC-1 using commonly used assays, cell extracts did cleave citrate, and the reaction was dependent upon the presence of ATP and coenzyme A. Thus, we infer the presence of an ATP-dependent citrate-cleaving mechanism. These results are consistent with the operation of the rTCA cycle in MC-1. Strain MC-1 appears to be the first known representative of the alpha-Proteobacteria to use the rTCA cycle for autotrophy.
Collapse
Affiliation(s)
- Timothy J Williams
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, Iowa 50011, USA
| | | | | | | |
Collapse
|
107
|
Li H, Li J, Wong L. Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scale. Bioinformatics 2006; 22:989-96. [PMID: 16446278 DOI: 10.1093/bioinformatics/btl020] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Protein-protein interaction, mediated by protein interaction sites, is intrinsic to many functional processes in the cell. In this paper, we propose a novel method to discover patterns in protein interaction sites. We observed from protein interaction networks that there exist a kind of significant substructures called interacting protein group pairs, which exhibit an all-versus-all interaction between the two protein-sets in such a pair. The full-interaction between the pair indicates a common interaction mechanism shared by the proteins in the pair, which can be referred as an interaction type. Motif pairs at the interaction sites of the protein group pairs can be used to represent such interaction type, with each motif derived from the sequences of a protein group by standard motif discovery algorithms. The systematic discovery of all pairs of interacting protein groups from large protein interaction networks is a computationally challenging problem. By a careful and sophisticated problem transformation, the problem is solved using efficient algorithms for mining frequent patterns, a problem extensively studied in data mining. RESULTS We found 5349 pairs of interacting protein groups from a yeast interaction dataset. The expected value of sequence identity within the groups is only 7.48%, indicating non-homology within these protein groups. We derived 5343 motif pairs from these group pairs, represented in the form of blocks. Comparing our motifs with domains in the BLOCKS and PRINTS databases, we found that our blocks could be mapped to an average of 3.08 correlated blocks in these two databases. The mapped blocks occur 4221 out of total 6794 domains (protein groups) in these two databases. Comparing our motif pairs with iPfam consisting of 3045 interacting domain pairs derived from PDB, we found 47 matches occurring in 105 distinct PDB complexes. Comparing with another putative domain interaction database InterDom, we found 203 matches. AVAILABILITY http://research.i2r.a-star.edu.sg/BindingMotifPairs/resources. SUPPLEMENTARY INFORMATION http://research.i2r.a-star.edu.sg/BindingMotifPairs and Bioinformatics online.
Collapse
Affiliation(s)
- Haiquan Li
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore
| | | | | |
Collapse
|
108
|
Huang YM, Bystroff C. Improved pairwise alignments of proteins in the Twilight Zone using local structure predictions. Bioinformatics 2005; 22:413-22. [PMID: 16352653 DOI: 10.1093/bioinformatics/bti828] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.
Collapse
Affiliation(s)
- Yao-Ming Huang
- Center for Bioinformatics, Department of Biology, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
| | | |
Collapse
|
109
|
Kunin V, Goldovsky L, Darzentas N, Ouzounis CA. The net of life: reconstructing the microbial phylogenetic network. Genome Res 2005; 15:954-9. [PMID: 15965028 PMCID: PMC1172039 DOI: 10.1101/gr.3666505] [Citation(s) in RCA: 164] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
It has previously been suggested that the phylogeny of microbial species might be better described as a network containing vertical and horizontal gene transfer (HGT) events. Yet, all phylogenetic reconstructions so far have presented microbial trees rather than networks. Here, we present a first attempt to reconstruct such an evolutionary network, which we term the "net of life". We use available tree reconstruction methods to infer vertical inheritance, and use an ancestral state inference algorithm to map HGT events on the tree. We also describe a weighting scheme used to estimate the number of genes exchanged between pairs of organisms. We demonstrate that vertical inheritance constitutes the bulk of gene transfer on the tree of life. We term the bulk of horizontal gene flow between tree nodes as "vines", and demonstrate that multiple but mostly tiny vines interconnect the tree. Our results strongly suggest that the HGT network is a scale-free graph, a finding with important implications for genome evolution. We propose that genes might propagate extremely rapidly across microbial species through the HGT network, using certain organisms as hubs.
Collapse
Affiliation(s)
- Victor Kunin
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, United Kingdom
| | | | | | | |
Collapse
|
110
|
Wen ZN, Wang KL, Li ML, Nie FS, Yang Y. Analyzing functional similarity of protein sequences with discrete wavelet transform. Comput Biol Chem 2005; 29:220-8. [PMID: 15979042 DOI: 10.1016/j.compbiolchem.2005.04.007] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2004] [Accepted: 04/14/2005] [Indexed: 10/25/2022]
Abstract
This paper applies discrete wavelet transform (DWT) with various protein substitution models to find functional similarity of proteins with low identity. A new metric, 'S' function, based on the DWT is proposed to measure the pair-wise similarity. We also develop a segmentation technique, combined with DWT, to handle long protein sequences. The results are compared with those using the pair-wise alignment and PSI-BLAST.
Collapse
Affiliation(s)
- Zhi-ning Wen
- College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, PR China
| | | | | | | | | |
Collapse
|
111
|
Haas BJ, Wortman JR, Ronning CM, Hannick LI, Smith RK, Maiti R, Chan AP, Yu C, Farzad M, Wu D, White O, Town CD. Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol 2005; 3:7. [PMID: 15784138 PMCID: PMC1082884 DOI: 10.1186/1741-7007-3-7] [Citation(s) in RCA: 111] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2004] [Accepted: 03/22/2005] [Indexed: 11/29/2022] Open
Abstract
Background Since the initial publication of its complete genome sequence, Arabidopsis thaliana has become more important than ever as a model for plant research. However, the initial genome annotation was submitted by multiple centers using inconsistent methods, making the data difficult to use for many applications. Results Over the course of three years, TIGR has completed its effort to standardize the structural and functional annotation of the Arabidopsis genome. Using both manual and automated methods, Arabidopsis gene structures were refined and gene products were renamed and assigned to Gene Ontology categories. We present an overview of the methods employed, tools developed, and protocols followed, summarizing the contents of each data release with special emphasis on our final annotation release (version 5). Conclusion Over the entire period, several thousand new genes and pseudogenes were added to the annotation. Approximately one third of the originally annotated gene models were significantly refined yielding improved gene structure annotations, and every protein-coding gene was manually inspected and classified using Gene Ontology terms.
Collapse
Affiliation(s)
- Brian J Haas
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Jennifer R Wortman
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Catherine M Ronning
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Linda I Hannick
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Roger K Smith
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Rama Maiti
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Agnes P Chan
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Chunhui Yu
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Maryam Farzad
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Dongying Wu
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Owen White
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| | - Christopher D Town
- The Institute for Genomic Research, 9172 Medical Center Drive, Rockville, Maryland, 20850, USA
| |
Collapse
|
112
|
A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities. BMC Bioinformatics 2005; 6:49. [PMID: 15757521 PMCID: PMC555736 DOI: 10.1186/1471-2105-6-49] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2004] [Accepted: 03/10/2005] [Indexed: 11/15/2022] Open
Abstract
Background Popular methods to reconstruct molecular phylogenies are based on multiple sequence alignments, in which addition or removal of data may change the resulting tree topology. We have sought a representation of homologous proteins that would conserve the information of pair-wise sequence alignments, respect probabilistic properties of Z-scores (Monte Carlo methods applied to pair-wise comparisons) and be the basis for a novel method of consistent and stable phylogenetic reconstruction. Results We have built up a spatial representation of protein sequences using concepts from particle physics (configuration space) and respecting a frame of constraints deduced from pair-wise alignment score properties in information theory. The obtained configuration space of homologous proteins (CSHP) allows the representation of real and shuffled sequences, and thereupon an expression of the TULIP theorem for Z-score probabilities. Based on the CSHP, we propose a phylogeny reconstruction using Z-scores. Deduced trees, called TULIP trees, are consistent with multiple-alignment based trees. Furthermore, the TULIP tree reconstruction method provides a solution for some previously reported incongruent results, such as the apicomplexan enolase phylogeny. Conclusion The CSHP is a unified model that conserves mutual information between proteins in the way physical models conserve energy. Applications include the reconstruction of evolutionary consistent and robust trees, the topology of which is based on a spatial representation that is not reordered after addition or removal of sequences. The CSHP and its assigned phylogenetic topology, provide a powerful and easily updated representation for massive pair-wise genome comparisons based on Z-score computations.
Collapse
|
113
|
Kunin V, Ahren D, Goldovsky L, Janssen P, Ouzounis CA. Measuring genome conservation across taxa: divided strains and united kingdoms. Nucleic Acids Res 2005; 33:616-21. [PMID: 15681613 PMCID: PMC548337 DOI: 10.1093/nar/gki181] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Species evolutionary relationships have traditionally been defined by sequence similarities of phylogenetic marker molecules, recently followed by whole-genome phylogenies based on gene order, average ortholog similarity or gene content. Here, we introduce genome conservation--a novel metric of evolutionary distances between species that simultaneously takes into account, both gene content and sequence similarity at the whole-genome level. Genome conservation represents a robust distance measure, as demonstrated by accurate phylogenetic reconstructions. The genome conservation matrix for all presently sequenced organisms exhibits a remarkable ability to define evolutionary relationships across all taxonomic ranges. An assessment of taxonomic ranks with genome conservation shows that certain ranks are inadequately described and raises the possibility for a more precise and quantitative taxonomy in the future. All phylogenetic reconstructions are available at the genome phylogeny server: <http://maine.ebi.ac.uk:8000/cgi-bin/gps/GPS.pl>.
Collapse
Affiliation(s)
| | | | | | - Paul Janssen
- Laboratory of Microbiology, Belgian Nuclear Research Centre SCK/CENBoeretang 200, B-2400-MOL, Belgium
| | - Christos A. Ouzounis
- To whom correspondence should be addressed. Tel: +44 1223 494653; Fax: +44 1223 494471;
| |
Collapse
|
114
|
Pirun M, Babnigg G, Stevens FJ. Template-based recognition of protein fold within the midnight and twilight zones of protein sequence similarity. J Mol Recognit 2005; 18:203-12. [PMID: 15540237 DOI: 10.1002/jmr.728] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Most homologous pairs of proteins have no significant sequence similarity to each other and are not identified by direct sequence comparison or profile-based strategies. However, multiple sequence alignments of low similarity homologues typically reveal a limited number of positions that are well conserved despite diversity of function. It may be inferred that conservation at most of these positions is the result of the importance of the contribution of these amino acids to the folding and stability of the protein. As such, these amino acids and their relative positions may define a structural signature. We demonstrate that extraction of this fold template provides the basis for the sequence database to be searched for patterns consistent with the fold, enabling identification of homologs that are not recognized by global sequence analysis. The fold template method was developed to address the need for a tool that could comprehensively search the midnight and twilight zones of protein sequence similarity without reliance on global statistical significance. Manual implementations of the fold template method were performed on three folds--immunoglobulin, c-lectin and TIM barrel. Following proof of concept of the template method, an automated version of the approach was developed. This automated fold template method was used to develop fold templates for 10 of the more populated folds in the SCOP database. The fold template method developed three-dimensional structural motifs or signatures that were able to return a diverse collection of proteins, while maintaining a low false positive rate. Although the results of the manual fold template method were more comprehensive than the automated fold template method, the diversity of the results from the automated fold template method surpassed those of current methods that rely on statistical significance to infer evolutionary relationships among divergent proteins.
Collapse
Affiliation(s)
- Mono Pirun
- Department of Bioengineering, University of Illinois at Chicago, 60607, USA
| | | | | |
Collapse
|
115
|
Stevens FJ. Efficient recognition of protein fold at low sequence identity by conservative application of Psi-BLAST: validation. J Mol Recognit 2005; 18:139-49. [PMID: 15558595 DOI: 10.1002/jmr.721] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.
Collapse
Affiliation(s)
- F J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
116
|
Hsieh MJ, Luo R. Physical scoring function based on AMBER force field and Poisson-Boltzmann implicit solvent for protein structure prediction. Proteins 2004; 56:475-86. [PMID: 15229881 DOI: 10.1002/prot.20133] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
A well-behaved physics-based all-atom scoring function for protein structure prediction is analyzed with several widely used all-atom decoy sets. The scoring function, termed AMBER/Poisson-Boltzmann (PB), is based on a refined AMBER force field for intramolecular interactions and an efficient PB model for solvation interactions. Testing on the chosen decoy sets shows that the scoring function, which is designed to consider detailed chemical environments, is able to consistently discriminate all 62 native crystal structures after considering the heteroatom groups, disulfide bonds, and crystal packing effects that are not included in the decoy structures. When NMR structures are considered in the testing, the scoring function is able to discriminate 8 out of 10 targets. In the more challenging test of selecting near-native structures, the scoring function also performs very well: for the majority of the targets studied, the scoring function is able to select decoys that are close to the corresponding native structures as evaluated by ranking numbers and backbone Calpha root mean square deviations. Various important components of the scoring function are also studied to understand their discriminative contributions toward the rankings of native and near-native structures. It is found that neither the nonpolar solvation energy as modeled by the surface area model nor a higher protein dielectric constant improves its discriminative power. The terms remaining to be improved are related to 1-4 interactions. The most troublesome term is found to be the large and highly fluctuating 1-4 electrostatics term, not the dihedral-angle term. These data support ongoing efforts in the community to develop protein structure prediction methods with physics-based potentials that are competitive with knowledge-based potentials.
Collapse
Affiliation(s)
- Meng-Juei Hsieh
- Department of Molecular Biology and Biochemistry, University of California, Irvine, California 92697-3900, USA
| | | |
Collapse
|
117
|
Hall BG. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol Biol Evol 2004; 22:792-802. [PMID: 15590907 DOI: 10.1093/molbev/msi066] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A biologically realistic method was used to simulate evolutionary trees. The method uses a real DNA coding sequence as the starting point, simulates mutation according to the mutational spectrum of Escherichia coli-including base substitutions, insertions, and deletions-and separates the processes of mutation and selection. Trees of 8, 16, 32, and 64 taxa were simulated with average branch lengths of 50, 100, 150, 200, and 250 changes per branch. The resulting sequences were aligned with ClustalX, and trees were estimated by Neighbor Joining, Parsimony, Maximum Likelihood, and Bayesian methods from both DNA sequences and the corresponding protein sequences. The estimated trees were compared with the true trees, and both topological and branch length accuracies were scored. Over the variety of conditions tested, Bayesian trees estimated from DNA sequences that had been aligned according to the alignment of the corresponding protein sequences were the most accurate, followed by Maximum Likelihood trees estimated from DNA sequences and Parsimony trees estimated from protein sequences.
Collapse
Affiliation(s)
- Barry G Hall
- Biology Department, University of Rochester, USA.
| |
Collapse
|
118
|
Abstract
The type III secretion system (TTSS) of gram-negative bacteria is responsible for delivering bacterial proteins, termed effectors, from the bacterial cytosol directly into the interior of host cells. The TTSS is expressed predominantly by pathogenic bacteria and is usually used to introduce deleterious effectors into host cells. While biochemical activities of effectors vary widely, the TTSS apparatus used to deliver these effectors is conserved and shows functional complementarity for secretion and translocation. This review focuses on proteins that constitute the TTSS apparatus and on mechanisms that guide effectors to the TTSS apparatus for transport. The TTSS apparatus includes predicted integral inner membrane proteins that are conserved widely across TTSSs and in the basal body of the bacterial flagellum. It also includes proteins that are specific to the TTSS and contribute to ring-like structures in the inner membrane and includes secretin family members that form ring-like structures in the outer membrane. Most prominently situated on these coaxial, membrane-embedded rings is a needle-like or pilus-like structure that is implicated as a conduit for effector translocation into host cells. A short region of mRNA sequence or protein sequence in effectors acts as a signal sequence, directing proteins for transport through the TTSS. Additionally, a number of effectors require the action of specific TTSS chaperones for efficient and physiologically meaningful translocation into host cells. Numerous models explaining how effectors are transported into host cells have been proposed, but understanding of this process is incomplete and this topic remains an active area of inquiry.
Collapse
Affiliation(s)
- Partho Ghosh
- Department of Chemistry & Biochemistry, University of California-San Diego, La Jolla, CA 92093-0314, USA.
| |
Collapse
|
119
|
Abstract
MOTIVATION Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. RESULTS We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.
Collapse
Affiliation(s)
- Johannes Söding
- Department of Protein Evolution, Max-Planck-Institute for Developmental Biology Spemannstrasse 35, D-72076 Tübingen, Germany.
| |
Collapse
|
120
|
Ouyang Z, Zhu H, Wang J, She ZS. Multivariate entropy distance method for prokaryotic gene identification. J Bioinform Comput Biol 2004; 2:353-73. [PMID: 15297987 DOI: 10.1142/s0219720004000624] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/10/2003] [Indexed: 11/18/2022]
Abstract
A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.
Collapse
Affiliation(s)
- Zhengqing Ouyang
- State Key Lab for Turbulence and Complex Systems and Center for Theoretical Biology, Peking University, Beijing 100871, China
| | | | | | | |
Collapse
|
121
|
Bazylinski DA, Dean AJ, Williams TJ, Long LK, Middleton SL, Dubbels BL. Chemolithoautotrophy in the marine, magnetotactic bacterial strains MV-1 and MV-2. Arch Microbiol 2004; 182:373-87. [PMID: 15338111 DOI: 10.1007/s00203-004-0716-y] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2004] [Revised: 06/14/2004] [Accepted: 07/19/2004] [Indexed: 11/28/2022]
Abstract
Magnetite-producing magnetotactic bacteria collected from the oxic-anoxic transition zone of chemically stratified marine environments characterized by O2/H2S inverse double gradients, contained internal S-rich inclusions resembling elemental S globules, suggesting they oxidize reduced S compounds that could support autotrophy. Two strains of marine magnetotactic bacteria, MV-1 and MV-2, isolated from such sites grew in O2-gradient media with H2S or thiosulfate (S2O3(2-)) as electron sources and O2 as electron acceptor or anaerobically with S2O3(2-) and N2O as electron acceptor, with bicarbonate (HCO3-)/CO2 as sole C source. Cells grown with H2S contained S-rich inclusions. Cells oxidized S2O3(2-) to sulfate (SO4(2-)). Both strains grew microaerobically with formate. Neither grew microaerobically with tetrathionate (S4O6(2-)), methanol, or Fe2+ as FeS, or siderite (FeCO3). Growth with S2O3(2-) and radiolabeled 14C-HCO3- showed that cell C was derived from HCO3-/CO2. Cell-free extracts showed ribulose 1,5-bisphosphate carboxylase/oxygenase (RubisCO) activity. Southern blot analyses indicated the presence of a form II RubisCO (cbbM) but no form I (cbbL) in both strains. cbbM and cbbQ, a putative post-translational activator of RubisCO, were identified in MV-1. MV-1 and MV-2 are thus chemolithoautotrophs that use the Calvin-Benson-Bassham pathway. cbbM was also identified in Magnetospirillum magnetotacticum. Thus, magnetotactic bacteria at the oxic-anoxic transition zone of chemically stratified aquatic environments are important in C cycling and primary productivity.
Collapse
Affiliation(s)
- Dennis A Bazylinski
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011, USA.
| | | | | | | | | | | |
Collapse
|
122
|
Bhaduri A, Pugalenthi G, Gupta N, Sowdhamini R. iMOT: an interactive package for the selection of spatially interacting motifs. Nucleic Acids Res 2004; 32:W602-5. [PMID: 15215459 PMCID: PMC441513 DOI: 10.1093/nar/gkh375] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Functional selection and three-dimensional structural constraints of proteins relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor overall sequence identity and evolutionary pressures. We report the availability of 'iMOT' (interacting MOTif) server, an interactive package for the automatic identification of spatially interacting motifs among distantly related proteins sharing similar folds and possessing common ancestral lineage. Spatial interactions between conserved stretches of a protein are evaluated by calculations of pseudo-potentials that describe the strength of interactions. Such an evaluation permits the automatic identification of highly interacting conserved regions of a protein. Interacting motifs have been shown to be useful in searching for distant homologues and establishing remote homologies among the largely unassigned sequences in genome databases. Information on such motifs should also be of value in protein folding, modelling and engineering experiments. The iMOT server can be accessed from http://www.ncbs.res.in/~faculty/mini/imot/iMOTserver.html. Supplementary Material can be accessed from: http://www.ncbs.res.in/~faculty/mini/imot/supplementary.html.
Collapse
Affiliation(s)
- A Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK Campus, Bellary Road, Bangalore 560 065, Karnataka, India
| | | | | | | |
Collapse
|
123
|
Sadreyev RI, Grishin NV. Estimates of statistical significance for comparison of individual positions in multiple sequence alignments. BMC Bioinformatics 2004; 5:106. [PMID: 15296518 PMCID: PMC516024 DOI: 10.1186/1471-2105-5-106] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2004] [Accepted: 08/05/2004] [Indexed: 11/17/2022] Open
Abstract
Background Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families. Results For problems (1) and (2), we propose analytical estimates of P-value and apply them to the detection of significant positional dissimilarities in various experimental situations. (a) We compare structure-based predictions of residue propensities at a protein position to the actual residue frequencies in the MSA of homologs. (b) We evaluate our method by the ability to detect erroneous position matches produced by an automatic sequence aligner. (c) We compare MSA positions that correspond to residues aligned by automatic structure aligners. (d) We compare MSA positions that are aligned by high-quality manual superposition of structures. Detected dissimilarities reveal shortcomings of the automatic methods for residue frequency prediction and alignment construction. For the high-quality structural alignments, the dissimilarities suggest sites of potential functional or structural importance. Conclusion The proposed computational method is of significant potential value for the analysis of protein families.
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute, and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| |
Collapse
|
124
|
|
125
|
Newlove T, Konieczka JH, Cordes MHJ. Secondary Structure Switching in Cro Protein Evolution. Structure 2004; 12:569-81. [PMID: 15062080 DOI: 10.1016/j.str.2004.02.024] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2003] [Revised: 01/05/2004] [Accepted: 01/05/2004] [Indexed: 11/28/2022]
Abstract
We report the solution structure of the Cro protein from bacteriophage P22. Comparisons of its sequence and structure to those of lambda Cro strongly suggest an alpha-to-beta secondary structure switching event during Cro evolution. The folds of P22 Cro and lambda Cro share a three alpha helix fragment comprising the N-terminal half of the domain. However, P22 Cro's C terminus folds as two helices, while lambda Cro's folds as a beta hairpin. The all-alpha fold found for P22 Cro appears to be ancestral, since it also occurs in cI proteins, which are anciently duplicated paralogues of Cro. PSI-BLAST and transitive homology analyses strongly suggest that the sequences of P22 Cro and lambda Cro are globally homologous despite encoding different folds. The alpha+beta fold of lambda Cro therefore likely evolved from its all-alpha ancestor by homologous secondary structure switching, rather than by nonhomologous replacement of both sequence and structure.
Collapse
Affiliation(s)
- Tracey Newlove
- Department of Biochemistry and Molecular Biophysics, University of Arizona, Tucson, AZ 85701 USA
| | | | | |
Collapse
|
126
|
Pandit SB, Bhadra R, Gowri VS, Balaji S, Anand B, Srinivasan N. SUPFAM: a database of sequence superfamilies of protein domains. BMC Bioinformatics 2004; 5:28. [PMID: 15113407 PMCID: PMC394316 DOI: 10.1186/1471-2105-5-28] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2003] [Accepted: 03/15/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND SUPFAM database is a compilation of superfamily relationships between protein domain families of either known or unknown 3-D structure. In SUPFAM, sequence families from Pfam and structural families from SCOP are associated, using profile matching, to result in sequence superfamilies of known structure. Subsequently all-against-all family profile matches are made to deduce a list of new potential superfamilies of yet unknown structure. DESCRIPTION The current version of SUPFAM (release 1.4) corresponds to significant enhancements and major developments compared to the earlier and basic version. In the present version we have used RPS-BLAST, which is robust and sensitive, for profile matching. The reliability of connections between protein families is ensured better than before by use of benchmarked criteria involving strict e-value cut-off and a minimal alignment length condition. An e-value based indication of reliability of connections is now presented in the database. Web access to a RPS-BLAST-based tool to associate a query sequence to one of the family profiles in SUPFAM is available with the current release. In terms of the scientific content the present release of SUPFAM is entirely reorganized with the use of 6190 Pfam families and 2317 structural families derived from SCOP. Due to a steep increase in the number of sequence and structural families used in SUPFAM the details of scientific content in the present release are almost entirely complementary to previous basic version. Of the 2286 families, we could relate 245 Pfam families with apparently no structural information to families of known 3-D structures, thus resulting in the identification of new families in the existing superfamilies. Using the profiles of 3904 Pfam families of yet unknown structure, an all-against-all comparison involving sequence-profile match resulted in clustering of 96 Pfam families into 39 new potential superfamilies. CONCLUSION SUPFAM presents many non-trivial superfamily relationships of sequence families involved in a variety of functions and hence the information content is of interest to a wide scientific community. The grouping of related proteins without a known structure in SUPFAM is useful in identifying priority targets for structural genomics initiatives and in the assignment of putative functions. Database URL: http://pauling.mbu.iisc.ernet.in/~supfam.
Collapse
Affiliation(s)
- Shashi B Pandit
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - Rana Bhadra
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - VS Gowri
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - S Balaji
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | - B Anand
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK campus, Bangalore 560 065, India
- Present address: Department of Biosciences and Bioengineering, Indian Institute of Technology, Kanpur, Kanpur – 208 016, India
| | - N Srinivasan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| |
Collapse
|
127
|
Sunyaev SR, Bogopolsky GA, Oleynikova NV, Vlasov PK, Finkelstein AV, Roytberg MA. From analysis of protein structural alignments toward a novel approach to align protein sequences. Proteins 2003; 54:569-82. [PMID: 14748004 DOI: 10.1002/prot.10503] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.
Collapse
Affiliation(s)
- Shamil R Sunyaev
- Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia
| | | | | | | | | | | |
Collapse
|
128
|
Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res 2003; 31:4632-8. [PMID: 12888524 PMCID: PMC169885 DOI: 10.1093/nar/gkg495] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Accurate detection of protein families allows assignment of protein function and the analysis of functional diversity in complete genomes. Recently, we presented a novel algorithm called TribeMCL for the detection of protein families that is both accurate and efficient. This method allows family analysis to be carried out on a very large scale. Using TribeMCL, we have generated a resource called TRIBES that contains protein family information, comprising annotations, protein sequence alignments and phylogenetic distributions describing 311 257 proteins from 83 completely sequenced genomes. The analysis of at least 60 934 detected protein families reveals that, with the essential families excluded, paralogy levels are similar between prokaryotes, irrespective of genome size. The number of essential families is estimated to be between 366 and 426. We also show that the currently known space of protein families is scale free and discuss the implications of this distribution. In addition, we show that smaller families are often formed by shorter proteins and discuss the reasons for this intriguing pattern. Finally, we analyse the functional diversity of protein families in entire genome sequences. The TRIBES protein family resource is accessible at http://www.ebi.ac.uk/research/cgg/tribes/.
Collapse
Affiliation(s)
- Anton J Enright
- Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
129
|
Hung LH, Samudrala R. PROTINFO: Secondary and tertiary protein structure prediction. Nucleic Acids Res 2003; 31:3296-9. [PMID: 12824311 PMCID: PMC168948 DOI: 10.1093/nar/gkg541] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2003] [Revised: 03/31/2003] [Accepted: 03/31/2003] [Indexed: 11/14/2022] Open
Abstract
Information about the secondary and tertiary structure of a protein sequence can greatly assist biologists in the generation and testing of hypotheses, as well as design of experiments. The PROTINFO server enables users to submit a protein sequence and request a prediction of the three-dimensional (tertiary) structure based on comparative modeling, fold generation and de novo methods developed by the authors. In addition, users can submit NMR chemical shift data and request protein secondary structure assignment that is based on using neural networks to combine the chemical shifts with secondary structure predictions. The server is available at http://protinfo.compbio.washington.edu.
Collapse
Affiliation(s)
- Ling-Hong Hung
- Computational Genomics Group, Department of Microbiology, University of Washington School of Medicine, Seattle, WA 98195, USA
| | | |
Collapse
|
130
|
Simmons MP, Freudenstein JV. The effects of increasing genetic distance on alignment of, and tree construction from, rDNA internal transcribed spacer sequences. Mol Phylogenet Evol 2003; 26:444-51. [PMID: 12644403 DOI: 10.1016/s1055-7903(02)00366-4] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
We examined how alignment of internal transcribed spacers of rDNA in fungi and plants changes with increasing genetic distance by successive removal of sequences from each data set followed by realignment and phylogenetic analysis. Increasing genetic distance can negatively affect phylogenetic reconstruction in two ways. First, it may cause errors in the alignment and therefore the homology hypotheses of the sequence characters. Second, it may cause errors in the homology assessments of character states because of multiple hits on individual branches. These two causes of error in phylogenetic inference were distinguished from one another in our analysis. The errors in alignment caused by increasing genetic distance were primarily due to inserting too few gaps and inserting gaps at the wrong positions. Errors in tree resolution, topology, and/or branch-support values were more often caused by multiple hits than by misaligned positions. This suggests that increasing genetic distance negatively affects our primary homology assessments of character states more severely than our primary homology assessments of characters. We suggest that increasing taxon sampling with the aim of subdividing long branches is a strategy for obtaining reliable alignments.
Collapse
Affiliation(s)
- Mark P Simmons
- The Ohio State University Herbarium, 1315 Kinnear Road, Columbus, OH 43212, USA.
| | | |
Collapse
|
131
|
Sadreyev R, Grishin N. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003; 326:317-36. [PMID: 12547212 DOI: 10.1016/s0022-2836(02)01371-2] [Citation(s) in RCA: 198] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We present a novel method for the comparison of multiple protein alignments with assessment of statistical significance (COMPASS). The method derives numerical profiles from alignments, constructs optimal local profile-profile alignments and analytically estimates E-values for the detected similarities. The scoring system and E-value calculation are based on a generalization of the PSI-BLAST approach to profile-sequence comparison, which is adapted for the profile-profile case. Tested along with existing methods for profile-sequence (PSI-BLAST) and profile-profile (prof_sim) comparison, COMPASS shows increased abilities for sensitive and selective detection of remote sequence similarities, as well as improved quality of local alignments. The method allows prediction of relationships between protein families in the PFAM database beyond the range of conventional methods. Two predicted relations with high significance are similarities between various Rossmann-type folds and between various helix-turn-helix-containing families. The potential value of COMPASS for structure/function predictions is illustrated by the detection of an intricate homology between the DNA-binding domain of the CTF/NFI family and the MH1 domain of the Smad family.
Collapse
Affiliation(s)
- Ruslan Sadreyev
- Howard Hughes Medical Institute, and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | | |
Collapse
|
132
|
Edwards YJK, Cottage A. Bioinformatics methods to predict protein structure and function. A practical approach. Mol Biotechnol 2003; 23:139-66. [PMID: 12632698 DOI: 10.1385/mb:23:2:139] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Protein structure prediction by using bioinformatics can involve sequence similarity searches, multiple sequence alignments, identification and characterization of domains, secondary structure prediction, solvent accessibility prediction, automatic protein fold recognition, constructing three-dimensional models to atomic detail, and model validation. Not all protein structure prediction projects involve the use of all these techniques. A central part of a typical protein structure prediction is the identification of a suitable structural target from which to extrapolate three-dimensional information for a query sequence. The way in which this is done defines three types of projects. The first involves the use of standard and well-understood techniques. If a structural template remains elusive, a second approach using nontrivial methods is required. If a target fold cannot be reliably identified because inconsistent results have been obtained from nontrivial data analyses, the project falls into the third type of project and will be virtually impossible to complete with any degree of reliability. In this article, a set of protocols to predict protein structure from sequence is presented and distinctions among the three types of project are given. These methods, if used appropriately, can provide valuable indicators of protein structure and function.
Collapse
Affiliation(s)
- Yvonne J K Edwards
- Research Division, UK Human Genome Mapping Project Resource Center, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10, 1SB, England, UK.
| | | |
Collapse
|
133
|
Abstract
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.
Collapse
Affiliation(s)
- Patrice Koehl
- Department of Structural Biology, Fairchild Building, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
134
|
Samudrala R, Levitt M. A comprehensive analysis of 40 blind protein structure predictions. BMC STRUCTURAL BIOLOGY 2002; 2:3. [PMID: 12150712 PMCID: PMC122083 DOI: 10.1186/1472-6807-2-3] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2002] [Accepted: 08/01/2002] [Indexed: 11/21/2022]
Abstract
BACKGROUND We thoroughly analyse the results of 40 blind predictions for which an experimental answer was made available at the fourth meeting on the critical assessment of protein structure methods (CASP4). Using our comparative modelling and fold recognition methodologies, we made 29 predictions for targets that had sequence identities ranging from 50% to 10% to the nearest related protein with known structure. Using our ab initio methodologies, we made eleven predictions for targets that had no detectable sequence relationships. RESULTS For 23 of these proteins, we produced models ranging from 1.0 to 6.0 A root mean square deviation (RMSD) for the Calpha atoms between the model and the corresponding experimental structure for all or large parts of the protein, with model accuracies scaling fairly linearly with respect to sequence identity (i.e., the higher the sequence identity, the better the prediction). We produced nine models with accuracies ranging from 4.0 to 6.0 A Calpha RMSD for 60-100 residue proteins (or large fragments of a protein), with a prediction accuracy of 4.0 A Calpha RMSD for residues 1-80 for T110/rbfa. CONCLUSIONS The areas of protein structure prediction that work well, and areas that need improvement, are discernable by examining how our methods have performed over the past four CASP experiments. These results have implications for modelling the structure of all tractable proteins encoded by the genome of an organism.
Collapse
Affiliation(s)
- Ram Samudrala
- Department of Microbiology, University of Washington, School of Medicine, Seattle, WA 98195, USA
| | - Michael Levitt
- Department of Structural Biology, Stanford University, School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
135
|
de Trad CH, Fang Q, Cosic I. Protein sequence comparison based on the wavelet transform approach. Protein Eng Des Sel 2002; 15:193-203. [PMID: 11932490 DOI: 10.1093/protein/15.3.193] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A protein's chemical properties, the chain conformation, the function of the protein and its species specificity are determined by the information contained in the amino acid sequence. Proteins of similar functions have at some level sequential identical amino acid sequences. The closer the phylogenetic relationship, the more similar are the sequences. To find the similarities between two or more protein sequences is of great importance for protein sequence analysis. The differences in the amino acid sequences permit the construction of a family tree of evolution. In this work, a comparison method was devised that is capable of analysing a protein sequence 'hierarchically', i.e. it can examine a protein sequence at different spatial resolutions. Based on a wavelet decomposition of protein sequences and a cross-correlation study, a sequence-scale similarity concept is proposed for generating a similarity vector, which renders the comparison of two sequences feasible at different spatial resolutions (scales). This new similarity concept is an expansion of the conventional sequence similarity, which only takes into account the local pairwise amino acid match and ignores the information contained in coarser spatial resolutions.
Collapse
Affiliation(s)
- Chafia Hejase de Trad
- BioElectronics Group, Department of Electrical and Computer Systems Engineering, PO Box 35, Monash University, VIC 3800, Australia
| | | | | |
Collapse
|
136
|
Campos F, Richardson M. The complete amino acid sequence of the α-amylase inhibitor I-2 from seeds of ragi (Indian finger millet, Eleusine coracana
Gaertn.). FEBS Lett 2001. [DOI: 10.1016/0014-5793(84)80130-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
137
|
Abstract
Typically, protein spatial structures are more conserved in evolution than amino acid sequences. However, the recent explosion of sequence and structure information accompanied by the development of powerful computational methods led to the accumulation of examples of homologous proteins with globally distinct structures. Significant sequence conservation, local structural resemblance, and functional similarity strongly indicate evolutionary relationships between these proteins despite pronounced structural differences at the fold level. Several mechanisms such as insertions/deletions/substitutions, circular permutations, and rearrangements in beta-sheet topologies account for the majority of detected structural irregularities. The existence of evolutionarily related proteins that possess different folds brings new challenges to the homology modeling techniques and the structure classification strategies and offers new opportunities for protein design in experimental studies.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute, Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, Texas 75390-9050, USA
| |
Collapse
|
138
|
Balaji S, Srinivasan N. Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous proteins. PROTEIN ENGINEERING 2001; 14:219-26. [PMID: 11391013 DOI: 10.1093/protein/14.4.219] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The database PALI (Phylogeny and ALIgnment of homologous protein structures) consists of families of protein domains of known three-dimensional (3D) structure. In a PALI family, every member has been structurally aligned with every other member (pairwise) and also simultaneous superposition (multiple) of all the members has been performed. The database also contains 3D structure-based and structure-dependent sequence similarity-based phylogenetic dendrograms for all the families. The PALI release used in the present analysis comprises 225 families derived largely from the HOMSTRAD and SCOP databases. The quality of the multiple rigid-body structural alignments in PALI was compared with that obtained from COMPARER, which encodes a procedure based on properties and relationships. The alignments from the two procedures agreed very well and variations are seen only in the low sequence similarity cases often in the loop regions. A validation of Direct Pairwise Alignment (DPA) between two proteins is provided by comparing it with Pairwise alignment extracted from Multiple Alignment of all the members in the family (PMA). In general, DPA and PMA are found to vary rarely. The ready availability of pairwise alignments allows the analysis of variations in structural distances as a function of sequence similarities and number of topologically equivalent Calpha atoms. The structural distance metric used in the analysis combines root mean square deviation (r.m.s.d.) and number of equivalences, and is shown to vary similarly to r.m.s.d. The correlation between sequence similarity and structural similarity is poor in pairs with low sequence similarities. A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms. The difference could occur when the sequence similarity among the homologues is low or when the structures are subjected to evolutionary pressure for the retention of function. The PALI database is expected to be useful in furthering our understanding of the relationship between sequences and structures of homologous proteins and their evolution.
Collapse
Affiliation(s)
- S Balaji
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | |
Collapse
|
139
|
Reddy BV, Li WW, Shindyalov IN, Bourne PE. Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins. Proteins 2001. [DOI: 10.1002/1097-0134(20010201)42:2%3c148::aid-prot20%3e3.0.co;2-r] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
140
|
Reddy BV, Li WW, Shindyalov IN, Bourne PE. Conserved key amino acid positions (CKAAPs) derived from the analysis of common substructures in proteins. Proteins 2001; 42:148-63. [PMID: 11119639 DOI: 10.1002/1097-0134(20010201)42:2<148::aid-prot20>3.0.co;2-r] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
An all-against-all protein structure comparison using the Combinatorial Extension (CE) algorithm applied to a representative set of PDB structures revealed a gallery of common substructures in proteins (http://cl.sdsc.edu/ce.html). These substructures represent commonly identified folds, domains, or components thereof. Most of the subsequences forming these similar substructures have no significant sequence similarity. We present a method to identify conserved amino acid positions and residue-dependent property clusters within these subsequences starting with structure alignments. Each of the subsequences is aligned to its homologues in SWALL, a nonredundant protein sequence database. The most similar sequences are purged into a common frequency matrix, and weighted homologues of each one of the subsequences are used in scoring for conserved key amino acid positions (CKAAPs). We have set the top 20% of the high-scoring positions in each substructure to be CKAAPs. It is hypothesized that CKAAPs may be responsible for the common folding patterns in either a local or global view of the protein-folding pathway. Where a significant number of structures exist, CKAAPs have also been identified in structure alignments of complete polypeptide chains from the same protein family or superfamily. Evidence to support the presence of CKAAPs comes from other computational approaches and experimental studies of mutation and protein-folding experiments, notably the Paracelsus challenge. Finally, the structural environment of CKAAPs versus non-CKAAPs is examined for solvent accessibility, hydrogen bonding, and secondary structure. The identification of CKAAPs has important implications for protein engineering, fold recognition, modeling, and structure prediction studies and is dependent on the availability of structures and an accurate structure alignment methodology. Proteins 2001;42:148-163.
Collapse
Affiliation(s)
- B V Reddy
- San Diego Supercomputer Center, University of California, San Diego, La Jolla, California 92093-0505, USA
| | | | | | | |
Collapse
|
141
|
Grishin NV. KH domain: one motif, two folds. Nucleic Acids Res 2001; 29:638-43. [PMID: 11160884 PMCID: PMC30387 DOI: 10.1093/nar/29.3.638] [Citation(s) in RCA: 241] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2000] [Revised: 12/01/2000] [Accepted: 12/01/2000] [Indexed: 11/14/2022] Open
Abstract
The K homology (KH) module is a widespread RNA-binding motif that has been detected by sequence similarity searches in such proteins as heterogeneous nuclear ribonucleoprotein K (hnRNP K) and ribosomal protein S3. Analysis of spatial structures of KH domains in hnRNP K and S3 reveals that they are topologically dissimilar and thus belong to different protein folds. Thus KH motif proteins provide a rare example of protein domains that share significant sequence similarity in the motif regions but possess globally distinct structures. The two distinct topologies might have arisen from an ancestral KH motif protein by N- and C-terminal extensions, or one of the existing topologies may have evolved from the other by extension, displacement and deletion. C-terminal extension (deletion) requires ss-sheet rearrangement through the insertion (removal) of a ss-strand in a manner similar to that observed in serine protease inhibitors serpins. Current analysis offers a new look on how proteins can change fold in the course of evolution.
Collapse
Affiliation(s)
- N V Grishin
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA.
| |
Collapse
|
142
|
Balaji S, Sujatha S, Kumar SS, Srinivasan N. PALI-a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res 2001; 29:61-5. [PMID: 11125050 PMCID: PMC29825 DOI: 10.1093/nar/29.1.61] [Citation(s) in RCA: 65] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2000] [Revised: 10/25/2000] [Accepted: 10/25/2000] [Indexed: 11/13/2022] Open
Abstract
PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous super-position (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 'orphans' (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet. in/ approximately pali.
Collapse
Affiliation(s)
- S Balaji
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560 012, India
| | | | | | | |
Collapse
|
143
|
Abstract
Success in the protein structure prediction problem relies heavily on the choice of an appropriate potential function. One approach toward extracting these potentials from a database of known protein structures is to maximize the Z-score of the database proteins, which represents the ability of the potential to discriminate correct from random conformations. These optimization methods model the entire distribution of alternative structures, reducing their ability to concentrate on the lowest energy structures most competitive with the native state and resulting in an unfortunate tendency to underestimate the repulsive interactions. This leads to reduced accuracy and predictive ability. Using a lattice model, we demonstrate how we can weight the distribution to suppress the contributions of the high-energy conformations to the Z-score calculation. The result is a potential that is more accurate and more likely to yield correct predictions than other Z-score optimization methods as well as potentials of mean force.
Collapse
Affiliation(s)
- T L Chiu
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109-1055, USA
| | | |
Collapse
|
144
|
Abstract
An analysis of amino acid composition of small, naturally occurring peptides ranging in size from 3 to 50 residues has been carried out. The purpose of the study is to determine whether differential trends in amino acid usage exist for small peptides compared to larger polypeptides and proteins. Results indicate that Cys, Trp, and Phe are substantially more frequent in peptides compared to their abundance in proteins at large. Aliphatic hydrophobic residues, particularly Leu and Ile, are somewhat underrepresented, while the frequency of Glu is significantly reduced. The shorter peptides are also more frequently neutral and become increasingly charged as their size increases.
Collapse
Affiliation(s)
- H O Villar
- Telik, Inc., 750 Gateway Blvd., South San Francisco, CA 94080, USA.
| | | |
Collapse
|
145
|
Thomas MC, García-Pérez JL, Alonso C, López MC. Molecular characterization of KMP11 from Trypanosoma cruzi: a cytoskeleton-associated protein regulated at the translational level. DNA Cell Biol 2000; 19:47-57. [PMID: 10668791 DOI: 10.1089/104454900314708] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Kinetoplasmid membrane protein-11 (KMP11) is present in a wide range of trypanosomatids. In the present paper, we show that the T. cruzi KMP11 gene is organized in a cluster formed by four gene units arranged in a head-to-tail tandem manner located on a chromosome of about 1900 kb. Northern blot analyses indicated that the steady-state level of mature KMP11 transcripts of 0.52 kb is high and similar in the three forms of the parasite. The KMP11 mRNAs have a half-life of about 16 h whose steady-state level is strongly downregulated when the parasites reach the stationary growth phase. The T. cruzi KMP11 sequence presents a significant homology with the amino-terminal third of the cytoskeleton-associated protein CIP1 from Arabidopsis thaliana. Western blot and immunoelectron microscopy studies showed that KMP11 is present in the cytoskeleton structure. Because the strong downregulation observed in the de novo synthesis of KMP11 protein in parasites treated with vinblastine is not accompanied by a significant fall in the steady-state level of KMP11 mRNAs, regulatory control of the protein at the translational level is suggested.
Collapse
Affiliation(s)
- M C Thomas
- Instituto de Parasitología y Biomedicina López Neyra, Consejo Superior de Investigaciones Científicas, Granada, Spain
| | | | | | | |
Collapse
|
146
|
Chopra S, Brendel V, Zhang J, Axtell JD, Peterson T. Molecular characterization of a mutable pigmentation phenotype and isolation of the first active transposable element from Sorghum bicolor. Proc Natl Acad Sci U S A 1999; 96:15330-5. [PMID: 10611384 PMCID: PMC24819 DOI: 10.1073/pnas.96.26.15330] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Accumulation of red phlobaphene pigments in sorghum grain pericarp is under the control of the Y gene. A mutable allele of Y, designated as y-cs (y-candystripe), produces a variegated pericarp phenotype. Using probes from the maize p1 gene that cross-hybridize with the sorghum Y gene, we isolated the y-cs allele containing a large insertion element. Our results show that the Y gene is a member of the MYB-transcription factor family. The insertion element, named Candystripe1 (Cs1), is present in the second intron of the Y gene and shares features of the CACTA superfamily of transposons. Cs1 is 23,018 bp in size and is bordered by 20-bp terminal inverted repeat sequences. It generated a 3-bp target site duplication upon insertion within the Y gene and excised from y-cs, leaving a 2-bp footprint in two cases analyzed. Reinsertion of the excised copy of Cs1 was identified by Southern hybridization in the genome of each of seven red pericarp revertant lines tested. Cs1 is the first active transposable element isolated from sorghum. Our analysis suggests that Cs1-homologous sequences are present in low copy number in sorghum and other grasses, including sudangrass, maize, rice, teosinte, and sugarcane. The low copy number and high transposition frequency of Cs1 imply that this transposon could prove to be an efficient gene isolation tool in sorghum.
Collapse
Affiliation(s)
- S Chopra
- Department of Zoology, Iowa State University, Ames, IA 50011, USA
| | | | | | | | | |
Collapse
|
147
|
Desiere F, Lucchini S, Brüssow H. Comparative sequence analysis of the DNA packaging, head, and tail morphogenesis modules in the temperate cos-site Streptococcus thermophilus bacteriophage Sfi21. Virology 1999; 260:244-53. [PMID: 10417259 DOI: 10.1006/viro.1999.9830] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The temperate Streptococcus thermophilus bacteriophage Sfi21 possesses 15-nucleotide-long cohesive ends with a 3' overhang that reconstitutes a cos-site with twofold hyphenated rotational symmetry. Over the DNA packaging, head and tail morphogenesis modules, the Sfi21 sequence predicts a gene map that is strikingly similar to that of lambdoid coliphages in the absence of any sequence similarity. A nearly one to one gene correlation was found with the phage lambda genes Nu1 to H, except for gene B-to-E complex, where the Sfi21 map resembled that of coliphage HK97. The similarity between Sfi21 and HK97 was striking: both major head proteins showed an N-terminal coiled-coil structure, the mature major head proteins started at amino acid positions 105 and 104, respectively, and both major head genes were preceded by genes encoding a possible protease and portal protein. The purported Sfi21 protease is the first viral member of the ClpP protease family. The prediction of Sfi21 gene functions by reference to the gene map of intensively investigated coliphages was experimentally confirmed for the major head and tail gene. Phage Sfi21 shows nucleotide sequence similarity with Lactococcus phage BK5-T and a lactococcal prophage and amino acid sequence similarity with the Lactobacillus phage A2 and the Staphylococcus phage PVL. PVL is a missing link that connects the portal proteins from Sfi21 and HK97 with respect to sequence similarity. These observations and database searches, which demonstrate sequence similarity between proteins of phage from gram-positive bacteria, proteobacteria, and Archaea, constrain models of phage evolution.
Collapse
Affiliation(s)
- F Desiere
- Nestlé Research Centre, Nestec Ltd., Vers-chez-les-Blanc, Lausanne 26, CH-1000, Switzerland
| | | | | |
Collapse
|
148
|
Fraternali F, Pastore A. Modularity and homology: modelling of the type II module family from titin. J Mol Biol 1999; 290:581-93. [PMID: 10390355 DOI: 10.1006/jmbi.1999.2876] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We report the homology modelling of the structures of the 162 type II modules from the giant multi-domain protein titin (also known as connectin). The package MODELLER was used and implemented in an automated fashion using four experimentally determined structures as templates. Validation of the models was assessed in terms of divergence from the templates and consensus of the alignments. The homology within the whole family of type II modules as well as with the templates is relatively high (20-35% identity and ca 50% similarity). Comparison between the models of domains for which an NMR structure has been solved and the experimental solution gives an estimate of the quality of the modelling. Our results allow us to distinguish between a set of structurally relevant residues, which are conserved throughout the whole family and buried in the hydrophobic core, from the residues that are conserved and exposed. These latter residues are potentially functionally important. Comparison of exposed conserved patches for modules in different regions of the titin molecule suggests potential interaction surfaces. Our results may be tested directly for those modules whose binding partner is known.
Collapse
|
149
|
Thompson JD, Plewniak F, Poch O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999; 27:2682-90. [PMID: 10373585 PMCID: PMC148477 DOI: 10.1093/nar/27.13.2682] [Citation(s) in RCA: 387] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In recent years improvements to existing programs and the introduction of new iterative algorithms have changed the state-of-the-art in protein sequence alignment. This paper presents the first systematic study of the most commonly used alignment programs using BAliBASE benchmark alignments as test cases. Even below the 'twilight zone' at 10-20% residue identity, the best programs were capable of correctly aligning on average 47% of the residues. We show that iterative algorithms often offer improved alignment accuracy though at the expense of computation time. A notable exception was the effect of introducing a single divergent sequence into a set of closely related sequences, causing the iteration to diverge away from the best alignment. Global alignment programs generally performed better than local methods, except in the presence of large N/C-terminal extensions and internal insertions. In these cases, a local algorithm was more successful in identifying the most conserved motifs. This study enables us to propose appropriate alignment strategies, depending on the nature of a particular set of sequences. The employment of more than one program based on different alignment techniques should significantly improve the quality of automatic protein sequence alignment methods. The results also indicate guidelines for improvement of alignment algorithms.
Collapse
Affiliation(s)
- J D Thompson
- Laboratoire de Biologie Structurale, Institut de Génétique et de Biologie Moléculaire et Cellulaire, (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex, France
| | | | | |
Collapse
|
150
|
Benvenga S, Alesci S, Trimarchi F, Facchiano A. Homologies of the thyroid sodium-iodide symporter with bacterial and viral proteins. J Endocrinol Invest 1999; 22:535-40. [PMID: 10475151 DOI: 10.1007/bf03343605] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
We have demonstrated that Na+/I- symporter (NIS), a novel thyroid autoantigen, has local amino acid sequence homologies with the other thyroid autoantigens: Thyroglobulin (Tg), thyroid peroxidase (TPO) and thyrotropin receptor (TSH-R). These homologies concern the 4th, 5th, 6th extracellular loop and the beginning of the intracellular tail. We have expanded our studies and found that there are significant local homologies with other 11 proteins, most of them of bacterial or viral origin (e.g., Streptococcus or Herpes). These homologies concern the 2nd and 4th extracellular loop, and both the beginning and the end of the intracellular tail. These 11 homologies were retrieved by a computer-assisted search and extracted out of a database containing almost 300,000 amino acid sequences. These homologies were of magnitude greater than those concerning the three thyroid autoantigens [identities=51.1+/-7.3% vs 25.3+/-7.8% (mean+/-SD), p<0.001; similarities=70.6+/-10.7% vs 43.3+/-8.5%; p<0.001]. In addition, extensive, not local, homology was found with a number of unknown proteins from invertebrates (Drosophila melanogaster and Caenorhabditis elegans) and bacteria such as Bacillus subtilis and Xanthobacter. Previously, we had found that NIS has no extensive homology with Tg or TPO or TSH-R. This is the first demonstration of both extensive and local homologies between one thyroid autoantigen (NIS) and microbiological proteins. Taken together with data of the literature on the homologies between other thyroid antigens (Tg, TPO, TSH-R) and bacteria, the homologies we have now found reinforce the view that both bacterial and viral infections may trigger autoimmune thyroid diseases.
Collapse
Affiliation(s)
- S Benvenga
- Cattedra di Endocrinologia, Università di Messina, Italy
| | | | | | | |
Collapse
|