1
|
Cascarina SM, Ross ED. Identification of Low-Complexity Domains by Compositional Signatures Reveals Class-Specific Frequencies and Functions Across the Domains of Life. PLoS Comput Biol 2024; 20:e1011372. [PMID: 38748749 PMCID: PMC11132505 DOI: 10.1371/journal.pcbi.1011372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 05/28/2024] [Accepted: 05/04/2024] [Indexed: 05/29/2024] Open
Abstract
Low-complexity domains (LCDs) in proteins are typically enriched in one or two predominant amino acids. As a result, LCDs often exhibit unusual structural/biophysical tendencies and can occupy functional niches. However, for each organism, protein sequences must be compatible with intracellular biomolecules and physicochemical environment, both of which vary from organism to organism. This raises the possibility that LCDs may occupy sequence spaces in select organisms that are otherwise prohibited in most organisms. Here, we report a comprehensive survey and functional analysis of LCDs in all known reference proteomes (>21k organisms), with added focus on rare and unusual types of LCDs. LCDs were classified according to both the primary amino acid and secondary amino acid in each LCD sequence, facilitating detailed comparisons of LCD class frequencies across organisms. Examination of LCD classes at different depths (i.e., domain of life, organism, protein, and per-residue levels) reveals unique facets of LCD frequencies and functions. To our surprise, all 400 LCD classes occur in nature, although some are exceptionally rare. A number of rare classes can be defined for each domain of life, with many LCD classes appearing to be eukaryote-specific. Certain LCD classes were consistently associated with identical functions across many organisms, particularly in eukaryotes. Our analysis methods enable simultaneous, direct comparison of all LCD classes between individual organisms, resulting in a proteome-scale view of differences in LCD frequencies and functions. Together, these results highlight the remarkable diversity and functional specificity of LCDs across all known life forms.
Collapse
Affiliation(s)
- Sean M. Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, Colorado, United States of America
| | - Eric D. Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, Colorado, United States of America
| |
Collapse
|
2
|
Harrison PM. Optimizing strategy for the discovery of compositionally-biased or low-complexity regions in proteins. Sci Rep 2024; 14:680. [PMID: 38182699 PMCID: PMC10770407 DOI: 10.1038/s41598-023-50991-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 12/28/2023] [Indexed: 01/07/2024] Open
Abstract
Proteins can contain tracts dominated by a subset of amino acids and that have a functional significance. These are often termed 'low-complexity regions' (LCRs) or 'compositionally-biased regions' (CBRs). However, a wide spectrum of compositional bias is possible, and program parameters used to annotate these regions are often arbitrarily chosen. Also, investigators are sometimes interested in longer regions, or sometimes very short ones. Here, two programs for annotating LCRs/CBRs, namely SEG and fLPS, are investigated in detail across the whole expanse of their parameter spaces. In doing so, boundary behaviours are resolved that are used to derive an optimized systematic strategy for annotating LCRs/CBRs. Sets of parameters that progressively annotate or 'cover' more of protein sequence space and are optimized for a given target length have been derived. This progressive annotation can be applied to discern the biological relevance of CBRs, e.g., in parsing domains for experimental constructs and in generating hypotheses. It is also useful for picking out candidate regions of interest of a given target length and bias signature, and for assessing the parameter dependence of annotations. This latter application is demonstrated for a set of human intrinsically-disordered proteins associated with cancer.
Collapse
Affiliation(s)
- Paul M Harrison
- Department of Biology, McGill University, Montreal, QC, Canada.
| |
Collapse
|
3
|
Wesp V, Theißen G, Schuster S. Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content. Sci Rep 2023; 13:22996. [PMID: 38151539 PMCID: PMC10752896 DOI: 10.1038/s41598-023-49626-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 12/10/2023] [Indexed: 12/29/2023] Open
Abstract
Knowledge of the frequencies of synonymous triplets in protein-coding and non-coding DNA stretches can be used in gene finding. These frequencies depend on the GC content of the genome or parts of it. An example of interest is provided by stop codons. This is relevant for the definition of Open Reading Frames. A generic case is provided by pseudo-random sequences, especially when they code for complex proteins or when they are non-coding and not subject to selection pressure. Here, we calculate, for such sequences and for all 25 known genetic codes, the frequency of each amino acid and stop codon based on their set of codons and as a function of GC content. The amino acids can be classified into five groups according to the GC content where their expected frequency reaches its maximum. We determine the overall Shannon information based on groups of synonymous codons and show that it becomes maximum at a percent GC of 43.3% (for the standard code). This is in line with the observation that in most fungi, plants, and animals, this genomic parameter is in the range from 35 to 50%. By analysing natural sequences, we show that there is a clear bias for triplets corresponding to stop codons near the 5'- and 3'-splice sites in the introns of various clades.
Collapse
Affiliation(s)
- Valentin Wesp
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany
| | - Stefan Schuster
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany.
| |
Collapse
|
4
|
Harrison PM. fLPS 2.0: rapid annotation of compositionally-biased regions in biological sequences. PeerJ 2021; 9:e12363. [PMID: 34760378 PMCID: PMC8557692 DOI: 10.7717/peerj.12363] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 09/30/2021] [Indexed: 12/12/2022] Open
Abstract
Compositionally-biased (CB) regions in biological sequences are enriched for a subset of sequence residue types. These can be shorter regions with a concentrated bias (i.e., those termed ‘low-complexity’), or longer regions that have a compositional skew. These regions comprise a prominent class of the uncharacterized ‘dark matter’ of the protein universe. Here, I report the latest version of the fLPS package for the annotation of CB regions, which includes added consideration of DNA sequences, to label the eight possible biased regions of DNA. In this version, the user is now able to restrict analysis to a specified subset of residue types, and also to filter for previously annotated domains to enable detection of discontinuous CB regions. A ‘thorough’ option has been added which enables the labelling of subtler biases, typically made from a skew for several residue types. In the output, protein CB regions are now labelled with bias classes reflecting the physico-chemical character of the biasing residues. The fLPS 2.0 package is available from: https://github.com/pmharrison/flps2 or in a Supplemental File of this paper.
Collapse
Affiliation(s)
- Paul M Harrison
- Department of Biology, McGill University, Montreal, QC, Canada
| |
Collapse
|
5
|
Cascarina SM, King DC, Osborne Nishimura E, Ross ED. LCD-Composer: an intuitive, composition-centric method enabling the identification and detailed functional mapping of low-complexity domains. NAR Genom Bioinform 2021; 3:lqab048. [PMID: 34056598 PMCID: PMC8153834 DOI: 10.1093/nargab/lqab048] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/13/2021] [Accepted: 05/06/2021] [Indexed: 02/07/2023] Open
Abstract
Low complexity domains (LCDs) in proteins are regions predominantly composed of a small subset of the possible amino acids. LCDs are involved in a variety of normal and pathological processes across all domains of life. Existing methods define LCDs using information-theoretical complexity thresholds, sequence alignment with repetitive regions, or statistical overrepresentation of amino acids relative to whole-proteome frequencies. While these methods have proven valuable, they are all indirectly quantifying amino acid composition, which is the fundamental and biologically-relevant feature related to protein sequence complexity. Here, we present a new computational tool, LCD-Composer, that directly identifies LCDs based on amino acid composition and linear amino acid dispersion. Using LCD-Composer's default parameters, we identified simple LCDs across all organisms available through UniProt and provide the resulting data in an accessible form as a resource. Furthermore, we describe large-scale differences between organisms from different domains of life and explore organisms with extreme LCD content for different LCD classes. Finally, we illustrate the versatility and specificity achievable with LCD-Composer by identifying diverse classes of LCDs using both simple and multifaceted composition criteria. We demonstrate that the ability to dissect LCDs based on these multifaceted criteria enhances the functional mapping and classification of LCDs.
Collapse
Affiliation(s)
- Sean M Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - David C King
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Erin Osborne Nishimura
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Eric D Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
6
|
Atypical structural tendencies among low-complexity domains in the Protein Data Bank proteome. PLoS Comput Biol 2020; 16:e1007487. [PMID: 31986130 PMCID: PMC7004392 DOI: 10.1371/journal.pcbi.1007487] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 02/06/2020] [Accepted: 12/23/2019] [Indexed: 11/29/2022] Open
Abstract
A variety of studies have suggested that low-complexity domains (LCDs) tend to be intrinsically disordered and are relatively rare within structured proteins in the Protein Data Bank (PDB). Although LCDs are often treated as a single class, we previously found that LCDs enriched in different amino acids can exhibit substantial differences in protein metabolism and function. Therefore, we wondered whether the structural conformations of LCDs are likewise dependent on which specific amino acids are enriched within each LCD. Here, we directly examined relationships between enrichment of individual amino acids and secondary structure tendencies across the entire PDB proteome. Secondary structure tendencies varied as a function of the identity of the amino acid enriched and its degree of enrichment. Furthermore, divergence in secondary structure profiles often occurred for LCDs enriched in physicochemically similar amino acids (e.g. valine vs. leucine), indicating that LCDs composed of related amino acids can have distinct secondary structure tendencies. Comparison of LCD secondary structure tendencies with numerous pre-existing secondary structure propensity scales resulted in relatively poor correlations for certain types of LCDs, indicating that these scales may not capture secondary structure tendencies as sequence complexity decreases. Collectively, these observations provide a highly resolved view of structural tendencies among LCDs parsed by the nature and magnitude of single amino acid enrichment. The structures that proteins adopt are directly related to their amino acid sequences. Low-complexity domains (LCDs) in protein sequences are unusual regions made up of only a few different types of amino acids. Although this is the key feature that classifies sequences as LCDs, the physical properties of LCDs will differ based on the types of amino acids that are found in each domain. For example, the sequences “AAAAAAAAAA”, “EEEEEEEEEE”, and “EEKRKEEEKE” will have very different properties, even though they would all be classified as LCDs by traditional methods. In a previous study, we developed a new method to further divide LCDs into categories that more closely reflect the differences in their physical properties. In this study, we apply that approach to examine the structures of LCDs when sorted into different categories based on their amino acids. This allowed us to define relationships between the types of amino acids in the LCDs and their corresponding structures. Since protein structure is closely related to protein function, this has important implications for understanding the basic functions and properties of LCDs in a variety of proteins.
Collapse
|
7
|
Cascarina SM, Ross ED. Proteome-scale relationships between local amino acid composition and protein fates and functions. PLoS Comput Biol 2018; 14:e1006256. [PMID: 30248088 PMCID: PMC6171957 DOI: 10.1371/journal.pcbi.1006256] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 10/04/2018] [Accepted: 08/16/2018] [Indexed: 11/26/2022] Open
Abstract
Proteins with low-complexity domains continue to emerge as key players in both normal and pathological cellular processes. Although low-complexity domains are often grouped into a single class, individual low-complexity domains can differ substantially with respect to amino acid composition. These differences may strongly influence the physical properties, cellular regulation, and molecular functions of low-complexity domains. Therefore, we developed a bioinformatic approach to explore relationships between amino acid composition, protein metabolism, and protein function. We find that local compositional enrichment within protein sequences is associated with differences in translation efficiency, abundance, half-life, protein-protein interaction promiscuity, subcellular localization, and molecular functions of proteins on a proteome-wide scale. However, local enrichment of related amino acids is sometimes associated with opposite effects on protein regulation and function, highlighting the importance of distinguishing between different types of low-complexity domains. Furthermore, many of these effects are discernible at amino acid compositions below those required for classification as low-complexity or statistically-biased by traditional methods and in the absence of homopolymeric amino acid repeats, indicating that thresholds employed by classical methods may not reflect biologically relevant criteria. Application of our analyses to composition-driven processes, such as the formation of membraneless organelles, reveals distinct composition profiles even for closely related organelles. Collectively, these results provide a unique perspective and detailed insights into relationships between amino acid composition, protein metabolism, and protein functions. Low-complexity domains in protein sequences are regions that are composed of only a few amino acids in the protein “alphabet”. These domains often have unique chemical properties and play important biological roles in both normal and disease-related processes. While a number of approaches have been developed to define low-complexity domains, these methods each possess conceptual limitations. Therefore, we developed a complementary approach that focuses on local amino acid composition (i.e. the amino acid composition within small regions of proteins). We find that high local composition of individual amino acids is associated with pervasive effects on protein metabolism, subcellular localization, and molecular function on a proteome-wide scale. Importantly, the nature of the effects depend on the type of amino acid enriched within the examined domains, and are observable in the absence of classically-defined low-complexity (and related) domains. Furthermore, we define the compositions of proteins involved in the formation of membraneless, protein-rich organelles such as stress granules and P-bodies. Our results provide a coherent view and unprecedented resolution of the effects of local amino acid enrichment on protein biology.
Collapse
Affiliation(s)
- Sean M. Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO, United States of America
- * E-mail: (SMC); (EDR)
| | - Eric D. Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO, United States of America
- * E-mail: (SMC); (EDR)
| |
Collapse
|
8
|
Chen GL, Chang YJ, Hsueh CH. PRAP: an ab initio software package for automated genome-wide analysis of DNA repeats for prokaryotes. Bioinformatics 2013; 29:2683-9. [DOI: 10.1093/bioinformatics/btt482] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
|
9
|
Pascal C, Paté F, Cheynier V, Delsuc MA. Study of the interactions between a proline-rich protein and a flavan-3-ol by NMR: Residual structures in the natively unfolded protein provides anchorage points for the ligands. Biopolymers 2009; 91:745-56. [DOI: 10.1002/bip.21221] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
10
|
Bannen RM, Bingman CA, Phillips GN. Effect of low-complexity regions on protein structure determination. ACTA ACUST UNITED AC 2008; 8:217-26. [PMID: 18302007 DOI: 10.1007/s10969-008-9039-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 02/05/2008] [Indexed: 11/24/2022]
Abstract
It has been previously shown that protein sequences containing a quasi-repetitive assortment of amino acids are common in genomes and databases such as Swiss-Prot but are under-represented in the structure-based Protein Data Bank (PDB). Structural genomics groups have been using the absence of these "low-complexity" sequences for several years as a way to select proteins that have a good chance of successful structure determination. In this study, we examine the data deposited in the PDB as well as the available data from structural genomics groups in TargetDB and PepcDB to reveal interesting trends that could be taken into consideration when using low-complexity sequences as part of the target selection process.
Collapse
Affiliation(s)
- Ryan M Bannen
- Department of Biochemistry, University of Wisconsin-Madison, 433 Babcock Drive, Madison, WI 53711, USA
| | | | | |
Collapse
|
11
|
Coronado JE, Attie O, Epstein SL, Qiu WG, Lipke PN. Composition-modified matrices improve identification of homologs of saccharomyces cerevisiae low-complexity glycoproteins. EUKARYOTIC CELL 2006; 5:628-37. [PMID: 16607010 PMCID: PMC1459670 DOI: 10.1128/ec.5.4.628-637.2006] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Yeast glycoproteins are representative of low-complexity sequences, those sequences rich in a few types of amino acids. Low-complexity protein sequences comprise more than 10% of the proteome but are poorly aligned by existing methods. Under default conditions, BLAST and FASTA use the scoring matrix BLOSUM62, which is optimized for sequences with diverse amino acid compositions. Because low-complexity sequences are rich in a few amino acids, these tools tend to align the most common residues in nonhomologous positions, thereby generating anomalously high scores, deviations from the expected extreme value distribution, and small e values. This anomalous scoring prevents BLOSUM62-based BLAST and FASTA from identifying correct homologs for proteins with low-complexity sequences, including Saccharomyces cerevisiae wall proteins. We have devised and empirically tested scoring matrices that compensate for the overrepresentation of some amino acids in any query sequence in different ways. These matrices were tested for sensitivity in finding true homologs, discrimination against nonhomologous and random sequences, conformance to the extreme value distribution, and accuracy of e values. Of the tested matrices, the two best matrices (called E and gtQ) gave reliable alignments in BLAST and FASTA searches, identified a consistent set of paralogs of the yeast cell wall test set proteins, and improved the consistency of secondary structure predictions for cell wall proteins.
Collapse
Affiliation(s)
- Juan E Coronado
- Department of Biological Sciences, Hunter College, 695 Park Ave., New York, NY 10021, USA
| | | | | | | | | |
Collapse
|
12
|
Li X, Kahveci T. A Novel algorithm for identifying low-complexity regions in a protein sequence. ACTA ACUST UNITED AC 2006; 22:2980-7. [PMID: 17018537 DOI: 10.1093/bioinformatics/btl495] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY The program is available on request.
Collapse
Affiliation(s)
- Xuehui Li
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA.
| | | |
Collapse
|
13
|
Subramanyam MB, Gnanamani M, Ramachandran S. Simple sequence proteins in prokaryotic proteomes. BMC Genomics 2006; 7:141. [PMID: 16762057 PMCID: PMC1524752 DOI: 10.1186/1471-2164-7-141] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2006] [Accepted: 06/08/2006] [Indexed: 12/05/2022] Open
Abstract
Background The structural and functional features associated with Simple Sequence Proteins (SSPs) are non-globularity, disease states, signaling and post-translational modification. SSPs are also an important source of genetic and possibly phenotypic variation. Analysis of 249 prokaryotic proteomes offers a new opportunity to examine the genomic properties of SSPs. Results SSPs are a minority but they grow with proteome size. This relationship is exhibited across species varying in genomic GC, mutational bias, life style, and pathogenicity. Their proportion in each proteome is strongly influenced by genomic base compositional bias. In most species simple duplications is favoured, but in a few cases such as Mycobacteria, large families of duplications occur. Amino acid preference in SSPs exhibits a trend towards low cost of biosynthesis. In SSPs and in non-SSPs, Alanine, Glycine, Leucine, and Valine are abundant in species widely varying in genomic GC whereas Isoleucine and Lysine are rich only in organisms with low genomic GC. Arginine is abundant in SSPs of two species and in the non-SSPs of Xanthomonas oryzae. Asparagine is abundant only in SSPs of low GC species. Aspartic acid is abundant only in the non-SSPs of Halobacterium sp NRC1. The abundance of Serine in SSPs of 62 species extends over a broader range compared to that of non-SSPs. Threonine(T) is abundant only in SSPs of a couple of species. SSPs exhibit preferential association with Cell surface, Cell membrane and Transport functions and a negative association with Metabolism. Mesophiles and Thermophiles display similar ranges in the content of SSPs. Conclusion Although SSPs are a minority, the genomic forces of base compositional bias and duplications influence their growth and pattern in each species. The preferences and abundance of amino acids are governed by low biosynthetic cost, evolutionary age and base composition of codons. Abundance of charged amino acids Arginine and Aspartic acid is severely restricted. SSPs preferentially associate with cell surface and interface functions as opposed to metabolism, wherein proteins of high sequence complexity with globular structures are preferred. Mesophiles and Thermophiles are similar with respect to the content of SSPs. Our analysis serves to expandthe commonly held views on SSPs.
Collapse
Affiliation(s)
- Mekapati Bala Subramanyam
- G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall road, Delhi-110007, India
| | - Muthiah Gnanamani
- G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall road, Delhi-110007, India
| | - Srinivasan Ramachandran
- G.N. Ramachandran Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, Mall road, Delhi-110007, India
| |
Collapse
|
14
|
Prakash T, Ramakrishnan C, Dash D, Brahmachari SK. Conformational Analysis of Invariant Peptide Sequences in Bacterial Genomes. J Mol Biol 2005; 345:937-55. [PMID: 15644196 DOI: 10.1016/j.jmb.2004.11.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2004] [Revised: 10/26/2004] [Accepted: 11/05/2004] [Indexed: 10/26/2022]
Abstract
The functional significance of evolutionarily conserved motifs/patterns of short regions in proteins is well documented. Although a large number of sequences are conserved, only a small fraction of these are invariant across several organisms. Here, we have examined the structural features of the functionally important peptide sequences, which have been found invariant across diverse bacterial genera. Ramachandran angles (phi,psi) have been used to analyze the conformation, folding patterns and geometrical location (buried/exposed) of these invariant peptides in different crystal structures harboring these sequences. The analysis indicates that the peptides preferred a single conformation in different protein structures, with the exception of only a few longer peptides that exhibited some conformational variability. In addition, it is noticed that the variability of conformation occurs mainly due to flipping of peptide units about the virtual C(alpha)...C(alpha) bond. However, for a given invariant peptide, the folding patterns are found to be similar in almost all the cases. Over and above, such peptides are found to be buried in the protein core. Thus, we can safely conclude that these invariant peptides are structurally important for the proteins, since they acquire unique structures across different proteins and can act as structural determinants (SD) of the proteins. The location of these SD peptides on the protein chain indicated that most of them are clustered towards the N-terminal and middle region of the protein with the C-terminal region exhibiting low preference. Another feature that emerges out of this study is that some of these SD peptides can also play the roles of "fold boundaries" or "hinge nucleus" in the protein structure. The study indicates that these SD peptides may act as chain-reversal signatures, guiding the proteins to adopt appropriate folds. In some cases the invariant signature peptides may also act as folding nuclei (FN) of the proteins.
Collapse
Affiliation(s)
- Tulika Prakash
- G.N.R. Knowledge Centre for Genome Informatics, Institute of Genomics and Integrative Biology, CSIR, Mall Road, Delhi 110007, India
| | | | | | | |
Collapse
|
15
|
Sachdeva G, Kumar K, Jain P, Ramachandran S. SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks. Bioinformatics 2004; 21:483-91. [PMID: 15374866 PMCID: PMC7109999 DOI: 10.1093/bioinformatics/bti028] [Citation(s) in RCA: 130] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The adhesion of microbial pathogens to host cells is mediated by adhesins. Experimental methods used for characterizing adhesins are time-consuming and demand large resources. The availability of specialized software can rapidly aid experimenters in simplifying this problem. We have employed 105 compositional properties and artificial neural networks to develop SPAAN, which predicts the probability of a protein being an adhesin (Pad). RESULTS SPAAN had optimal sensitivity of 89% and specificity of 100% on a defined test set and could identify 97.4% of known adhesins at high Pad value from a wide range of bacteria. Furthermore, SPAAN facilitated improved annotation of several proteins as adhesins. Novel adhesins were identified in 17 pathogenic organisms causing diseases in humans and plants. In the severe acute respiratory syndrome (SARS) associated human corona virus, the spike glycoprotein and nsps (nsp2, nsp5, nsp6 and nsp7) were identified as having adhesin-like characteristics. These results offer new lead for rapid experimental testing. AVAILABILITY SPAAN is freely available through ftp://203.195.151.45 CONTACT ramu@igib.res.in.
Collapse
Affiliation(s)
- Gaurav Sachdeva
- G.N. Ramachandran Knowledge Center for Genome Informatics, Institute of Genomics and Integrative Biology Mall Road, Delhi 110 007, India
| | | | | | | |
Collapse
|
16
|
Knight CG, Kassen R, Hebestreit H, Rainey PB. Global analysis of predicted proteomes: functional adaptation of physical properties. Proc Natl Acad Sci U S A 2004; 101:8390-5. [PMID: 15150418 PMCID: PMC420404 DOI: 10.1073/pnas.0307270101] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
The physical characteristics of proteins are fundamentally important in organismal function. We used the complete predicted proteomes of >100 organisms spanning the three domains of life to investigate the comparative biology and evolution of proteomes. Theoretical 2D gels were constructed with axes of protein mass and charge (pI) and converted to density estimates comparable across all types and sizes of proteome. We asked whether we could detect general patterns of proteome conservation and variation. The overall pattern of theoretical 2D gels was strongly conserved across all life forms. Nevertheless, coevolved replicons from the same organism (different chromosomes or plasmid and host chromosomes) encode proteomes more similar to each other than those from different organisms. Furthermore, there was disparity between the membrane and nonmembrane subproteomes within organisms (proteins of membrane proteomes are on the average more basic and heavier) and their variation across organisms, suggesting that membrane proteomes evolve most rapidly. Experimentally, a significant positive relationship independent of phylogeny was found between the predicted proteome and Biolog profile, a measure associated with the ecological niche. Finally, we show that, for the smallest and most alkaline proteomes, there is a negative relationship between proteome size and basicity. This relationship is not adequately explained by AT bias at the DNA sequence level. Together, these data provide evidence of functional adaptation in the properties of complete proteomes.
Collapse
Affiliation(s)
- Christopher G Knight
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford OX1 3RB, United Kingdom.
| | | | | | | |
Collapse
|
17
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003. [PMCID: PMC2447285 DOI: 10.1002/cfg.230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
|