1
|
Teekas L, Sharma S, Vijay N. Terminal regions of a protein are a hotspot for low complexity regions and selection. Open Biol 2024; 14:230439. [PMID: 38862022 DOI: 10.1098/rsob.230439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/13/2024] [Indexed: 06/13/2024] Open
Abstract
Volatile low complexity regions (LCRs) are a novel source of adaptive variation, functional diversification and evolutionary novelty. An interplay of selection and mutation governs the composition and length of low complexity regions. High %GC and mutations provide length variability because of mechanisms like replication slippage. Owing to the complex dynamics between selection and mutation, we need a better understanding of their coexistence. Our findings underscore that positively selected sites (PSS) and low complexity regions prefer the terminal regions of genes, co-occurring in most Tetrapoda clades. We observed that positively selected sites within a gene have position-specific roles. Central-positively selected site genes primarily participate in defence responses, whereas terminal-positively selected site genes exhibit non-specific functions. Low complexity region-containing genes in the Tetrapoda clade exhibit a significantly higher %GC and lower ω (dN/dS: non-synonymous substitution rate/synonymous substitution rate) compared with genes without low complexity regions. This lower ω implies that despite providing rapid functional diversity, low complexity region-containing genes are subjected to intense purifying selection. Furthermore, we observe that low complexity regions consistently display ubiquitous prevalence at lower purity levels, but exhibit a preference for specific positions within a gene as the purity of the low complexity region stretch increases, implying a composition-dependent evolutionary role. Our findings collectively contribute to the understanding of how genetic diversity and adaptation are shaped by the interplay of selection and low complexity regions in the Tetrapoda clade.
Collapse
Affiliation(s)
- Lokdeep Teekas
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| | - Sandhya Sharma
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| | - Nagarjun Vijay
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| |
Collapse
|
2
|
Dickson ZW, Golding GB. Evolution of Transcript Abundance is Influenced by Indels in Protein Low Complexity Regions. J Mol Evol 2024; 92:153-168. [PMID: 38485789 DOI: 10.1007/s00239-024-10158-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 01/24/2024] [Indexed: 04/02/2024]
Abstract
Protein Protein low complexity regions (LCRs) are compositionally biased amino acid sequences, many of which have significant evolutionary impacts on the proteins which contain them. They are mutationally unstable experiencing higher rates of indels and substitutions than higher complexity regions. LCRs also impact the expression of their proteins, likely through multiple effects along the path from gene transcription, through translation, and eventual protein degradation. It has been observed that proteins which contain LCRs are associated with elevated transcript abundance (TAb), despite having lower protein abundance. We have gathered and integrated human data to investigate the co-evolution of TAb and LCRs through ancestral reconstructions and model inference using an approximate Bayesian calculation based method. We observe that on short evolutionary timescales TAb evolution is significantly impacted by changes in LCR length, with insertions driving TAb down. But in contrast, the observed data is best explained by indel rates in LCRs which are unaffected by shifts in TAb. Our work demonstrates a coupling between LCR and TAb evolution, and the utility of incorporating multiple responses into evolutionary analyses.
Collapse
Affiliation(s)
| | - G Brian Golding
- Department of Biology, McMaster University, Hamilton, ON, Canada
| |
Collapse
|
3
|
Chan-Yao-Chong M, Chan J, Kono H. Benchmarking of force fields to characterize the intrinsically disordered R2-FUS-LC region. Sci Rep 2023; 13:14226. [PMID: 37648703 PMCID: PMC10468508 DOI: 10.1038/s41598-023-40801-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 08/16/2023] [Indexed: 09/01/2023] Open
Abstract
Intrinsically Disordered Proteins (IDPs) play crucial roles in numerous diseases like Alzheimer's and ALS by forming irreversible amyloid fibrils. The effectiveness of force fields (FFs) developed for globular proteins and their modified versions for IDPs varies depending on the specific protein. This study assesses 13 FFs, including AMBER and CHARMM, by simulating the R2 region of the FUS-LC domain (R2-FUS-LC region), an IDP implicated in ALS. Due to the flexibility of the region, we show that utilizing multiple measures, which evaluate the local and global conformations, and combining them together into a final score are important for a comprehensive evaluation of force fields. The results suggest c36m2021s3p with mTIP3p water model is the most balanced FF, capable of generating various conformations compatible with known ones. In addition, the mTIP3P water model is computationally more efficient than those of top-ranked AMBER FFs with four-site water models. The evaluation also reveals that AMBER FFs tend to generate more compact conformations compared to CHARMM FFs but also more non-native contacts. The top-ranking AMBER and CHARMM FFs can reproduce intra-peptide contacts but underperform for inter-peptide contacts, indicating there is room for improvement.
Collapse
Affiliation(s)
- Maud Chan-Yao-Chong
- Molecular Modeling and Simulation (MMS) Team, Institute for Quantum Life Science, National Institutes for Quantum Science and Technology (QST), 4-9-1, Anagawa, Inage Ward, Chiba City, Chiba, 263-8555, Japan
- Toulouse Biotechnology Institute, TBI, Université de Toulouse, CNRS, INRAE, INSA, 135, Avenue de Rangueil, 31077, Toulouse Cedex 04, France
| | - Justin Chan
- Molecular Modeling and Simulation (MMS) Team, Institute for Quantum Life Science, National Institutes for Quantum Science and Technology (QST), 4-9-1, Anagawa, Inage Ward, Chiba City, Chiba, 263-8555, Japan
| | - Hidetoshi Kono
- Molecular Modeling and Simulation (MMS) Team, Institute for Quantum Life Science, National Institutes for Quantum Science and Technology (QST), 4-9-1, Anagawa, Inage Ward, Chiba City, Chiba, 263-8555, Japan.
| |
Collapse
|
4
|
Sousa e Silva R, Sousa AD, Vieira J, Vieira CP. The Josephin domain (JD) containing proteins are predicted to bind to the same interactors: Implications for spinocerebellar ataxia type 3 (SCA3) studies using Drosophila melanogaster mutants. Front Mol Neurosci 2023; 16:1140719. [PMID: 37008788 PMCID: PMC10050893 DOI: 10.3389/fnmol.2023.1140719] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 02/21/2023] [Indexed: 03/17/2023] Open
Abstract
Spinocerebellar ataxia type 3, also known as Machado-Joseph disease (SCA3/ MJD), is the most frequent polyglutamine (polyQ) neurodegenerative disorder. It is caused by a pathogenic expansion of the polyQ tract, located at the C-terminal region of the protein encoded by the ATXN3 gene. This gene codes for a deubiquitinating enzyme (DUB) that belongs to a gene family, that in humans is composed by three more genes (ATXN3L, JOSD1, and JOSD2), that define two gene lineages (the ATXN3 and the Josephins). These proteins have in common the N-terminal catalytic domain (Josephin domain, JD), that in Josephins is the only domain present. In ATXN3 knock-out mouse and nematode models, the SCA3 neurodegeneration phenotype is not, however, reproduced, suggesting that in the genome of these species there are other genes that are able to compensate for the lack of ATXN3. Moreover, in mutant Drosophila melanogaster, where the only JD protein is coded by a Josephin-like gene, expression of the expanded human ATXN3 gene reproduces multiple aspects of the SCA3 phenotype, in contrast with the results of the expression of the wild type human form. In order to explain these findings, phylogenetic, as well as, protein–protein docking inferences are here performed. Here we show multiple losses of JD containing genes across the animal kingdom, suggesting partial functional redundancy of these genes. Accordingly, we predict that the JD is essential for binding with ataxin-3 and proteins of the Josephin lineages, and that D. melanogaster mutants are a good model of SCA3 despite the absence of a gene from the ATXN3 lineage. The molecular recognition regions of the ataxin-3 binding and those predicted for the Josephins are, however, different. We also report different binding regions between the two ataxin-3 forms (wild-type (wt) and expanded (exp)). The interactors that show an increase in the interaction strength with exp ataxin-3, are enriched in extrinsic components of mitochondrial outer membrane and endoplasmatic reticulum membrane. On the other hand, the group of interactors that show a decrease in the interaction strength with exp ataxin-3 is significantly enriched in extrinsic component of cytoplasm.
Collapse
|
5
|
Mier P, Elena-Real CA, Cortés J, Bernadó P, Andrade-Navarro MA. The sequence context in poly-alanine regions: structure, function and conservation. Bioinformatics 2022; 38:4851-4858. [PMID: 36106994 PMCID: PMC9620824 DOI: 10.1093/bioinformatics/btac610] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 07/07/2022] [Accepted: 09/05/2022] [Indexed: 11/24/2022] Open
Abstract
MOTIVATION Poly-alanine (polyA) regions are protein stretches mostly composed of alanines. Despite their abundance in eukaryotic proteomes and their association to nine inherited human diseases, the structural and functional roles exerted by polyA stretches remain poorly understood. In this work we study how the amino acid context in which polyA regions are settled in proteins influences their structure and function. RESULTS We identified glycine and proline as the most abundant amino acids within polyA and in the flanking regions of polyA tracts, in human proteins as well as in 17 additional eukaryotic species. Our analyses indicate that the non-structuring nature of these two amino acids influences the α-helical conformations predicted for polyA, suggesting a relevant role in reducing the inherent aggregation propensity of long polyA. Then, we show how polyA position in protein N-termini relates with their function as transit peptides. PolyA placed just after the initial methionine is often predicted as part of mitochondrial transit peptides, whereas when placed in downstream positions, polyA are part of signal peptides. A few examples from known structures suggest that short polyA can emerge by alanine substitutions in α-helices; but evolution by insertion is observed for longer polyA. Our results showcase the importance of studying the sequence context of homorepeats as a mechanism to shape their structure-function relationships. AVAILABILITY AND IMPLEMENTATION The datasets used and/or analyzed during the current study are available from the corresponding author onreasonable request. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| | - Carlos A Elena-Real
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS, 34090 Montpellier, France
| | - Juan Cortés
- LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
| | - Pau Bernadó
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS, 34090 Montpellier, France
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| |
Collapse
|
6
|
Lee J, Cho H, Kwon I. Phase separation of low-complexity domains in cellular function and disease. EXPERIMENTAL & MOLECULAR MEDICINE 2022; 54:1412-1422. [PMID: 36175485 PMCID: PMC9534829 DOI: 10.1038/s12276-022-00857-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 07/15/2022] [Accepted: 07/19/2022] [Indexed: 11/09/2022]
Abstract
In this review, we discuss the ways in which recent studies of low-complexity (LC) domains have challenged our understanding of the mechanisms underlying cellular organization. LC sequences, long believed to function in the absence of a molecular structure, are abundant in the proteomes of all eukaryotic organisms. Over the past decade, the phase separation of LC domains has emerged as a fundamental mechanism driving dynamic multivalent interactions of many cellular processes. We review the key evidence showing the role of phase separation of individual proteins in organizing cellular assemblies and facilitating biological function while implicating the dynamics of phase separation as a key to biological validity and functional utility. We also highlight the evidence showing that pathogenic LC proteins alter various phase separation-dependent interactions to elicit debilitating human diseases, including cancer and neurodegenerative diseases. Progress in understanding the biology of phase separation may offer useful hints toward possible therapeutic interventions to combat the toxicity of pathogenic proteins.
Collapse
Affiliation(s)
- Jiwon Lee
- Department of Anatomy and Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419, Korea
| | - Hana Cho
- Department of Physiology, Sungkyunkwan University School of Medicine, Suwon, 16419, Korea.
| | - Ilmin Kwon
- Department of Anatomy and Cell Biology, Sungkyunkwan University School of Medicine, Suwon, 16419, Korea.
| |
Collapse
|
7
|
Dickson ZW, Golding GB. Low complexity regions in mammalian proteins are associated with low protein abundance and high transcript abundance. Mol Biol Evol 2022; 39:6575407. [PMID: 35482425 PMCID: PMC9070799 DOI: 10.1093/molbev/msac087] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Low Complexity Regions (LCRs) are present in a surprisingly large number of eukaryotic proteins. These highly repetitive and compositionally biased sequences are often structurally disordered, bind promiscuously, and evolve rapidly. Frequently studied in terms of evolutionary dynamics, little is known about how LCRs affect the expression of the proteins which contain them. It would be expected that rapidly evolving LCRs are unlikely to be tolerated in strongly conserved, highly abundant proteins, leading to lower overall abundance in proteins which contain LCRs. To test this hypothesis and examine the associations of protein abundance and transcript abundance with the presence of LCRs, we have integrated high-throughput data from across mammals. We have found that LCRs are indeed associated with reduced protein abundance, but are also associated with elevated transcript abundance. These associations are qualitatively consistent across 12 human tissues and nine mammalian species. The differential impacts of LCRs on abundance at the protein and transcript level are not explained by differences in either protein degradation rates or the inefficiency of translation for LCR containing proteins. We suggest that rapidly evolving LCRs are a source of selective pressure on the regulatory mechanisms which maintain steady-state protein abundance levels.
Collapse
Affiliation(s)
- Zachery W Dickson
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | - G Brian Golding
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
8
|
Chavali S, Singh AK, Santhanam B, Babu MM. Amino acid homorepeats in proteins. Nat Rev Chem 2020; 4:420-434. [PMID: 37127972 DOI: 10.1038/s41570-020-0204-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2020] [Indexed: 12/16/2022]
Abstract
Amino acid homorepeats, or homorepeats, are polypeptide segments found in proteins that contain stretches of identical amino acid residues. Although abnormal homorepeat expansions are linked to pathologies such as neurodegenerative diseases, homorepeats are prevalent in eukaryotic proteomes, suggesting that they are important for normal physiology. In this Review, we discuss recent advances in our understanding of the biological functions of homorepeats, which range from facilitating subcellular protein localization to mediating interactions between proteins across diverse cellular pathways. We explore how the functional diversity of homorepeat-containing proteins could be linked to the ability of homorepeats to adopt different structural conformations, an ability influenced by repeat composition, repeat length and the nature of flanking sequences. We conclude by highlighting how an understanding of homorepeats will help us better characterize and develop therapeutics against the human diseases to which they contribute.
Collapse
Affiliation(s)
- Sreenivas Chavali
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK.
- Department of Biology, Indian Institute of Science Education and Research (IISER) Tirupati, Tirupati, India.
| | - Anjali K Singh
- Department of Biology, Indian Institute of Science Education and Research (IISER) Tirupati, Tirupati, India
| | - Balaji Santhanam
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK
- Department of Structural Biology and Center for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - M Madan Babu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK.
- Department of Structural Biology and Center for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN, USA.
| |
Collapse
|
9
|
Cooper DG, Fassler JS. Med15: Glutamine-Rich Mediator Subunit with Potential for Plasticity. Trends Biochem Sci 2019; 44:737-751. [PMID: 31036407 DOI: 10.1016/j.tibs.2019.03.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Revised: 03/16/2019] [Accepted: 03/25/2019] [Indexed: 02/07/2023]
Abstract
The Mediator complex is required for basal activity of the RNA polymerase (Pol) II transcriptional apparatus and for responsiveness to some activator proteins. Med15, situated in the Mediator tail, plays a role in transmitting regulatory information from distant DNA-bound transcription factors to the transcriptional apparatus poised at promoters. Yeast Med15 and its orthologs share an unusual, glutamine-rich amino acid composition. Here, we discuss this sequence feature and the tendency of polyglutamine tracts to vary in length among strains of Saccharomyces cerevisiae, and we propose that different polyglutamine tract lengths may be adaptive within certain domestication habitats.
Collapse
Affiliation(s)
- David G Cooper
- Department of Biology, University of Iowa, Iowa City, IA 52242, USA
| | - Jan S Fassler
- Department of Biology, University of Iowa, Iowa City, IA 52242, USA.
| |
Collapse
|
10
|
Abstract
Liquid-liquid phase separation seems to play critical roles in the compartmentalization of cells through the formation of biomolecular condensates. Many proteins with low-complexity regions are found in these condensates, and they can undergo phase separation in vitro in response to changes in temperature, pH, and ion concentration. Low-complexity regions are thus likely important players in mediating compartmentalization in response to stress. However, how the phase behavior is encoded in their amino acid composition and patterning is only poorly understood. We discuss here that polymer physics provides a powerful framework for our understanding of the thermodynamics of mixing and demixing and for how the phase behavior is encoded in the primary sequence. We propose to classify low-complexity regions further into subcategories based on their sequence properties and phase behavior. Ongoing research promises to improve our ability to link the primary sequence of low-complexity regions to their phase behavior as well as the emerging miscibility and material properties of the resulting biomolecular condensates, providing mechanistic insight into this fundamental biological process across length scales.
Collapse
Affiliation(s)
- Erik W Martin
- Department of Structural Biology , St. Jude Children's Research Hospital , Memphis , Tennessee 38105-3678 , United States
| | - Tanja Mittag
- Department of Structural Biology , St. Jude Children's Research Hospital , Memphis , Tennessee 38105-3678 , United States
| |
Collapse
|
11
|
Zhang Y, Man VH, Roland C, Sagui C. Amyloid Properties of Asparagine and Glutamine in Prion-like Proteins. ACS Chem Neurosci 2016; 7:576-87. [PMID: 26911543 DOI: 10.1021/acschemneuro.5b00337] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Sequences rich in glutamine (Q) and asparagine (N) are intrinsically disordered in monomeric form, but can aggregate into highly ordered amyloids, as seen in Q/N-rich prion domains (PrDs). Amyloids are fibrillar protein aggregates rich in β-sheet structures that can self-propagate through protein-conformational chain reactions. Here, we present a comprehensive theoretical study of N/Q-rich peptides, including sequences found in the yeast Sup35 PrD, in parallel and antiparallel β-sheet aggregates, and probe via fully atomistic molecular dynamics simulations all their possible steric-zipper interfaces in order to determine their protofibril structure and their relative stability. Our results show that polyglutamine aggregates are more stable than polyasparagine aggregates. Enthalpic contributions to the free energy favor the formation of polyQ protofibrils, while entropic contributions favor the formation of polyN protofibrils. The considerably larger phase space that disordered polyQ must sample on its way to aggregation probably is at the root of the associated slower kinetics observed experimentally. When other amino acids are present, such as in the Sup35 PrD, their shorter side chains favor steric-zipper formation for N but not Q, as they preclude the in-register association of the long Q side chains.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Physics, and
Center for High Performance Simulations (CHiPS), North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Viet Hoang Man
- Department of Physics, and
Center for High Performance Simulations (CHiPS), North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Christopher Roland
- Department of Physics, and
Center for High Performance Simulations (CHiPS), North Carolina State University, Raleigh, North Carolina 27695, United States
| | - Celeste Sagui
- Department of Physics, and
Center for High Performance Simulations (CHiPS), North Carolina State University, Raleigh, North Carolina 27695, United States
| |
Collapse
|
12
|
Wu R, Liu Q, Zhang P, Liang D. Tandem amino acid repeats in the green anole (Anolis carolinensis) and other squamates may have a role in increasing genetic variability. BMC Genomics 2016; 17:109. [PMID: 26868501 PMCID: PMC4751654 DOI: 10.1186/s12864-016-2430-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2015] [Accepted: 02/02/2016] [Indexed: 01/04/2023] Open
Abstract
Background Tandem amino acid repeats are characterised by the consecutive recurrence of a single amino acid. They exhibit high rates of length mutations in addition to point mutations and have been proposed to be involved in genetic plasticity. Squamate reptiles (lizards and snakes) diversify in both morphology and physiology. The underlying mechanism is yet to be understood. In a previous phylogenomic analysis of reptiles, the density of tandem repeats in an anole lizard diverged heavily from that of the other reptiles. To gain further insight into the tandem amino acid repeats in squamates, we analysed the repeat content in the green anole (Anolis carolinensis) proteome and compared the amino acid repeats in a large orthologous protein data set from six vertebrates (the Western clawed frog, the green anole, the Chinese softshell turtle, the zebra finch, mouse and human). Results Our results revealed that the number of amino acid repeats in the green anole exceeded those found in the other five species studied. Species-only repeats were found in high proportion in the green anole but not in the other five species, suggesting that the green anole had gained many amino acid repeats in either the Anolis or the squamate lineage. Since the amino acid repeat containing genes in the green anole were highly enriched in genes related to transcription and development, an important family of developmental genes, i.e., the Hox family, was further studied in a wide collection of squamates. Abundant amino acid repeats were also observed, implying the general high tolerance of amino acid repeats in squamates. A particular enrichment of amino acid repeats was observed in the central class Hox genes that are known to be responsible for defining cervical to lumbar regions. Conclusions Our study suggests that the abundant amino acid repeats in the green anole, and possibly in other squamates, may play a role in increasing the genetic variability, and contribute to the evolutionary diversity of this clade. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2430-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Riga Wu
- Key Laboratory of Gene Engineering of the Ministry of Education, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, People's Republic of China.
| | - Qingfeng Liu
- Key Laboratory of Gene Engineering of the Ministry of Education, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, People's Republic of China.
| | - Peng Zhang
- Key Laboratory of Gene Engineering of the Ministry of Education, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, People's Republic of China.
| | - Dan Liang
- Key Laboratory of Gene Engineering of the Ministry of Education, State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-Sen University, Guangzhou, People's Republic of China.
| |
Collapse
|
13
|
The Octatricopeptide Repeat Protein Raa8 Is Required for Chloroplast trans Splicing. EUKARYOTIC CELL 2015. [PMID: 26209695 DOI: 10.1128/ec.00096-15] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The mRNA maturation of the tripartite chloroplast psaA gene from the green alga Chlamydomonas reinhardtii depends on various nucleus-encoded factors that participate in trans splicing of two group II introns. Recently, a multiprotein complex was identified that is involved in processing the psaA precursor mRNA. Using coupled tandem affinity purification (TAP) and mass spectrometry analyses with the trans-splicing factor Raa4 as a bait protein, we recently identified a multisubunit ribonucleoprotein (RNP) complex comprising the previously characterized trans-splicing factors Raa1, Raa3, Raa4, and Rat2 plus novel components. Raa1 and Rat2 share a structural motif, an octatricopeptide repeat (OPR), that presumably functions as an RNA interaction module. Two of the novel RNP complex components also exhibit a predicted OPR motif and were therefore considered potential trans-splicing factors. In this study, we selected bacterial artificial chromosome (BAC) clones encoding these OPR proteins and conducted functional complementation assays using previously generated trans-splicing mutants. Our assay revealed that the trans-splicing defect of mutant F19 was restored by a new factor we named RAA8; molecular characterization of complemented strains verified that Raa8 participates in splicing of the first psaA group II intron. Three of six OPR motifs are located in the C-terminal end of Raa8, which was shown to be essential for restoring psaA mRNA trans splicing. Our results support the important role played by OPR proteins in chloroplast RNA metabolism and also demonstrate that combining TAP and mass spectrometry with functional complementation studies represents a vigorous tool for identifying trans-splicing factors.
Collapse
|
14
|
Lenz C, Haerty W, Golding GB. Increased substitution rates surrounding low-complexity regions within primate proteins. Genome Biol Evol 2014; 6:655-65. [PMID: 24572016 PMCID: PMC3971593 DOI: 10.1093/gbe/evu042] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Previous studies have found that DNA-flanking low-complexity regions (LCRs) have an increased substitution rate. Here, the substitution rate was confirmed to increase in the vicinity of LCRs in several primate species, including humans. This effect was also found among human sequences from the 1000 Genomes Project. A strong correlation was found between average substitution rate per site and distance from the LCR, as well as the proportion of genes with gaps in the alignment at each site and distance from the LCR. Along with substitution rates, dN/dS ratios were also determined for each site, and the proportion of sites undergoing negative selection was found to have a negative relationship with distance from the LCR.
Collapse
Affiliation(s)
- Carolyn Lenz
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | | | | |
Collapse
|
15
|
Espinosa Angarica V, Ventura S, Sancho J. Discovering putative prion sequences in complete proteomes using probabilistic representations of Q/N-rich domains. BMC Genomics 2013; 14:316. [PMID: 23663289 PMCID: PMC3654983 DOI: 10.1186/1471-2164-14-316] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 05/06/2013] [Indexed: 01/23/2023] Open
Abstract
Background Prion proteins conform a special class among amyloids due to their ability to transmit aggregative folds. Prions are known to act as infectious agents in neurodegenerative diseases in animals, or as key elements in transcription and translation processes in yeast. It has been suggested that prions contain specific sequential domains with distinctive amino acid composition and physicochemical properties that allow them to control the switch between soluble and β-sheet aggregated states. Those prion-forming domains are low complexity segments enriched in glutamine/asparagine and depleted in charged residues and prolines. Different predictive methods have been developed to discover novel prions by either assessing the compositional bias of these stretches or estimating the propensity of protein sequences to form amyloid aggregates. However, the available algorithms hitherto lack a thorough statistical calibration against large sequence databases, which makes them unable to accurately predict prions without retrieving a large number of false positives. Results Here we present a computational strategy to predict putative prion-forming proteins in complete proteomes using probabilistic representations of prionogenic glutamine/asparagine rich regions. After benchmarking our predictive model against large sets of non-prionic sequences, we were able to filter out known prions with high precision and accuracy, generating prediction sets with few false positives. The algorithm was used to scan all the proteomes annotated in public databases for the presence of putative prion proteins. We analyzed the presence of putative prion proteins in all taxa, from viruses and archaea to plants and higher eukaryotes, and found that most organisms encode evolutionarily unrelated proteins with susceptibility to behave as prions. Conclusions To our knowledge, this is the first wide-ranging study aiming to predict prion domains in complete proteomes. Approaches of this kind could be of great importance to identify potential targets for further experimental testing and to try to reach a deeper understanding of prions’ functional and regulatory mechanisms.
Collapse
Affiliation(s)
- Vladimir Espinosa Angarica
- Departamento de Bioquímica y Biología Molecular y Celular, Facultad de Ciencias, Universidad de Zaragoza, Pedro Cerbuna 12, Zaragoza 50009, Spain
| | | | | |
Collapse
|
16
|
Radó-Trilla N, Albà M. Dissecting the role of low-complexity regions in the evolution of vertebrate proteins. BMC Evol Biol 2012; 12:155. [PMID: 22920595 PMCID: PMC3523016 DOI: 10.1186/1471-2148-12-155] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 07/30/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Low-complexity regions (LCRs) in proteins are tracts that are highly enriched in one or a few amino acids. Given their high abundance, and their capacity to expand in relatively short periods of time through replication slippage, they can greatly contribute to increase protein sequence space and generate novel protein functions. However, little is known about the global impact of LCRs on protein evolution. RESULTS We have traced back the evolutionary history of 2,802 LCRs from a large set of homologous protein families from H.sapiens, M.musculus, G.gallus, D.rerio and C.intestinalis. Transcriptional factors and other regulatory functions are overrepresented in proteins containing LCRs. We have found that the gain of novel LCRs is frequently associated with repeat expansion whereas the loss of LCRs is more often due to accumulation of amino acid substitutions as opposed to deletions. This dichotomy results in net protein sequence gain over time. We have detected a significant increase in the rate of accumulation of novel LCRs in the ancestral Amniota and mammalian branches, and a reduction in the chicken branch. Alanine and/or glycine-rich LCRs are overrepresented in recently emerged LCR sets from all branches, suggesting that their expansion is better tolerated than for other LCR types. LCRs enriched in positively charged amino acids show the contrary pattern, indicating an important effect of purifying selection in their maintenance. CONCLUSION We have performed the first large-scale study on the evolutionary dynamics of LCRs in protein families. The study has shown that the composition of an LCR is an important determinant of its evolutionary pattern.
Collapse
Affiliation(s)
- Núria Radó-Trilla
- Evolutionary Genomics Group, Research Programme on Biomedical Informatics - IMIM Hospital del Mar Research Institute, Universitat Pompeu Fabra, Dr. Aiguader 88, Barcelona 08003, Spain
| | | |
Collapse
|
17
|
Li H, Liu J, Wu K, Chen Y. Insight into role of selection in the evolution of polyglutamine tracts in humans. PLoS One 2012; 7:e41167. [PMID: 22848438 PMCID: PMC3405088 DOI: 10.1371/journal.pone.0041167] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2012] [Accepted: 06/18/2012] [Indexed: 11/21/2022] Open
Abstract
Glutamine tandem repeats are common in eukaryotic proteins. Although some studies have proposed that replication slippage plays an important role in shaping these repeats, the role of natural selection in glutamine tandem repeat evolution is somewhat unclear. In this study, we identified all of the glutamine tandem repeats containing four or more glutamines in human proteins and then estimated the nonsynonymous (dN) and synonymous (dS) substitution rates for the regions flanking the glutamine tandem repeats and the proteins containing them. The results indicated that most of the proteins containing polyglutamine (polyQ) tracts of four or more glutamines have undergone purifying selection, and that the purifying selection for the regions flanking the repeats is weaker. Additionally, we observed that the conserved repeats were under stronger selection constraints than the nonconserved repeats. Interestingly, we found that there was a higher level of purifying selection for the regions flanking the polyQ tracts encoded by pure CAG codons compared with those encoded by mixed codons. Based on our findings, we propose that selection has played a more important role than was previously speculated in constraining the expansion of polyQ tracts encoded by pure codons.
Collapse
Affiliation(s)
- Hongwei Li
- College of Veterinary Medicine, China Agricultural University, Beijing, China.
| | | | | | | |
Collapse
|
18
|
Behura SK, Severson DW. Genome-wide comparative analysis of simple sequence coding repeats among 25 insect species. Gene 2012; 504:226-32. [PMID: 22633877 DOI: 10.1016/j.gene.2012.05.020] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2012] [Revised: 05/11/2012] [Accepted: 05/12/2012] [Indexed: 10/28/2022]
Abstract
We present a detailed genome-scale comparative analysis of simple sequence repeats within protein coding regions among 25 insect genomes. The repetitive sequences in the coding regions primarily represented single codon repeats and codon pair repeats. The CAG triplet is highly repetitive in the coding regions of insect genomes. It is frequently paired with the synonymous codon CAA to code for polyglutamine repeats. The codon pairs that are least repetitive code for polyalanine repeats. The frequency of hexanucleotide and dinucleotide motifs of codon pair repeats is significantly (p<0.001) different in the Drosophila species compared to the non-Drosophila species. However, the frequency of synonymous and non-synonymous codon pair repeats varies in a correlated manner (r(2)=0.79) among all the species. Results further show that perfect and imperfect repeats have significant association with the trinucleotide and hexanucleotide coding repeats in most of these insects. However, only select species show significant association between the numbers of perfect/imperfect hexamers and repeat coding for single amino acid/amino acid pair runs. Our data further suggests that genes containing simple sequence coding repeats may be under negative selection as they tend to be poorly conserved across species. The sequences of coding repeats of orthologous genes vary according to the known phylogeny among the species. In conclusion, the study shows that simple sequence coding repeats are important features of genome diversity among insects.
Collapse
Affiliation(s)
- Susanta K Behura
- Eck Institute for Global Health, Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA.
| | | |
Collapse
|
19
|
Zhou Y, Liu J, Han L, Li ZG, Zhang Z. Comprehensive analysis of tandem amino acid repeats from ten angiosperm genomes. BMC Genomics 2011; 12:632. [PMID: 22195734 PMCID: PMC3283746 DOI: 10.1186/1471-2164-12-632] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2011] [Accepted: 12/23/2011] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The presence of tandem amino acid repeats (AARs) is one of the signatures of eukaryotic proteins. AARs were thought to be frequently involved in bio-molecular interactions. Comprehensive studies that primarily focused on metazoan AARs have suggested that AARs are evolving rapidly and are highly variable among species. However, there is still controversy over causal factors of this inter-species variation. In this work, we attempted to investigate this topic mainly by comparing AARs in orthologous proteins from ten angiosperm genomes. RESULTS Angiosperm AAR content is positively correlated with the GC content of the protein coding sequence. However, based on observations from fungal AARs and insect AARs, we argue that the applicability of this kind of correlation is limited by AAR residue composition and species' life history traits. Angiosperm AARs also tend to be fast evolving and structurally disordered, supporting the results of comprehensive analyses of metazoans. The functions of conserved long AARs are summarized. Finally, we propose that the rapid mRNA decay rate, alternative splicing and tissue specificity are regulatory processes that are associated with angiosperm proteins harboring AARs. CONCLUSIONS Our investigation suggests that GC content is a predictor of AAR content in the protein coding sequence under certain conditions. Although angiosperm AARs lack conservation and 3D structure, a fraction of the proteins that contain AARs may be functionally important and are under extensive regulation in plant cells.
Collapse
Affiliation(s)
- Yuan Zhou
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Jing Liu
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Lei Han
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Zhi-Gang Li
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
| |
Collapse
|
20
|
Toll-Riera M, Radó-Trilla N, Martys F, Albà MM. Role of low-complexity sequences in the formation of novel protein coding sequences. Mol Biol Evol 2011; 29:883-6. [PMID: 22045997 DOI: 10.1093/molbev/msr263] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Low-complexity sequences are extremely abundant in eukaryotic proteins for reasons that remain unclear. One hypothesis is that they contribute to the formation of novel coding sequences, facilitating the generation of novel protein functions. Here, we test this hypothesis by examining the content of low-complexity sequences in proteins of different age. We show that recently emerged proteins contain more low-complexity sequences than older proteins and that these sequences often form functional domains. These data are consistent with the idea that low-complexity sequences may play a key role in the emergence of novel genes.
Collapse
Affiliation(s)
- Macarena Toll-Riera
- Evolutionary Genomics Group, Research Programme in Biomedical Informatics, Universitat Pompeu Fabra (UPF)-Institute Municipal d'Investigació Mèdica (IMIM), Barcelona, Spain
| | | | | | | |
Collapse
|
21
|
Kurosaki T, Gojobori J, Ueda S. Comparative Genetics of the Poly-Q Tract of Ataxin-1 and Its Binding Protein PQBP-1. Biochem Genet 2011; 50:309-17. [DOI: 10.1007/s10528-011-9473-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2010] [Accepted: 06/14/2011] [Indexed: 11/28/2022]
|
22
|
Haerty W, Golding GB. Low-complexity sequences and single amino acid repeats: not just "junk" peptide sequences. Genome 2011; 53:753-62. [PMID: 20962881 DOI: 10.1139/g10-063] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
For decades proteins were thought to interact in a "lock and key" system, which led to the definition of a paradigm linking stable three-dimensional structure to biological function. As a consequence, any non-structured peptide was considered to be nonfunctional and to evolve neutrally. Surprisingly, the most commonly shared peptides between eukaryotic proteomes are low-complexity sequences that in most conditions do not present a stable three-dimensional structure. However, because these sequences evolve rapidly and because the size variation of a few of them can have deleterious effects, low-complexity sequences have been suggested to be the target of selection. Here we review evidence that supports the idea that these simple sequences should not be considered just "junk" peptides and that selection drives the evolution of many of them.
Collapse
Affiliation(s)
- Wilfried Haerty
- Biology Department, McMaster University, Hamilton, ON, Canada
| | | |
Collapse
|
23
|
Gojobori J, Ueda S. Elevated evolutionary rate in genes with homopolymeric amino acid repeats constituting nondisordered structure. Mol Biol Evol 2010; 28:543-50. [PMID: 20798138 DOI: 10.1093/molbev/msq225] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Homopolymeric amino acid repeats are tandem repeats of single amino acids. About 650 genes are known to have repeats of this kind comprising seven residues or more in the human genome. According to the evolutionary conservativeness, we classified the repeats into three categories: those whose length is conserved among mammals (CM), those whose length differs among nonprimate mammals but is conserved among primates (CP), and those whose length differs among primates (VP). The frequency of each repeat, especially Ala, Leu, Pro, and Glu repeats, varies greatly in each category. The 3D structure of homopolymeric amino acid repeats is considered to be intrinsically disordered. As expected, a large proportion of the repeats had a disordered structure, and nearly half of the repeats were predicted as completely disordered. However, a number of the repeats predicted to have nondisordered structure: 13% and 25% of the repeats for categories CM and VP, respectively. Comparison of the substitution rates showed a higher Ka/Ks ratio for the genes with not disordered repeats than the genes with disordered repeats. These results indicate that amino acid substitution rates have been elevated in the genes with nondisordered repeats.
Collapse
Affiliation(s)
- Jun Gojobori
- School of Advanced Studies, Graduate University for Advanced Studies, Hayama, Kanagawa, Japan
| | | |
Collapse
|
24
|
Haerty W, Golding GB. Genome-wide evidence for selection acting on single amino acid repeats. Genome Res 2010; 20:755-60. [PMID: 20056893 DOI: 10.1101/gr.101246.109] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Low complexity and homopolymer sequences within coding regions are known to evolve rapidly. While their expansion may be deleterious, there is increasing evidence for a functional role associated with these amino acid sequences. Homopolymer sequences are thought to evolve mostly through replication slippage and, therefore, they may be expected to be longer in regions with relaxed selective constraint. Within the coding sequences of eukaryotes, alternatively spliced exons are known to evolve under relaxed constraints in comparison to those exons that are constitutively spliced because they are not included in all of the mature mRNA of a gene. This relaxed exposure to selection leads to faster rates of evolution for alternatively spliced exons in comparison to constitutively spliced exons. Here, we have tested the effect of splicing on the structure (composition, length) of homopolymer sequences in relation to the splicing pattern in which they are found. We observed a significant relationship between alternative splicing and homopolymer sequences with alternatively spliced genes being enriched in number and length of homopolymer sequences. We also observed lower codon diversity and longer homocodons, suggesting a balance between slippage and point mutations linked to the constraints imposed by selection.
Collapse
Affiliation(s)
- Wilfried Haerty
- Biology Department, McMaster University, Hamilton, Ontario L8S4L8, Canada
| | | |
Collapse
|
25
|
Du X, Xiao Q, Zhao R, Wu F, Xu Q, Chong K, Meng Z. TrMADS3, a new MADS-box gene, from a perennial species Taihangia rupestris (Rosaceae) is upregulated by cold and experiences seasonal fluctuation in expression level. Dev Genes Evol 2008; 218:281-92. [PMID: 18465139 DOI: 10.1007/s00427-008-0218-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2007] [Accepted: 04/02/2008] [Indexed: 11/26/2022]
Abstract
In many temperate perennial plants, floral transition is initiated in the first growth season but the development of flower is arrested during the winter to ensure production of mature flowers in the next spring. The molecular mechanisms of the process remain poorly understood with few well-characterized regulatory genes. Here, a MADS-box gene, named as TrMADS3, was isolated from the overwintering inflorescences of Taihangia rupestris, a temperate perennial in the rose family. Phylogenetic analysis reveals that TrMADS3 is more closely related to the homologs of the FLOWERING LOCUS C lineage than to any of the other MIKC-type MADS-box lineages known from Arabidopsis. The TrMADS3 transcripts are extensively distributed in inflorescences, roots, and leaves during the winter. In controlled conditions, the TrMADS3 expression level is upregulated by a chilling exposure for 1 to 2 weeks and remains high for a longer period of time in warm conditions after cold treatment. In situ hybridization reveals that TrMADS3 is predominantly expressed in the vegetative and reproductive meristems. Ectopic expression of TrMADS3 in Arabidopsis promotes seed germination on the media containing relatively high NaCl or mannitol concentrations. These data indicate that TrMADS3 in a perennial species might have its role in both vegetative and reproductive meristems in response to cold.
Collapse
Affiliation(s)
- Xiaoqiu Du
- Laboratory of Photosynthesis and Environmental Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, Xiangshan, Beijing, China
| | | | | | | | | | | | | |
Collapse
|
26
|
Sharon I, Birkland A, Chang K, El-Yaniv R, Yona G. Correcting BLAST e-values for low-complexity segments. J Comput Biol 2008; 12:980-1003. [PMID: 16201917 DOI: 10.1089/cmb.2005.12.980] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The statistical estimates of BLAST and PSI-BLAST are of extreme importance to determine the biological relevance of sequence matches. While being very effective in evaluating most matches, these estimates usually overestimate the significance of matches in the presence of low complexity segments. In this paper, we present a model, based on divergence measures and statistics of the alignment structure, that corrects BLAST e-values for low complexity sequences without filtering or excluding them and generates scores that are more effective in distinguishing true similarities from chance similarities. We evaluate our method and compare it to other known methods using the Gene Ontology (GO) knowledge resource as a benchmark. Various performance measures, including ROC analysis, indicate that the new model improves upon the state of the art. The program is available at biozon.org/ftp/ and www.cs.technion.ac.il/ approximately itaish/lowcomp/.
Collapse
Affiliation(s)
- Itai Sharon
- Department of Computer Science, Technion, Haifa, Israel
| | | | | | | | | |
Collapse
|
27
|
Bannen RM, Bingman CA, Phillips GN. Effect of low-complexity regions on protein structure determination. ACTA ACUST UNITED AC 2008; 8:217-26. [PMID: 18302007 DOI: 10.1007/s10969-008-9039-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2007] [Accepted: 02/05/2008] [Indexed: 11/24/2022]
Abstract
It has been previously shown that protein sequences containing a quasi-repetitive assortment of amino acids are common in genomes and databases such as Swiss-Prot but are under-represented in the structure-based Protein Data Bank (PDB). Structural genomics groups have been using the absence of these "low-complexity" sequences for several years as a way to select proteins that have a good chance of successful structure determination. In this study, we examine the data deposited in the PDB as well as the available data from structural genomics groups in TargetDB and PepcDB to reveal interesting trends that could be taken into consideration when using low-complexity sequences as part of the target selection process.
Collapse
Affiliation(s)
- Ryan M Bannen
- Department of Biochemistry, University of Wisconsin-Madison, 433 Babcock Drive, Madison, WI 53711, USA
| | | | | |
Collapse
|
28
|
Huntley MA, Clark AG. Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species. Mol Biol Evol 2007; 24:2598-609. [PMID: 17602168 DOI: 10.1093/molbev/msm129] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of species particularly useful for the study of protein repeats. We also find that proteins containing repeats are overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some repeats may be inducing compensatory substitutions in their surrounding sequence.
Collapse
Affiliation(s)
- Melanie A Huntley
- Department of Molecular Biology and Genetics Cornell University, USA.
| | | |
Collapse
|
29
|
Zhang L, Yu S, Cao Y, Wang J, Zuo K, Qin J, Tang K. Distributional gradient of amino acid repeats in plant proteins. Genome 2007; 49:900-5. [PMID: 17036065 DOI: 10.1139/g06-054] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A computer-based analysis was conducted to assess the characteristics of amino acid repeats in Arabidopsis and rice. Our analysis showed a negative gradient in amino acid repeat distribution along the direction of translation in plants. Repeat occurrences are obviously associated with position in plant proteins but are not consistent with the corresponding amino acid contents. These repeats are encoded by the mixed synonymous codons rather than the uninterrupted reiterations of a single codon, and both Arabidopsis and rice have gradients in their distribution. Functional investigation showed that these repeat-containing proteins are preferentially involved in transcription regulation and protein ubiquitination but significantly underrepresented in the processes of DNA recombination and DNA replication. These data reveal that the direction-related mutation bias and functional selection have influenced the distribution of amino acid repeats in plants.
Collapse
Affiliation(s)
- Lida Zhang
- Plant Biotechnology Research Center, Institute of Systems Biology, Shanghai Key Laboratory of Agrobiotechnology, Fudan-SJTU-Nottingham Plant Biotechnology R & D Center, School of Agriculture, Shanghai, China
| | | | | | | | | | | | | |
Collapse
|
30
|
Romov PA, Li F, Lipke PN, Epstein SL, Qiu WG. Comparative genomics reveals long, evolutionarily conserved, low-complexity islands in yeast proteins. J Mol Evol 2006; 63:415-25. [PMID: 16927006 DOI: 10.1007/s00239-005-0291-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2005] [Accepted: 04/27/2006] [Indexed: 01/12/2023]
Abstract
Eukaryotic proteomes abound in low-complexity sequences, including tandem repeats and regions with significantly biased amino acid compositions. We assessed the functional importance of compositionally biased sequences in the yeast proteome using an evolutionary analysis of 2838 orthologous open reading frame (ORF) families from three Saccharomyces species (S. cerevisiae, S. bayanus, and S. paradoxus). Sequence conservation was measured by the amino acid sequence variability and by the ratio of nonsynonymous-to-synonymous nucleotide substitutions (K(a)/K(s)) between pairs of orthologous ORFs. A total of 1033 ORF families contained one or more long (at least 45 residues), low-complexity islands as defined by a measure based on the Shannon information index. Low-complexity islands were generally less conserved than ORFs as a whole; on average they were 50% more variable in amino acid sequences and 50% higher in K(a)/K(s) ratios. Fast-evolving low-complexity sequences outnumbered conserved low-complexity sequences by a ratio of 10 to 1. Sequence differences between orthologous ORFs fit well to a selectively neutral Poisson model of sequence divergence. We therefore used the Poisson model to identify conserved low-complexity sequences. ORFs containing the 33 most conserved low-complexity sequences were overrepresented by those encoding nucleic acid binding proteins, cytoskeleton components, and intracellular transporters. While a few conserved low-complexity islands were known functional domains (e.g., DNA/RNA-binding domains), most were uncharacterized. We discuss how comparative genomics of closely related species can be employed further to distinguish functionally important, shorter, low-complexity sequences from the vast majority of such sequences likely maintained by neutral processes.
Collapse
Affiliation(s)
- Philip A Romov
- Department of Computer Science, Hunter College, City University of New York, New York, New York 10021, USA
| | | | | | | | | |
Collapse
|
31
|
Abstract
Highly repetitive sequence within proteins is an abundant feature yet is considered by some to be the protein equivalent of "junk DNA." Homopolymer sequences, the most highly repetitive of this group, are typically encoded by trinucleotide repeats at the DNA level. It is thought that many of these sequences are produced by a replicative slippage mechanism. Recent studies suggest that these highly mutable regions within proteins may allow for rapid morphological evolution emerging from the increased variability afforded by such coding structures. However, in a homopolymer, it is difficult to determine if the repeated amino acid is due to slippage at the DNA level or due to selection at the protein level. Here we develop and test a model to detect cases for which the homopolymer tract has clearly been selected for, with no evidence of slippage at the DNA level. The polyserine tract within the phosphatidylserine receptor protein is used as an excellent example of one such case.
Collapse
Affiliation(s)
- Melanie A Huntley
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
32
|
Abstract
Proteins associated with disease and development of the nervous system are thought to contain repetitive, simple sequences. However, genome-wide surveys for simple sequences within proteins have revealed that repetitive peptide sequences are the most frequent shared peptide segments among eukaryotic proteins, including those of Saccharomyces cerevisiae, which has few to no specialized developmental and neurological proteins. It is therefore of interest to determine if these specialized proteins have an excess of simple sequences when compared to other sets of compositionally similar proteins. We have determined the relative abundance of simple sequences within neurological proteins and find no excess of repetitive simple sequence within this class. In fact, polyglutamine repeats that are associated with many neurodegenerative diseases are no more abundant within neurological specialized proteins than within nonneurological collections of proteins. We also examined the codon composition of serine homopolymers to determine what forces may play a role in the evolution of extended homopolymers. Codon type homogeneity tends to be favored, suggesting replicative slippage instead of selection as the main force responsible for producing these homopolymers.
Collapse
Affiliation(s)
- Melanie A Huntley
- Department of Biology, McMaster University, Hamilton, Ontario L8S 4K1, Canada
| | | |
Collapse
|
33
|
Abstract
Protein simple sequences are a subclass of low complexity regions of sequence that are highly enriched in one or a few residue types. Such sequences are common in transcription regulatory proteins, in structural proteins, in proteins involved in nucleic acid interactions, and in mediating protein-protein interactions. Simple sequences of 10 or more residues, containing >/=50% of a single residue type are surveyed in this work. Both eukaryote and prokaryote proteomes are investigated with emphasis on the eukaryotes. Very large numbers of such sequences are found in all organisms surveyed. It is found that eukaryotes possess far more simple sequences per protein than do the prokaryotes. Prokaryotes display a linear relationship between number of proteins containing simple sequences and proteome size, whereas it is not clear that such a relationship holds for eukaryotes. Strikingly, it is found that each eukaryote possesses its own unique distribution of simple sequences. Within those distributions it is found that simple sequences enriched in certain residue types are clearly favored, whereas others are just as clearly discriminated against. The preferences observed are not correlated with residue occurrence. An analysis of classes of proteins of known function suggests that simple sequence occurrence and distribution may be related to protein function. Based upon this analysis, the large number of simple sequences found above that would be expected from a simple statistical model, plus the known functional importance of numerous such sequences, it is postulated that eukaryotes have evolved to not only tolerate large numbers of simple sequences but also to require them.
Collapse
Affiliation(s)
- Kim Lan Sim
- Center for Structural Biology, Department of Molecular and Cellular Biochemistry, University of Kentucky, Lexington, Kentucky 40536-0298, USA
| | | |
Collapse
|
34
|
Abstract
A simple sequence is abundant in the proteins that have been sequenced to date. But unusual protein features, such as a simple sequence, are not present in the same high frequency within structural databases. A subset of these simple sequences, a group with a highly repetitive nature has been shown to be abundant in eukaryotes but not in prokaryotes. In this study, an examination of the eukaryotic proteins in the Protein Data Bank (PDB) has revealed a large deficiency of low complexity, highly repetitive protein repeats. Through simulated databases of similar samples of eukaryotic proteins taken from the National Center for Biotechnology Information (NCBI) database, it is shown that the PDB contains a significantly less highly repetitive, simple sequence than artificial databases of similar composition randomly derived from NCBI. When the structural data for those few PDB sequences that did contain a highly repetitive simple sequence is examined in detail, it is found that in most cases the tertiary structure is unknown for the regions consisting of a simple sequence. This lack of a simple sequence both in the PDB database and in the structural information suggests that this type of simple sequence may produce disordered structures that make structural characterization difficult.
Collapse
Affiliation(s)
- Melanie A Huntley
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | | |
Collapse
|
35
|
Abstract
Full-sequence data available for Plasmodium falciparumchromosomes 2 and 3 are exploited to perform a statistical analysis of the long tracts of biased amino acid composition that characterize the vast majority of P. falciparum proteins and to make a comparison with similarly defined tracts from other simple eukaryotes. When the relatively minor subset of prevalently hydrophobic segments is discarded from the set of low-complexity segments identified by current segmentation methods in P. falciparum proteins, a good correspondence is found between prevalently hydrophilic low-complexity segments and the species-specific, rapidly diverging insertions detected by multiple-alignment procedures when sequences of bona fide homologs are available. Amino acid preferences are fairly uniform in the set of hydrophilic low-complexity segments identified in the twoP. falciparum chromosomes sequenced, as well as in sequenced genes from Plasmodium berghei, but differ from those observed in Saccharomyces cerevisiae and Dictyostelium discoideum. In the two plasmodial species, amino acid frequencies do not correlate with properties such as hydrophilicity, small volume, or flexibility, which might be expected to characterize residues involved in nonglobular domains but do correlate with A-richness in codons. An effect of phenotypic selection versus neutral drift, however, is suggested by the predominance of asparagine over lysine.
Collapse
|
36
|
Abstract
Full-sequence data available for Plasmodium falciparum chromosomes 2 and 3 are exploited to perform a statistical analysis of the long tracts of biased amino acid composition that characterize the vast majority of P. falciparum proteins and to make a comparison with similarly defined tracts from other simple eukaryotes. When the relatively minor subset of prevalently hydrophobic segments is discarded from the set of low-complexity segments identified by current segmentation methods in P. falciparum proteins, a good correspondence is found between prevalently hydrophilic low-complexity segments and the species-specific, rapidly diverging insertions detected by multiple-alignment procedures when sequences of bona fide homologs are available. Amino acid preferences are fairly uniform in the set of hydrophilic low-complexity segments identified in the two P. falciparum chromosomes sequenced, as well as in sequenced genes from Plasmodium berghei, but differ from those observed in Saccharomyces cerevisiae and Dictyostelium discoideum. In the two plasmodial species, amino acid frequencies do not correlate with properties such as hydrophilicity, small volume, or flexibility, which might be expected to characterize residues involved in nonglobular domains but do correlate with A-richness in codons. An effect of phenotypic selection versus neutral drift, however, is suggested by the predominance of asparagine over lysine.
Collapse
Affiliation(s)
- E Pizzi
- Laboratorio di Biologia Cellulare, Istituto Superiore di Sanitá, 00161 Rome, Italy
| | | |
Collapse
|
37
|
Katti MV, Sami-Subbu R, Ranjekar PK, Gupta VS. Amino acid repeat patterns in protein sequences: their diversity and structural-functional implications. Protein Sci 2000; 9:1203-9. [PMID: 10892812 PMCID: PMC2144659 DOI: 10.1110/ps.9.6.1203] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
All the protein sequences from SWISS-PROT database were analyzed for occurrence of single amino acid repeats, tandem oligo-peptide repeats, and periodically conserved amino acids. Single amino acid repeats of glutamine, serine, glutamic acid, glycine, and alanine seem to be tolerated to a considerable extent in many proteins. Tandem oligo-peptide repeats of different types with varying levels of conservation were detected in several proteins and found to be conspicuous, particularly in structural and cell surface proteins. It appears that repeated sequence patterns may be a mechanism that provides regular arrays of spatial and functional groups, useful for structural packing or for one to one interactions with target molecules. To facilitate further explorations, a database of Tandem Repeats in Protein Sequences (TRIPS) has been developed and is available at URL: http://www.ncl-india.org/trips.
Collapse
Affiliation(s)
- M V Katti
- Division of Biochemical Sciences, National Chemical Laboratory, Pune, India
| | | | | | | |
Collapse
|