1
|
Steczkiewicz K, Kossakowski A, Janik S, Muszewska A. Low-complexity regions in fungi display functional groups and are depleted in positively charged amino acids. NAR Genom Bioinform 2025; 7:lqaf014. [PMID: 40041205 PMCID: PMC11878562 DOI: 10.1093/nargab/lqaf014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 01/29/2025] [Accepted: 02/20/2025] [Indexed: 03/06/2025] Open
Abstract
Reports on the diversity and occurrence of low-complexity regions (LCR) in Eukaryota are limited. Some studies have provided a more extensive characterization of LCR proteins in prokaryotes. There is a growing body of knowledge about a plethora of biological functions attributable to LCRs. However, it is hard to determine to what extent observed phenomena apply to fungi since most studies of fungal LCRs were limited to model yeasts. To fill this gap, we performed a survey of LCRs in proteins across all fungal tree of life branches. We show that the abundance of LCRs and the abundance of proteins with LCRs are positively correlated with proteome size. We observed that most LCRs are present in proteins with protein domains but do not overlap with the domain regions. LCRs are associated with many duplicated protein domains. The quantity of particular amino acids in LCRs deviates from the background frequency with a clear over-representation of amino acids with functional groups and a negative charge. Moreover, we discovered that each lineage of fungi favors distinct LCRs expansions. Early diverging fungal lineages differ in LCR abundance and composition pointing at a different evolutionary trajectory of each fungal group.
Collapse
Affiliation(s)
- Kamil Steczkiewicz
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5A, 02-106 Warsaw, Poland
| | - Aleksander Kossakowski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5A, 02-106 Warsaw, Poland
| | - Stanisław Janik
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5A, 02-106 Warsaw, Poland
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Stefana Banacha 2, 02-097 Warsaw, Poland
| | - Anna Muszewska
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5A, 02-106 Warsaw, Poland
| |
Collapse
|
2
|
Bonet DF, Ranyai S, Aswad L, Lane DP, Arsenian-Henriksson M, Landreh M, Lama D. AlphaFold with conformational sampling reveals the structural landscape of homorepeats. Structure 2024; 32:2160-2167.e2. [PMID: 39299235 DOI: 10.1016/j.str.2024.08.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/27/2024] [Accepted: 08/23/2024] [Indexed: 09/22/2024]
Abstract
Homorepeats are motifs with reiterations of the same amino acid. They are prevalent in proteins associated with diverse physiological functions but also linked to several pathologies. Structural characterization of homorepeats has remained largely elusive, primarily because they generally occur in the disordered regions or proteins. Here, we address this subject by combining structures derived from machine learning with conformational sampling through physics-based simulations. We find that hydrophobic homorepeats have a tendency to fold into structured secondary conformations, while hydrophilic ones predominantly exist in unstructured states. Our data show that the flexibility rendered by disorder is a critical component besides the chemical feature that drives homorepeats composition toward hydrophilicity. The formation of regular secondary structures also influences their solubility, as pathologically relevant homorepeats display a direct correlation between repeat expansion, induction of helicity, and self-assembly. Our study provides critical insights into the conformational landscape of protein homorepeats and their structure-activity relationship.
Collapse
Affiliation(s)
- David Fernandez Bonet
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, Tomtebodavägen 23A, SE-171 65 Solna, Sweden
| | - Shahrayar Ranyai
- Department of Chemical Engineering, KTH Royal Institute of Technology, Teknikringen 42, SE-114 28 Stockholm, Sweden
| | - Luay Aswad
- Clinical Proteomics Mass Spectrometry, Department of Oncology-Pathology, Karolinska Institutet, Science for Life Laboratory, Tomtebodavagen 23A, SE-171 65 Solna, Sweden
| | - David P Lane
- Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Biomedicum, SE-171 65 Stockholm, Sweden
| | - Marie Arsenian-Henriksson
- Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Biomedicum, SE-171 65 Stockholm, Sweden
| | - Michael Landreh
- Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Biomedicum, SE-171 65 Stockholm, Sweden; Department of Cell and Molecular Biology, Uppsala University, SE-751 24 Uppsala, Sweden
| | - Dilraj Lama
- Department of Microbiology, Tumor and Cell Biology (MTC), Karolinska Institutet, Biomedicum, SE-171 65 Stockholm, Sweden.
| |
Collapse
|
3
|
Dovidchenko NV, Lobanov MY, Galzitskaya OV. Is there a bias in the codon frequency corresponding to homo-repeats found in human proteins? Biosystems 2024; 246:105357. [PMID: 39442908 DOI: 10.1016/j.biosystems.2024.105357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 09/30/2024] [Accepted: 10/20/2024] [Indexed: 10/25/2024]
Abstract
It is well known that there is a codon usage bias in genomes, that is, some codons are observed more often than others. Codons implicated in the homo-repeats regions in human proteins are no exception. In this work, we analyzed the codon usage bias for all amino acid residues in homo-repeats larger than 4 in 3753 human proteins from 20447 protein sequences from the canonically reviewed human proteome. We have discovered that almost all homo-repeats in the human proteome, most of which encode Ala, Glu, Gly, Leu, Pro, and Ser (∼80% of all homo-repeats), have a codon usage bias, i.e. are mainly encoded by one codon. Moreover, there is a strong shift in homo-repeats in favor of the content of GC rich codons. Homo-repeats with Ala, Glu, Gly, Leu, Pro, and Ser predominate in the PDB, which has both ordered and disordered status. Examining the distribution of splicing sites, we found that about 15% of homo-repeats either contain or are located within 10 nucleotides of the splicing site, and Glu and Leu predominate in these homo-repeats. Our data is important for future study of the functions of homo-repeats, protein-protein interactions, and evolutionary fitness.
Collapse
Affiliation(s)
- Nikita V Dovidchenko
- Gamaleya Research Centre of Epidemiology and Microbiology, 123098, Moscow, Russia; Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia.
| | - Mikhail Yu Lobanov
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia
| | - Oxana V Galzitskaya
- Gamaleya Research Centre of Epidemiology and Microbiology, 123098, Moscow, Russia; Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia; Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia.
| |
Collapse
|
4
|
Teekas L, Sharma S, Vijay N. Terminal regions of a protein are a hotspot for low complexity regions and selection. Open Biol 2024; 14:230439. [PMID: 38862022 DOI: 10.1098/rsob.230439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/13/2024] [Indexed: 06/13/2024] Open
Abstract
Volatile low complexity regions (LCRs) are a novel source of adaptive variation, functional diversification and evolutionary novelty. An interplay of selection and mutation governs the composition and length of low complexity regions. High %GC and mutations provide length variability because of mechanisms like replication slippage. Owing to the complex dynamics between selection and mutation, we need a better understanding of their coexistence. Our findings underscore that positively selected sites (PSS) and low complexity regions prefer the terminal regions of genes, co-occurring in most Tetrapoda clades. We observed that positively selected sites within a gene have position-specific roles. Central-positively selected site genes primarily participate in defence responses, whereas terminal-positively selected site genes exhibit non-specific functions. Low complexity region-containing genes in the Tetrapoda clade exhibit a significantly higher %GC and lower ω (dN/dS: non-synonymous substitution rate/synonymous substitution rate) compared with genes without low complexity regions. This lower ω implies that despite providing rapid functional diversity, low complexity region-containing genes are subjected to intense purifying selection. Furthermore, we observe that low complexity regions consistently display ubiquitous prevalence at lower purity levels, but exhibit a preference for specific positions within a gene as the purity of the low complexity region stretch increases, implying a composition-dependent evolutionary role. Our findings collectively contribute to the understanding of how genetic diversity and adaptation are shaped by the interplay of selection and low complexity regions in the Tetrapoda clade.
Collapse
Affiliation(s)
- Lokdeep Teekas
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| | - Sandhya Sharma
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| | - Nagarjun Vijay
- Computational Evolutionary Genomics Lab, Department of Biological Sciences, IISER Bhopal , Bhauri, Madhya Pradesh, India
| |
Collapse
|
5
|
Elena-Real CA, Mier P, Sibille N, Andrade-Navarro MA, Bernadó P. Structure-function relationships in protein homorepeats. Curr Opin Struct Biol 2023; 83:102726. [PMID: 37924569 DOI: 10.1016/j.sbi.2023.102726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 10/06/2023] [Accepted: 10/09/2023] [Indexed: 11/06/2023]
Abstract
Homorepeats (or polyX), protein segments containing repetitions of the same amino acid, are abundant in proteomes from all kingdoms of life and are involved in crucial biological functions as well as several neurodegenerative and developmental diseases. Mainly inserted in disordered segments of proteins, the structure/function relationships of homorepeats remain largely unexplored. In this review, we summarize present knowledge for the most abundant homorepeats, highlighting the role of the inherent structure and the conformational influence exerted by their flanking regions. Recent experimental and computational methods enable residue-specific investigations of these regions and promise novel structural and dynamic information for this elusive group of proteins. This information should increase our knowledge about the structural bases of phenomena such as liquid-liquid phase separation and trinucleotide repeat disorders.
Collapse
Affiliation(s)
- Carlos A Elena-Real
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS. 29 rue de Navacelles, 34090 Montpellier, France. https://twitter.com/carloselenareal
| | - Pablo Mier
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz. Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Nathalie Sibille
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS. 29 rue de Navacelles, 34090 Montpellier, France
| | - Miguel A Andrade-Navarro
- Institute of Organismic and Molecular Evolution, Faculty of Biology, Johannes Gutenberg University Mainz. Hans-Dieter-Hüsch-Weg 15, 55128 Mainz, Germany
| | - Pau Bernadó
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS. 29 rue de Navacelles, 34090 Montpellier, France.
| |
Collapse
|
6
|
Barbosa Pereira PJ, Manso JA, Macedo-Ribeiro S. The structural plasticity of polyglutamine repeats. Curr Opin Struct Biol 2023; 80:102607. [PMID: 37178477 DOI: 10.1016/j.sbi.2023.102607] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 04/11/2023] [Accepted: 04/12/2023] [Indexed: 05/15/2023]
Abstract
From yeast to humans, polyglutamine (polyQ) repeat tracts are found frequently in the proteome and are particularly prominent in the activation domains of transcription factors. PolyQ is a polymorphic motif that modulates functional protein-protein interactions and aberrant self-assembly. Expansion of the polyQ repeated sequences beyond critical physiological repeat length thresholds triggers self-assembly and is linked to severe pathological implications. This review provides an overview of the current knowledge on the structures of polyQ tracts in the soluble and aggregated states and discusses the influence of neighboring regions on polyQ secondary structure, aggregation, and fibril morphologies. The influence of the genetic context of the polyQ-encoding trinucleotides is briefly discussed as a challenge for future endeavors in this field.
Collapse
Affiliation(s)
- Pedro José Barbosa Pereira
- IBMC - Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135, Porto, Portugal; Instituto de Investigação e Inovação em Saúde, Universidade do Porto, 4200-135, Porto, Portugal.
| | - José A Manso
- IBMC - Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135, Porto, Portugal; Instituto de Investigação e Inovação em Saúde, Universidade do Porto, 4200-135, Porto, Portugal
| | - Sandra Macedo-Ribeiro
- IBMC - Instituto de Biologia Molecular e Celular, Universidade do Porto, 4200-135, Porto, Portugal; Instituto de Investigação e Inovação em Saúde, Universidade do Porto, 4200-135, Porto, Portugal
| |
Collapse
|
7
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
8
|
Ito Y, Chadani Y, Niwa T, Yamakawa A, Machida K, Imataka H, Taguchi H. Nascent peptide-induced translation discontinuation in eukaryotes impacts biased amino acid usage in proteomes. Nat Commun 2022; 13:7451. [PMID: 36460666 PMCID: PMC9718836 DOI: 10.1038/s41467-022-35156-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 11/18/2022] [Indexed: 12/04/2022] Open
Abstract
Robust translation elongation of any given amino acid sequence is required to shape proteomes. Nevertheless, nascent peptides occasionally destabilize ribosomes, since consecutive negatively charged residues in bacterial nascent chains can stochastically induce discontinuation of translation, in a phenomenon termed intrinsic ribosome destabilization (IRD). Here, using budding yeast and a human factor-based reconstituted translation system, we show that IRD also occurs in eukaryotic translation. Nascent chains enriched in aspartic acid (D) or glutamic acid (E) in their N-terminal regions alter canonical ribosome dynamics, stochastically aborting translation. Although eukaryotic ribosomes are more robust to ensure uninterrupted translation, we find many endogenous D/E-rich peptidyl-tRNAs in the N-terminal regions in cells lacking a peptidyl-tRNA hydrolase, indicating that the translation of the N-terminal D/E-rich sequences poses an inherent risk of failure. Indeed, a bioinformatics analysis reveals that the N-terminal regions of ORFs lack D/E enrichment, implying that the translation defect partly restricts the overall amino acid usage in proteomes.
Collapse
Affiliation(s)
- Yosuke Ito
- grid.32197.3e0000 0001 2179 2105School of Life Science and Technology, Tokyo Institute of Technology, Yokohama, 226-8503 Japan
| | - Yuhei Chadani
- grid.32197.3e0000 0001 2179 2105Cell Biology Center, Institute of Innovative Research, Tokyo Institute of Technology, Yokohama, 226-8503 Japan
| | - Tatsuya Niwa
- grid.32197.3e0000 0001 2179 2105School of Life Science and Technology, Tokyo Institute of Technology, Yokohama, 226-8503 Japan ,grid.32197.3e0000 0001 2179 2105Cell Biology Center, Institute of Innovative Research, Tokyo Institute of Technology, Yokohama, 226-8503 Japan
| | - Ayako Yamakawa
- grid.32197.3e0000 0001 2179 2105School of Life Science and Technology, Tokyo Institute of Technology, Yokohama, 226-8503 Japan
| | - Kodai Machida
- grid.266453.00000 0001 0724 9317Graduate School of Engineering, University of Hyogo, Himeji, Hyogo 671-2280 Japan
| | - Hiroaki Imataka
- grid.266453.00000 0001 0724 9317Graduate School of Engineering, University of Hyogo, Himeji, Hyogo 671-2280 Japan
| | - Hideki Taguchi
- grid.32197.3e0000 0001 2179 2105School of Life Science and Technology, Tokyo Institute of Technology, Yokohama, 226-8503 Japan ,grid.32197.3e0000 0001 2179 2105Cell Biology Center, Institute of Innovative Research, Tokyo Institute of Technology, Yokohama, 226-8503 Japan
| |
Collapse
|
9
|
Mier P, Elena-Real CA, Cortés J, Bernadó P, Andrade-Navarro MA. The sequence context in poly-alanine regions: structure, function and conservation. Bioinformatics 2022; 38:4851-4858. [PMID: 36106994 PMCID: PMC9620824 DOI: 10.1093/bioinformatics/btac610] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 07/07/2022] [Accepted: 09/05/2022] [Indexed: 11/24/2022] Open
Abstract
MOTIVATION Poly-alanine (polyA) regions are protein stretches mostly composed of alanines. Despite their abundance in eukaryotic proteomes and their association to nine inherited human diseases, the structural and functional roles exerted by polyA stretches remain poorly understood. In this work we study how the amino acid context in which polyA regions are settled in proteins influences their structure and function. RESULTS We identified glycine and proline as the most abundant amino acids within polyA and in the flanking regions of polyA tracts, in human proteins as well as in 17 additional eukaryotic species. Our analyses indicate that the non-structuring nature of these two amino acids influences the α-helical conformations predicted for polyA, suggesting a relevant role in reducing the inherent aggregation propensity of long polyA. Then, we show how polyA position in protein N-termini relates with their function as transit peptides. PolyA placed just after the initial methionine is often predicted as part of mitochondrial transit peptides, whereas when placed in downstream positions, polyA are part of signal peptides. A few examples from known structures suggest that short polyA can emerge by alanine substitutions in α-helices; but evolution by insertion is observed for longer polyA. Our results showcase the importance of studying the sequence context of homorepeats as a mechanism to shape their structure-function relationships. AVAILABILITY AND IMPLEMENTATION The datasets used and/or analyzed during the current study are available from the corresponding author onreasonable request. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| | - Carlos A Elena-Real
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS, 34090 Montpellier, France
| | - Juan Cortés
- LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
| | - Pau Bernadó
- Centre de Biologie Structurale (CBS), Université de Montpellier, INSERM, CNRS, 34090 Montpellier, France
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University Mainz, 55128 Mainz, Germany
| |
Collapse
|
10
|
Cascarina SM, Ross ED. Expansion and functional analysis of the SR-related protein family across the domains of life. RNA (NEW YORK, N.Y.) 2022; 28:1298-1314. [PMID: 35863866 PMCID: PMC9479744 DOI: 10.1261/rna.079170.122] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 06/29/2022] [Indexed: 06/15/2023]
Abstract
Serine/arginine-rich (SR) proteins comprise a family of proteins that is predominantly found in eukaryotes and plays a prominent role in RNA splicing. A characteristic feature of SR proteins is the presence of an S/R-rich low-complexity domain (RS domain), often in conjunction with spatially distinct RNA recognition motifs (RRMs). To date, 52 human proteins have been classified as SR or SR-related proteins. Here, using an unbiased series of composition criteria together with enrichment for known RNA binding activity, we identified >100 putative SR-related proteins in the human proteome. This method recovers known SR and SR-related proteins with high sensitivity (∼94%), yet identifies a number of additional proteins with many of the hallmark features of true SR-related proteins. Newly identified SR-related proteins display slightly different amino acid compositions yet similar levels of post-translational modification, suggesting that these new SR-related candidates are regulated in vivo and functionally important. Furthermore, candidate SR-related proteins with known RNA-binding activity (but not currently recognized as SR-related proteins) are nevertheless strongly associated with a variety of functions related to mRNA splicing and nuclear speckles. Finally, we applied our SR search method to all available reference proteomes, and provide maps of RS domains and Pfam annotations for all putative SR-related proteins as a resource. Together, these results expand the set of SR-related proteins in humans, and identify the most common functions associated with SR-related proteins across all domains of life.
Collapse
Affiliation(s)
- Sean M Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, Colorado 80523, USA
| | - Eric D Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, Colorado 80523, USA
| |
Collapse
|
11
|
Mier P, Andrade-Navarro MA. Regions with two amino acids in protein sequences: a step forward from homorepeats into the low complexity landscape. Comput Struct Biotechnol J 2022; 20:5516-5523. [PMID: 36249567 PMCID: PMC9550522 DOI: 10.1016/j.csbj.2022.09.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/07/2022] [Accepted: 09/07/2022] [Indexed: 11/17/2022] Open
Abstract
Low complexity regions (LCRs) differ in amino acid composition from the background provided by the corresponding proteomes. The simplest LCRs are homorepeats (or polyX), regions composed of mostly-one amino acid type. Extensive research has been done to characterize homorepeats, and their taxonomic, functional and structural features depend on the amino acid type and sequence context. From them, the next step towards the study of LCRs are the regions composed of two types of amino acids, which we call polyXY. We classify polyXY in three categories based on the arrangement of the two amino acid types ‘X’ and ‘Y’: direpeats (e.g. ‘XYXYXY’), joined (e.g. ‘XXXYYY’) and shuffled (e.g. ‘XYYXXY’). We developed a script to search for polyXY, and located them in a comprehensive set of 20,340 reference proteomes. These results are available in a dedicated web server called XYs, in which the user can also submit their own protein datasets to detect polyXY. We studied the distribution of polyXY types by amino acid pair XY and category, and show that polyXY in Eukaryota are mainly located within intrinsically disordered regions. Our study provides a first step towards the characterization of polyXY as protein motifs.
Collapse
Affiliation(s)
- Pablo Mier
- Corresponding author at: Hanns-Dieter-Hüsch-Weg 15 55118 Mainz (Germany).
| | | |
Collapse
|
12
|
Chakrabarty B, Parekh N. DbStRiPs: Database of structural repeats in proteins. Protein Sci 2022; 31:23-36. [PMID: 33641184 PMCID: PMC8740836 DOI: 10.1002/pro.4052] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 02/11/2021] [Accepted: 02/15/2021] [Indexed: 01/03/2023]
Abstract
Recent interest in repeat proteins has arisen due to stable structural folds, high evolutionary conservation and repertoire of functions provided by these proteins. However, repeat proteins are poorly characterized because of high sequence variation between repeating units and structure-based identification and classification of repeats is desirable. Using a robust network-based pipeline, manual curation and Kajava's structure-based classification schema, we have developed a database of tandem structural repeats, Database of Structural Repeats in Proteins (DbStRiPs). A unique feature of this database is that available knowledge on sequence repeat families is incorporated by mapping Pfam classification scheme onto structural classification. Integration of sequence and structure-based classifications help in identifying different functional groups within the same structural subclass, leading to refinement in the annotation of repeat proteins. Analysis of complete Protein Data Bank revealed 16,472 repeat annotations in 15,141 protein chains, one previously uncharacterized novel protein repeat family (PRF), named left-handed beta helix, and 33 protein repeat clusters (PRCs). Based on their unique structural motif, ~79% of these repeat proteins are classified in one of the 14 PRFs or 33 PRCs, and the remaining are grouped as unclassified repeat proteins. Each repeat protein is provided with a detailed annotation in DbStRiPs that includes start and end boundaries of repeating units, copy number, secondary and tertiary structure view, repeat class/subclass, disease association, MSA of repeating units and cross-references to various protein pattern databases, human protein atlas and interaction resources. DbStRiPs provides easy search and download options to high-quality annotations of structural repeat proteins (URL: http://bioinf.iiit.ac.in/dbstrips/).
Collapse
Affiliation(s)
- Broto Chakrabarty
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyHyderabadIndia
| | - Nita Parekh
- Centre for Computational Natural Sciences and Bioinformatics, International Institute of Information TechnologyHyderabadIndia
| |
Collapse
|
13
|
Eaton AF, Brown D, Merkulova M. The evolutionary conserved TLDc domain defines a new class of (H +)V-ATPase interacting proteins. Sci Rep 2021; 11:22654. [PMID: 34811399 PMCID: PMC8608904 DOI: 10.1038/s41598-021-01809-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 11/02/2021] [Indexed: 01/26/2023] Open
Abstract
We recently found that nuclear receptor coactivator 7 (Ncoa7) and Oxr1 interact with the proton-pumping V-ATPase. Ncoa7 and Oxr1 belong to a group of proteins playing a role in the oxidative stress response, that contain the conserved “TLDc” domain. Here we asked if the three other proteins in this family, i.e., Tbc1d24, Tldc1 and Tldc2 also interact with the V-ATPase and if the TLDc domains are involved in all these interactions. By co-immunoprecipitation, endogenous kidney Tbc1d24 (and Ncoa7 and Oxr1) and overexpressed Tldc1 and Tldc2, all interacted with the V-ATPase. In addition, purified TLDc domains of Ncoa7, Oxr1 and Tldc2 (but not Tbc1d24 or Tldc1) interacted with V-ATPase in GST pull-downs. At the amino acid level, point mutations G815A, G845A and G896A in conserved regions of the Ncoa7 TLDc domain abolished interaction with the V-ATPase, and S817A, L926A and E938A mutations resulted in decreased interaction. Furthermore, poly-E motifs upstream of the TLDc domain in Ncoa7 and Tldc2 show a (nonsignificant) trend towards enhancing the interaction with V-ATPase. Our principal finding is that all five members of the TLDc family of proteins interact with the V-ATPase. We conclude that the TLDc motif defines a new class of V-ATPase interacting regulatory proteins.
Collapse
Affiliation(s)
- A F Eaton
- Program in Membrane Biology and Division of Nephrology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA
| | - D Brown
- Program in Membrane Biology and Division of Nephrology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA
| | - M Merkulova
- Program in Membrane Biology and Division of Nephrology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA. .,Program in Membrane Biology and Division of Nephrology, Massachusetts General Hospital, Simches Research Center, 128 Cambridge St., Boston, MA, 02114, USA.
| |
Collapse
|
14
|
Mier P, Andrade-Navarro MA. Between Interactions and Aggregates: The PolyQ Balance. Genome Biol Evol 2021; 13:evab246. [PMID: 34791220 PMCID: PMC8763233 DOI: 10.1093/gbe/evab246] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2021] [Indexed: 11/17/2022] Open
Abstract
Polyglutamine (polyQ) regions are highly abundant consecutive runs of glutamine residues. They have been generally studied in relation to the so-called polyQ-associated diseases, characterized by protein aggregation caused by the expansion of the polyQ tract via a CAG-slippage mechanism. However, more than 4,800 human proteins contain a polyQ, and only nine of these regions are known to be associated with disease. Computational sequence studies and experimental structure determinations are completing a more interesting picture in which polyQ emerge as a motif for modulation of protein-protein interactions. But long polyQ regions may lead to an excess of interactions, and produce aggregates. Within this mechanistic perspective of polyQ function and malfunction, we discuss polyQ definition and properties such as variable codon usage, sequence and context structure imposition, functional relevance, evolutionary patterns in species-centered analyses, and open resources.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University of Mainz, Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Institute of Organismic and Molecular Evolution, Johannes Gutenberg University of Mainz, Mainz, Germany
| |
Collapse
|
15
|
Deryusheva EI, Machulin AV, Galzitskaya OV. Structural, Functional, and Evolutionary Characteristics of Proteins with Repeats. Mol Biol 2021. [DOI: 10.1134/s0026893321040038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
16
|
Cascarina SM, King DC, Osborne Nishimura E, Ross ED. LCD-Composer: an intuitive, composition-centric method enabling the identification and detailed functional mapping of low-complexity domains. NAR Genom Bioinform 2021; 3:lqab048. [PMID: 34056598 PMCID: PMC8153834 DOI: 10.1093/nargab/lqab048] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/13/2021] [Accepted: 05/06/2021] [Indexed: 02/07/2023] Open
Abstract
Low complexity domains (LCDs) in proteins are regions predominantly composed of a small subset of the possible amino acids. LCDs are involved in a variety of normal and pathological processes across all domains of life. Existing methods define LCDs using information-theoretical complexity thresholds, sequence alignment with repetitive regions, or statistical overrepresentation of amino acids relative to whole-proteome frequencies. While these methods have proven valuable, they are all indirectly quantifying amino acid composition, which is the fundamental and biologically-relevant feature related to protein sequence complexity. Here, we present a new computational tool, LCD-Composer, that directly identifies LCDs based on amino acid composition and linear amino acid dispersion. Using LCD-Composer's default parameters, we identified simple LCDs across all organisms available through UniProt and provide the resulting data in an accessible form as a resource. Furthermore, we describe large-scale differences between organisms from different domains of life and explore organisms with extreme LCD content for different LCD classes. Finally, we illustrate the versatility and specificity achievable with LCD-Composer by identifying diverse classes of LCDs using both simple and multifaceted composition criteria. We demonstrate that the ability to dissect LCDs based on these multifaceted criteria enhances the functional mapping and classification of LCDs.
Collapse
Affiliation(s)
- Sean M Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - David C King
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Erin Osborne Nishimura
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Eric D Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
17
|
Moving beyond disease to function: Physiological roles for polyglutamine-rich sequences in cell decisions. Curr Opin Cell Biol 2021; 69:120-126. [PMID: 33610098 DOI: 10.1016/j.ceb.2021.01.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 12/18/2020] [Accepted: 01/12/2021] [Indexed: 12/17/2022]
Abstract
Glutamine-rich tracts, also known as polyQ domains, have received a great deal of attention for their role in multiple neurodegenerative diseases, including Huntington's disease (HD), spinocerebellar ataxia (SCA), and others [22], [27]. Expansions in the normal polyQ tracts are thus commonly linked to disease, but polyQ domains themselves play multiple important functional roles in cells that are being increasingly appreciated. The biochemical nature of these domains allows them to adopt a number of different structures and form large assemblies that enable environmental responsiveness, localized signaling, and cellular memory. In many cases, these involve the formation of condensates that have varied material states. In this review, we highlight known and emerging functional roles for polyQ tracts in normal cell physiology.
Collapse
|
18
|
Hernandez R, Facelli JC. Structure analysis of the proteins associated with polyA repeat expansion disorders. J Biomol Struct Dyn 2021; 40:5556-5565. [PMID: 33459170 PMCID: PMC8286276 DOI: 10.1080/07391102.2021.1871957] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Repeat regions are low-complexity regions in the human genome that largely code for intrinsic disorder in proteins. Expansions outside the normal thresholds in repeat regions are likely to be pathogenic, leading to the so-called repeat expansion diseases. There have been numerous studies on the most common group of repeat expansion diseases, which are the polyglutamine (polyQ) repeat expansion diseases, but there has been much less work done on the second-largest group of expansion repeats disorders, which involves the expansion of polyalanine (polyA) repeat tracts. In this article, we present a comprehensive study of the structural changes predicted using I-TASSER when comparing the wild type and enlarged structures of all known polyA expansion disorders. The results show that there is a reduction in α helices, an increase in extended strands in parallel and/or anti-parallel β-sheet conformation, an increase in random coils/loops and irregular elements, and a large increase in solvent-accessible surface area. When compared to the findings in polyQ expansions disorders we see similar trends, suggesting that the polyQ and polyA repeat expansion causes similar effects on the respective proteins, which lead to higher misfolding and aggregation propensities.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Rolando Hernandez
- Department of Biomedical Informatics, The University of Utah, Salt Lake City, UT, USA
| | - Julio C Facelli
- Department of Biomedical Informatics, The University of Utah, Salt Lake City, UT, USA.,Center for Clinical and Translational Science, The University of Utah, Salt Lake City, UT, USA
| |
Collapse
|
19
|
Wang Y, Yang HJ, Harrison PM. The relationship between protein domains and homopeptides in the Plasmodium falciparum proteome. PeerJ 2020; 8:e9940. [PMID: 33062426 PMCID: PMC7534687 DOI: 10.7717/peerj.9940] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Accepted: 08/24/2020] [Indexed: 12/03/2022] Open
Abstract
The proteome of the malaria parasite Plasmodium falciparum is notable for the pervasive occurrence of homopeptides or low-complexity regions (i.e., regions that are made from a small subset of amino-acid residue types). The most prevalent of these are made from residues encoded by adenine/thymidine (AT)-rich codons, in particular asparagine. We examined homopeptide occurrences within protein domains in P. falciparum. Homopeptide enrichments occur for hydrophobic (e.g., valine), or small residues (alanine or glycine) in short spans (<5 residues), but these enrichments disappear for longer lengths. We observe that short asparagine homopeptides (<10 residues long) have a dramatic relative depletion inside protein domains, indicating some selective constraint to keep them from forming. We surmise that this is possibly linked to co-translational protein folding, although there are specific protein domains that are enriched in longer asparagine homopeptides (≥10 residues) indicating a functional linkage for specific poly-asparagine tracts. Top gene ontology functional category enrichments for homopeptides associated with diverse protein domains include “vesicle-mediated transport”, and “DNA-directed 5′-3′ RNA polymerase activity”, with various categories linked to “binding” evidencing significant homopeptide depletions. Also, in general homopeptides are substantially enriched in the parts of protein domains that are near/in IDRs. The implications of these findings are discussed.
Collapse
|
20
|
Chavali S, Singh AK, Santhanam B, Babu MM. Amino acid homorepeats in proteins. Nat Rev Chem 2020; 4:420-434. [PMID: 37127972 DOI: 10.1038/s41570-020-0204-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/04/2020] [Indexed: 12/16/2022]
Abstract
Amino acid homorepeats, or homorepeats, are polypeptide segments found in proteins that contain stretches of identical amino acid residues. Although abnormal homorepeat expansions are linked to pathologies such as neurodegenerative diseases, homorepeats are prevalent in eukaryotic proteomes, suggesting that they are important for normal physiology. In this Review, we discuss recent advances in our understanding of the biological functions of homorepeats, which range from facilitating subcellular protein localization to mediating interactions between proteins across diverse cellular pathways. We explore how the functional diversity of homorepeat-containing proteins could be linked to the ability of homorepeats to adopt different structural conformations, an ability influenced by repeat composition, repeat length and the nature of flanking sequences. We conclude by highlighting how an understanding of homorepeats will help us better characterize and develop therapeutics against the human diseases to which they contribute.
Collapse
Affiliation(s)
- Sreenivas Chavali
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK.
- Department of Biology, Indian Institute of Science Education and Research (IISER) Tirupati, Tirupati, India.
| | - Anjali K Singh
- Department of Biology, Indian Institute of Science Education and Research (IISER) Tirupati, Tirupati, India
| | - Balaji Santhanam
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK
- Department of Structural Biology and Center for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - M Madan Babu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge, UK.
- Department of Structural Biology and Center for Data Driven Discovery, St. Jude Children's Research Hospital, Memphis, TN, USA.
| |
Collapse
|
21
|
Lobanov MY, Likhachev IV, Galzitskaya OV. Disordered Residues and Patterns in the Protein Data Bank. Molecules 2020; 25:molecules25071522. [PMID: 32230759 PMCID: PMC7180803 DOI: 10.3390/molecules25071522] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 03/24/2020] [Accepted: 03/25/2020] [Indexed: 01/05/2023] Open
Abstract
We created a new library of disordered patterns and disordered residues in the Protein Data Bank (PDB). To obtain such datasets, we clustered the PDB and obtained the groups of chains with different identities and marked disordered residues. We elaborated a new procedure for finding disordered patterns and created a new version of the library. This library includes three sets of patterns: unique patterns, patterns consisting of two kinds of amino acids, and homo-repeats. Using this database, the user can: (1) find homologues in the entire Protein Data Bank; (2) perform a statistical analysis of disordered residues in protein structures; (3) search for disordered patterns and homo-repeats; (4) search for disordered regions in different chains of the same protein; (5) download clusters of protein chains with different identity from our database and library of disordered patterns; and (6) observe 3D structure interactively using MView. A new library of disordered patterns will help improve the accuracy of predictions for residues that will be structured or unstructured in a given region.
Collapse
Affiliation(s)
- Mikhail Yu. Lobanov
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
| | - Ilya V. Likhachev
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
- Institute of Mathematical Problems of Biology, Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Vitkevicha str.1, Pushchino, 142290 Moscow, Russia
| | - Oxana V. Galzitskaya
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia; (M.Y.L.); (I.V.L.)
- Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, Pushchino, 142290 Moscow, Russia
- Correspondence: ; Tel.: +7-903-675-0156
| |
Collapse
|
22
|
Atypical structural tendencies among low-complexity domains in the Protein Data Bank proteome. PLoS Comput Biol 2020; 16:e1007487. [PMID: 31986130 PMCID: PMC7004392 DOI: 10.1371/journal.pcbi.1007487] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 02/06/2020] [Accepted: 12/23/2019] [Indexed: 11/29/2022] Open
Abstract
A variety of studies have suggested that low-complexity domains (LCDs) tend to be intrinsically disordered and are relatively rare within structured proteins in the Protein Data Bank (PDB). Although LCDs are often treated as a single class, we previously found that LCDs enriched in different amino acids can exhibit substantial differences in protein metabolism and function. Therefore, we wondered whether the structural conformations of LCDs are likewise dependent on which specific amino acids are enriched within each LCD. Here, we directly examined relationships between enrichment of individual amino acids and secondary structure tendencies across the entire PDB proteome. Secondary structure tendencies varied as a function of the identity of the amino acid enriched and its degree of enrichment. Furthermore, divergence in secondary structure profiles often occurred for LCDs enriched in physicochemically similar amino acids (e.g. valine vs. leucine), indicating that LCDs composed of related amino acids can have distinct secondary structure tendencies. Comparison of LCD secondary structure tendencies with numerous pre-existing secondary structure propensity scales resulted in relatively poor correlations for certain types of LCDs, indicating that these scales may not capture secondary structure tendencies as sequence complexity decreases. Collectively, these observations provide a highly resolved view of structural tendencies among LCDs parsed by the nature and magnitude of single amino acid enrichment. The structures that proteins adopt are directly related to their amino acid sequences. Low-complexity domains (LCDs) in protein sequences are unusual regions made up of only a few different types of amino acids. Although this is the key feature that classifies sequences as LCDs, the physical properties of LCDs will differ based on the types of amino acids that are found in each domain. For example, the sequences “AAAAAAAAAA”, “EEEEEEEEEE”, and “EEKRKEEEKE” will have very different properties, even though they would all be classified as LCDs by traditional methods. In a previous study, we developed a new method to further divide LCDs into categories that more closely reflect the differences in their physical properties. In this study, we apply that approach to examine the structures of LCDs when sorted into different categories based on their amino acids. This allowed us to define relationships between the types of amino acids in the LCDs and their corresponding structures. Since protein structure is closely related to protein function, this has important implications for understanding the basic functions and properties of LCDs in a variety of proteins.
Collapse
|
23
|
Galzitskaya OV, Novikov GS. An Overlap between Splicing Sites in RNA and Homo-Repeats in Human Proteins. Mol Biol 2019. [DOI: 10.1134/s0026893319030063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
24
|
Galzitskaya OV, Novikov GS, Dovidchenko NV, Lobanov MY. Is there codon usage bias for poly-Q stretches in the human proteome? J Bioinform Comput Biol 2019; 17:1950010. [PMID: 30866735 DOI: 10.1142/s0219720019500100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We have analyzed codon usage for poly-Q stretches of different lengths for the human proteome. First, we have obtained that all long poly-Q stretches in Protein Data Bank (PDB) belong to the disordered regions. Second, we have found the bias for codon usage for glutamine homo-repeats in the human proteome. In the cases when the same codon is used for poly-Q stretches only CAG triplets are found. Similar results are obtained for human proteins with glutamine homo-repeats associated with diseases. Moreover, for proteins associated with diseases (from the HraDis database), the fraction of proteins for which the same codon is used for glutamine homo-repeats is less (22%) than for proteins from the human proteome (26%). We have demonstrated for poly-Q stretches in the human proteome that in some cases (28) the splicing sites correspond to the homo-repeats and in 11 cases, these sites appear at the C -terminal part of the homo-repeats with statistical significance 10 -8 .
Collapse
Affiliation(s)
- Oxana V Galzitskaya
- * Institute of Protein Research, Russian Academy of Sciences, Institutskaya Str., 4, Pushchino, Moscow Region 142290, Russia
| | - Georgii S Novikov
- † St. Petersburg Academic University, Nanotechnology Research and Education Centre of the Russian Academy of Sciences, St. Petersburg, Khlopina Str., 8/3, 194021, Russia
| | - Nikita V Dovidchenko
- * Institute of Protein Research, Russian Academy of Sciences, Institutskaya Str., 4, Pushchino, Moscow Region 142290, Russia
| | - Mikhail Yu Lobanov
- * Institute of Protein Research, Russian Academy of Sciences, Institutskaya Str., 4, Pushchino, Moscow Region 142290, Russia
| |
Collapse
|
25
|
Galzitskaya OV, Lobanov MY. Proteome-scale understanding of relationship between homo-repeat enrichments and protein aggregation properties. PLoS One 2018; 13:e0206941. [PMID: 30399196 PMCID: PMC6219797 DOI: 10.1371/journal.pone.0206941] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 10/22/2018] [Indexed: 02/07/2023] Open
Abstract
Expansion of homo-repeats is a molecular basis for human neurological diseases. We are the first who studied the influence of homo-repeats with lengths larger than four amino acid residues on the aggregation properties of 1449683 proteins across 122 eukaryotic and bacterial proteomes. Only 15% of proteins (215481) include homo-repeats of such length. We demonstrated that RNA-binding proteins with a prion-like domain are enriched with homo-repeats in comparison with other non-redundant protein sequences and those in the PDB. We performed a bioinformatics analysis for these proteins and found that proteins with homo-repeats are on average two times longer than those in the whole database. Moreover, we are first to discover that as a rule, homo-repeats appear in proteins not alone but in pairs: hydrophobic and aromatic homo-repeats appear with similar ones, while homo-repeats with small, polar and charged amino acids appear together with different preferences. We elaborated a new complementary approach to demonstrate the influence of homo-repeats on their host protein aggregation properties. We have shown that addition of artificial homo-repeats to natural and random proteins results in intensification of aggregation properties of the proteins. The maximal effect is observed for the insertion of artificial homo-repeats with 5–6 residues, which is consistent with the minimal length of an amyloidogenic region. We have also demonstrated that the ability of proteins with homo-repeats to aggregate cannot be explained only by the presence of long homo-repeats in them. There should be other characteristics of proteins intensifying the aggregation property including such as the appearance of homo-repeats in pairs in the same protein. We are the first who elaborated a new approach to study the influence of homo-repeats present in proteins on their aggregation properties and performed an appropriate analysis of the large number of proteomes and proteins.
Collapse
Affiliation(s)
- Oxana V. Galzitskaya
- Group of Bioinformatics, Institute of Protein Research, Russian Academy of Science, Pushchino, Moscow Region, Russia
- * E-mail:
| | - Miсhail Yu. Lobanov
- Group of Bioinformatics, Institute of Protein Research, Russian Academy of Science, Pushchino, Moscow Region, Russia
| |
Collapse
|
26
|
Cascarina SM, Ross ED. Proteome-scale relationships between local amino acid composition and protein fates and functions. PLoS Comput Biol 2018; 14:e1006256. [PMID: 30248088 PMCID: PMC6171957 DOI: 10.1371/journal.pcbi.1006256] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 10/04/2018] [Accepted: 08/16/2018] [Indexed: 11/26/2022] Open
Abstract
Proteins with low-complexity domains continue to emerge as key players in both normal and pathological cellular processes. Although low-complexity domains are often grouped into a single class, individual low-complexity domains can differ substantially with respect to amino acid composition. These differences may strongly influence the physical properties, cellular regulation, and molecular functions of low-complexity domains. Therefore, we developed a bioinformatic approach to explore relationships between amino acid composition, protein metabolism, and protein function. We find that local compositional enrichment within protein sequences is associated with differences in translation efficiency, abundance, half-life, protein-protein interaction promiscuity, subcellular localization, and molecular functions of proteins on a proteome-wide scale. However, local enrichment of related amino acids is sometimes associated with opposite effects on protein regulation and function, highlighting the importance of distinguishing between different types of low-complexity domains. Furthermore, many of these effects are discernible at amino acid compositions below those required for classification as low-complexity or statistically-biased by traditional methods and in the absence of homopolymeric amino acid repeats, indicating that thresholds employed by classical methods may not reflect biologically relevant criteria. Application of our analyses to composition-driven processes, such as the formation of membraneless organelles, reveals distinct composition profiles even for closely related organelles. Collectively, these results provide a unique perspective and detailed insights into relationships between amino acid composition, protein metabolism, and protein functions. Low-complexity domains in protein sequences are regions that are composed of only a few amino acids in the protein “alphabet”. These domains often have unique chemical properties and play important biological roles in both normal and disease-related processes. While a number of approaches have been developed to define low-complexity domains, these methods each possess conceptual limitations. Therefore, we developed a complementary approach that focuses on local amino acid composition (i.e. the amino acid composition within small regions of proteins). We find that high local composition of individual amino acids is associated with pervasive effects on protein metabolism, subcellular localization, and molecular function on a proteome-wide scale. Importantly, the nature of the effects depend on the type of amino acid enriched within the examined domains, and are observable in the absence of classically-defined low-complexity (and related) domains. Furthermore, we define the compositions of proteins involved in the formation of membraneless, protein-rich organelles such as stress granules and P-bodies. Our results provide a coherent view and unprecedented resolution of the effects of local amino acid enrichment on protein biology.
Collapse
Affiliation(s)
- Sean M. Cascarina
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO, United States of America
- * E-mail: (SMC); (EDR)
| | - Eric D. Ross
- Department of Biochemistry and Molecular Biology, Colorado State University, Fort Collins, CO, United States of America
- * E-mail: (SMC); (EDR)
| |
Collapse
|
27
|
Mier P, Andrade-Navarro MA. Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length. Genome Biol Evol 2018; 10:816-825. [PMID: 29608721 PMCID: PMC5841385 DOI: 10.1093/gbe/evy046] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/19/2018] [Indexed: 12/16/2022] Open
Abstract
Amino acid usage in a proteome depends mostly on its taxonomy, as it does the codon usage in transcriptomes. Here, we explore the level of variation in the codon usage of a specific amino acid, glutamine, in relation to the number of consecutive glutamine residues. We show that CAG triplets are consistently more abundant in short glutamine homorepeats (polyQ, four to eight residues) than in shorter glutamine stretches (one to three residues), leading to the evolutionary growth of the repeat region in a CAG-dependent manner. The length of orthologous polyQ regions is mostly stable in primates, particularly the short ones. Interestingly, given a short polyQ the CAG usage is higher in unstable-in-length orthologous polyQ regions. This indicates that CAG triplets produce the necessary instability for a glutamine stretch to grow. Proteins related to polyQ-associated diseases behave in a more extreme way, with longer glutamine stretches in human and evolutionarily closer nonhuman primates, and an overall higher CAG usage. In the light of our results, we suggest an evolutionary model to explain the glutamine codon usage in polyQ regions.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Germany
- Institute of Molecular Biology, Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Germany
- Institute of Molecular Biology, Mainz, Germany
| |
Collapse
|
28
|
Comparative mechanical unfolding studies of spectrin domains R15, R16 and R17. J Struct Biol 2017; 201:162-170. [PMID: 29221897 DOI: 10.1016/j.jsb.2017.12.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Revised: 11/08/2017] [Accepted: 12/04/2017] [Indexed: 11/20/2022]
Abstract
Spectrins belong to repetitive three-helix bundle proteins that have vital functions in multicellular organisms and are of potential value in nanotechnology. To reveal the unique physical features of repeat proteins we have studied the structural and mechanical properties of three repeats of chicken brain α-spectrin (R15, R16 and R17) at the atomic level under stretching at constant velocities (0.01, 0.05 and 0.1 Å·ps-1) and constant forces (700 and 900 pN) using molecular dynamics (MD) simulations at T = 300 K. 114 independent MD simulations were performed and their analysis has been done. Despite structural similarity of these domains we have found that R15 is less mechanically stable than R16, which is less stable than R17. This result is in agreement with the thermal unfolding rates. Moreover, we have observed the relationship between mechanical stability, flexibility of the domains and the number of aromatic residues involved in aromatic clusters.
Collapse
|
29
|
Michel CJ, Ngoune VN, Poch O, Ripp R, Thompson JD. Enrichment of Circular Code Motifs in the Genes of the Yeast Saccharomyces cerevisiae. Life (Basel) 2017; 7:life7040052. [PMID: 29207500 PMCID: PMC5745565 DOI: 10.3390/life7040052] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Revised: 11/27/2017] [Accepted: 11/27/2017] [Indexed: 12/17/2022] Open
Abstract
A set X of 20 trinucleotides has been found to have the highest average occurrence in the reading frame, compared to the two shifted frames, of genes of bacteria, archaea, eukaryotes, plasmids and viruses. This set X has an interesting mathematical property, since X is a maximal C3 self-complementary trinucleotide circular code. Furthermore, any motif obtained from this circular code X has the capacity to retrieve, maintain and synchronize the original (reading) frame. Since 1996, the theory of circular codes in genes has mainly been developed by analysing the properties of the 20 trinucleotides of X, using combinatorics and statistical approaches. For the first time, we test this theory by analysing the X motifs, i.e., motifs from the circular code X, in the complete genome of the yeast Saccharomyces cerevisiae. Several properties of X motifs are identified by basic statistics (at the frequency level), and evaluated by comparison to R motifs, i.e., random motifs generated from 30 different random codes R. We first show that the frequency of X motifs is significantly greater than that of R motifs in the genome of S. cerevisiae. We then verify that no significant difference is observed between the frequencies of X and R motifs in the non-coding regions of S. cerevisiae, but that the occurrence number of X motifs is significantly higher than R motifs in the genes (protein-coding regions). This property is true for all cardinalities of X motifs (from 4 to 20) and for all 16 chromosomes. We further investigate the distribution of X motifs in the three frames of S. cerevisiae genes and show that they occur more frequently in the reading frame, regardless of their cardinality or their length. Finally, the ratio of X genes, i.e., genes with at least one X motif, to non-X genes, in the set of verified genes is significantly different to that observed in the set of putative or dubious genes with no experimental evidence. These results, taken together, represent the first evidence for a significant enrichment of X motifs in the genes of an extant organism. They raise two hypotheses: the X motifs may be evolutionary relics of the primitive codes used for translation, or they may continue to play a functional role in the complex processes of genome decoding and protein synthesis.
Collapse
Affiliation(s)
- Christian J Michel
- Complex Systems and Translational Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
| | - Viviane Nguefack Ngoune
- Complex Systems and Translational Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
| | - Olivier Poch
- Complex Systems and Translational Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
| | - Raymond Ripp
- Complex Systems and Translational Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
| | - Julie D Thompson
- Complex Systems and Translational Bioinformatics, ICube, University of Strasbourg, CNRS, 300 Boulevard Sébastien Brant, 67400 Illkirch, France.
| |
Collapse
|
30
|
Mier P, Andrade-Navarro MA. dAPE: a web server to detect homorepeats and follow their evolution. Bioinformatics 2017; 33:1221-1223. [PMID: 28031183 PMCID: PMC5408840 DOI: 10.1093/bioinformatics/btw790] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 12/09/2016] [Indexed: 01/10/2023] Open
Abstract
Summary Homorepeats are low complexity regions consisting of repetitions of a single amino acid residue. There is no current consensus on the minimum number of residues needed to define a functional homorepeat, nor even if mismatches are allowed. Here we present dAPE, a web server that helps following the evolution of homorepeats based on orthology information, using a sensitive but tunable cutoff to help in the identification of emerging homorepeats. Availability and Implementation dAPE can be accessed from http://cbdm-01.zdv.uni-mainz.de/∼munoz/polyx. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg Universität, Institute of Molecular Biology, Mainz, Germany
- To whom correspondence should be addressed.
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg Universität, Institute of Molecular Biology, Mainz, Germany
| |
Collapse
|
31
|
Campbell L, Turner SR. A Comprehensive Analysis of RALF Proteins in Green Plants Suggests There Are Two Distinct Functional Groups. FRONTIERS IN PLANT SCIENCE 2017; 8:37. [PMID: 28174582 PMCID: PMC5258720 DOI: 10.3389/fpls.2017.00037] [Citation(s) in RCA: 76] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Accepted: 01/09/2017] [Indexed: 05/20/2023]
Abstract
Rapid Alkalinization Factors (RALFs) are small, cysteine-rich peptides known to be involved in various aspects of plant development and growth. Although RALF peptides have been identified within many species, a single wide-ranging phylogenetic analysis of the family across the plant kingdom has not yet been undertaken. Here, we identified RALF proteins from 51 plant species that represent a variety of land plant lineages. The inferred evolutionary history of the 795 identified RALFs suggests that the family has diverged into four major clades. We found that much of the variation across the family exists within the mature peptide region, suggesting clade-specific functional diversification. Clades I, II, and III contain the features that have been identified as important for RALF activity, including the RRXL cleavage site and the YISY motif required for receptor binding. In contrast, members of clades IV that represent a third of the total dataset, is highly diverged and lacks these features that are typical of RALFs. Members of clade IV also exhibit distinct expression patterns and physico-chemical properties. These differences suggest a functional divergence of clades and consequently, we propose that the peptides within clade IV are not true RALFs, but are more accurately described as RALF-related peptides. Expansion of this RALF-related clade in the Brassicaceae is responsible for the large number of RALF genes that have been previously described in Arabidopsis thaliana. Future experimental work will help to establish the nature of the relationship between the true RALFs and the RALF-related peptides, and whether they function in a similar manner.
Collapse
|
32
|
Adaptive Variation and Introgression of a CONSTANS-Like Gene in North American Red Oaks. FORESTS 2016. [DOI: 10.3390/f8010003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|