1
|
Song J. In the Beginning: Let Hydration Be Coded in Proteins for Manifestation and Modulation by Salts and Adenosine Triphosphate. Int J Mol Sci 2024; 25:12817. [PMID: 39684527 DOI: 10.3390/ijms252312817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 11/25/2024] [Accepted: 11/26/2024] [Indexed: 12/18/2024] Open
Abstract
Water exists in the beginning and hydrates all matter. Life emerged in water, requiring three essential components in compartmentalized spaces: (1) universal energy sources driving biochemical reactions and processes, (2) molecules that store, encode, and transmit information, and (3) functional players carrying out biological activities and structural organization. Phosphorus has been selected to create adenosine triphosphate (ATP) as the universal energy currency, nucleic acids for genetic information storage and transmission, and phospholipids for cellular compartmentalization. Meanwhile, proteins composed of 20 α-amino acids have evolved into extremely diverse three-dimensional forms, including folded domains, intrinsically disordered regions (IDRs), and membrane-bound forms, to fulfill functional and structural roles. This review examines several unique findings: (1) insoluble proteins, including membrane proteins, can become solubilized in unsalted water, while folded cytosolic proteins can acquire membrane-inserting capacity; (2) Hofmeister salts affect protein stability by targeting hydration; (3) ATP biphasically modulates liquid-liquid phase separation (LLPS) of IDRs; (4) ATP antagonizes crowding-induced protein destabilization; and (5) ATP and triphosphates have the highest efficiency in inducing protein folding. These findings imply the following: (1) hydration might be encoded in protein sequences, central to manifestation and modulation of protein structures, dynamics, and functionalities; (2) phosphate anions have a unique capacity in enhancing μs-ms protein dynamics, likely through ionic state exchanges in the hydration shell, underpinning ATP, polyphosphate, and nucleic acids as molecular chaperones for protein folding; and (3) ATP, by linking triphosphate with adenosine, has acquired the capacity to spacetime-specifically release energy and modulate protein hydration, thus possessing myriad energy-dependent and -independent functions. In light of the success of AlphaFolds in accurately predicting protein structures by neural networks that store information as distributed patterns across nodes, a fundamental question arises: Could cellular networks also handle information similarly but with more intricate coding, diverse topological architectures, and spacetime-specific ATP energy supply in membrane-compartmentalized aqueous environments?
Collapse
Affiliation(s)
- Jianxing Song
- Department of Biological Sciences, Faculty of Science, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260, Singapore
| |
Collapse
|
2
|
Mahapatra A, Newberry RW. Liquid-liquid phase separation of α-synuclein is highly sensitive to sequence complexity. Protein Sci 2024; 33:e4951. [PMID: 38511533 PMCID: PMC10955625 DOI: 10.1002/pro.4951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 02/06/2024] [Accepted: 02/19/2024] [Indexed: 03/22/2024]
Abstract
The Parkinson's-associated protein α-synuclein (α-syn) can undergo liquid-liquid phase separation (LLPS), which typically leads to the formation of amyloid fibrils. The coincidence of LLPS and amyloid formation has complicated the identification of the molecular determinants unique to LLPS of α-syn. Moreover, the lack of strategies to selectively perturb LLPS makes it difficult to dissect the biological roles specific to α-syn LLPS, independent of fibrillation. Herein, using a combination of subtle missense mutations, we show that LLPS of α-syn is highly sensitive to its sequence complexity. In fact, we find that even a highly conservative mutation (V16I) that increases sequence complexity without perturbing physicochemical and structural properties, is sufficient to reduce LLPS by 75%; this effect can be reversed by an adjacent V-to-I mutation (V15I) that restores the original sequence complexity. A18T, a complexity-enhancing PD-associated mutation, was likewise found to reduce LLPS, implicating sequence complexity in α-syn pathogenicity. Furthermore, leveraging the differences in LLPS propensities among different α-syn variants, we demonstrate that fibrillation of α-syn does not necessarily correlate with its LLPS. In fact, we identify mutations that selectively perturb LLPS or fibrillation of α-syn, unlike previously studied mutations. The variants and design principles reported herein should therefore empower future studies to disentangle these two phenomena and distinguish their (patho)biological roles.
Collapse
|
3
|
Tagad A, Singh RK, Patwari GN. Binary Matrix Method to Enumerate, Hierarchically Order, and Structurally Classify Peptide Aggregation. J Chem Inf Model 2022; 62:1585-1594. [PMID: 35232014 DOI: 10.1021/acs.jcim.2c00069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Protein aggregation is a common and complex phenomenon in biological processes, yet a robust analysis of this aggregation process remains elusive. The commonly used methods such as center-of-mass to center-of-mass (COM-COM) distance, the radius of gyration (Rg), hydrogen bonding (HB), and solvent accessible surface area do not quantify the aggregation accurately. Herein, a new and robust method that uses an aggregation matrix (AM) approach to investigate peptide aggregation in a MD simulation trajectory is presented. An nxn two-dimensional AM is created by using the interpeptide Cα-Cα cutoff distances, which are binarily encoded (0 or 1). These aggregation matrices are analyzed to enumerate, hierarchically order, and structurally classify the aggregates. Comparison of the present AM method suggests that it is superior to the HB method since it can incorporate nonspecific interactions and the Rg and COM-COM methods since the cutoff distance is independent of the length of the peptide. More importantly, the present method can structurally classify the peptide aggregates, which the conventional Rg, COM-COM, and HB methods fail to do. The unique selling point of this method is its ability to structurally classify peptide aggregates using two-dimensional matrices.
Collapse
Affiliation(s)
- Amol Tagad
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Reman Kumar Singh
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - G Naresh Patwari
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
4
|
Yu B, Kong D, Cheng C, Xiang D, Cao L, Liu Y, He Y. Assembly and recognition of keratins: A structural perspective. Semin Cell Dev Biol 2021; 128:80-89. [PMID: 34654627 DOI: 10.1016/j.semcdb.2021.09.018] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/22/2021] [Accepted: 09/29/2021] [Indexed: 12/21/2022]
Abstract
Keratins are one of the major components of cytoskeletal network and assemble into fibrous structures named intermediate filaments (IFs), which are important for maintaining the mechanical properties of cells and tissues. Over the past decades, evidence has shown that the functions of keratins go beyond providing mechanical support for cells, they interact with multiple cellular components and are widely involved in the pathways of cell proliferation, differentiation, motility and death. However, the structural details of keratins and IFs are largely missing and many questions remain regarding the mechanisms of keratin assembly and recognition. Here we briefly review the current structural models and assembly of keratins as well as the interactions of keratins with the binding partners, which may provide a structural view for understanding the mechanisms of keratins in the biological activities and the related diseases.
Collapse
Affiliation(s)
- Bowen Yu
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Department of Immunology, School of Basic Medical Sciences, Weifang Medical University, Weifang, China
| | - Dandan Kong
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Chen Cheng
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Dongxi Xiang
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Department of Biliary-Pancreatic Surgery, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Longxing Cao
- School of Life Science, Westlake University, Hangzhou, Zhejiang, China
| | - Yingbin Liu
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Department of Biliary-Pancreatic Surgery, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yongning He
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Department of Biliary-Pancreatic Surgery, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai, China.
| |
Collapse
|
5
|
Lau Y, Oamen HP, Caudron F. Protein Phase Separation during Stress Adaptation and Cellular Memory. Cells 2020; 9:cells9051302. [PMID: 32456195 PMCID: PMC7291175 DOI: 10.3390/cells9051302] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Revised: 05/14/2020] [Accepted: 05/21/2020] [Indexed: 12/13/2022] Open
Abstract
Cells need to organise and regulate their biochemical processes both in space and time in order to adapt to their surrounding environment. Spatial organisation of cellular components is facilitated by a complex network of membrane bound organelles. Both the membrane composition and the intra-organellar content of these organelles can be specifically and temporally controlled by imposing gates, much like bouncers controlling entry into night-clubs. In addition, a new level of compartmentalisation has recently emerged as a fundamental principle of cellular organisation, the formation of membrane-less organelles. Many of these structures are dynamic, rapidly condensing or dissolving and are therefore ideally suited to be involved in emergency cellular adaptation to stresses. Remarkably, the same proteins have also the propensity to adopt self-perpetuating assemblies which properties fit the needs to encode cellular memory. Here, we review some of the principles of phase separation and the function of membrane-less organelles focusing particularly on their roles during stress response and cellular memory.
Collapse
|
6
|
Suvorova YM, Korotkov EV. New Method for Potential Fusions Detection in Protein-Coding Sequences. J Comput Biol 2019; 26:1253-1261. [PMID: 31211597 DOI: 10.1089/cmb.2019.0122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
Abstract
Gene fusion is known to be one of the mechanisms of a new gene formation. Most bioinformatics methods for studying fused genes are based on the sequence similarity search. However, if the ancestral sequences were lost during evolution or changed too much, it is impossible to detect the fusion. Previously, we have developed a method of searching for triplet periodicity (TP) change points in protein-coding sequences (CDS) and showed the possible relation of this phenomenon with gene formation as a result of fusion. In this study, we improved the TP change point detection method and studied the genes of six eukaryotic genomes. At the level of 2%-3% of the probability of type I error, TP change points were found in 20%-40% of genes. Further analysis showed that about 30% of the TP change points can be explained by amino acid repeats. Another 30% can be potentially fused genes, alignment for which was detected by the BLAST program. We believe that the rest of the results can be fused genes, the ancestral sequences for which have been lost. The method is more sensitive to TP changes and allowed us to find up to two to three times more cases of significant TP change points than our previous method.
Collapse
Affiliation(s)
- Yulia M Suvorova
- Federal State Institution "Federal Research Centre "Fundamentals of Biotechnology" of the Russian Academy of Sciences", Moscow, Russian Federation
| | - Eugene V Korotkov
- Federal State Institution "Federal Research Centre "Fundamentals of Biotechnology" of the Russian Academy of Sciences", Moscow, Russian Federation.,Applied Mathematics Department, National Research Nuclear University MEPhI, Moscow, Russian Federation
| |
Collapse
|
7
|
Urbanek A, Morató A, Allemand F, Delaforge E, Fournet A, Popovic M, Delbecq S, Sibille N, Bernadó P. A General Strategy to Access Structural Information at Atomic Resolution in Polyglutamine Homorepeats. Angew Chem Int Ed Engl 2018; 57:3598-3601. [PMID: 29359503 PMCID: PMC5901001 DOI: 10.1002/anie.201711530] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 12/28/2017] [Indexed: 12/31/2022]
Abstract
Homorepeat (HR) proteins are involved in key biological processes and multiple pathologies, however their high-resolution characterization has been impaired due to their homotypic nature. To overcome this problem, we have developed a strategy to isotopically label individual glutamines within HRs by combining nonsense suppression and cell-free expression. Our method has enabled the NMR investigation of huntingtin exon1 with a 16-residue polyglutamine (poly-Q) tract, and the results indicate the presence of an N-terminal α-helix at near neutral pH that vanishes towards the end of the HR. The generality of the strategy was demonstrated by introducing a labeled glutamine into a pathological version of huntingtin with 46 glutamines. This methodology paves the way to decipher the structural and dynamic perturbations induced by HR extensions in poly-Q-related diseases. Our approach can be extended to other amino acids to investigate biological processes involving proteins containing low-complexity regions (LCRs).
Collapse
Affiliation(s)
- Annika Urbanek
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Anna Morató
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Frédéric Allemand
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Elise Delaforge
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Aurélie Fournet
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Matija Popovic
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Stephane Delbecq
- Laboratoire de Biologie Cellulaire et Moléculaire, (LBCM-EA4558 Vaccination Antiparasitaire)UFR PharmacieUniversité de MontpellierMontpellierFrance
| | - Nathalie Sibille
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| | - Pau Bernadó
- Centre de Biochimie Structurale (CBS), INSERM, CNRSUniversité de Montpellier29 rue de Navacelles34090MontpellierFrance
| |
Collapse
|
8
|
Urbanek A, Morató A, Allemand F, Delaforge E, Fournet A, Popovic M, Delbecq S, Sibille N, Bernadó P. A General Strategy to Access Structural Information at Atomic Resolution in Polyglutamine Homorepeats. Angew Chem Int Ed Engl 2018. [DOI: 10.1002/ange.201711530] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Annika Urbanek
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Anna Morató
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Frédéric Allemand
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Elise Delaforge
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Aurélie Fournet
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Matija Popovic
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Stephane Delbecq
- Laboratoire de Biologie Cellulaire et Moléculaire, (LBCM-EA4558 Vaccination Antiparasitaire); UFR Pharmacie; Université de Montpellier; Montpellier France
| | - Nathalie Sibille
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| | - Pau Bernadó
- Centre de Biochimie Structurale (CBS), INSERM, CNRS; Université de Montpellier; 29 rue de Navacelles 34090 Montpellier France
| |
Collapse
|
9
|
Song J. Environment-transformable sequence-structure relationship: a general mechanism for proteotoxicity. Biophys Rev 2017; 10:503-516. [PMID: 29204881 DOI: 10.1007/s12551-017-0369-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Accepted: 11/19/2017] [Indexed: 12/15/2022] Open
Abstract
In his Nobel Lecture, Anfinsen stated "the native conformation is determined by the totality of interatomic interactions and hence by the amino acid sequence, in a given environment." As aqueous solutions and membrane systems co-exist in cells, proteins are classified into membrane and non-membrane proteins, but whether one can transform one into the other remains unknown. Intriguingly, many well-folded non-membrane proteins are converted into "insoluble" and toxic forms by aging- or disease-associated factors, but the underlying mechanisms remain elusive. In 2005, we discovered a previously unknown regime of proteins seemingly inconsistent with the classic "Salting-in" dogma: "insoluble" proteins including the integral membrane fragments could be solubilized in the ion-minimized water. We have thus successfully studied "insoluble" forms of ALS-causing P56S-MSP, L126Z-SOD1, nascent SOD1 and C71G-Profilin1, as well as E. coli S1 fragments. The results revealed that these "insoluble" forms are either unfolded or co-exist with their unfolded states. Most unexpectedly, these unfolded states acquire a novel capacity of interacting with membranes energetically driven by the formation of helices/loops over amphiphilic/hydrophobic regions which universally exit in proteins but are normally locked away in their folded native states. Our studies suggest that most, if not all, proteins contain segments which have the dual ability to fold into distinctive structures in aqueous and membrane environments. The abnormal membrane interaction might initiate disease and/or aging processes; and its further coupling with protein aggregation could result in radical proteotoxicity by forming inclusions composed of damaged membranous organelles and protein aggregates. Therefore, environment-transformable sequence-structure relationship may represent a general mechanism for proteotoxicity.
Collapse
Affiliation(s)
- Jianxing Song
- Department of Biological Sciences, Faculty of Science, National University of Singapore, 10 Kent Ridge Crescent, Singapore, 119260, Singapore.
| |
Collapse
|
10
|
Screening of nucleotide variations in genomic sequences encoding charged protein regions in the human genome. BMC Genomics 2017; 18:588. [PMID: 28789634 PMCID: PMC5549384 DOI: 10.1186/s12864-017-4000-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 08/01/2017] [Indexed: 11/24/2022] Open
Abstract
Background Studying genetic variation distribution in proteins containing charged regions, called charge clusters (CCs), is of great interest to unravel their functional role. Charge clusters are 20 to 75 residue segments with high net positive charge, high net negative charge, or high total charge relative to the overall charge composition of the protein. We previously developed a bioinformatics tool (FCCP) to detect charge clusters in proteomes and scanned the human proteome for the occurrence of CCs. In this paper we investigate the genetic variations in the human proteins harbouring CCs. Results We studied the coding regions of 317 positively charged clusters and 1020 negatively charged ones previously detected in human proteins. Results revealed that coding parts of CCs are richer in sequence variants than their corresponding genes, full mRNAs, and exonic + intronic sequences and that these variants are predominately rare (Minor allele frequency < 0.005). Furthermore, variants occurring in the coding parts of positively charged regions of proteins are more often pathogenic than those occurring in negatively charged ones. Classification of variants according to their types showed that substitution is the major type followed by Indels (Insertions-deletions). Concerning substitutions, it was found that within clusters of both charges, the charged amino acids were the greatest loser groups whereas polar residues were the greatest gainers. Conclusions Our findings highlight the prominent features of the human charged regions from the DNA up to the protein sequence which might provide potential clues to improve the current understanding of those charged regions and their implication in the emergence of diseases. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4000-3) contains supplementary material, which is available to authorized users.
Collapse
|
11
|
Berezovsky IN, Guarnera E, Zheng Z. Basic units of protein structure, folding, and function. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2016; 128:85-99. [PMID: 27697476 DOI: 10.1016/j.pbiomolbio.2016.09.009] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Revised: 09/05/2016] [Accepted: 09/26/2016] [Indexed: 10/20/2022]
Abstract
Study of the hierarchy of domain structure with alternative sets of domains and analysis of discontinuous domains, consisting of remote segments of the polypeptide chain, raised a question about the minimal structural unit of the protein domain. The hypothesis on the decisive role of the polypeptide backbone in determining the elementary units of globular proteins have led to the discovery of closed loops. It is reviewed here how closed loops form the loop-n-lock structure of proteins, providing the foundation for stability and designability of protein folds/domain and underlying their co-translational folding. Simplified protein sequences are considered here with the aim to explore the basic principles that presumably dominated the folding and stability of proteins in the early stages of structural evolution. Elementary functional loops (EFLs), closed loops with one or few catalytic residues, are, in turn, units of the protein function. They are apparent descendants of the prebiotic ring-like peptides, which gave rise to the first functional folds/domains being fused in the beginning of the evolution of protein structure. It is also shown how evolutionary relations between protein functional superfamilies and folds delineated with the help of EFLs can contribute to establishing the rules for design of desired enzymatic functions. Generalized descriptors of the elementary functions are proposed to be used as basic units in the future computational design.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore; Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, 117579, Singapore.
| | - Enrico Guarnera
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| | - Zejun Zheng
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| |
Collapse
|
12
|
The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. Methods Mol Biol 2016; 1415:477-506. [PMID: 27115649 DOI: 10.1007/978-1-4939-3572-7_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/02/2023]
|
13
|
Chou CC, Wang AHJ. Structural D/E-rich repeats play multiple roles especially in gene regulation through DNA/RNA mimicry. MOLECULAR BIOSYSTEMS 2016; 11:2144-51. [PMID: 26088262 DOI: 10.1039/c5mb00206k] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Aspartic acid and glutamic acid repeats in proteins exhibit strong negative charge distribution and they may play special biological roles. From 39,684 unique structural data in the RCSB Protein Data Bank (PDB), 173 structures were found to contain ordered D/E-rich repeat structures, and 57 of them were related to DNA/RNA functions. The frequency of occurrence of glutamic acid (36.90%) was higher than that of aspartic acid (27.02%). Glycine (2.38%), alanine (2.68%), valine (3.54%), leucine (5.57%), and isoleucine (3.34%), but not methionine (0.91%), were the most abundant hydrophobic residues. The available complex structures suggested that D/E-rich proteins might be involved in DNA mimicry, mRNA processing and regulation of the transcription complex. The region surrounding the D/E-rich repeat sequences plays important roles in the binding specificity toward the target proteins. The numbers and composition of aspartic acid and glutamic acid might also affect binding properties. Aspartic acid and glutamic acid are disorder-promoting residues in the intrinsically disorder proteins. Our findings suggest that the D/E-rich repeats are unique components of intrinsically disordered proteins, which are involved in the gene regulation and could serve as potential druggable fragments or drug targets.
Collapse
Affiliation(s)
- Chia-Cheng Chou
- Institute of Biological Chemistry, Academia Sinica, Taipei, Taiwan.
| | | |
Collapse
|
14
|
Kumari B, Kumar R, Kumar M. Low complexity and disordered regions of proteins have different structural and amino acid preferences. MOLECULAR BIOSYSTEMS 2014; 11:585-94. [PMID: 25468592 DOI: 10.1039/c4mb00425f] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Low complexity regions (LCRs) or non-random regions of a few amino acids are abundantly present in proteins. LCRs are traditionally considered as floppy structures with high solvent accessibility. Thus little attention was paid to them for structural studies. However LCRs have been found to contain information relevant to protein structure and various important functions. The present study is an attempt to understand the structural trend of LCRs. Here we report a study conducted to understand the structural trend, solvent accessibility and amino acid preferences of LCRs. The results show that LCRs might attain any type of secondary structure; however, the helix is frequently seen, whereas sheets occur rarely. We also found that LCRs are not always exposed on the surface. We found insignificant contribution of trans-membrane helices to the overall helix content. The LCRs having a secondary structure have different enrichment and depletion of amino acids from LCRs without a secondary structure and disordered protein sequences. However, LCRs of NMR structures showed compositional and functional similarity to the disordered regions of proteins. We also noted that in ∼3/4 LCRs, the entire amino acid did not have a single structural class, but rather an ensemble of more than one secondary structure, which indicates that they are found at places where structure transition occurs. Overall analysis suggests that the overall protein sequence has a greater influence on the structural and sequence enrichment rather than only the local amino acid composition of LCRs.
Collapse
Affiliation(s)
- Bandana Kumari
- Department of Biophysics, University of Delhi South Campus, New Delhi, India.
| | | | | |
Collapse
|
15
|
Das S, Pal U, Das S, Bagga K, Roy A, Mrigwani A, Maiti NC. Sequence complexity of amyloidogenic regions in intrinsically disordered human proteins. PLoS One 2014; 9:e89781. [PMID: 24594841 PMCID: PMC3940659 DOI: 10.1371/journal.pone.0089781] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Accepted: 01/26/2014] [Indexed: 01/03/2023] Open
Abstract
An amyloidogenic region (AR) in a protein sequence plays a significant role in protein aggregation and amyloid formation. We have investigated the sequence complexity of AR that is present in intrinsically disordered human proteins. More than 80% human proteins in the disordered protein databases (DisProt+IDEAL) contained one or more ARs. With decrease of protein disorder, AR content in the protein sequence was decreased. A probability density distribution analysis and discrete analysis of AR sequences showed that ∼8% residue in a protein sequence was in AR and the region was in average 8 residues long. The residues in the AR were high in sequence complexity and it seldom overlapped with low complexity regions (LCR), which was largely abundant in disorder proteins. The sequences in the AR showed mixed conformational adaptability towards α-helix, β-sheet/strand and coil conformations.
Collapse
Affiliation(s)
- Swagata Das
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Uttam Pal
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Supriya Das
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Khyati Bagga
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Anupam Roy
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Arpita Mrigwani
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
| | - Nakul C. Maiti
- Structural Biology and Bioinformatics Division, Council of Scientific and Industrial Research (CSIR)-Indian Institute of Chemical Biology (IICB), Kolkata, India
- * E-mail:
| |
Collapse
|
16
|
Persi E, Horn D. Systematic analysis of compositional order of proteins reveals new characteristics of biological functions and a universal correlate of macroevolution. PLoS Comput Biol 2013; 9:e1003346. [PMID: 24278003 PMCID: PMC3836704 DOI: 10.1371/journal.pcbi.1003346] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2013] [Accepted: 10/03/2013] [Indexed: 01/01/2023] Open
Abstract
We present a novel analysis of compositional order (CO) based on the occurrence of Frequent amino-acid Triplets (FTs) that appear much more than random in protein sequences. The method captures all types of proteomic compositional order including single amino-acid runs, tandem repeats, periodic structure of motifs and otherwise low complexity amino-acid regions. We introduce new order measures, distinguishing between ‘regularity’, ‘periodicity’ and ‘vocabulary’, to quantify these phenomena and to facilitate the identification of evolutionary effects. Detailed analysis of representative species across the tree-of-life demonstrates that CO proteins exhibit numerous functional enrichments, including a wide repertoire of particular patterns of dependencies on regularity and periodicity. Comparison between human and mouse proteomes further reveals the interplay of CO with evolutionary trends, such as faster substitution rate in mouse leading to decrease of periodicity, while innovation along the human lineage leads to larger regularity. Large-scale analysis of 94 proteomes leads to systematic ordering of all major taxonomic groups according to FT-vocabulary size. This is measured by the count of Different Frequent Triplets (DFT) in proteomes. The latter provides a clear hierarchical delineation of vertebrates, invertebrates, plants, fungi and prokaryotes, with thermophiles showing the lowest level of FT-vocabulary. Among eukaryotes, this ordering correlates with phylogenetic proximity. Interestingly, in all kingdoms CO accumulation in the proteome has universal characteristics. We suggest that CO is a genomic-information correlate of both macroevolution and various protein functions. The results indicate a mechanism of genomic ‘innovation’ at the peptide level, involved in protein elongation, shaped in a universal manner by mutational and selective forces. Variations in compositionally ordered (CO) sections of proteins, such as amino acid runs, tandem repeats and low complexity regions, are often considered as a third type of genomic variation along with SNP and CNV. At the microevolutionary scale, they are involved in the rapid evolution of numerous biological functions and the development of novel phenotypic complex traits, including disease in human, in particular neurodegeneration and cancer. At the macroevolutionary scale, the best discriminating proteomic factor between super-kingdoms is the prevalence of CO proteins in eukaryotes. The analysis of CO structures has so far been quite eclectic. Here we introduce a novel unifying methodology, accounting for all types of low-complexity regions and repetitive phenomena, including the existence of large periodic structures in protein sequences. We define new CO measures providing insights into the correlation of CO with protein function and with evolution. In particular, a large-scale analysis of 94 proteomes shows that the CO vocabulary of frequently appearing amino acid triplets serves as a measure of taxonomic ordering separating major clades from each other. It unravels a missing genomic correlate of macroevolution and serves as a novel phylogenetic tool. This suggests that major CO generation occurs during the creation of a completely new species, i.e. during macroevolutionary events.
Collapse
Affiliation(s)
- Erez Persi
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - David Horn
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- * E-mail:
| |
Collapse
|
17
|
María Velasco A, Becerra A, Hernández-Morales R, Delaye L, Jiménez-Corona ME, Ponce-de-Leon S, Lazcano A. Low complexity regions (LCRs) contribute to the hypervariability of the HIV-1 gp120 protein. J Theor Biol 2013; 338:80-6. [PMID: 24021867 DOI: 10.1016/j.jtbi.2013.08.039] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2013] [Revised: 08/01/2013] [Accepted: 08/31/2013] [Indexed: 01/27/2023]
Abstract
Low complexity regions (LCRs) are sequences of nucleic acids or proteins defined by a compositional bias. Their occurrence has been confirmed in sequences of the three cellular lineages (Bacteria, Archaea and Eucarya), and has also been reported in viral genomes. We present here the results of a detailed computer analysis of the LCRs present in the HIV-1 glycoprotein 120 (gp120) encoded by the viral gene env. The analysis was performed using a sample of 3637 Env polyprotein sequences derived from 4117 completely sequenced and translated HIV-1 genomes available in public databases as of December 2012. We have identified 1229 LCRs located in four different regions of the gp120 protein that correspond to four of the five regions that have been identified as hypervariable (V1, V2, V4 and V5). The remaining 29 LCRs are found in the signal peptide and in the conserved regions C2, C3, C4 and C5. No LCR has been identified in the hypervariable region V3. The LCRs detected in the V1, V2, V4, and V5 hypervariable regions exhibit a high Asn content in their amino acid composition, which very likely correspond to glycosylation sites, which may contribute to the retroviral ability to avoid the immune system. In sharp contrast with what is observed in gp120 proteins lacking LCRs, the glycosylation sites present in LCRs tend to be clustered towards the center of the region forming well-defined islands. The results presented here suggest that LCRs represent a hitherto undescribed source of genomic variability in lentivirus, and that these repeats may represent an important source of antigenic variation in HIV-1 populations. The results reported here may exemplify the evolutionary processes that may have increased the size of primitive cellular RNA genomes and the role of LCRs as a source of raw material during the processes of evolutionary acquisition of new functions.
Collapse
Affiliation(s)
- Ana María Velasco
- Facultad de Ciencias, UNAM, Ciudad Universitaria, Apdo. Postal 70-407, México D. F. 04510, Mexico; Laboratorios de Biológicos y Reactivos de México, Amores 1240, Colonia Del Valle, México D. F. 03100, Mexico
| | | | | | | | | | | | | |
Collapse
|
18
|
Marie A, Alves S, Marie B, Dubost L, Bédouet L, Berland S. Analysis of low complex region peptides derived from mollusk shell matrix proteins using CID, high-energy collisional dissociation, and electron transfer dissociation on an LTQ-orbitrap: Implications for peptide to spectrum match. Proteomics 2012; 12:3069-75. [DOI: 10.1002/pmic.201200143] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Revised: 07/09/2012] [Accepted: 07/25/2012] [Indexed: 11/11/2022]
Affiliation(s)
- Arul Marie
- Plateforme de spectrométrie de masse et de protéomique; UMR7245 CNRS; Muséum National d'Histoire Naturelle; Paris France
| | - Sandra Alves
- UMR 7201 CNRS; Institut Parisien de Chimie Moléculaire; Université Pierre et Marie Curie; Paris France
| | - Benjamin Marie
- UMR7245 CNRS, Département RDDM; Muséum National d'Histoire Naturelle; Paris France
| | - Lionel Dubost
- Plateforme de spectrométrie de masse et de protéomique; UMR7245 CNRS; Muséum National d'Histoire Naturelle; Paris France
| | - Laurent Bédouet
- UMR BOREA; MNHN/CNRS 7208/IRD 207; Muséum National d'Histoire Naturelle; Paris France
| | - Sophie Berland
- UMR BOREA; MNHN/CNRS 7208/IRD 207; Muséum National d'Histoire Naturelle; Paris France
| |
Collapse
|
19
|
Glanz S, Jacobs J, Kock V, Mishra A, Kück U. Raa4 is a trans-splicing factor that specifically binds chloroplast tscA intron RNA. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2012; 69:421-431. [PMID: 21954961 DOI: 10.1111/j.1365-313x.2011.04801.x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
During trans-splicing of discontinuous organellar introns, independently transcribed coding sequences are joined together to generate a continuous mRNA. The chloroplast psaA gene from Chlamydomonas reinhardtii encoding the P(700) core protein of photosystem I (PSI) is split into three exons and two group IIB introns, which are both spliced in trans. Using forward genetics, we isolated a novel PSI mutant, raa4, with a defect in trans-splicing of the first intron. Complementation analysis identified the affected gene encoding the 112.4 kDa Raa4 protein, which shares no strong sequence identity with other known proteins. The chloroplast localization of the protein was confirmed by confocal fluorescence microscopy, using a GFP-tagged Raa4 fusion protein. RNA-binding studies showed that Raa4 binds specifically to domains D2 and D3, but not to other conserved domains of the tripartite group II intron. Raa4 may play a role in stabilizing folding intermediates or functionally active structures of the split intron RNA.
Collapse
Affiliation(s)
- Stephanie Glanz
- Department for General and Molecular Botany, Ruhr-University Bochum, D-44780 Bochum, Germany
| | | | | | | | | |
Collapse
|
20
|
Zhang T, Faraggi E, Xue B, Dunker AK, Uversky VN, Zhou Y. SPINE-D: accurate prediction of short and long disordered regions by a single neural-network based method. J Biomol Struct Dyn 2012; 29:799-813. [PMID: 22208280 PMCID: PMC3297974 DOI: 10.1080/073911012010525022] [Citation(s) in RCA: 138] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Short and long disordered regions of proteins have different preference for different amino acid residues. Different methods often have to be trained to predict them separately. In this study, we developed a single neural-network-based technique called SPINE-D that makes a three-state prediction first (ordered residues and disordered residues in short and long disordered regions) and reduces it into a two-state prediction afterwards. SPINE-D was tested on various sets composed of different combinations of Disprot annotated proteins and proteins directly from the PDB annotated for disorder by missing coordinates in X-ray determined structures. While disorder annotations are different according to Disprot and X-ray approaches, SPINE-D's prediction accuracy and ability to predict disorder are relatively independent of how the method was trained and what type of annotation was employed but strongly depend on the balance in the relative populations of ordered and disordered residues in short and long disordered regions in the test set. With greater than 85% overall specificity for detecting residues in both short and long disordered regions, the residues in long disordered regions are easier to predict at 81% sensitivity in a balanced test dataset with 56.5% ordered residues but more challenging (at 65% sensitivity) in a test dataset with 90% ordered residues. Compared to eleven other methods, SPINE-D yields the highest area under the curve (AUC), the highest Mathews correlation coefficient for residue-based prediction, and the lowest mean square error in predicting disorder contents of proteins for an independent test set with 329 proteins. In particular, SPINE-D is comparable to a meta predictor in predicting disordered residues in long disordered regions and superior in short disordered regions. SPINE-D participated in CASP 9 blind prediction and is one of the top servers according to the official ranking. In addition, SPINE-D was examined for prediction of functional molecular recognition motifs in several case studies.
Collapse
Affiliation(s)
- Tuo Zhang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Eshel Faraggi
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Bin Xue
- Department of Molecular Medicine, University of South Florida, Tampa, FL 33612, USA
| | - A. Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Vladimir N. Uversky
- Department of Molecular Medicine, University of South Florida, Tampa, FL 33612, USA
- Institute for Biological Instrumentation, Russian Academy of Sciences, 142290 Pushchino, Moscow Region, Russia
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
21
|
Zhang T, Faraggi E, Zhou Y. Fluctuations of backbone torsion angles obtained from NMR-determined structures and their prediction. Proteins 2010; 78:3353-62. [PMID: 20818661 PMCID: PMC2976825 DOI: 10.1002/prot.22842] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Protein molecules exhibit varying degrees of flexibility throughout their three-dimensional structures. Protein structural flexibility is often characterized by fluctuations in the Cartesian coordinate space. On the other hand, the protein backbone can be mostly defined by two torsion angles ϕ and ψ only. We introduce a new flexibility descriptor, backbone torsion-angle fluctuation derived from the variation of backbone torsion angles from different NMR models. The torsion-angle fluctuations correlate with mean-squared spatial fluctuations derived from the same collection of NMR models. We developed a neural-network based real-value predictor based on sequence information only. The predictor achieved ten-fold cross-validated correlation coefficients of 0.59 and 0.60, and mean absolute errors of 22.7° and 24.3° for the angle fluctuation of ϕ and ψ, respectively. This predictor is expected to be useful for function prediction and protein structure prediction when predicted torsion angles are used as restraints. Both sequence- and structure-based prediction of torsion-angle fluctuation will be available at http://sparks.informatics.iupui.edu within the SPINE-X package.
Collapse
Affiliation(s)
- Tuo Zhang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202
- Center for computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave., Walker Plaza Building Suite 319, Indianapolis, IN 46202, USA
| | - Eshel Faraggi
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202
- Center for computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave., Walker Plaza Building Suite 319, Indianapolis, IN 46202, USA
| | - Yaoqi Zhou
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202
- Center for computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Ave., Walker Plaza Building Suite 319, Indianapolis, IN 46202, USA
| |
Collapse
|
22
|
Abstract
The quantitative underpinning of the information content of biosequences represents an elusive goal and yet also an obvious prerequisite to the quantitative modeling and study of biological function and evolution. Several past studies have addressed the question of what distinguishes biosequences from random strings, the latter being clearly unpalatable to the living cell. Such studies typically analyze the organization of biosequences in terms of their constituent characters or substrings and have, in particular, consistently exposed a tenacious lack of compressibility on behalf of biosequences. This article attempts, perhaps for the first time, an assessement of the structure and randomness of polypeptides in terms on newly introduced parameters that relate to the vocabulary of their (suitably constrained) subsequences rather than their substrings. It is shown that such parameters grasp structural/functional information, and are related to each other under a specific set of rules that span biochemically diverse polypeptides. Measures on subsequences separate few amino acid strings from their random permutations, but show that the random permutations of most polypeptides amass along specific linear loci.
Collapse
Affiliation(s)
- Alberto Apostolico
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30318, USA.
| | | |
Collapse
|
23
|
Capone G, Novello G, Fasano C, Trost B, Bickis M, Kusalik A, Kanduc D. The oligodeoxynucleotide sequences corresponding to never-expressed peptide motifs are mainly located in the non-coding strand. BMC Bioinformatics 2010; 11:383. [PMID: 20646284 PMCID: PMC2919516 DOI: 10.1186/1471-2105-11-383] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2010] [Accepted: 07/20/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We study the usage of specific peptide platforms in protein composition. Using the pentapeptide as a unit of length, we find that in the universal proteome many pentapeptides are heavily repeated (even thousands of times), whereas some are quite rare, and a small number do not appear at all. To understand the physico-chemical-biological basis underlying peptide usage at the proteomic level, in this study we analyse the energetic costs for the synthesis of rare and never-expressed versus frequent pentapeptides. In addition, we explore residue bulkiness, hydrophobicity, and codon number as factors able to modulate specific peptide frequencies. Then, the possible influence of amino acid composition is investigated in zero- and high-frequency pentapeptide sets by analysing the frequencies of the corresponding inverse-sequence pentapeptides. As a final step, we analyse the pentadecamer oligodeoxynucleotide sequences corresponding to the never-expressed pentapeptides. RESULTS We find that only DNA context-dependent constraints (such as oligodeoxynucleotide sequence location in the minus strand, introns, pseudogenes, frameshifts, etc.) provide a coherent mechanistic platform to explain the occurrence of never-expressed versus frequent pentapeptides in the protein world. CONCLUSIONS This study is of importance in cell biology. Indeed, the rarity (or lack of expression) of specific 5-mer peptide modules implies the rarity (or lack of expression) of the corresponding n-mer peptide sequences (with n < 5), so possibly modulating protein compositional trends. Moreover the data might further our understanding of the role exerted by rare pentapeptide modules as critical biological effectors in protein-protein interactions.
Collapse
Affiliation(s)
- Giovanni Capone
- Department of Biochemistry and Molecular Biology "Ernesto Quagliariello", University of Bari, Bari, Italy
| | - Giuseppe Novello
- Department of Biochemistry and Molecular Biology "Ernesto Quagliariello", University of Bari, Bari, Italy
| | - Candida Fasano
- Department of Biochemistry and Molecular Biology "Ernesto Quagliariello", University of Bari, Bari, Italy
| | - Brett Trost
- Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
| | - Mik Bickis
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, Canada
| | - Anthony Kusalik
- Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
| | - Darja Kanduc
- Department of Biochemistry and Molecular Biology "Ernesto Quagliariello", University of Bari, Bari, Italy
| |
Collapse
|
24
|
Coletta A, Pinney JW, Solís DYW, Marsh J, Pettifer SR, Attwood TK. Low-complexity regions within protein sequences have position-dependent roles. BMC SYSTEMS BIOLOGY 2010; 4:43. [PMID: 20385029 PMCID: PMC2873317 DOI: 10.1186/1752-0509-4-43] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2009] [Accepted: 04/13/2010] [Indexed: 11/10/2022]
Abstract
BACKGROUND Regions of protein sequences with biased amino acid composition (so-called Low-Complexity Regions (LCRs)) are abundant in the protein universe. A number of studies have revealed that i) these regions show significant divergence across protein families; ii) the genetic mechanisms from which they arise lends them remarkable degrees of compositional plasticity. They have therefore proved difficult to compare using conventional sequence analysis techniques, and functions remain to be elucidated for most of them. Here we undertake a systematic investigation of LCRs in order to explore their possible functional significance, placed in the particular context of Protein-Protein Interaction (PPI) networks and Gene Ontology (GO)-term analysis. RESULTS In keeping with previous results, we found that LCR-containing proteins tend to have more binding partners across different PPI networks than proteins that have no LCRs. More specifically, our study suggests i) that LCRs are preferentially positioned towards the protein sequence extremities and, in contrast with centrally-located LCRs, such terminal LCRs show a correlation between their lengths and degrees of connectivity, and ii) that centrally-located LCRs are enriched with transcription-related GO terms, while terminal LCRs are enriched with translation and stress response-related terms. CONCLUSIONS Our results suggest not only that LCRs may be involved in flexible binding associated with specific functions, but also that their positions within a sequence may be important in determining both their binding properties and their biological roles.
Collapse
Affiliation(s)
- Alain Coletta
- Faculty of Life Sciences, University of Manchester, Manchester M13 9PL, UK.
| | | | | | | | | | | |
Collapse
|
25
|
The universal trend of amino acid gain-loss is caused by CpG hypermutability. J Mol Evol 2008; 67:334-42. [PMID: 18810523 DOI: 10.1007/s00239-008-9141-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2008] [Revised: 05/31/2008] [Accepted: 06/23/2008] [Indexed: 10/21/2022]
Abstract
Understanding the cause of the changes in the amino acid composition of proteins is essential for understanding the evolution of protein functions. Since the early 1970s, it has been known that the frequency of some amino acids in protein sequences is increasing and that of others is decreasing. Recently, it was found that the trends of amino acid changes were similar in 15 taxa representing Bacteria, Archaea, and Eukaryota. However, the cause of this similarity in the trend of the gains and losses of amino acids continued to be debated. Here, we show that this trend of the gain and loss of amino acids can be simply explained by CpG hypermutability. We found that the frequency of amino acids coded by codons with TpG dinucleotides and those with CpA dinucleotides is increasing, while that of amino acids coded by codons with CpG dinucleotides is decreasing. We also found that organisms that lack DNA methyltransferase show different trends of the gain and loss of amino acids. DNA methyltransferase methylates CpG dinucleotides and induces CpG hypermutability. The incorporation of CpG hypermutability into models of protein evolution will improve studies on protein evolution in different organisms.
Collapse
|
26
|
Towards completion of the Earth's proteome. EMBO Rep 2008; 8:1135-41. [PMID: 18059312 DOI: 10.1038/sj.embor.7401117] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2007] [Accepted: 10/15/2007] [Indexed: 11/08/2022] Open
Abstract
New protein sequences are deposited in databases at an accelerating pace; however, many of these are homologous to known proteins and could be considered redundant. If all historical releases of the protein database are analysed using the original sequence-clustering procedure described here, the fraction of newly sequenced proteins that are redundant is increasing. We interpret this as an indication that the sequencing of the Earth's proteome--the complete set of proteins on Earth--is approaching completion. We estimate the approximate size of the Earth's proteome to be 5 million sequences, most of which will be identified during the next 5 years. As the Earth's proteome nears completion, cluster analysis of the protein database will become essential to identify under-explored taxa to which future sequencing efforts should be directed and to focus research on protein families without experimental characterization.
Collapse
|
27
|
Sharon I, Birkland A, Chang K, El-Yaniv R, Yona G. Correcting BLAST e-values for low-complexity segments. J Comput Biol 2008; 12:980-1003. [PMID: 16201917 DOI: 10.1089/cmb.2005.12.980] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
The statistical estimates of BLAST and PSI-BLAST are of extreme importance to determine the biological relevance of sequence matches. While being very effective in evaluating most matches, these estimates usually overestimate the significance of matches in the presence of low complexity segments. In this paper, we present a model, based on divergence measures and statistics of the alignment structure, that corrects BLAST e-values for low complexity sequences without filtering or excluding them and generates scores that are more effective in distinguishing true similarities from chance similarities. We evaluate our method and compare it to other known methods using the Gene Ontology (GO) knowledge resource as a benchmark. Various performance measures, including ROC analysis, indicate that the new model improves upon the state of the art. The program is available at biozon.org/ftp/ and www.cs.technion.ac.il/ approximately itaish/lowcomp/.
Collapse
Affiliation(s)
- Itai Sharon
- Department of Computer Science, Technion, Haifa, Israel
| | | | | | | | | |
Collapse
|
28
|
Huska MR, Buschmann H, Andrade-Navarro MA. BiasViz: visualization of amino acid biased regions in protein alignments. Bioinformatics 2007; 23:3093-4. [PMID: 17921493 DOI: 10.1093/bioinformatics/btm489] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
About a third of all protein sequences have at least one composition biased region (CBR). Such regions might act as linkers between protein domains but often confer specific binding to various molecules; therefore, their characterization in terms of their boundaries and over-represented residues is important. Analysis of CBRs in a particular sequence can be time consuming if several types of biases have to be explored and their position visualized. Assessment of the significance of the detected CBRs can be approached by comparison to homologous protein sequences. To assist this procedure, we have developed BiasViz, a tool that allows to graphically studying local amino acid composition in protein sequences of a multiple sequence alignment.
Collapse
Affiliation(s)
- Matthew R Huska
- Molecular Medicine, Ottawa Health Research Institute, 501 Smyth Road, Ottawa, ON, Canada.
| | | | | |
Collapse
|
29
|
Ogata H, Claverie JM. Unique genes in giant viruses: regular substitution pattern and anomalously short size. Genome Res 2007; 17:1353-61. [PMID: 17652424 PMCID: PMC1950904 DOI: 10.1101/gr.6358607] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Large DNA viruses, including giant mimivirus with a 1.2-Mb genome, exhibit numerous orphan genes possessing no database homologs or genes with homologs solely in close members of the same viral family. Due to their solitary nature, the functions and evolutionary origins of those genes remain obscure. We examined sequence features and evolutionary rates of viral family-specific genes in three nucleo-cytoplasmic large DNA virus (NCLDV) lineages. First, we showed that the proportion of family-specific genes does not correlate with sequence divergence rate. Second, position-dependent nucleotide statistics were similar between family-specific genes and the remaining genes in the genome. Third, we showed that the synonymous-to-nonsynonymous substitution ratios in those viruses are at levels comparable to those estimated for vertebrate proteomes. Thus, the vast majority of family-specific genes do not exhibit an accelerated evolutionary rate, and are thus likely to specify functional polypeptides. On the other hand, these family-specific proteins exhibit several distinct properties: (1) they are shorter, (2) they include a larger fraction of predicted transmembrane proteins, and (3) they are enriched in low-complexity sequences. These results suggest that family-specific genes do not correspond to recent horizontal gene transfer. We propose that their characteristic features are the consequences of the specific evolutionary forces shaping the viral gene repertoires in the context of their parasitic lifestyles.
Collapse
Affiliation(s)
- Hiroyuki Ogata
- Structural and Genomic Information Laboratory CNRS-UPR 2589, IBSM Parc Scientifique de Luminy, Case 934 13288 Marseille Cedex 9, France.
| | | |
Collapse
|
30
|
Huntley MA, Clark AG. Evolutionary Analysis of Amino Acid Repeats across the Genomes of 12 Drosophila Species. Mol Biol Evol 2007; 24:2598-609. [PMID: 17602168 DOI: 10.1093/molbev/msm129] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Repeated motifs of amino acids within proteins are an abundant feature of eukaryotic sequences and may catalyze the rapid production of genetic and even phenotypic variation among organisms. The completion of the genome sequencing projects of 12 distinct Drosophila species provides a unique dataset to study these intriguing sequence features on a phylogeny with a variety of timescales. We show that there is a higher percentage of proteins containing repeats within the Drosophila genus than most other eukaryotes, including non-Drosphila insects, which makes this collection of species particularly useful for the study of protein repeats. We also find that proteins containing repeats are overrepresented in functional categories involving developmental processes, signaling, and gene regulation. Using the set of 1-to-1 ortholog alignments for the 12 Drosophila species, we test the ability of repeats to act as reliable phylogenetic signals and find that they resolve the generally accepted phylogeny despite the noise caused by their accelerated rate of evolution. We also determine that in general the position of repeats within a protein sequence is non-random, with repeats more often being absent from the middle regions of sequences. Finally we find evidence to suggest that the presence of repeats is associated with an increase in evolutionary rate upon the entire sequence in which they are embedded. With additional evidence to suggest a corresponding elevation in positive selection we propose that some repeats may be inducing compensatory substitutions in their surrounding sequence.
Collapse
Affiliation(s)
- Melanie A Huntley
- Department of Molecular Biology and Genetics Cornell University, USA.
| | | |
Collapse
|
31
|
Abstract
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. This paper quantifies inexact simple repeats as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic distribution with two parameters (λ and k), which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits λ and k to be computed from matrices whose dimension equals the size of the relevant alphabet. The simplification applies to some MAPs where the summand distributions do not depend on consecutive pairs of Markov states as usual, but on pairs with a fixed time-lag larger than one.
Collapse
|
32
|
Markov Additive Processes and Repeats in Sequences. J Appl Probab 2007. [DOI: 10.1017/s0021900200003132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. This paper quantifies inexact simple repeats as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic distribution with two parameters (λ and k), which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits λ and k to be computed from matrices whose dimension equals the size of the relevant alphabet. The simplification applies to some MAPs where the summand distributions do not depend on consecutive pairs of Markov states as usual, but on pairs with a fixed time-lag larger than one.
Collapse
|
33
|
Hirose S, Shimizu K, Kanai S, Kuroda Y, Noguchi T. POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions. Bioinformatics 2007; 23:2046-53. [PMID: 17545177 DOI: 10.1093/bioinformatics/btm302] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
MOTIVATION Recent experimental and theoretical studies have revealed several proteins containing sequence segments that are unfolded under physiological conditions. These segments are called disordered regions. They are actively investigated because of their possible involvement in various biological processes, such as cell signaling, transcriptional and translational regulation. Additionally, disordered regions can represent a major obstacle to high-throughput proteome analysis and often need to be removed from experimental targets. The accurate prediction of long disordered regions is thus expected to provide annotations that are useful for a wide range of applications. RESULTS We developed Prediction Of Order and Disorder by machine LEarning (POODLE-L; L stands for long), the Support Vector Machines (SVMs) based method for predicting long disordered regions using 10 kinds of simple physico-chemical properties of amino acid. POODLE-L assembles the output of 10 two-level SVM predictors into a final prediction of disordered regions. The performance of POODLE-L for predicting long disordered regions, which exhibited a Matthew's correlation coefficient of 0.658, was the highest when compared with eight well-established publicly available disordered region predictors. AVAILABILITY POODLE-L is freely available at http://mbs.cbrc.jp/poodle/poodle-l.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
34
|
Abstract
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. This paper quantifies inexact simple repeats as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic distribution with two parameters (λ and k), which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits λ and k to be computed from matrices whose dimension equals the size of the relevant alphabet. The simplification applies to some MAPs where the summand distributions do not depend on consecutive pairs of Markov states as usual, but on pairs with a fixed time-lag larger than one.
Collapse
|
35
|
Barney BM. Classification of proteins based on minimal modular repeats: lessons from nature in protein design. J Proteome Res 2007; 5:473-82. [PMID: 16512661 DOI: 10.1021/pr050103m] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Proteins containing internal repeats within their primary sequence have received increased attention recently, as the extent of their presence in various organisms is recognized more fully, and their role in evolution is more thoroughly studied. Presented here is a technique used to detect and classify proteins based on a modular evolutionary phenomenon that results in a series of small internal repeats. The parameters chosen are based on a minimum segment of seven residues that result in simple functional scaffolds. The genomes and corresponding proteomes of a variety of eubacteria and archaea have been analyzed using an algorithm that searches prokaryotic genomes for proteins containing small conserved repeats assembled in a modular fashion similar to a recently characterized protein from the organism Nitrosomonas europaea. This analysis has revealed additional proteins present in N. europaea with similar modular characteristics. A further survey of a variety of organisms demonstrates that this evolutionary pathway has been utilized in other organisms as well, to yield a broad assortment of small modular proteins. A thorough description of the sequential characteristics of these modular proteins follows, along with a selection and discussion of the various proteins uncovered through this expanded search and analysis. Several databases of the proteins uncovered from this work and the program used to perform the search are available.
Collapse
Affiliation(s)
- Brett M Barney
- Department of Chemistry and Biochemistry, 0300 Old Main Hill, Utah State University, Logan, Utah 84322, USA.
| |
Collapse
|
36
|
Radivojac P, Iakoucheva LM, Oldfield CJ, Obradovic Z, Uversky VN, Dunker AK. Intrinsic disorder and functional proteomics. Biophys J 2007; 92:1439-56. [PMID: 17158572 PMCID: PMC1796814 DOI: 10.1529/biophysj.106.094045] [Citation(s) in RCA: 560] [Impact Index Per Article: 31.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2006] [Accepted: 11/15/2006] [Indexed: 11/18/2022] Open
Abstract
The recent advances in the prediction of intrinsically disordered proteins and the use of protein disorder prediction in the fields of molecular biology and bioinformatics are reviewed here, especially with regard to protein function. First, a close look is taken at intrinsically disordered proteins and then at the methods used for their experimental characterization. Next, the major statistical properties of disordered regions are summarized, and prediction models developed thus far are described, including their numerous applications in functional proteomics. The future of the prediction of protein disorder and the future uses of such predictions in functional proteomics comprise the last section of this article.
Collapse
Affiliation(s)
- Predrag Radivojac
- School of Informatics, Indiana University, Bloomington, Indiana, USA
| | | | | | | | | | | |
Collapse
|
37
|
Weathers EA, Paulaitis ME, Woolf TB, Hoh JH. Insights into protein structure and function from disorder-complexity space. Proteins 2007; 66:16-28. [PMID: 17044059 DOI: 10.1002/prot.21055] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Intrinsically disordered proteins have a wide variety of important functional roles. However, the relationship between sequence and function in these proteins is significantly different than that for well-folded proteins. In a previous work, we showed that the propensity to be disordered can be recognized based on sequence composition alone. Here that analysis is furthered by examining the relationship of disorder propensity to sequence complexity, where the metrics for these two properties depend only on composition. The distributions of 40 amino acid peptides from both ordered and disordered proteins are graphed in this disorder-complexity space. An analysis of Swiss-Prot shows that most peptides have high complexity and relatively low disorder. However, there are also an appreciable number of low complexity-high disorder peptides in the database. In contrast, there are no low complexity-low disorder peptides. A similar analysis for peptides in the PDB reveals a much narrower distribution, with few peptides of low complexity and high disorder. In this case, the bounds of the disorder-complexity distribution are well defined and might be used to evaluate the likelihood that a peptide can be crystallized with current methods. The disorder-complexity distributions of individual proteins and sets of proteins grouped by function are also examined. Among individual proteins, there is an enormous variety of distributions that in some cases can be rationalized with regard to function. Groups of functionally related proteins are found to have distributions that are similar within each group but show notable differences between groups. Finally, a pattern matching algorithm is used to search for proteins with particular disorder-complexity distributions. The results suggest that this approach might be used to identify relationships between otherwise dissimilar proteins.
Collapse
Affiliation(s)
- Edward A Weathers
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | | | | | | |
Collapse
|
38
|
Hernandez VP, Fallon AM. Histone H1-like, lysine-rich low complexity amino acid extensions in mosquito ribosomal proteins RpL23a and RpS6 have evolved independently. ARCHIVES OF INSECT BIOCHEMISTRY AND PHYSIOLOGY 2007; 64:100-10. [PMID: 17212354 DOI: 10.1002/arch.20163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Histone H1-like amino acid extensions have been described at the amino terminus of Drosophila RpL22 and RpL23a, and at the carboxyl terminus of mosquito ribosomal protein RpS6. An in silico search suggested that RpL23a, but not RpL22, in Anopheles gambiae has an amino-terminal extension. Because low complexity amino acid extensions are not common on eukaryotic ribosomal proteins, and their functions are unknown, we cloned cDNAs encoding RpL23a from Aedes albopictus and Anopheles stephensi mosquito cell lines. RpL23a proteins in Aedes and Anopheles mosquitoes are rich in lysine (approximately 25%), alanine (approximately 21%), and proline (approximately 8%), have a mass of approximately 40 kDa, a pI of 11.4 to 11.5, and contain an N-terminal extension of approximately 260 amino acid residues. The N-terminal extension in mosquito RpL23a is about 100 amino acids longer than that in the Drosophila RpL23a homolog, and contains several repeated amino acid motifs. Analysis of exon-intron organization in the An. gambiae and in D. melanogaster genes suggests that a short first exon encodes a series of 11 amino acid residues conserved in RpL23a proteins from Drosophila, mosquitoes, and the moth, Bombyx mori. The histone H1-like sequence in RpL23a is encoded entirely within the second exon. The C-terminal 126 amino acid residues of the RpL23a protein, encoded by exon 3 in Drosophila, and by exons 3 and 4 in Anopheles gambiae, are well conserved, and correspond to Escherichia coli RpL23 with the addition of the eukaryotic N-terminal nuclear localization sequence. Sequence comparisons indicate that the histone H1-like extensions on mosquito RpS6 and RpL23a have evolved independently of each other, and of histone H1 proteins.
Collapse
Affiliation(s)
- Vida P Hernandez
- Department of Entomology, University of Minnesota, St. Paul, MN 55108, USA
| | | |
Collapse
|
39
|
Dosztányi Z, Chen J, Dunker AK, Simon I, Tompa P. Disorder and sequence repeats in hub proteins and their implications for network evolution. J Proteome Res 2007; 5:2985-95. [PMID: 17081050 DOI: 10.1021/pr060171o] [Citation(s) in RCA: 265] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Protein interaction networks display approximate scale-free topology, in which hub proteins that interact with a large number of other proteins determine the overall organization of the network. In this study, we aim to determine whether hubs are distinguishable from other networked proteins by specific sequence features. Proteins of different connectednesses were compared in the interaction networks of Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Homo sapienswith respect to the distribution of predicted structural disorder, sequence repeats, low complexity regions, and chain length. Highly connected proteins ("hub proteins") contained significantly more of, and greater proportion of, these sequence features and tended to be longer overall as compared to less connected proteins. These sequence features provide two different functional means for realizing multiple interactions: (1) extended interaction surface and (2) flexibility and adaptability, providing a mechanism for the same region to bind distinct partners. Our view contradicts the prevailing view that scaling in protein interactomes arose from gene duplication and preferential attachment of equivalent proteins. We propose an alternative evolutionary network specialization process, in which certain components of the protein interactome improved their fitness for binding by becoming longer or accruing regions of disorder and/or internal repeats and have therefore become specialized in network organization.
Collapse
Affiliation(s)
- Zsuzsanna Dosztányi
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, 1518 Budapest, Hungary
| | | | | | | | | |
Collapse
|
40
|
Li X, Kahveci T. A Novel algorithm for identifying low-complexity regions in a protein sequence. ACTA ACUST UNITED AC 2006; 22:2980-7. [PMID: 17018537 DOI: 10.1093/bioinformatics/btl495] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY The program is available on request.
Collapse
Affiliation(s)
- Xuehui Li
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA.
| | | |
Collapse
|
41
|
Dicko C, Kenney JM, Vollrath F. β‐Silks: Enhancing and Controlling Aggregation. ADVANCES IN PROTEIN CHEMISTRY 2006; 73:17-53. [PMID: 17190610 DOI: 10.1016/s0065-3233(06)73002-9] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
It appears that fiber-forming proteins are not an exclusive group but that, with appropriate conditions, many proteins can potentially aggregate and form fibrils; though only certain proteins, for example, amyloids and silks, do so under normal physiological conditions. Even so, this suggests a ubiquitous aggregation mechanism in which the protein environment is at least as important as the sequence. An ideal model system in which forced and natural aggregation has been observed is silk. Silks have evolved specifically to readily form insoluble ordered structures with a wide range of structural functionality. The animal, be it silkworm or spider, will produce, store, and transport high molecular weight proteins in a complex environment to eventually allow formation of silk fibers with a variety of mechanical properties. Here we review fiber formation and its prerequisites, and discuss the mechanism by which the animal facilitates and modulates silk assembly to achieve controlled protein aggregation.
Collapse
Affiliation(s)
- Cedric Dicko
- Zoology Department, Oxford University, OX1 3PS, United Kingdom
| | | | | |
Collapse
|
42
|
Huq NL, Cross KJ, Ung M, Reynolds EC. A review of protein structure and gene organisation for proteins associated with mineralised tissue and calcium phosphate stabilisation encoded on human chromosome 4. Arch Oral Biol 2005; 50:599-609. [PMID: 15892946 DOI: 10.1016/j.archoralbio.2004.12.009] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2004] [Accepted: 12/23/2004] [Indexed: 12/14/2022]
Abstract
Several proteins associated with mineralised tissue (teeth and bone) or involved in calcium phosphate stabilisation in the body fluids, milk and saliva have been mapped to the q arm of human chromosome 4. These include the dentine/bone proteins dentine sialophosphoprotein (DSPP), dentine matrix protein 1 (DMP1), bone sialoprotein (BSP), matrix extracellular phosphoglycoprotein, osteopontin (OPN), enamelin, ameloblastin, milk caseins, salivary statherin, and proline-rich proteins. The proposed function of those that are multiphosphorylated is: (i) the stabilisation of calcium phosphate in solution (e.g. casein, statherin) preventing spontaneous precipitation and seeded-crystal growth or (ii) promoting biomineralisation (e.g. the phosphophoryn domain of DSPP), where the protein described as a template macromolecule, is proposed to act as a nucleator/promoter of crystal growth. The genes of these proteins have been subjected to conserved chromosomal synteny during mammalian evolution. The multiphosphorylated proteins statherin, caseins, phosphophoryn, BSP and OPN have been characterised as intrinsically disordered. The codon usage patterns for the amino acid serine reveal a bias for AGC and AGT codons within the human genes dspp, dmp1 and bsp, mouse dspp and dmp1 but not significantly for statherin or caseins. This pattern was also observed in the gene encoding hen phosvitin that also contains stretches of multiphosphorylated serines and in the dmp1 gene sequences of mammalian, reptilian and avian classes. In conclusion, these intrinsically disordered multiphosphorylated proteins are the translation products of genes displaying examples of codon usage bias, internal repeats and conserved chromosomal synteny within the mammalian class.
Collapse
Affiliation(s)
- N Laila Huq
- Cooperative Research Centre for Oral Health Science, School of Dental Science, The University of Melbourne, 711 Elizabeth Street, Melbourne, Vic. 3010, Australia
| | | | | | | |
Collapse
|
43
|
Gustiananda M, Liggins JR, Cummins PL, Gready JE. Conformation of prion protein repeat peptides probed by FRET measurements and molecular dynamics simulations. Biophys J 2004; 86:2467-83. [PMID: 15041684 PMCID: PMC1304095 DOI: 10.1016/s0006-3495(04)74303-9] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
We report the combined use of steady-state fluorescence resonance energy transfer (FRET) experiments and molecular dynamics (MD) simulations to investigate conformational distributions of the prion protein (PrP) repeat system. FRET was used for the first time to probe the distance, as a function of temperature and pH, between a donor Trp residue and an acceptor dansyl group attached to the N-terminus in seven model peptides containing one to three repeats of the second decarepeat of PrP from marsupial possum (PHPGGSNWGQ)nG, and one and two human PrP consensus octarepeats (PHGGGWGQ)nG. In multirepeat peptides, single-Trp mutants were made by replacing other Trp(s) with Phe. As previous work has shown PrP repeats do not adopt a single preferred stable conformation, the FRET values are averages reflecting heterogeneity in the donor-acceptor distances. The T-dependence of the conformational distributions, and derived average dansyl-Trp distances, were obtained directly from MD simulation of the marsupial dansyl-PHPGGSNWGQG peptide. The results show excellent agreement between the FRET and MD T-dependent distances, and demonstrate the remarkable sensitivity and reproducibility of the FRET method in this first-time use for a set of disordered peptides. Based on the results, we propose a model involving cation-pi or pi-pi His-Trp interactions to explain the T- (5-85 degrees C) and pH- (6.0, 7.2) dependencies on distance, with HW i, i + 4 or WH i, i + 4 separations in sequence being more stable than HW i, i + 6 or WH i, i + 6 separations. The model has peptides adopting loosely folded conformations, with dansyl-Trp distances very much less than estimates for fully extended conformations, for example, approximately 16 vs. 33, approximately 21 vs. 69, and approximately 22 vs. 106 A for 1-3 decarepeats, and approximately 14 vs. 25 and approximately 19 vs. 54 A for 1-2 octarepeats, respectively. The study demonstrates the usefulness of combining FRET with MD, a combination reported only once previously. Initial "mapping" of the conformational distribution of flexible peptides by simulation can assist in designing and interpreting experiments using steady-state intensity methods, and indicating how time-resolved or anisotropy methods might be used.
Collapse
Affiliation(s)
- Marsia Gustiananda
- Computational Proteomics Group, John Curtin School of Medical Research, Australian National University, Canberra ACT 2601, Australia
| | | | | | | |
Collapse
|
44
|
Mukhopadhyay R, Kumar S, Hoh JH. Molecular mechanisms for organizing the neuronal cytoskeleton. Bioessays 2004; 26:1017-25. [PMID: 15351972 DOI: 10.1002/bies.20088] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Neurofilaments and microtubules are important components of the neuronal cytoskeleton. In axons or dendrites, these filaments are aligned in parallel arrays, and separated from one another by nonrandom distances. This distinctive organization has been attributed to cross bridges formed by NF side arms or microtubule-associated proteins. We recently proposed a polymer-brush-based mechanism for regulating interactions between neurofilaments and between microtubules. In this model, the side arms of neurofilaments and the projection domains of microtubule-associated proteins are highly unstructured and exert long-range repulsive forces that are largely entropic in origin; these forces then act to organize the cytoskeleton in axons and dendrites. Here, we review the biochemical, biophysical, genetic and cell biological data for the polymer-brush and cross-bridging models. We explore how the data traditionally used to support cross bridging may be reconciled with a polymer-brush mechanism and compare the implications of recent experimental insights into axonal transport and physiology for each model.
Collapse
|
45
|
Dunker AK, Brown CJ, Obradovic Z. Identification and functions of usefully disordered proteins. ADVANCES IN PROTEIN CHEMISTRY 2004; 62:25-49. [PMID: 12418100 DOI: 10.1016/s0065-3233(02)62004-2] [Citation(s) in RCA: 291] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- A Keith Dunker
- School of Molecular Biosciences, Washington State University, Pullman, Washington 99164, USA
| | | | | |
Collapse
|
46
|
Abstract
Late embryogenesis abundant (LEA) proteins are produced in maturing seeds and anhydrobiotic plants, animals and microorganisms, in which their expression correlates with desiccation tolerance. However, their function has remained obscure for 20 years. We argue that novel computational tools devised for non-globular proteins might now overcome this problem. Predictions arising from bioinformatics fit well with recent data on Group 3 proteins, which potentially form cytoskeletal filaments, and suggest experimentally testable functions for these and other LEA protein groups.
Collapse
Affiliation(s)
- Michael J Wise
- Department of Genetics, University of Cambridge, Downing Street, CB2 3EH, Cambridge, UK
| | | |
Collapse
|
47
|
|
48
|
Abstract
The proportion of the genome encoding intrinsically unstructured proteins increases with the complexity of organisms, which demands specific mechanism(s) for generating novel genetic material of this sort. Here it is suggested that one such mechanism is the expansion of internal repeat regions, i.e., coding micro- and minisatellites. An analysis of 126 known unstructured sequences shows the preponderance of repeats: the percentage of proteins with tandemly repeated short segments is much higher in this class (39%) than earlier reported for all Swiss-Prot (14%), yeast (18%) or human (28%) proteins. Furthermore, prime examples, such as salivary proline-rich proteins, titin, eukaryotic RNA polymerase II, the prion protein and several others, demonstrate that the repetitive segments carry fundamental function in these proteins. In addition, their repeat numbers show functionally significant interspecies variation and polymorphism, which underlines that these regions have been shaped by intense evolutionary activity. In all, the major point of this paper is that the genetic instability of repetitive regions combined with the structurally and functionally permissive nature of unstructured proteins has powered the extension and possible functional expansion of this newly recognized protein class.
Collapse
Affiliation(s)
- Peter Tompa
- Institute of Enzymology, Biological Research Center, Hungarian Academy of Sciences, 1518 Budapest, PO Box 7, Hungary.
| |
Collapse
|
49
|
Wan H, Li L, Federhen S, Wootton JC. Discovering simple regions in biological sequences associated with scoring schemes. J Comput Biol 2003; 10:171-85. [PMID: 12804090 DOI: 10.1089/106652703321825955] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Let A denote an alphabet consisting of n types of letters. Given a sequence S of length L with v(i) letters of type i on A, to describe the compositional properties and combinatorial structure of S, we propose a new complexity function of S, called the reciprocal complexity of S, as C(S) = (i=1) product operator (n) (L/nv(i))(vi) Based on this complexity measure, an efficient algorithm is developed for classifying and analyzing simple segments of protein and nucleotide sequence databases associated with scoring schemes. The running time of the algorithm is nearly proportional to the sequence length. The program DSR corresponding to the algorithm was written in C++, associated with two parameters (window length and cutoff value) and a scoring matrix. Some examples regarding protein sequences illustrate how the method can be used to find regions. The first application of DSR is the masking of simple sequences for searching databases. Queries masked by DSR returned a manageable set of hits below the E-value cutoff score, which contained all true positive homologues. The second application is to study simple regions detected by the DSR program corresponding to known structural features of proteins. An extensive computational analysis has been made of protein sequences with known, physicochemically defined nonglobular segments. For the SWISS-PROT amino acid sequence database (Release 40.2 of 02-Nov-2001), we determine that the best parameters and the best BLOSUM matrix are, respectively, for automatic segmentation of amino acid sequences into nonglobular and globular regions by the DSR program: Window length k = 35, cutoff value b = 0.46, and the BLOSUM 62.5 matrix. The average "agreement accuracy (sensitivity)" of DSR segmentation for the SWISS-PROT database is 97.3%.
Collapse
Affiliation(s)
- Honghui Wan
- National Center for Genome Resources, Santa Fe, NM 87505, USA.
| | | | | | | |
Collapse
|
50
|
Abstract
The current theory of protein evolution is that all contemporary proteins are derived from an ancestral subset. However, each new sequenced genome exhibits many genes with no detectable homologues in other species, leading to the paradoxical picture of a universal ancestor with more genes than any of its progeny. Standard explanations indicate that fast evolving genes might disappear into the 'twilight zone' of sequence similarity. Regardless of the size of the original ancestral subset, its origin and the potential mechanisms of its subsequent enlargement are rarely addressed. Sequencing of Rickettsia conorii genome recently led to the discovery of three families of repeat-mobile elements frequently inserted into the middle of protein coding genes. Although not yet identified in other species of bacteria, this discovery has provided the first clear evidence for the de novo creation of long protein segments (up to 50 amino acid residues) by repeat insertion. Based on previous results and theories on the coding potential of palindromic elements, we speculate that their insertion and mobility might have played a significant role in the early stages of protein evolution.
Collapse
Affiliation(s)
- Jean-Michel Claverie
- Information Génétique et Structurale, CNRS-AVENTIS UMR 1889, Institut de Biologie Structurale et Microbiologie, Marseille, France.
| | | |
Collapse
|