1
|
Borriello E, Walker SI, Laubichler MD. Cell phenotypes as macrostates of the GRN dynamics. JOURNAL OF EXPERIMENTAL ZOOLOGY PART B-MOLECULAR AND DEVELOPMENTAL EVOLUTION 2020; 334:213-224. [DOI: 10.1002/jez.b.22938] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 02/16/2020] [Accepted: 02/17/2020] [Indexed: 01/04/2023]
Affiliation(s)
- Enrico Borriello
- ASU‐SFI Center for Biosocial Complex SystemsArizona State UniversityTempe Arizona
| | - Sara I. Walker
- ASU‐SFI Center for Biosocial Complex SystemsArizona State UniversityTempe Arizona
- Beyond Center for Fundamental Concepts in ScienceArizona State UniversityTempe Arizona
- School of Earth and Space ExplorationArizona State UniversityTempe Arizona
- Blue Marble Space Institute of ScienceSeattle Washington
| | - Manfred D. Laubichler
- ASU‐SFI Center for Biosocial Complex SystemsArizona State UniversityTempe Arizona
- Santa Fe InstituteSanta Fe New Mexico
- Marine Biological LaboratoryWoods Hole Massachusetts
- School of Life SciencesArizona State UniversityTempe Arizona
| |
Collapse
|
2
|
Will WR, Brzovic P, Le Trong I, Stenkamp RE, Lawrenz MB, Karlinsey JE, Navarre WW, Main-Hester K, Miller VL, Libby SJ, Fang FC. The Evolution of SlyA/RovA Transcription Factors from Repressors to Countersilencers in Enterobacteriaceae. mBio 2019; 10:e00009-19. [PMID: 30837332 PMCID: PMC6401476 DOI: 10.1128/mbio.00009-19] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 01/29/2019] [Indexed: 02/02/2023] Open
Abstract
Gene duplication and subsequent evolutionary divergence have allowed conserved proteins to develop unique roles. The MarR family of transcription factors (TFs) has undergone extensive duplication and diversification in bacteria, where they act as environmentally responsive repressors of genes encoding efflux pumps that confer resistance to xenobiotics, including many antimicrobial agents. We have performed structural, functional, and genetic analyses of representative members of the SlyA/RovA lineage of MarR TFs, which retain some ancestral functions, including repression of their own expression and that of divergently transcribed multidrug efflux pumps, as well as allosteric inhibition by aromatic carboxylate compounds. However, SlyA and RovA have acquired the ability to countersilence horizontally acquired genes, which has greatly facilitated the evolution of Enterobacteriaceae by horizontal gene transfer. SlyA/RovA TFs in different species have independently evolved novel regulatory circuits to provide the enhanced levels of expression required for their new role. Moreover, in contrast to MarR, SlyA is not responsive to copper. These observations demonstrate the ability of TFs to acquire new functions as a result of evolutionary divergence of both cis-regulatory sequences and in trans interactions with modulatory ligands.IMPORTANCE Bacteria primarily evolve via horizontal gene transfer, acquiring new traits such as virulence and antibiotic resistance in single transfer events. However, newly acquired genes must be integrated into existing regulatory networks to allow appropriate expression in new hosts. This is accommodated in part by the opposing mechanisms of xenogeneic silencing and countersilencing. An understanding of these mechanisms is necessary to understand the relationship between gene regulation and bacterial evolution. Here we examine the functional evolution of an important lineage of countersilencers belonging to the ancient MarR family of classical transcriptional repressors. We show that although members of the SlyA lineage retain some ancestral features associated with the MarR family, their cis-regulatory sequences have evolved significantly to support their new function. Understanding the mechanistic requirements for countersilencing is critical to understanding the pathoadaptation of emerging pathogens and also has practical applications in synthetic biology.
Collapse
Affiliation(s)
- W Ryan Will
- Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
| | - Peter Brzovic
- Department of Biochemistry, University of Washington, Seattle, Washington, USA
| | - Isolde Le Trong
- Department of Biological Structure, University of Washington, Seattle, Washington, USA
| | - Ronald E Stenkamp
- Department of Biochemistry, University of Washington, Seattle, Washington, USA
- Department of Biological Structure, University of Washington, Seattle, Washington, USA
| | - Matthew B Lawrenz
- Department of Microbiology and Immunology and the Center for Predictive Medicine for Biodefense and Emerging Infectious Diseases, University of Louisville School of Medicine, Louisville, Kentucky, USA
| | - Joyce E Karlinsey
- Department of Microbiology, University of Washington, Seattle, Washington, USA
| | - William W Navarre
- Department of Microbiology, University of Washington, Seattle, Washington, USA
| | - Kara Main-Hester
- Department of Microbiology, University of Washington, Seattle, Washington, USA
| | - Virginia L Miller
- Department of Microbiology and Immunology, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
- Department of Genetics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA
| | - Stephen J Libby
- Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
| | - Ferric C Fang
- Department of Laboratory Medicine, University of Washington, Seattle, Washington, USA
- Department of Microbiology, University of Washington, Seattle, Washington, USA
| |
Collapse
|
3
|
Strygina KV, Börner A, Khlestkina EK. Identification and characterization of regulatory network components for anthocyanin synthesis in barley aleurone. BMC PLANT BIOLOGY 2017; 17:184. [PMID: 29143621 PMCID: PMC5688479 DOI: 10.1186/s12870-017-1122-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
BACKGROUND Among natural populations, there are different colours of barley (Hordeum vulgare L.). The colour of barley grains is directly related to the accumulation of different pigments in the aleurone layer, pericarp and lemma. Blue grain colour is due to the accumulation of anthocyanins in the aleurone layer, which is dependent on the presence of five Blx genes that are not sequenced yet (Blx1, Blx3 and Blx4 genes clustering on chromosome 4HL and Blx2 and Blx5 on 7HL). Due to the health benefits of anthocyanins, blue-grained barley can be considered as a source of dietary food. The goal of the current study was to identify and characterize components of the anthocyanin synthesis regulatory network for the aleurone layer in barley. RESULTS The candidate genes for components of the regulatory complex MBW (consisting of transcription factors MYB, bHLH/MYC and WD40) for anthocyanin synthesis in barley aleurone were identified. These genes were designated HvMyc2 (4HL), HvMpc2 (4HL), and HvWD40 (6HL). HvMyc2 was expressed in aleurone cells only. A loss-of-function (frame shift) mutation in HvMyc2 of non-coloured compared to blue-grained barley was revealed. Unlike aleurone-specific HvMyc2, the HvMpc2 gene was expressed in different tissues; however, its activity was not detected in non-coloured aleurone in contrast to a coloured aleurone, and allele-specific mutations in its promoter region were found. The single-copy gene HvWD40, which encodes the required component of the regulatory MBW complex, was expressed constantly in coloured and non-coloured tissues and had no allelic differences. HvMyc2 and HvMpc2 were genetically mapped using allele-specific developed CAPS markers developed. HvMyc2 was mapped in position between SSR loci XGBS0875-4H (3.4 cM distal) and XGBM1048-4H (3.4 cM proximal) matching the region chromosome 4HL where the Blx-cluster was found. In this position, one of the anthocyanin biosynthesis structural genes (HvF3'5'H) was also mapped using an allele-specific CAPS-marker developed in the current study. CONCLUSIONS The genes involved in anthocyanin synthesis in the barley aleurone layer were identified and characterized, including components of the regulatory complex MBW, from which the MYC-encoding gene (HvMyc2) appeared to be the main factor underlying variation of barley by aleurone colour.
Collapse
Affiliation(s)
- Ksenia V. Strygina
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Lavrentjeva ave. 10, Novosibirsk, 630090 Russia
| | - Andreas Börner
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Corrensstr. 3, 06466 Stadt Seeland, OT Gatersleben Germany
| | - Elena K. Khlestkina
- Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Lavrentjeva ave. 10, Novosibirsk, 630090 Russia
- Novosibirsk State University, Pirogova str., 1, Novosibirsk, 630090 Russia
| |
Collapse
|
4
|
Gene-Family Extension Measures and Correlations. Life (Basel) 2016; 6:life6030030. [PMID: 27527218 PMCID: PMC5041006 DOI: 10.3390/life6030030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Revised: 07/18/2016] [Accepted: 07/18/2016] [Indexed: 12/28/2022] Open
Abstract
The existence of multiple copies of genes is a well-known phenomenon. A gene family is a set of sufficiently similar genes, formed by gene duplication. In earlier works conducted on a limited number of completely sequenced and annotated genomes it was found that size of gene family and size of genome are positively correlated. Additionally, it was found that several atypical microbes deviated from the observed general trend. In this study, we reexamined these associations on a larger dataset consisting of 1484 prokaryotic genomes and using several ranking approaches. We applied ranking methods in such a way that genomes with lower numbers of gene copies would have lower rank. Until now only simple ranking methods were used; we applied the Kemeny optimal aggregation approach as well. Regression and correlation analysis were utilized in order to accurately quantify and characterize the relationships between measures of paralog indices and genome size. In addition, boxplot analysis was employed as a method for outlier detection. We found that, in general, all paralog indexes positively correlate with an increase of genome size. As expected, different groups of atypical prokaryotic genomes were found for different types of paralog quantities. Mycoplasmataceae and Halobacteria appeared to be among the most interesting candidates for further research of evolution through gene duplication.
Collapse
|
5
|
Cai S, Liu Z, Lee HC. Mean field theory for biology inspired duplication-divergence network model. CHAOS (WOODBURY, N.Y.) 2015; 25:083106. [PMID: 26328557 DOI: 10.1063/1.4928212] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The duplication-divergence network model is generally thought to incorporate key ingredients underlying the growth and evolution of protein-protein interaction networks. Properties of the model have been elucidated through numerous simulation studies. However, a comprehensive theoretical study of the model is lacking. Here, we derived analytic expressions for quantities describing key characteristics of the network-the average degree, the degree distribution, the clustering coefficient, and the neighbor connectivity-in the mean-field, large-N limit of an extended version of the model, duplication-divergence complemented with heterodimerization and addition. We carried out extensive simulations and verified excellent agreement between simulation and theory except for one partial case. All four quantities obeyed power-laws even at moderate network size ( N∼10(4)), except the degree distribution, which had an additional exponential factor observed to obey power-law. It is shown that our network model can lead to the emergence of scale-free property and hierarchical modularity simultaneously, reproducing the important topological properties of real protein-protein interaction networks.
Collapse
Affiliation(s)
- Shuiming Cai
- Faculty of Science, Jiangsu University, Zhenjiang 212013, China
| | - Zengrong Liu
- Institute of Systems Biology, Shanghai University, Shanghai 200444, China
| | - H C Lee
- Institute of Systems Biology and Bioinformatics, National Central University, Zhongli, 32001 Taiwan
| |
Collapse
|
6
|
Espinoza-Valles I, Vora GJ, Lin B, Leekitcharoenphon P, González-Castillo A, Ussery D, Høj L, Gomez-Gil B. Unique and conserved genome regions in Vibrio harveyi and related species in comparison with the shrimp pathogen Vibrio harveyi CAIM 1792. MICROBIOLOGY-SGM 2015. [PMID: 26198743 DOI: 10.1099/mic.0.000141] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Vibrio harveyi CAIM 1792 is a marine bacterial strain that causes mortality in farmed shrimp in north-west Mexico, and the identification of virulence genes in this strain is important for understanding its pathogenicity. The aim of this work was to compare the V. harveyi CAIM 1792 genome with related genome sequences to determine their phylogenic relationship and explore unique regions in silico that differentiate this strain from other V. harveyi strains. Twenty-one newly sequenced genomes were compared in silico against the CAIM 1792 genome at nucleotidic and predicted proteome levels. The proteome of CAIM 1792 had higher similarity to those of other V. harveyi strains (78%) than to those of the other closely related species Vibrio owensii (67%), Vibrio rotiferianus (63%) and Vibrio campbellii (59%). Pan-genome ORFans trees showed the best fit with the accepted phylogeny based on DNA-DNA hybridization and multi-locus sequence analysis of 11 concatenated housekeeping genes. SNP analysis clustered 34/38 genomes within their accepted species. The pangenomic and SNP trees showed that V. harveyi is the most conserved of the four species studied and V. campbellii may be divided into at least three subspecies, supported by intergenomic distance analysis. blastp atlases were created to identify unique regions among the genomes most related to V. harveyi CAIM 1792; these regions included genes encoding glycosyltransferases, specific type restriction modification systems and a transcriptional regulator, LysR, reported to be involved in virulence, metabolism, quorum sensing and motility.
Collapse
Affiliation(s)
| | - Gary J Vora
- Center for Bio/Molecular Science & Engineering, Naval Research Laboratory, Washington, DC, USA
| | - Baochuan Lin
- Center for Bio/Molecular Science & Engineering, Naval Research Laboratory, Washington, DC, USA
| | - Pimlapas Leekitcharoenphon
- National Food Institute, Division for Epidemiology and Microbial Genomics, Technical University of Denmark, Kongens Lyngby, Denmark.,Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Dave Ussery
- Department of Systems Biology, Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark.,Comparative Genomics group, Biosciences Division, Oak Ridge National Labs, Oak Ridge, Tennessee, USA
| | - Lone Høj
- Australian Institute of Marine Science, Townsville, Queensland, Australia
| | - Bruno Gomez-Gil
- CIAD A.C., Mazatlán Unit for Aquaculture, Mazatlán, Sinaloa, Mexico
| |
Collapse
|
7
|
Yadav A, Jalan S. Origin and implications of zero degeneracy in networks spectra. CHAOS (WOODBURY, N.Y.) 2015; 25:043110. [PMID: 25933658 DOI: 10.1063/1.4917286] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
The spectra of many real world networks exhibit properties which are different from those of random networks generated using various models. One such property is the existence of a very high degeneracy at the zero eigenvalue. In this work, we provide all the possible reasons behind the occurrence of the zero degeneracy in the network spectra, namely, the complete and partial duplications, as well as their implications. The power-law degree sequence and the preferential attachment are the properties which enhances the occurrence of such duplications and hence leading to the zero degeneracy. A comparison of the zero degeneracy in protein-protein interaction networks of six different species and in their corresponding model networks indicates importance of the degree sequences and the power-law exponent for the occurrence of zero degeneracy.
Collapse
Affiliation(s)
- Alok Yadav
- Complex Systems Lab, Discipline of Physics, Indian Institute of Technology Indore, Indore 452017, India
| | - Sarika Jalan
- Complex Systems Lab, Discipline of Physics, Indian Institute of Technology Indore, Indore 452017, India
| |
Collapse
|
8
|
Nunes A, Borrego MJ, Gomes JP. Genomic features beyond Chlamydia trachomatis phenotypes: what do we think we know? INFECTION GENETICS AND EVOLUTION 2013; 16:392-400. [PMID: 23523596 DOI: 10.1016/j.meegid.2013.03.018] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2013] [Revised: 02/25/2013] [Accepted: 03/13/2013] [Indexed: 10/27/2022]
Abstract
The obligate intracellular pathogen Chlamydia trachomatis is the causative agent of the blinding trachoma and the world's leading cause of bacterial sexually transmitted infections. Despite aggressive antibacterial control measures, C. trachomatis infections have been increasing, constituting a serious public health concern due to its morbidity and socioeconomic burden. Still, very little is known about the molecular basis underlying the phenotypic disparities observed among C. trachomatis serovars in terms of tissue tropism (ocular conjunctiva, epithelial-genitalia and lymph nodes), virulence (disease outcomes) and ecological success. This is in part due to the inexistence of straightforward tools to genetically manipulate Chlamydiae and host cell-free growth systems, hampering the elucidation of the biological role of loci. The recent release of tenths of full-genome C. trachomatis sequences depict a strains clustering scenario reflecting the organ/cell-type that they preferentially infect. However, the high degree of genomic conservation implies that few genetic features are involved in phenotypic dissimilarities. The purpose of this review is to gather the most relevant data dispersed throughout the literature concerning the genotypic evidences that support niche-specific phenotypes. This review focus on chromosomal dynamics phenomena like recombination and point-mutations, essentially involving outer and inclusion membrane proteins, type III secretion effectors, and hypothetical proteins with unknown function. The scrutiny of C. trachomatis loci involved in tissue tropism, pathogenesis and ecological success is crucial for the development of disease-specific prophylaxis.
Collapse
Affiliation(s)
- Alexandra Nunes
- Department of Infectious Diseases, National Institute of Health, Av. Padre Cruz, 1649-016 Lisbon, Portugal.
| | | | | |
Collapse
|
9
|
Roach JM, Racioppi L, Jones CD, Masci AM. Phylogeny of Toll-like receptor signaling: adapting the innate response. PLoS One 2013; 8:e54156. [PMID: 23326591 PMCID: PMC3543326 DOI: 10.1371/journal.pone.0054156] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2012] [Accepted: 12/10/2012] [Indexed: 02/06/2023] Open
Abstract
The Toll-like receptors represent a largely evolutionarily conserved pathogen recognition machinery responsible for recognition of bacterial, fungal, protozoan, and viral pathogen associated microbial patterns and initiation of inflammatory response. Structurally the Toll-like receptors are comprised of an extracellular leucine rich repeat domain and a cytoplasmic Toll/Interleukin 1 receptor domain. Recognition takes place in the extracellular domain where as the cytoplasmic domain triggers a complex signal network required to sustain appropriate immune response. Signal transduction is regulated by the recruitment of different intracellular adaptors. The Toll-like receptors can be grouped depending on the usage of the adaptor, MyD88, into MyD88-dependent and MyD88 independent subsets. Herein, we present a unique phylogenetic analysis of domain regions of these receptors and their cognate signaling adaptor molecules. Although previously unclear from the phylogeny of full length receptors, these analyses indicate a separate evolutionary origin for the MyD88-dependent and MyD88-independent signaling pathway and provide evidence of a common ancestor for the vertebrate and invertebrate orthologs of the adaptor molecule MyD88. Together these observations suggest a very ancient origin of the MyD88-dependent pathway Additionally we show that early duplications gave rise to several adaptor molecule families. In some cases there is also strong pattern of parallel duplication between adaptor molecules and their corresponding TLR. Our results further support the hypothesis that phylogeny of specific domains involved in signaling pathway can shed light on key processes that link innate to adaptive immune response.
Collapse
Affiliation(s)
- Jeffrey M. Roach
- Research Computing Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Luigi Racioppi
- Department of Medicine, Duke University, Durham, North Carolina; United States of America
- Department of Cellular and Molecular Biology and Pathology, University of Naples Federico II, Naples, Italy
| | - Corbin D. Jones
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Anna Maria Masci
- Department of Immunology, Duke University, Durham, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
10
|
Early Career Research Award Lecture. Structure, evolution and dynamics of transcriptional regulatory networks. Biochem Soc Trans 2011; 38:1155-78. [PMID: 20863280 DOI: 10.1042/bst0381155] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The availability of entire genome sequences and the wealth of literature on gene regulation have enabled researchers to model an organism's transcriptional regulation system in the form of a network. In such a network, TFs (transcription factors) and TGs (target genes) are represented as nodes and regulatory interactions between TFs and TGs are represented as directed links. In the present review, I address the following topics pertaining to transcriptional regulatory networks. (i) Structure and organization: first, I introduce the concept of networks and discuss our understanding of the structure and organization of transcriptional networks. (ii) Evolution: I then describe the different mechanisms and forces that influence network evolution and shape network structure. (iii) Dynamics: I discuss studies that have integrated information on dynamics such as mRNA abundance or half-life, with data on transcriptional network in order to elucidate general principles of regulatory network dynamics. In particular, I discuss how cell-to-cell variability in the expression level of TFs could permit differential utilization of the same underlying network by distinct members of a genetically identical cell population. Finally, I conclude by discussing open questions for future research and highlighting the implications for evolution, development, disease and applications such as genetic engineering.
Collapse
|
11
|
SOYER OS, CREEVEY CJ. Duplicate retention in signalling proteins and constraints from network dynamics. J Evol Biol 2010; 23:2410-21. [DOI: 10.1111/j.1420-9101.2010.02101.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
12
|
Abstract
From a comparatively small number of protein structural domains a staggering array of structural variants has evolved which has, in turn, facilitated an expanse of functional derivatives. Herein I review the primary mechanisms which have contributed to the vastness of our existing, and expanding, protein repertoires.
Collapse
Affiliation(s)
- Roy D Sleator
- Department of Biological Sciences, Cork Institute of Technology.
| |
Collapse
|
13
|
Farré D, Albà MM. Heterogeneous patterns of gene-expression diversification in mammalian gene duplicates. Mol Biol Evol 2009; 27:325-35. [PMID: 19822635 DOI: 10.1093/molbev/msp242] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Gene duplication is a major mechanism for molecular evolutionary innovation. Young gene duplicates typically exhibit elevated rates of protein evolution and, according to a number of recent studies, increased expression divergence. However, the nature of these changes is still poorly understood. To gain novel insights into the functional consequences of gene duplication, we have undertaken an in-depth analysis of a large data set of gene families containing primate- and/or rodent-specific gene duplicates. We have found a clear tendency toward an increase in protein, promoter, and expression divergence with increasing number of duplication events undergone by each gene since the human-mouse split. In addition, gene duplication is significantly associated with a reduction in expression breadth and intensity. Interestingly, it is possible to identify three main groups regarding the evolution of gene expression following gene duplication. The first group, which comprises around 25% of the families, shows patterns compatible with tissue-expression partitioning. The second and largest group, comprising 33-53% of the families, shows broad expression of one of the gene copies and reduced, overlapping, expression of the other copy or copies. This can be attributed, in most cases, to loss of expression in several tissues of one or more gene copies. Finally, a substantial number of families, 19-35%, maintain a very high level of tissue-expression overlap (>0.8) after tens of millions of years of evolution. These families may have been subject to selection for increased gene dosage.
Collapse
|
14
|
Abstract
It has been known for more than 35 years that, during evolution, new proteins are formed by gene duplications, sequence and structural divergence and, in many cases, gene combinations. The genome projects have produced complete, or almost complete, descriptions of the protein repertoires of over 600 distinct organisms. Analyses of these data have dramatically increased our understanding of the formation of new proteins. At the present time, we can accurately trace the evolutionary relationships of about half the proteins found in most genomes, and it is these proteins that we discuss in the present review. Usually, the units of evolution are protein domains that are duplicated, diverge and form combinations. Small proteins contain one domain, and large proteins contain combinations of two or more domains. Domains descended from a common ancestor are clustered into superfamilies. In most genomes, the net growth of superfamily members means that more than 90% of domains are duplicates. In a section on domain duplications, we discuss the number of currently known superfamilies, their size and distribution, and superfamily expansions related to biological complexity and to specific lineages. In a section on divergence, we describe how sequences and structures diverge, the changes in stability produced by acceptable mutations, and the nature of functional divergence and selection. In a section on domain combinations, we discuss their general nature, the sequential order of domains, how combinations modify function, and the extraordinary variety of the domain combinations found in different genomes. We conclude with a brief note on other forms of protein evolution and speculations of the origins of the duplication, divergence and combination processes.
Collapse
|
15
|
Kummerfeld SK, Teichmann SA. Protein domain organisation: adding order. BMC Bioinformatics 2009; 10:39. [PMID: 19178743 PMCID: PMC2657131 DOI: 10.1186/1471-2105-10-39] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2008] [Accepted: 01/29/2009] [Indexed: 11/30/2022] Open
Abstract
Background Domains are the building blocks of proteins. During evolution, they have been duplicated, fused and recombined, to produce proteins with novel structures and functions. Structural and genome-scale studies have shown that pairs or groups of domains observed together in a protein are almost always found in only one N to C terminal order and are the result of a single recombination event that has been propagated by duplication of the multi-domain unit. Previous studies of domain organisation have used graph theory to represent the co-occurrence of domains within proteins. We build on this approach by adding directionality to the graphs and connecting nodes based on their relative order in the protein. Most of the time, the linear order of domains is conserved. However, using the directed graph representation we have identified non-linear features of domain organization that are over-represented in genomes. Recognising these patterns and unravelling how they have arisen may allow us to understand the functional relationships between domains and understand how the protein repertoire has evolved. Results We identify groups of domains that are not linearly conserved, but instead have been shuffled during evolution so that they occur in multiple different orders. We consider 192 genomes across all three kingdoms of life and use domain and protein annotation to understand their functional significance. To identify these features and assess their statistical significance, we represent the linear order of domains in proteins as a directed graph and apply graph theoretical methods. We describe two higher-order patterns of domain organisation: clusters and bi-directionally associated domain pairs and explore their functional importance and phylogenetic conservation. Conclusion Taking into account the order of domains, we have derived a novel picture of global protein organization. We found that all genomes have a higher than expected degree of clustering and more domain pairs in forward and reverse orientation in different proteins relative to random graphs with identical degree distributions. While these features were statistically over-represented, they are still fairly rare. Looking in detail at the proteins involved, we found strong functional relationships within each cluster. In addition, the domains tended to be involved in protein-protein interaction and are able to function as independent structural units. A particularly striking example was the human Jak-STAT signalling pathway which makes use of a set of domains in a range of orders and orientations to provide nuanced signaling functionality. This illustrated the importance of functional and structural constraints (or lack thereof) on domain organisation.
Collapse
Affiliation(s)
- Sarah K Kummerfeld
- Department of Developmental Biology, 279 Campus Dr, Stanford, 94305, CA, USA.
| | | |
Collapse
|
16
|
Sales-Pardo M, Chan AOB, Amaral LAN, Guimerà R. Evolution of protein families: is it possible to distinguish between domains of life? Gene 2007; 402:81-93. [PMID: 17826006 PMCID: PMC2441766 DOI: 10.1016/j.gene.2007.07.029] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2007] [Revised: 07/18/2007] [Accepted: 07/23/2007] [Indexed: 11/28/2022]
Abstract
Understanding evolutionary relationships between species can shed new light into the rooting of the tree of life and the origin of eukaryotes, thus, resulting in a long standing interest in accurately assessing evolutionary parameters at time scales on the order of a billion of years. Prior work suggests large variability in molecular substitution rates, however, we still do not know whether such variability is due to species-specific trends at a genomic scale, or whether it can be attributed to the fluctuations inherent in any stochastic process. Here, we study the statistical properties of gene and protein-family sizes in order to quantify the long time scale evolutionary differences and similarities across species. We first determine the protein families of 209 species of bacteria and 20 species of archaea. We find that we are unable to reject the null hypothesis that the protein-family sizes of these species are drawn from the same distribution. In addition, we find that for species classified in the same phylogenetic branch or in the same lifestyle group, family size distributions are not significantly more similar than for species in different branches. These two findings can be accounted for in terms of a dynamical birth, death, and innovation model that assumes identical protein-family evolutionary rates for all species. Our theoretical and empirical results thus strongly suggest that the variability empirically observed in protein-family size distributions is compatible with the expected stochastic fluctuations for an evolutionary process with identical genomic evolutionary rates. Our findings hold special importance for the plausibility of some theories of the origin of eukaryotes which require drastic changes in evolutionary rates for some period during the last 2 billion years.
Collapse
Affiliation(s)
- Marta Sales-Pardo
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA.
| | | | | | | |
Collapse
|
17
|
Zhang Z, Liu C, Skogerbø G, Zhu X, Lu H, Chen L, Shi B, Zhang Y, Wang J, Wu T, Chen R. Dynamic changes in subgraph preference profiles of crucial transcription factors. PLoS Comput Biol 2006; 2:e47. [PMID: 16699597 PMCID: PMC1458966 DOI: 10.1371/journal.pcbi.0020047] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2005] [Accepted: 03/24/2006] [Indexed: 12/05/2022] Open
Abstract
Transcription factors with a large number of target genes—transcription hub(s), or THub(s)—are usually crucial components of the regulatory system of a cell, and the different patterns through which they transfer the transcriptional signal to downstream cascades are of great interest. By profiling normalized abundances (AN) of basic regulatory patterns of individual THubs in the yeast Saccharomyces cerevisiae transcriptional regulation network under five different cellular states and environmental conditions, we have investigated their preferences for different basic regulatory patterns. Subgraph-normalized abundances downstream of individual THubs often differ significantly from that of the network as a whole, and conversely, certain over-represented subgraphs are not preferred by any THub. The THub preferences changed substantially when the cellular or environmental conditions changed. This switching of regulatory pattern preferences suggests that a change in conditions does not only elicit a change in response by the regulatory network, but also a change in the mechanisms by which the response is mediated. The THub subgraph preference profile thus provides a novel tool for description of the structure and organization between the large-scale exponents and local regulatory patterns. Transcription factors are proteins that bind to short segments of DNA, thereby controlling transcription and expression of other genes. Transcription factors may control a number of other genes, and in turn be controlled by other transcription factors, thus forming an extensive transcriptional network of control and counter-control, which acts through space and time in the cell. In transcriptional networks, transcription factors and their target genes form various patterns (called subgraphs or motifs) that are suspected of being of importance to how transcription factors exert their control of cellular processes. Zhang and colleagues have studied how a subset of transcription factors (called transcription hubs) utilizes such subgraphs in networks generated from yeast cells under various cellular states and environmental conditions. Their analyses show that different transcription hubs in the same network prefer different types of subgraphs, and that these preferences are not governed by subgraph frequencies in the network. They further show that when cellular conditions change, the transcription hubs frequently change their subgraph preferences, indicating that different modes of control require different types of subgraph use. These findings could have implications for our understanding of the mechanisms that underlie the fine-tuned control systems that govern a cell or an organism.
Collapse
Affiliation(s)
- Zhihua Zhang
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Changning Liu
- Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Geir Skogerbø
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Xiaopeng Zhu
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Hongchao Lu
- Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Lan Chen
- Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Baochen Shi
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Yong Zhang
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Jie Wang
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Tao Wu
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate School of the Chinese Academy of Sciences, Beijing, China
| | - Runsheng Chen
- Bioinformatics Laboratory and National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Bioinformatics Research Group, Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
18
|
Bonomo J, Warnecke T, Hume P, Marizcurrena A, Gill RT. A comparative study of metabolic engineering anti-metabolite tolerance in Escherichia coli. Metab Eng 2006; 8:227-39. [PMID: 16497527 DOI: 10.1016/j.ymben.2005.12.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2005] [Revised: 12/15/2005] [Accepted: 12/28/2005] [Indexed: 11/22/2022]
Abstract
A problem in strain engineering is that mutations that benefit the expression of a phenotype in one environment may impose a cost to biological fitness in a new environment. The overall objective of this study was to improve understanding of this phenomenon within the context of a classic anti-metabolite selection strategy. We have engineered Escherichia coli using three mutagenesis techniques (chemical mutagenesis, insertional mutagenesis, and plasmid-based overexpression) and assessed the relative costs and benefits to biological fitness of mutants selected for tolerance to five amino acid analogs whose target amino acids (glutamatic acid, aspartic acid, tryptophan, glycine, and serine) differ in metabolic connectivity and biosynthetic energy requirements. Our major findings include (i) the fold increase in anti-metabolite tolerance, independent of mutagenesis strategy, was much greater for aspartic acid beta-hydroxamate (AAH) compared to all other tested hydroxamates, (ii) increased tolerance to glutamic acid gamma-hydroxamate (GAH) was not achieved using any of the mutagenesis strategies, and (iii) characteristics of the anti-metabolite, rather than those of the corresponding metabolite, were more important in determining the ability to increase tolerance.
Collapse
Affiliation(s)
- Jeanne Bonomo
- Department of Chemical and Biological Engineering, University of Colorado, Boulder, Campus Box 424, Boulder, CO 80309, USA
| | | | | | | | | |
Collapse
|
19
|
Price GA, Crooks GE, Green RE, Brenner SE. Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 2005; 21:3824-31. [PMID: 16105900 DOI: 10.1093/bioinformatics/bti627] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap. RESULTS We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices. AVAILABILITY The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/. CONTACT brenner@compbio.berkeley.edu.
Collapse
Affiliation(s)
- Gavin A Price
- Department of Bioengineering, University of California, Berkeley, 94720, USA
| | | | | | | |
Collapse
|
20
|
Li H, Pellegrini M, Eisenberg D. Detection of parallel functional modules by comparative analysis of genome sequences. Nat Biotechnol 2005; 23:253-60. [PMID: 15696156 DOI: 10.1038/nbt1065] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Parallel functional modules are separate sets of proteins in an organism that catalyze the same or similar biochemical reactions but act on different substrates or use different cofactors. They originate by gene duplication during evolution. Parallel functional modules provide versatility and complexity to organisms, and increase cellular flexibility and robustness. We have developed a four-step approach for genome-wide discovery of parallel modules from protein functional linkages. From ten genomes, we identified 37 cellular systems that consist of parallel functional modules. This approach recovers known parallel complexes and pathways, and discovers new ones that conventional homology-based methods did not previously reveal, as illustrated by examples of peptide transporters in Escherichia coli and nitrogenases in Rhodopseudomonas palustris. The approach untangles intertwined functional linkages between parallel functional modules and expands our ability to decode protein functions from genome sequences.
Collapse
Affiliation(s)
- Huiying Li
- Howard Hughes Medical Institute, UCLA-DOE Institute for Genomics and Proteomics, Department of Chemistry and Biochemistry, 90095-1570, USA
| | | | | |
Collapse
|
21
|
Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nat Genet 2004; 36:492-6. [PMID: 15107850 DOI: 10.1038/ng1340] [Citation(s) in RCA: 403] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2003] [Accepted: 03/01/2004] [Indexed: 11/09/2022]
Abstract
We are beginning to elucidate transcriptional regulatory networks on a large scale and to understand some of the structural principles of these networks, but the evolutionary mechanisms that form these networks are still mostly unknown. Here we investigate the role of gene duplication in network evolution. Gene duplication is the driving force for creating new genes in genomes: at least 50% of prokaryotic genes and over 90% of eukaryotic genes are products of gene duplication. The transcriptional interactions in regulatory networks consist of multiple components, and duplication processes that generate new interactions would need to be more complex. We define possible duplication scenarios and show that they formed the regulatory networks of the prokaryote Escherichia coli and the eukaryote Saccharomyces cerevisiae. Gene duplication has had a key role in network evolution: more than one-third of known regulatory interactions were inherited from the ancestral transcription factor or target gene after duplication, and roughly one-half of the interactions were gained during divergence after duplication. In addition, we conclude that evolution has been incremental, rather than making entire regulatory circuits or motifs by duplication with inheritance of interactions.
Collapse
Affiliation(s)
- Sarah A Teichmann
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK.
| | | |
Collapse
|
22
|
Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV. Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol Biol 2002; 2:18. [PMID: 12379152 PMCID: PMC137606 DOI: 10.1186/1471-2148-2-18] [Citation(s) in RCA: 112] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2002] [Accepted: 10/14/2002] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to pre-existing hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing non-trivial aspects of genome evolution. RESULTS A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a birth, death and innovation model (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium are derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to -1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the second-order balanced linear BDIM. Calculation of the parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes. CONCLUSIONS We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.
Collapse
Affiliation(s)
- Georgy P Karev
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Andrey Y Rzhetsky
- Columbia Genome Center, Columbia University, 1150 St. Nicholas Avenue, Unit 109, New York, NY 10032, USA
| | - Faina S Berezovskaya
- Department of Mathematics, Howard University, 2400 Sixth Str., Washington D.C., 20059, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
23
|
Mercereau-Puijalon O, Barale JC, Bischoff E. Three multigene families in Plasmodium parasites: facts and questions. Int J Parasitol 2002; 32:1323-44. [PMID: 12350369 DOI: 10.1016/s0020-7519(02)00111-x] [Citation(s) in RCA: 66] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Multigene families optimise fitness by providing a set of related genes with possibly different temporal and/or topological expression patterns. We analyse here the structural organisation and sequence diversity of the rDNA, sera and var C Plasmodium falciparum families, and discuss their consequences for parasite biology. The low rDNA copy number, which reduces reshuffling, is probably the corollary of the need for functionally distinct rRNAs in the insect and in the vertebrate host. The unusual intra-genome and population rDNA sequence diversity results in cells equipped with mosaic ribosome sets. The functional constraints are such that ribosome compatibility could influence parasite fitness and contribute to population structuring. Unlike the dispersed rDNA units, the sera family is arranged as a tandem gene cluster, with seven contiguous similar genes, and one more distantly related paralog. We address the question of the inclusion criteria in family definition. We discuss the results concerning the SERA proteins expression and function in the context of the long overlooked multigene family. The var C module is shared by var genes, 'orphan' var C and var C pseudogenes. Analysis of 125 var C deduced protein sequences highlights a well-conserved framework, including putative phosphorylation sites, consistent with the proposed function of mediating interaction with cytoskeletal proteins. The 5' and 3' flanking sequences of the var C pseudogenes are heterogeneous. In contrast, the flanking sequences of the uninterrupted var C modules show remarkable conservation. This is interesting in view of the silencing activity of the var intronic sequence on var expression. The 5' flanking sequence dichotomy reported for internal and sub-telomeric var genes extends to the 3' flanking sequences. This has profound implications for transcription regulation and generation of diversity. The var C family suggests a role for pseudogenes as a diversity reservoir and in genome dynamics by promoting ectopic recombination.
Collapse
Affiliation(s)
- Odile Mercereau-Puijalon
- Unité d'Immunologie Moléculaire des Parasites, Unité de Recherche Associée 1960 du Centre National de la Recherche Scientifique, Institut Pasteur, 25 rue du Dr ROUX, 75015, Paris, France.
| | | | | |
Collapse
|
24
|
Das R, Junker J, Greenbaum D, Gerstein MB. Global perspectives on proteins: comparing genomes in terms of folds, pathways and beyond. THE PHARMACOGENOMICS JOURNAL 2002; 1:115-25. [PMID: 11911438 DOI: 10.1038/sj.tpj.6500021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The sequencing of complete genomes provides us with a global view of all the proteins in an organism. Proteomic analysis can be done on a purely sequence-based level, with a focus on finding homologues and grouping them into families and clusters of orthologs. However, incorporating protein structure into this analysis provides valuable simplification; it allows one to collect together very distantly related sequences, thus condensing the proteome into a minimal number of 'parts.' We describe issues related to surveying proteomes in terms of structural parts, including methods for fold assignment and formats for comparisons (eg top-10 lists and whole-genome trees), and show how biases in the databases and in sampling can affect these surveys. We illustrate our main points through a case study on the unique protein properties evident in many thermophile genomes (eg more salt bridges). Finally, we discuss metabolic pathways as an even greater simplification of genomes. In comparison to folds these allow the organization of many more genes into coherent systems, yet can nevertheless be understood in many of the same terms.
Collapse
Affiliation(s)
- R Das
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA
| | | | | | | |
Collapse
|
25
|
Orengo CA, Sillitoe I, Reeves G, Pearl FM. Review: what can structural classifications reveal about protein evolution? J Struct Biol 2001; 134:145-65. [PMID: 11551176 DOI: 10.1006/jsbi.2001.4398] [Citation(s) in RCA: 42] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In this article we present a review of the methods used for comparing and classifying protein structures. We discuss the hierarchies and populations of fold groups and evolutionary families in some of the major classifications and we consider some of the problems confronting any general analyses of structural evolution in protein families. We also review some more recent analyses that have expanded these classifications by identifying sequence relatives in the genomes and thereby reveal interesting trends in fold usage and recurrence.
Collapse
Affiliation(s)
- C A Orengo
- Department of Biochemistry and Molecular Biology, University College, Gower Street, London, WC1E 6BT, United Kingdom
| | | | | | | |
Collapse
|
26
|
Qian J, Stenger B, Wilson CA, Lin J, Jansen R, Teichmann SA, Park J, Krebs WG, Yu H, Alexandrov V, Echols N, Gerstein M. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res 2001; 29:1750-64. [PMID: 11292848 PMCID: PMC31319 DOI: 10.1093/nar/29.8.1750] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2000] [Revised: 02/27/2001] [Accepted: 02/27/2001] [Indexed: 11/14/2022] Open
Abstract
As the number of protein folds is quite limited, a mode of analysis that will be increasingly common in the future, especially with the advent of structural genomics, is to survey and re-survey the finite parts list of folds from an expanding number of perspectives. We have developed a new resource, called PartsList, that lets one dynamically perform these comparative fold surveys. It is available on the web at http://bioinfo.mbb.yale.edu/partslist and http://www.partslist.org. The system is based on the existing fold classifications and functions as a form of companion annotation for them, providing 'global views' of many already completed fold surveys. The central idea in the system is that of comparison through ranking; PartsList will rank the approximately 420 folds based on more than 180 attributes. These include: (i) occurrence in a number of completely sequenced genomes (e.g. it will show the most common folds in the worm versus yeast); (ii) occurrence in the structure databank (e.g. most common folds in the PDB); (iii) both absolute and relative gene expression information (e.g. most changing folds in expression over the cell cycle); (iv) protein-protein interactions, based on experimental data in yeast and comprehensive PDB surveys (e.g. most interacting fold); (v) sensitivity to inserted transposons; (vi) the number of functions associated with the fold (e.g. most multi-functional folds); (vii) amino acid composition (e.g. most Cys-rich folds); (viii) protein motions (e.g. most mobile folds); and (ix) the level of similarity based on a comprehensive set of structural alignments (e.g. most structurally variable folds). The integration of whole-genome expression and protein-protein interaction data with structural information is a particularly novel feature of our system. We provide three ways of visualizing the rankings: a profiler emphasizing the progression of high and low ranks across many pre-selected attributes, a dynamic comparer for custom comparisons and a numerical rankings correlator. These allow one to directly compare very different attributes of a fold (e.g. expression level, genome occurrence and maximum motion) in the uniform numerical format of ranks. This uniform framework, in turn, highlights the way that the frequency of many of the attributes falls off with approximate power-law behavior (i.e. according to V(-b), for attribute value V and constant exponent b), with a few folds having large values and most having small values.
Collapse
Affiliation(s)
- J Qian
- Department of Molecular Biophysics and Biochemistry, Yale University, PO Box 208114, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res 2001; 11:555-65. [PMID: 11282971 PMCID: PMC311027 DOI: 10.1101/gr.gr-1660r] [Citation(s) in RCA: 122] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Gene duplication is an important mechanistic antecedent to the evolution of new genes and novel biochemical functions. In an attempt to assess the contribution of gene duplication to genome evolution in archaea and bacteria, clusters of related genes that appear to have expanded subsequent to the diversification of the major prokaryotic lineages (lineage-specific expansions) were analyzed. Analysis of 21 completely sequenced prokaryotic genomes shows that lineage-specific expansions comprise a substantial fraction (approximately 5%-33%) of their coding capacities. A positive correlation exists between the fraction of the genes taken up by lineage-specific expansions and the total number of genes in a genome. Consistent with the notion that lineage-specific expansions are made up of relatively recently duplicated genes, >90% of the detected clusters consists of only two to four genes. The more common smaller clusters tend to include genes with higher pairwise similarity (as reflected by average score density) than larger clusters. Regardless of size, cluster members tend to be located more closely on bacterial chromosomes than expected by chance, which could reflect a history of tandem gene duplication. In addition to the small clusters, almost all genomes also contain rare large clusters of size > or =20. Several examples of the potential adaptive significance of these large clusters are explored. The presence or absence of clusters and their related genes was used as the basis for the construction of a similarity graph for completely sequenced prokaryotic genomes. The topology of the resulting graph seems to reflect a combined effect of common ancestry, horizontal transfer, and lineage-specific gene loss.
Collapse
Affiliation(s)
- I K Jordan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | | | |
Collapse
|
28
|
Abstract
Gene duplication is an important mechanistic antecedent to the evolution of new genes and novel biochemical functions. In an attempt to assess the contribution of gene duplication to genome evolution in archaea and bacteria, clusters of related genes that appear to have expanded subsequent to the diversification of the major prokaryotic lineages (lineage-specific expansions) were analyzed. Analysis of 21 completely sequenced prokaryotic genomes shows that lineage-specific expansions comprise a substantial fraction (∼5%–33%) of their coding capacities. A positive correlation exists between the fraction of the genes taken up by lineage-specific expansions and the total number of genes in a genome. Consistent with the notion that lineage-specific expansions are made up of relatively recently duplicated genes, >90% of the detected clusters consists of only two to four genes. The more common smaller clusters tend to include genes with higher pairwise similarity (as reflected by average score density) than larger clusters. Regardless of size, cluster members tend to be located more closely on bacterial chromosomes than expected by chance, which could reflect a history of tandem gene duplication. In addition to the small clusters, almost all genomes also contain rare large clusters of size ≥20. Several examples of the potential adaptive significance of these large clusters are explored. The presence or absence of clusters and their related genes was used as the basis for the construction of a similarity graph for completely sequenced prokaryotic genomes. The topology of the resulting graph seems to reflect a combined effect of common ancestry, horizontal transfer, and lineage-specific gene loss.
Collapse
|
29
|
Yanai I, Camacho CJ, DeLisi C. Predictions of gene family distributions in microbial genomes: evolution by gene duplication and modification. PHYSICAL REVIEW LETTERS 2000; 85:2641-2644. [PMID: 10978127 DOI: 10.1103/physrevlett.85.2641] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2000] [Indexed: 05/23/2023]
Abstract
A universal property of microbial genomes is the considerable fraction of genes that are homologous to other genes within the same genome. The process by which these homologues are generated is not well understood, but sequence analysis of 20 microbial genomes unveils a recurrent distribution of gene family sizes. We show that a simple evolutionary model based on random gene duplication and point mutations fully accounts for these distributions and permits predictions for the number of gene families in genomes not yet complete. Our findings are consistent with the notion that a genome evolves from a set of precursor genes to a mature size by gene duplications and increasing modifications.
Collapse
Affiliation(s)
- I Yanai
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA
| | | | | |
Collapse
|
30
|
Abstract
The distribution of genes coding for membrane proteins was investigated in 16 complete genomes: 4 archaea, 11 bacteria, and 1 eukaryote. Membrane proteins were identified by our new method of predicting transmembrane segments () after the removal of amino-terminal signal peptides. Interestingly, about half of the membrane protein genes in each genome were found to be located next to another, forming tandem clusters. Roughly 10%-30% of the tandem clusters were conserved among organisms, and most of the conserved tandem clusters belonged to one of the three functional groups, namely, transporters, the electron transport system, and cell motility. A tandem cluster sometimes contained paralogous membrane proteins, in which case the cluster size and the number of transmembrane segments could be related to a functional category, especially to transporters. In addition to the clustering of membrane proteins, the clustering of membrane proteins and ATP-binding proteins in the complete genomes was also analyzed. Although this clustering was not statistically significant, it was useful to identify candidate membrane protein partners of isolated ATP-binding protein components in the ABC transporters. Possible implications of tandem cluster organization of membrane protein genes are discussed including the complex formation and other functional coupling of protein products and the mechanism of protein translocation to the cell membrane.
Collapse
Affiliation(s)
- D Kihara
- Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | | |
Collapse
|
31
|
Bischoff E, Guillotte M, Mercereau-Puijalon O, Bonnefoy S. A member of the Plasmodium falciparum Pf60 multigene family codes for a nuclear protein expressed by readthrough of an internal stop codon. Mol Microbiol 2000; 35:1005-16. [PMID: 10712683 DOI: 10.1046/j.1365-2958.2000.01788.x] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Four large multigene families have been described in Plasmodium falciparum malaria parasites (var, rif, stevor and Pf60). var and rif genes code for erythrocyte surface proteins and undergo clonal antigenic variation. We report here the characterization of the first Pf60 gene. The 6.1 gene is constitutively expressed by all mature blood stages and codes for a protein located within the nucleus. It has a single copy, 7-exon, 5' domain, separated by an internal stop codon from a 3' domain that presents a high homology with var exon II. Double-site immunoassay and P. falciparum transient transfection using the reporter luciferase gene demonstrated translation through the internal ochre codon. The 6.1 N-terminal domain has no homology with any protein described to date. Sequence analysis identified a leucine zipper and a putative nuclear localization signal and showed a high probability for coiled coils. Evidence for N-terminal coiled coil-mediated protein interactions was obtained. This identifies the 6.1 protein as a novel nuclear protein. These data show that the Pf60 and var genes form a superfamily with a common 3' domain, possibly involved in regulating homo- or heteromeric interactions.
Collapse
Affiliation(s)
- E Bischoff
- Unité d'Immunologie Moléculaire des Parasites, CNRS URA 1960, Institut Pasteur, 25 rue du Dr Roux, 75724 Paris Cedex 15, France
| | | | | | | |
Collapse
|
32
|
Abstract
In this study, we analyzed all known protein sequences for repeating amino acid segments. Although duplicated sequence segments occur in 14 % of all proteins, eukaryotic proteins are three times more likely to have internal repeats than prokaryotic proteins. After clustering the repetitive sequence segments into families, we find repeats from eukaryotic proteins have little similarity with prokaryotic repeats, suggesting most repeats arose after the prokaryotic and eukaryotic lineages diverged. Consequently, protein classes with the highest incidence of repetitive sequences perform functions unique to eukaryotes. The frequency distribution of the repeating units shows only weak length dependence, implicating recombination rather than duplex melting or DNA hairpin formation as the limiting mechanism underlying repeat formation. The mechanism favors additional repeats once an initial duplication has been incorporated. Finally, we show that repetitive sequences are favored that contain small and relatively water-soluble residues. We propose that error-prone repeat expansion allows repetitive proteins to evolve more quickly than non-repeat-containing proteins.
Collapse
Affiliation(s)
- E M Marcotte
- Molecular Biology Institute, UCLA-DOE Lab of Structural Biology and Molecular Medicine, Los Angeles, CA, P.O. Box 951570, USA
| | | | | | | |
Collapse
|
33
|
|
34
|
Abstract
New computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.
Collapse
Affiliation(s)
- S A Teichmann
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
| | | | | |
Collapse
|
35
|
Gerstein M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. FOLDING & DESIGN 1999; 3:497-512. [PMID: 9889159 DOI: 10.1016/s1359-0278(98)00066-2] [Citation(s) in RCA: 100] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
BACKGROUND Determining how representative the known structures are of the proteins encoded by a complete genome is important for assessing to what extent our current picture of protein stability and folding is overly influenced by biases in the structure databank (PDB). It is also important for improving database-based methods of structure prediction and genome annotation. RESULTS The known structures are compared to the proteins encoded by eight complete microbial genomes in terms of simple statistics such as sequence length, composition and secondary structure. The known structures are represented by a collection of nonhomologous domains from the PDB and a smaller list of 'biophysical proteins' on which folding experiments have concentrated. The proteins encoded by the genomes are considered as a whole and divided into various regions, such as known-structure homologue, low complexity (nonglobular), transmembrane or linker. Various tests are performed to assess the significance of the reported differences, in both a practical and a statistical sense. CONCLUSIONS The proteins encoded by the genomes are significantly different from those in the PDB. Their sequence lengths, which follow an extreme value distribution, are longer than the PDB proteins and much longer than the biophysical proteins. Their composition differs from the PDB proteins in having more Lys, Ile, Asn and Gln and less Cys and Trp. This is true overall and especially for the regions corresponding to soluble proteins of as yet unknown fold. Secondary-structure prediction on these uncharacterized regions indicates that they contain on average more helical structure than the PDB; differences about this mean are small, with yeast having slightly more sheet structure and Haemophilus influenzae and Helicobacter pylori more helical structure. Further information is available through the GeneCensus system at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA.
| |
Collapse
|
36
|
Teichmann SA, Park J, Chothia C. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 1998; 95:14658-63. [PMID: 9843945 PMCID: PMC24505 DOI: 10.1073/pnas.95.25.14658] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.
Collapse
Affiliation(s)
- S A Teichmann
- Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, United Kingdom.
| | | | | |
Collapse
|
37
|
Abstract
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage-whether a given fold occurs in a particular organism. Of the approximately 340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in allhelical structure and enriched in mixed helix-sheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many non-homologous sequence families, and are especially similar in overall architecture-eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly expressed folds are considerably different from the most highly duplicated folds. A tree can be constructed grouping genomes in terms of their shared folds. This has a remarkably similar topology to more conventional classifications, based on very different measures of relatedness. Finally, folds of membrane proteins can be analyzed through transmembrane-helix (TM) prediction. All the genomes appear to have similar usage patterns for these folds, with the occurrence of a particular fold falling off rapidly with increasing numbers of TM-elements, according to a "Zipf-like" law. This implies there are no marked preferences for proteins with particular numbers of TM-helices (e.g. 7-TM) in microbial genomes.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
| |
Collapse
|
38
|
Stephens RS, Kalman S, Lammel C, Fan J, Marathe R, Aravind L, Mitchell W, Olinger L, Tatusov RL, Zhao Q, Koonin EV, Davis RW. Genome sequence of an obligate intracellular pathogen of humans: Chlamydia trachomatis. Science 1998; 282:754-9. [PMID: 9784136 DOI: 10.1126/science.282.5389.754] [Citation(s) in RCA: 1133] [Impact Index Per Article: 43.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Analysis of the 1,042,519-base pair Chlamydia trachomatis genome revealed unexpected features related to the complex biology of chlamydiae. Although chlamydiae lack many biosynthetic capabilities, they retain functions for performing key steps and interconversions of metabolites obtained from their mammalian host cells. Numerous potential virulence-associated proteins also were characterized. Several eukaryotic chromatin-associated domain proteins were identified, suggesting a eukaryotic-like mechanism for chlamydial nucleoid condensation and decondensation. The phylogenetic mosaic of chlamydial genes, including a large number of genes with phylogenetic origins from eukaryotes, implies a complex evolution for adaptation to obligate intracellular parasitism.
Collapse
Affiliation(s)
- R S Stephens
- Program in Infectious Diseases, University of California, Berkeley, CA 94720, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Fani R, Mori E, Tamburini E, Lazcano A. Evolution of the structure and chromosomal distribution of histidine biosynthetic genes. ORIGINS LIFE EVOL B 1998; 28:555-70. [PMID: 9742729 DOI: 10.1023/a:1006531526299] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
A database of more than 100 histidine biosynthetic genes from different organisms belonging to the three primary domains has been analyzed, including those found in the now completely sequenced genomes of Haemophilus influenzae, Mycoplasma genitalium, Synechocystis sp., Methanococcus jannaschii, and Saccharomyces cerevisiae. The ubiquity of his genes suggests that it is a highly conserved pathway that was probably already present in the last common ancestor of all extant life. The chromosomal distribution of the his genes shows that the enterobacterial histidine operon structure is not the only possible organization, and that there is a diversity of gene arrays for the his pathway. Analysis of the available sequences shows that gene fusions (like those involved in the origin of the Escherichia coli and Salmonella typhimurium hisIE and hisB gene structures) are not universal. In contrast, the elongation event that led to the extant hisA gene from two homologous ancestral modules, as well as the subsequent paralogous duplication that originated hisF, appear to be irreversible and are conserved in all known organisms. The available evidence supports the hypothesis that histidine biosynthesis was assembled by a gene recruitment process.
Collapse
Affiliation(s)
- R Fani
- Dipartimento di Biologia Animale e Genetica, Università degli Studi di Firenze, Italy.
| | | | | | | |
Collapse
|
40
|
Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22:277-304. [PMID: 10357579 DOI: 10.1111/j.1574-6976.1998.tb00371.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
41
|
Nakatsu CH, Korona R, Lenski RE, de Bruijn FJ, Marsh TL, Forney LJ. Parallel and divergent genotypic evolution in experimental populations of Ralstonia sp. J Bacteriol 1998; 180:4325-31. [PMID: 9721265 PMCID: PMC107437 DOI: 10.1128/jb.180.17.4325-4331.1998] [Citation(s) in RCA: 47] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Genetic rearrangements within a population of bacteria were analyzed to understand the degree of divergence occurring after experimental evolution. We used 18 replicate populations founded from Ralstonia sp. strain TFD41 that had been propagated for 1,000 generations with 2,4-dichlorophenoxyacetic acid (2,4-D) as the carbon source. Genetic divergence was examined by restriction fragment length polymorphism analysis of the incumbent plasmid that carries the 2,4-D catabolic genes and by amplification of random regions of the genome via PCR. In 18 evolved clones examined, we observed duplication within the plasmid, including the tfdA gene, which encodes a 2,4-D dioxygenase that catalyzes the first step in the 2,4-D catabolic pathway. In 71 of 72 evolved clones, a common 2.4-kb PCR product was lost when genomic fingerprints produced by PCR amplification using degenerate primers based on repetitive extragenic palindromic (REP) sequences (REP-PCR) were compared. The nucleotide sequence of the 2.4-kb PCR product has homology to the TRAP (tripartite ATP-independent periplasmic) solute transporter gene family. Hybridization of the 2. 4-kb REP-PCR product from the ancestor to genomic DNA from the evolved populations showed that the loss of the PCR product resulted from deletions in the genome. Deletions in the plasmid and presence and/or absence of other REP-PCR products were also found in these clones but at much lower frequencies. The common and uncommon genetic changes observed show that both parallel and divergent genotypic evolution occurred in replicate populations of this bacterium.
Collapse
Affiliation(s)
- C H Nakatsu
- NSF Center for Microbial Ecology, Michigan State University, East Lansing, Michigan 48824, USA.
| | | | | | | | | | | |
Collapse
|
42
|
Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan Y, Bork P. Homology-based fold predictions for Mycoplasma genitalium proteins. J Mol Biol 1998; 280:323-6. [PMID: 9665839 DOI: 10.1006/jmbi.1998.1884] [Citation(s) in RCA: 88] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Homology search techniques based on the iterative PSI-BLAST method in combination with various filters for low sequence complexity are applied to assign folds to all Mycoplasma genitalium proteins. The resulting procedure (implemented as a web server) is able to predict at least one domain in 37% of these proteins automatically, with an estimated accuracy higher than 98%. Taking structural features such as coiled coil or transmembrane regions aside, folds can be assigned to more than half of the globular proteins in a bacterium just by iterative sequence comparison.
Collapse
Affiliation(s)
- M Huynen
- EMBL, Max-Delbrück-Center for Molecular Medicine, Meyerhoftstr.1, Heidelberg, 69012, Germany
| | | | | | | | | | | | | |
Collapse
|
43
|
Koonin EV, Tatusov RL, Galperin MY. Beyond complete genomes: from sequence to structure and function. Curr Opin Struct Biol 1998; 8:355-63. [PMID: 9666332 DOI: 10.1016/s0959-440x(98)80070-5] [Citation(s) in RCA: 114] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Computer analysis of complete prokaryotic genomes shows that microbial proteins are in general highly conserved--approximately 70% of them contain ancient conserved regions. This allows us to delineate families of orthologs across a wide phylogenetic range and, in many cases, predict protein functions with considerable precision. Sequence database searches using newly developed, sensitive algorithms result in the unification of such orthologous families into larger superfamilies sharing common sequence motifs. For many of these superfamilies, prediction of the structural fold and specific amino acid residues involved in enzymatic catalysis is possible. Taken together, sequence and structure comparisons provide a powerful methodology that can successfully complement traditional experimental approaches.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | |
Collapse
|
44
|
Abstract
We introduce and discuss a new computational approach towards prediction and inference of biological functions from genomic sequences by making use of the pathway data in KEGG. Due to its piecewise nature, the current approach of predicting each gene function based on sequence similarity searches often fails to reconstruct cellular functions with all necessary components. The pathway diagram in KEGG, which may be considered a wiring diagram of molecules in biological systems, can be utilised as a reference for functional reconstruction. KEGG also contains binary relations that represent molecular interactions and relations and that can be utilised for computing and comparing pathways.
Collapse
Affiliation(s)
- H Ogata
- Institute for Chemical Research, Kyoto University, Japan
| | | | | | | |
Collapse
|
45
|
Abstract
Computational biology exploits the evolutionary connectivity between proteins and protein families to predict structural and functional properties of uncharacterized gene products. In the past year, conceptual and statistical refinements have substantially improved algorithms for the detection of remote homologues. In conjunction with the rapid growth of biological databases, the global organization of proteins into sequence families, functional families and structural families has become both pertinent and feasible.
Collapse
Affiliation(s)
- L Holm
- European Molecular Biology Laboratory-European Bio-informatics Institute, Cambridge, UK
| |
Collapse
|
46
|
Levitt M, Gerstein M. A unified statistical framework for sequence comparison and structure comparison. Proc Natl Acad Sci U S A 1998; 95:5913-20. [PMID: 9600892 PMCID: PMC34495 DOI: 10.1073/pnas.95.11.5913] [Citation(s) in RCA: 232] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
Collapse
Affiliation(s)
- M Levitt
- Department of Structural Biology, Stanford University, Stanford, CA 94305, USA.
| | | |
Collapse
|
47
|
Huynen M, Dandekar T, Bork P. Differential genome analysis applied to the species-specific features of Helicobacter pylori. FEBS Lett 1998; 426:1-5. [PMID: 9598967 DOI: 10.1016/s0014-5793(98)00276-2] [Citation(s) in RCA: 68] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
We introduce a simple and rapid strategy to identify genes that are responsible for species-specific phenotypes. The genome of a species that has a specific phenotype is compared with at least one, closely related, species that lacks this phenotype. Homologous genes that are shared among the species compared are identified and discarded from the list of candidates for species-specific genes. The process is automated and rapidly yields a small subset of the genome that likely contains genes responsible for the species-specific features. Functions are assigned to the genes, and dubious annotations are filtered out. Information is extracted not only from the presence of genes, but also from their absence with respect to known phenotypes. We have applied the technique to identify a set of species-specific genes in Helicobacter pylori by comparing it with its closest relatives for which complete genome sequences are available, Haemophilus influenzae and Escherichia coli. Of the genes of this set for which functional features can be obtained, a large fraction (63%, 123 proteins) is (potentially) involved in H. pylori's interaction with its host. We hypothesize that a family of outer membrane proteins is critical for the ability of H. pylori to colonize host cells in highly acidic environments.
Collapse
|
48
|
Gerstein M, Levitt M. Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 1998; 7:445-56. [PMID: 9521122 PMCID: PMC2143933 DOI: 10.1002/pro.5560070226] [Citation(s) in RCA: 157] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
We apply a simple method for aligning protein sequences on the basis of a 3D structure, on a large scale, to the proteins in the scop classification of fold families. This allows us to assess, understand, and improve our automatic method against an objective, manually derived standard, a type of comprehensive evaluation that has not yet been possible for other structural alignment algorithms. Our basic approach directly matches the backbones of two structures, using repeated cycles of dynamic programming and least-squares fitting to determine an alignment minimizing coordinate difference. Because of simplicity, our method can be readily modified to take into account additional features of protein structure such as the orientation of side chains or the location-dependent cost of opening a gap. Our basic method, augmented by such modifications, can find reasonable alignments for all but 1.5% of the known structural similarities in scop, i.e., all but 32 of the 2,107 superfamily pairs. We discuss the specific protein structural features that make these 32 pairs so difficult to align and show how our procedure effectively partitions the relationships in scop into different categories, depending on what aspects of protein structure are involved (e.g., depending on whether or not consideration of side-chain orientation is necessary for proper alignment). We also show how our pairwise alignment procedure can be extended to generate a multiple alignment for a group of related structures. We have compared these alignments in detail with corresponding manual ones culled from the literature. We find good agreement (to within 95% for the core regions), and detailed comparison highlights how particular protein structural features (such as certain strands) are problematical to align, giving somewhat ambiguous results. With these improvements and systematic tests, our procedure should be useful for the development of scop and the future classification of protein folds.
Collapse
Affiliation(s)
- M Gerstein
- Molecular Biophysics & Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA.
| | | |
Collapse
|
49
|
Gerstein M. A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. J Mol Biol 1997; 274:562-76. [PMID: 9417935 DOI: 10.1006/jmbi.1997.1412] [Citation(s) in RCA: 124] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Representative genomes from each of the three kingdoms of life are compared in terms of protein structure, in particular, those of Haemophilus influenzae (a bacteria), Methanococcus jannaschii (an archaeon), and yeast (a eukaryote). The comparison is in the form of a census (or comprehensive accounting) of the relative occurrence of secondary and tertiary structures in the genomes, which particular emphasis on patterns of supersecondary structure. Comparison of secondary structure shows that the three genomes have nearly the same overall secondary-structure content, although they differ markedly in amino acid composition. Comparison of super-secondary structure, using a novel "frequent-words" approach, shows that yeast has a preponderance of consecutive strands (e.g. beta-beta-beta patterns), Haemophilus, consecutive helices (alpha-alpha-alpha), and Methanococcus, alternating helix-strand structures (beta-alpha-beta). Yeast also has significantly more helical membrane proteins than the other two genomes, with most of the differences concentrated in proteins containing two transmembrane segments. Comparison of tertiary structure (by sequence matching and domain-level clustering) highlights the substantial duplication in each genome (approximately 30% to 50%), with the degree of duplication following similar patterns in all three. Many sequence families are shared among the genomes, with the degree of overlap between any two genomes being roughly similar. In total, the three genomes contain 148 of the approximately 300 known protein folds. Forty-five of these 148 that are present in all three genomes are especially enriched in mixed super-secondary structures (alpha/beta). Moreover, the five most common of these 45 (the "top-5") have a remarkably similar super-secondary structure architecture, containing a central sheet of parallel strands with helices packed onto at least one face and beta-alpha-beta connections between adjacent strands. These most basic molecular parts, which, presumably, were present in the last common ancestor to the three Kingdoms, include the TIM-barrel, Rossmann, flavodoxin, thiamin-binding, and P-loop-hydrolase folds.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
50
|
Koonin EV, Galperin MY. Prokaryotic genomes: the emerging paradigm of genome-based microbiology. Curr Opin Genet Dev 1997; 7:757-63. [PMID: 9468784 DOI: 10.1016/s0959-437x(97)80037-8] [Citation(s) in RCA: 110] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Comparative analysis of the complete sequences of seven bacterial and three archaeal genomes leads to the first generalizations of emerging genome-based microbiology. Protein sequences are, generally, highly conserved, with -70% of the gene products in bacteria and archaea containing ancient conserved regions. In contrast, there is little conservation of genome organization, except for a few essential operons. The most striking conclusions derived by comparison of multiple genomes from phylogenetically distant species are that the number of universally conserved gene families is very small and that multiple events of horizontal gene transfer and genome fusion are major forces in evolution.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, Maryland 20894, USA.
| | | |
Collapse
|