1
|
Seto K, Mok W, Stone J. Bridging the gap between theory and practice in elucidating modular gene regulatory sequence organisation within genomes. Genome 2020; 63:281-289. [PMID: 32114793 DOI: 10.1139/gen-2019-0150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Changes to promoter regions probably have been responsible for many morphological evolutionary transitions, especially in animals. This idea is becoming testable, as data from genome projects amass and enable bioinformaticians to conduct comparative sequence analyses and test for correlations between genotypic similarities or differences and phenotypic likeness or disparity. Although such practical pursuits have initiated some theoretical considerations, a conceptual framework for understanding promoter region evolution, potentially effecting morphological evolution, is only starting to emerge, predominantly resulting from computational research. We contribute to this framework by specifying three big problems for promoter region research; reviewing computational research on promoter region evolution; and exemplifying a topic for future promoter region research - module evolution.
Collapse
Affiliation(s)
- Kelly Seto
- Department of Molecular & Medical Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Wendy Mok
- Department of Molecular Biology & Biophysics, University of Connecticut Health, Farmington, CT 06032, USA
| | - Jonny Stone
- Department of Biology, McMaster University, Hamilton, ON L8S 4K1, Canada; SHARCNet, McMaster University, Hamilton, ON L8S 4L8, Canada; Origins Institute, McMaster University, Hamilton, ON L8S 4M1, Canada
| |
Collapse
|
2
|
Ren J, Lee J, Na D. Recent advances in genetic engineering tools based on synthetic biology. J Microbiol 2020; 58:1-10. [PMID: 31898252 DOI: 10.1007/s12275-020-9334-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/19/2019] [Accepted: 11/05/2019] [Indexed: 12/26/2022]
Abstract
Genome-scale engineering is a crucial methodology to rationally regulate microbiological system operations, leading to expected biological behaviors or enhanced bioproduct yields. Over the past decade, innovative genome modification technologies have been developed for effectively regulating and manipulating genes at the genome level. Here, we discuss the current genome-scale engineering technologies used for microbial engineering. Recently developed strategies, such as clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9, multiplex automated genome engineering (MAGE), promoter engineering, CRISPR-based regulations, and synthetic small regulatory RNA (sRNA)-based knockdown, are considered as powerful tools for genome-scale engineering in microbiological systems. MAGE, which modifies specific nucleotides of the genome sequence, is utilized as a genome-editing tool. Contrastingly, synthetic sRNA, CRISPRi, and CRISPRa are mainly used to regulate gene expression without modifying the genome sequence. This review introduces the recent genome-scale editing and regulating technologies and their applications in metabolic engineering.
Collapse
Affiliation(s)
- Jun Ren
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Jingyu Lee
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Dokyun Na
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea.
| |
Collapse
|
3
|
Langer BE, Hiller M. TFforge utilizes large-scale binding site divergence to identify transcriptional regulators involved in phenotypic differences. Nucleic Acids Res 2019; 47:e19. [PMID: 30496469 PMCID: PMC6393245 DOI: 10.1093/nar/gky1200] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Revised: 11/06/2018] [Accepted: 11/15/2018] [Indexed: 12/19/2022] Open
Abstract
Changes in gene regulation are important for phenotypic and in particular morphological evolution. However, it remains challenging to identify the transcription factors (TFs) that contribute to differences in gene regulation and thus to phenotypic differences between species. Here, we present TFforge (Transcription Factor forward genomics), a computational method to identify TFs that are involved in the loss of phenotypic traits. TFforge screens an input set of regulatory genomic regions to detect TFs that exhibit a significant binding site divergence signature in species that lost a particular phenotypic trait. Using simulated data of modular and pleiotropic regulatory elements, we show that TFforge can identify the correct TFs for many different evolutionary scenarios. We applied TFforge to available eye regulatory elements to screen for TFs that exhibit a significant binding site decay signature in subterranean mammals. This screen identified interacting and co-binding eye-related TFs, and thus provides new insights into which TFs likely contribute to eye degeneration in these species. TFforge has broad applicability to identify the TFs that contribute to phenotypic changes between species, and thus can help to unravel the gene-regulatory differences that underlie phenotypic evolution.
Collapse
Affiliation(s)
- Björn E Langer
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.,Center for Systems Biology Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.,Center for Systems Biology Dresden, Germany
| |
Collapse
|
4
|
Langer BE, Roscito JG, Hiller M. REforge Associates Transcription Factor Binding Site Divergence in Regulatory Elements with Phenotypic Differences between Species. Mol Biol Evol 2019; 35:3027-3040. [PMID: 30256993 PMCID: PMC6278867 DOI: 10.1093/molbev/msy187] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Elucidating the genomic determinants of morphological differences between species is key to understanding how morphological diversity evolved. While differences in cis-regulatory elements are an important genetic source for morphological evolution, it remains challenging to identify regulatory elements involved in phenotypic differences. Here, we present Regulatory Element forward genomics (REforge), a computational approach that detects associations between transcription factor binding site divergence in putative regulatory elements and phenotypic differences between species. By simulating regulatory element evolution in silico, we show that this approach has substantial power to detect such associations. To validate REforge on real data, we used known binding motifs for eye-related transcription factors and identified significant binding site divergence in vision-impaired subterranean mammals in 1% of all conserved noncoding elements. We show that these genomic regions are significantly enriched in regulatory elements that are specifically active in mouse eye tissues, and that several of them are located near genes, which are required for eye development and photoreceptor function and are implicated in human eye disorders. Thus, our genome-wide screen detects widespread divergence of eye-regulatory elements and highlights regulatory regions that likely contributed to eye degeneration in subterranean mammals. REforge has broad applicability to detect regulatory elements that could be involved in many other phenotypes, which will help to reveal the genomic basis of morphological diversity.
Collapse
Affiliation(s)
- Björn E Langer
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.,Center for Systems Biology, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.,Center for Systems Biology, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.,Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.,Center for Systems Biology, Dresden, Germany
| |
Collapse
|
5
|
Abstract
Whole-genome alignment (WGA) is the prediction of evolutionary relationships at the nucleotide level between two or more genomes. It combines aspects of both colinear sequence alignment and gene orthology prediction and is typically more challenging to address than either of these tasks due to the size and complexity of whole genomes. Despite the difficulty of this problem, numerous methods have been developed for its solution because WGAs are valuable for genome-wide analyses such as phylogenetic inference, genome annotation, and function prediction. In this chapter, we discuss the meaning and significance of WGA and present an overview of the methods that address it. We also examine the problem of evaluating whole-genome aligners and offer a set of methodological challenges that need to be tackled in order to make most effective use of our rapidly growing databases of whole genomes.
Collapse
Affiliation(s)
- Colin N Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
6
|
Majumdar R, Shao L, Turlapati SA, Minocha SC. Polyamines in the life of Arabidopsis: profiling the expression of S-adenosylmethionine decarboxylase (SAMDC) gene family during its life cycle. BMC PLANT BIOLOGY 2017; 17:264. [PMID: 29281982 PMCID: PMC5745906 DOI: 10.1186/s12870-017-1208-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2017] [Accepted: 12/08/2017] [Indexed: 05/07/2023]
Abstract
BACKGROUND Arabidopsis has 5 paralogs of the S-adenosylmethionine decarboxylase (SAMDC) gene. Neither their specific role in development nor the role of positive/purifying selection in genetic divergence of this gene family is known. While some data are available on organ-specific expression of AtSAMDC1, AtSAMDC2, AtSAMDC3 and AtSAMDC4, not much is known about their promoters including AtSAMDC5, which is believed to be non-functional. RESULTS (1) Phylogenetic analysis of the five AtSAMDC genes shows similar divergence pattern for promoters and coding sequences (CDSs), whereas, genetic divergence of 5'UTRs and 3'UTRs was independent of the promoters and CDSs; (2) while AtSAMDC1 and AtSAMDC4 promoters exhibit high activity (constitutive in the former), promoter activities of AtSAMDC2, AtSAMDC3 and AtSAMDC5 are moderate to low in seedlings (depending upon translational or transcriptional fusions), and are localized mainly in the vascular tissues and reproductive organs in mature plants; (3) based on promoter activity, it appears that AtSAMDC5 is both transcriptionally and translationally active, but based on it's coding sequence it seems to produce a non-functional protein; (4) though 5'-UTR based regulation of AtSAMDC expression through upstream open reading frames (uORFs) in the 5'UTR is well known, no such uORFs are present in AtSAMDC4 and AtSAMDC5; (5) the promoter regions of all five AtSAMDC genes contain common stress-responsive elements and hormone-responsive elements; (6) at the organ level, the activity of AtSAMDC enzyme does not correlate with the expression of specific AtSAMDC genes or with the contents of spermidine and spermine. CONCLUSIONS Differential roles of positive/purifying selection were observed in genetic divergence of the AtSAMDC gene family. All tissues express one or more AtSAMDC gene with significant redundancy, and concurrently, there is cell/tissue-specificity of gene expression, particularly in mature organs. This study provides valuable information about AtSAMDC promoters, which could be useful in future manipulation of crop plants for nutritive purposes, stress tolerance or bioenergy needs. The AtSAMDC1 core promoter might serve the need of a strong constitutive promoter, and its high expression in the gametophytic cells could be exploited, where strong male/female gametophyte-specific expression is desired; e.g. in transgenic modification of crop varieties.
Collapse
Affiliation(s)
- Rajtilak Majumdar
- Department of Biological Sciences, University of New Hampshire, Durham, NH USA
- USDA-ARS, SRRC, 1100 Robert E. Lee Blvd, New Orleans, LA 70124 USA
| | - Lin Shao
- Department of Biological Sciences, University of New Hampshire, Durham, NH USA
| | - Swathi A. Turlapati
- Department of Biological Sciences, University of New Hampshire, Durham, NH USA
| | - Subhash C. Minocha
- Department of Biological Sciences, University of New Hampshire, Durham, NH USA
| |
Collapse
|
7
|
Lin H, Qin S. Tipping points in seaweed genetic engineering: scaling up opportunities in the next decade. Mar Drugs 2014; 12:3025-45. [PMID: 24857961 PMCID: PMC4052329 DOI: 10.3390/md12053025] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2014] [Revised: 04/04/2014] [Accepted: 04/25/2014] [Indexed: 12/30/2022] Open
Abstract
Seaweed genetic engineering is a transgenic expression system with unique features compared with those of heterotrophic prokaryotes and higher plants. This study discusses several newly sequenced seaweed nuclear genomes and the necessity that research on vector design should consider endogenous promoters, codon optimization, and gene copy number. Seaweed viruses and artificial transposons can be applied as transformation methods after acquiring a comprehensive understanding of the mechanism of viral infections in seaweeds and transposon patterns in seaweed genomes. After cultivating transgenic algal cells and tissues in a photobioreactor, a biosafety assessment of genetically modified (GM) seaweeds must be conducted before open-sea application. We propose a set of programs for the evaluation of gene flow from GM seaweeds to local/geographical environments. The effective implementation of such programs requires fundamentally systematic and interdisciplinary studies on algal physiology and genetics, marine hydrology, reproductive biology, and ecology.
Collapse
Affiliation(s)
- Hanzhi Lin
- Environmental Biophysics and Molecular Ecology Program, Institute of Marine and Coastal Sciences, Rutgers University, 71 Dudley Road, New Brunswick, NJ 08901, USA.
| | - Song Qin
- Key Lab of Coastal Biology and Bio-resource Utilization, Yantai Institute of Coastal Zone Research, Chinese Academy of Sciences, 17 Chunhui Road, Yantai 264003, China.
| |
Collapse
|
8
|
Signal correlations in ecological niches can shape the organization and evolution of bacterial gene regulatory networks. Adv Microb Physiol 2013; 61:1-36. [PMID: 23046950 DOI: 10.1016/b978-0-12-394423-8.00001-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Transcriptional regulation plays a significant role in the biological response of bacteria to changing environmental conditions. Therefore, mapping transcriptional regulatory networks is an important step not only in understanding how bacteria sense and interpret their environment but also to identify the functions involved in biological responses to specific conditions. Recent experimental and computational developments have facilitated the characterization of regulatory networks on a genome-wide scale in model organisms. In addition, the multiplication of complete genome sequences has encouraged comparative analyses to detect conserved regulatory elements and infer regulatory networks in other less well-studied organisms. However, transcription regulation appears to evolve rapidly, thus, creating challenges for the transfer of knowledge to nonmodel organisms. Nevertheless, the mechanisms and constraints driving the evolution of regulatory networks have been the subjects of numerous analyses, and several models have been proposed. Overall, the contributions of mutations, recombination, and horizontal gene transfer are complex. Finally, the rapid evolution of regulatory networks plays a significant role in the remarkable capacity of bacteria to adapt to new or changing environments. Conversely, the characteristics of environmental niches determine the selective pressures and can shape the structure of regulatory network accordingly.
Collapse
|
9
|
Ma X, Sela H, Jiao G, Li C, Wang A, Pourkheirandish M, Weiner D, Sakuma S, Krugman T, Nevo E, Komatsuda T, Korol A, Chen G. Population-genetic analysis of HvABCG31 promoter sequence in wild barley (Hordeum vulgare ssp. spontaneum). BMC Evol Biol 2012; 12:188. [PMID: 23006777 PMCID: PMC3544613 DOI: 10.1186/1471-2148-12-188] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2012] [Accepted: 09/18/2012] [Indexed: 01/31/2023] Open
Abstract
Background The cuticle is an important adaptive structure whose origin played a crucial role in the transition of plants from aqueous to terrestrial conditions. HvABCG31/Eibi1 is an ABCG transporter gene, involved in cuticle formation that was recently identified in wild barley (Hordeum vulgare ssp. spontaneum). To study the genetic variation of HvABCG31 in different habitats, its 2 kb promoter region was sequenced from 112 wild barley accessions collected from five natural populations from southern and northern Israel. The sites included three mesic and two xeric habitats, and differed in annual rainfall, soil type, and soil water capacity. Results Phylogenetic analysis of the aligned HvABCG31 promoter sequences clustered the majority of accessions (69 out of 71) from the three northern mesic populations into one cluster, while all 21 accessions from the Dead Sea area, a xeric southern population, and two isolated accessions (one from a xeric population at Mitzpe Ramon and one from the xeric ‘African Slope’ of “Evolution Canyon”) formed the second cluster. The southern arid populations included six haplotypes, but they differed from the consensus sequence at a large number of positions, while the northern mesic populations included 15 haplotypes that were, on average, more similar to the consensus sequence. Most of the haplotypes (20 of 22) were unique to a population. Interestingly, higher genetic variation occurred within populations (54.2%) than among populations (45.8%). Analysis of the promoter region detected a large number of transcription factor binding sites: 121–128 and 121–134 sites in the two southern arid populations, and 123–128,125–128, and 123–125 sites in the three northern mesic populations. Three types of TFBSs were significantly enriched: those related to GA (gibberellin), Dof (DNA binding with one finger), and light. Conclusions Drought stress and adaptive natural selection may have been important determinants in the observed sequence variation of HvABCG31 promoter. Abiotic stresses may be involved in the HvABCG31 gene transcription regulations, generating more protective cuticles in plants under stresses.
Collapse
Affiliation(s)
- Xiaoying Ma
- Extreme Stress Resistance and Biotechnology Laboratory, Cold and Arid Regions Environmental and Engineering Institute, Chinese Academy of Sciences, Lanzhou 730000, China
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Erb I, González-Vallinas JR, Bussotti G, Blanco E, Eyras E, Notredame C. Use of ChIP-Seq data for the design of a multiple promoter-alignment method. Nucleic Acids Res 2012; 40:e52. [PMID: 22230796 PMCID: PMC3326335 DOI: 10.1093/nar/gkr1292] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments.
Collapse
Affiliation(s)
- Ionas Erb
- Bioinformatics and Genomics program, Centre for Genomic Regulation and UPF, 08003 Barcelona, Spain
| | | | | | | | | | | |
Collapse
|
11
|
Taher L, McGaughey DM, Maragh S, Aneas I, Bessling SL, Miller W, Nobrega MA, McCallion AS, Ovcharenko I. Genome-wide identification of conserved regulatory function in diverged sequences. Genome Res 2011; 21:1139-49. [PMID: 21628450 PMCID: PMC3129256 DOI: 10.1101/gr.119016.110] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Accepted: 04/19/2011] [Indexed: 01/16/2023]
Abstract
Plasticity of gene regulatory encryption can permit DNA sequence divergence without loss of function. Functional information is preserved through conservation of the composition of transcription factor binding sites (TFBS) in a regulatory element. We have developed a method that can accurately identify pairs of functional noncoding orthologs at evolutionarily diverged loci by searching for conserved TFBS arrangements. With an estimated 5% false-positive rate (FPR) in approximately 3000 human and zebrafish syntenic loci, we detected approximately 300 pairs of diverged elements that are likely to share common ancestry and have similar regulatory activity. By analyzing a pool of experimentally validated human enhancers, we demonstrated that 7/8 (88%) of their predicted functional orthologs retained in vivo regulatory control. Moreover, in 5/7 (71%) of assayed enhancer pairs, we observed concordant expression patterns. We argue that TFBS composition is often necessary to retain and sufficient to predict regulatory function in the absence of overt sequence conservation, revealing an entire class of functionally conserved, evolutionarily diverged regulatory elements that we term "covert."
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - David M. McGaughey
- McKusick–Nathans Institute of Genetic Medicine, Department of Molecular and Comparative Pathobiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | - Samantha Maragh
- McKusick–Nathans Institute of Genetic Medicine, Department of Molecular and Comparative Pathobiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
- Biochemical Science Division, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, USA
| | - Ivy Aneas
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Seneca L. Bessling
- McKusick–Nathans Institute of Genetic Medicine, Department of Molecular and Comparative Pathobiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | - Webb Miller
- Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Marcelo A. Nobrega
- Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA
| | - Andrew S. McCallion
- McKusick–Nathans Institute of Genetic Medicine, Department of Molecular and Comparative Pathobiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|
12
|
Soccio RE, Tuteja G, Everett LJ, Li Z, Lazar MA, Kaestner KH. Species-specific strategies underlying conserved functions of metabolic transcription factors. Mol Endocrinol 2011; 25:694-706. [PMID: 21292830 DOI: 10.1210/me.2010-0454] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
The winged helix protein FOXA2 and the nuclear receptor peroxisome proliferator-activated receptor-γ (PPARγ) are highly conserved, regionally expressed transcription factors (TFs) that regulate networks of genes controlling complex metabolic functions. Cistrome analysis for Foxa2 in mouse liver and PPARγ in mouse adipocytes has previously produced consensus-binding sites that are nearly identical to those used by the corresponding TFs in human cells. We report here that, despite the conservation of the canonical binding motif, the great majority of binding regions for FOXA2 in human liver and for PPARγ in human adipocytes are not in the orthologous locations corresponding to the mouse genome, and vice versa. Of note, TF binding can be absent in one species despite sequence conservation, including motifs that do support binding in the other species, demonstrating a major limitation of in silico binding site prediction. Whereas only approximately 10% of binding sites are conserved, gene-centric analysis reveals that about 50% of genes with nearby TF occupancy are shared across species for both hepatic FOXA2 and adipocyte PPARγ. Remarkably, for both TFs, many of the shared genes function in tissue-specific metabolic pathways, whereas species-unique genes fail to show enrichment for these pathways. Nonetheless, the species-unique genes, like the shared genes, showed the expected transcriptional regulation by the TFs in loss-of-function experiments. Thus, species-specific strategies underlie the biological functions of metabolic TFs that are highly conserved across mammalian species. Analysis of factor binding in multiple species may be necessary to distinguish apparent species-unique noise and reveal functionally relevant information.
Collapse
Affiliation(s)
- Raymond E Soccio
- Division of Endocrinology, Diabetes, and Metabolism, Department of Medicine, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania 19104-6149, USA
| | | | | | | | | | | |
Collapse
|
13
|
Majoros WH, Ohler U. Modeling the evolution of regulatory elements by simultaneous detection and alignment with phylogenetic pair HMMs. PLoS Comput Biol 2010; 6:e1001037. [PMID: 21187896 PMCID: PMC3002982 DOI: 10.1371/journal.pcbi.1001037] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2010] [Accepted: 11/17/2010] [Indexed: 11/18/2022] Open
Abstract
The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation.
Collapse
Affiliation(s)
- William H Majoros
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America.
| | | |
Collapse
|
14
|
Kim J, Sinha S. Towards realistic benchmarks for multiple alignments of non-coding sequences. BMC Bioinformatics 2010; 11:54. [PMID: 20102627 PMCID: PMC2823711 DOI: 10.1186/1471-2105-11-54] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2009] [Accepted: 01/26/2010] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks. RESULTS We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the Drosophila group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of Drosophila non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in Drosophila non-coding sequences if provided with the true alignments. CONCLUSION We have developed a method to generate benchmarks for multiple alignments of Drosophila non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.
Collapse
Affiliation(s)
- Jaebum Kim
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
15
|
He X, Sinha S. Evolution of cis-regulatory sequences in Drosophila. Methods Mol Biol 2010; 674:283-296. [PMID: 20827599 DOI: 10.1007/978-1-60761-854-6_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Cross-species comparison is an emerging paradigm for identifying cis-regulatory sequences and understanding their function and evolution. In this chapter, we review probabilistic models of evolution of transcription factor binding sites, which provide the theoretical basis for a number of new bioinformatics tools for comparative sequence analysis. We illustrate how important functional and evolutionary insights on binding site gain and loss can be acquired through sequence comparison. This includes the observation that binding site turnover follows a molecular clock and that its rate correlates with the strength of binding sites and the presence of other sites in the neighborhood. We also comment on emerging trends that go beyond individual binding sites to a more holistic study of regulatory evolution. We point out common technical challenges, such as reliable sequence alignment and binding site prediction, when doing comparative regulatory sequence analysis and note some potential solutions thereof.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| | | |
Collapse
|
16
|
He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, Zhong S. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS One 2009; 4:e8155. [PMID: 19956545 PMCID: PMC2780727 DOI: 10.1371/journal.pone.0008155] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2009] [Accepted: 11/10/2009] [Indexed: 11/19/2022] Open
Abstract
Background How transcription factors (TFs) interact with cis-regulatory sequences and interact with each other is a fundamental, but not well understood, aspect of gene regulation. Methodology/Principal Findings We present a computational method to address this question, relying on the established biophysical principles. This method, STAP (sequence to affinity prediction), takes into account all combinations and configurations of strong and weak binding sites to analyze large scale transcription factor (TF)-DNA binding data to discover cooperative interactions among TFs, infer sequence rules of interaction and predict TF target genes in new conditions with no TF-DNA binding data. The distinctions between STAP and other statistical approaches for analyzing cis-regulatory sequences include the utility of physical principles and the treatment of the DNA binding data as quantitative representation of binding strengths. Applying this method to the ChIP-seq data of 12 TFs in mouse embryonic stem (ES) cells, we found that the strength of TF-DNA binding could be significantly modulated by cooperative interactions among TFs with adjacent binding sites. However, further analysis on five putatively interacting TF pairs suggests that such interactions may be relatively insensitive to the distance and orientation of binding sites. Testing a set of putative Nanog motifs, STAP showed that a novel Nanog motif could better explain the ChIP-seq data than previously published ones. We then experimentally tested and verified the new Nanog motif. A series of comparisons showed that STAP has more predictive power than several state-of-the-art methods for cis-regulatory sequence analysis. We took advantage of this power to study the evolution of TF-target relationship in Drosophila. By learning the TF-DNA interaction models from the ChIP-chip data of D. melanogaster (Mel) and applying them to the genome of D. pseudoobscura (Pse), we found that only about half of the sequences strongly bound by TFs in Mel have high binding affinities in Pse. We show that prediction of functional TF targets from ChIP-chip data can be improved by using the conservation of STAP predicted affinities as an additional filter. Conclusions/Significance STAP is an effective method to analyze binding site arrangements, TF cooperativity, and TF target genes from genome-wide TF-DNA binding data.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Chieh-Chun Chen
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Feng Hong
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Fang Fang
- Gene Regulation Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| | - Huck-Hui Ng
- Gene Regulation Laboratory, Genome Institute of Singapore, Singapore, Singapore
| | - Sheng Zhong
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
- * E-mail:
| |
Collapse
|
17
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
18
|
Rach EA, Yuan HY, Majoros WH, Tomancak P, Ohler U. Motif composition, conservation and condition-specificity of single and alternative transcription start sites in the Drosophila genome. Genome Biol 2009; 10:R73. [PMID: 19589141 PMCID: PMC2728527 DOI: 10.1186/gb-2009-10-7-r73] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2008] [Revised: 04/21/2009] [Accepted: 07/09/2009] [Indexed: 01/05/2023] Open
Abstract
A map of transcription start sites across the Drosophila genome, providing insights into initiation patterns and spatiotemporal conditions. Background Transcription initiation is a key component in the regulation of gene expression. mRNA 5' full-length sequencing techniques have enhanced our understanding of mammalian transcription start sites (TSSs), revealing different initiation patterns on a genomic scale. Results To identify TSSs in Drosophila melanogaster, we applied a hierarchical clustering strategy on available 5' expressed sequence tags (ESTs) and identified a high quality set of 5,665 TSSs for approximately 4,000 genes. We distinguished two initiation patterns: 'peaked' TSSs, and 'broad' TSS cluster groups. Peaked promoters were found to contain location-specific sequence elements; conversely, broad promoters were associated with non-location-specific elements. In alignments across other Drosophila genomes, conservation levels of sequence elements exceeded 90% within the melanogaster subgroup, but dropped considerably for distal species. Elements in broad promoters had lower levels of conservation than those in peaked promoters. When characterizing the distributions of ESTs, 64% of TSSs showed distinct associations to one out of eight different spatiotemporal conditions. Available whole-genome tiling array time series data revealed different temporal patterns of embryonic activity across the majority of genes with distinct alternative promoters. Many genes with maternally inherited transcripts were found to have alternative promoters utilized later in development. Core promoters of maternally inherited transcripts showed differences in motif composition compared to zygotically active promoters. Conclusions Our study provides a comprehensive map of Drosophila TSSs and the conditions under which they are utilized. Distinct differences in motif associations with initiation pattern and spatiotemporal utilization illustrate the complex regulatory code of transcription initiation.
Collapse
Affiliation(s)
- Elizabeth A Rach
- Program in Computational Biology and Bioinformatics, Duke University, Science Drive, Durham, NC 27708, USA
| | | | | | | | | |
Collapse
|
19
|
He X, Ling X, Sinha S. Alignment and prediction of cis-regulatory modules based on a probabilistic model of evolution. PLoS Comput Biol 2009; 5:e1000299. [PMID: 19293946 PMCID: PMC2657044 DOI: 10.1371/journal.pcbi.1000299] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2008] [Accepted: 01/22/2009] [Indexed: 11/30/2022] Open
Abstract
Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context. Comparison of noncoding DNA sequences across species has the potential to significantly improve our understanding of gene regulation and our ability to annotate regulatory regions of the genome. This potential is evident from recent publications analyzing 12 Drosophila genomes for regulatory annotation. However, because noncoding sequences are much less structured than coding sequences, their interspecies comparison presents technical challenges, such as ambiguity about how to align them and how to predict transcription factor binding sites, which are the fundamental units that make up regulatory sequences. This article describes how to build an integrated probabilistic framework that performs alignment and binding site prediction simultaneously, in the process improving the accuracy of both tasks. It defines a stochastic model for the evolution of entire “cis-regulatory modules,” with its highlight being a novel theoretical treatment of the commonly observed loss and gain of binding sites during evolution. This new evolutionary model forms the backbone of newly developed software for the prediction of new cis-regulatory modules, alignment of known modules to elucidate general principles of cis-regulatory evolution, or both. The new software is demonstrated to provide benefits in performance of these two crucial genomics tasks.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xu Ling
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
20
|
Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet 2008; 4:e1000106. [PMID: 18584029 PMCID: PMC2430619 DOI: 10.1371/journal.pgen.1000106] [Citation(s) in RCA: 220] [Impact Index Per Article: 12.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2008] [Accepted: 05/22/2008] [Indexed: 12/31/2022] Open
Abstract
The gene expression pattern specified by an animal regulatory sequence is generally viewed as arising from the particular arrangement of transcription factor binding sites it contains. However, we demonstrate here that regulatory sequences whose binding sites have been almost completely rearranged can still produce identical outputs. We sequenced the even-skipped locus from six species of scavenger flies (Sepsidae) that are highly diverged from the model species Drosophila melanogaster, but share its basic patterns of developmental gene expression. Although there is little sequence similarity between the sepsid eve enhancers and their well-characterized D. melanogaster counterparts, the sepsid and Drosophila enhancers drive nearly identical expression patterns in transgenic D. melanogaster embryos. We conclude that the molecular machinery that connects regulatory sequences to the transcription apparatus is more flexible than previously appreciated. In exploring this diverse collection of sequences to identify the shared features that account for their similar functions, we found a small number of short (20-30 bp) sequences nearly perfectly conserved among the species. These highly conserved sequences are strongly enriched for pairs of overlapping or adjacent binding sites. Together, these observations suggest that the local arrangement of binding sites relative to each other is more important than their overall arrangement into larger units of cis-regulatory function.
Collapse
Affiliation(s)
- Emily E. Hare
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
| | - Brant K. Peterson
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- Center for Integrative Genomics, University of California Berkeley, Berkeley, California, United States of America
| | - Venky N. Iyer
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
| | - Rudolf Meier
- Department of Biological Sciences, National University of Singapore, Singapore
| | - Michael B. Eisen
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- Center for Integrative Genomics, University of California Berkeley, Berkeley, California, United States of America
- Genomics Division, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- California Institute for Quantitative Biosciences, Berkeley, California, United States of America
| |
Collapse
|
21
|
Ray P, Shringarpure S, Kolar M, Xing EP. CSMET: comparative genomic motif detection via multi-resolution phylogenetic shadowing. PLoS Comput Biol 2008; 4:e1000090. [PMID: 18535663 PMCID: PMC2396503 DOI: 10.1371/journal.pcbi.1000090] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2007] [Accepted: 04/28/2008] [Indexed: 11/19/2022] Open
Abstract
Functional turnover of transcription factor binding sites (TFBSs), such as whole-motif loss or gain, are common events during genome evolution. Conventional probabilistic phylogenetic shadowing methods model the evolution of genomes only at nucleotide level, and lack the ability to capture the evolutionary dynamics of functional turnover of aligned sequence entities. As a result, comparative genomic search of non-conserved motifs across evolutionarily related taxa remains a difficult challenge, especially in higher eukaryotes, where the cis-regulatory regions containing motifs can be long and divergent; existing methods rely heavily on specialized pattern-driven heuristic search or sampling algorithms, which can be difficult to generalize and hard to interpret based on phylogenetic principles. We propose a new method: Conditional Shadowing via Multi-resolution Evolutionary Trees, or CSMET, which uses a context-dependent probabilistic graphical model that allows aligned sites from different taxa in a multiple alignment to be modeled by either a background or an appropriate motif phylogeny conditioning on the functional specifications of each taxon. The functional specifications themselves are the output of a phylogeny which models the evolution not of individual nucleotides, but of the overall functionality (e.g., functional retention or loss) of the aligned sequence segments over lineages. Combining this method with a hidden Markov model that autocorrelates evolutionary rates on successive sites in the genome, CSMET offers a principled way to take into consideration lineage-specific evolution of TFBSs during motif detection, and a readily computable analytical form of the posterior distribution of motifs under TFBS turnover. On both simulated and real Drosophila cis-regulatory modules, CSMET outperforms other state-of-the-art comparative genomic motif finders.
Collapse
Affiliation(s)
- Pradipta Ray
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Suyash Shringarpure
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Mladen Kolar
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Eric P. Xing
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|