1
|
Thakur V, Bains S, Kaur R, Singh K. Identification and characterization of SlbHLH, SlDof and SlWRKY transcription factors interacting with SlDPD gene involved in costunolide biosynthesis in Saussurea lappa. Int J Biol Macromol 2021; 173:146-159. [PMID: 33482203 DOI: 10.1016/j.ijbiomac.2021.01.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/26/2020] [Accepted: 01/17/2021] [Indexed: 11/27/2022]
Abstract
The genes involved in costunolide biosynthesis in Saussurea lappa have been identified recently by our lab. However, the study of transcriptional regulators of these genes was lacking for better opportunities for engineering the pharmacologically important biosynthetic pathway. Therefore, we cloned the promoter region of diphosphomevalonate decarboxylase gene (DPD) and analyzed its cis-acting regulatory elements to reveal the potential transcription factor (TF) binding sites for Dof, bHLH and WRKY family proteins in the gene promoter. The transcriptome study approach followed by the hidden Markov model based search, digital gene expression, co-expression network analysis, conserved domain properties and evolutionary analyses were carried out to screen out seven putative TFs for the DPD-TF interaction studies. Yeast one-hybrid assays were performed and three TFs were reported, namely, SlDOF2, SlbHLH3 and SlWRKY2 from Dof, bHLH and WRKY families, respectively that interacted positively with the DPD gene of the costunolide biosynthetic pathway. The tissue specific relative gene expression studies also supported the linked co-expression of the gene and its interacting TFs The present report will improve the understanding of transcriptional regulation pattern of costunolide biosynthetic pathway.
Collapse
Affiliation(s)
- Vasundhara Thakur
- Department of Biotechnology, Panjab University, BMS Block I, Sector 25, Chandigarh 160014, India
| | - Savita Bains
- Department of Biotechnology, Panjab University, BMS Block I, Sector 25, Chandigarh 160014, India
| | - Ravneet Kaur
- Department of Biotechnology, Panjab University, BMS Block I, Sector 25, Chandigarh 160014, India
| | - Kashmir Singh
- Department of Biotechnology, Panjab University, BMS Block I, Sector 25, Chandigarh 160014, India.
| |
Collapse
|
2
|
Carazo F, Romero JP, Rubio A. Upstream analysis of alternative splicing: a review of computational approaches to predict context-dependent splicing factors. Brief Bioinform 2020; 20:1358-1375. [PMID: 29390045 DOI: 10.1093/bib/bby005] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 12/14/2017] [Indexed: 12/13/2022] Open
Abstract
Alternative splicing (AS) has shown to play a pivotal role in the development of diseases, including cancer. Specifically, all the hallmarks of cancer (angiogenesis, cell immortality, avoiding immune system response, etc.) are found to have a counterpart in aberrant splicing of key genes. Identifying the context-specific regulators of splicing provides valuable information to find new biomarkers, as well as to define alternative therapeutic strategies. The computational models to identify these regulators are not trivial and require three conceptual steps: the detection of AS events, the identification of splicing factors that potentially regulate these events and the contextualization of these pieces of information for a specific experiment. In this work, we review the different algorithmic methodologies developed for each of these tasks. Main weaknesses and strengths of the different steps of the pipeline are discussed. Finally, a case study is detailed to help the reader be aware of the potential and limitations of this computational approach.
Collapse
|
3
|
Lenzini L, Di Patti F, Livi R, Fondi M, Fani R, Mengoni A. A Method for the Structure-Based, Genome-Wide Analysis of Bacterial Intergenic Sequences Identifies Shared Compositional and Functional Features. Genes (Basel) 2019; 10:genes10100834. [PMID: 31652625 PMCID: PMC6826451 DOI: 10.3390/genes10100834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 10/07/2019] [Accepted: 10/16/2019] [Indexed: 11/16/2022] Open
Abstract
In this paper, we propose a computational strategy for performing genome-wide analyses of intergenic sequences in bacterial genomes. Following similar directions of a previous paper, where a method for genome-wide analysis of eucaryotic Intergenic sequences was proposed, here we developed a tool for implementing similar concepts in bacteria genomes. This allows us to (i) classify intergenic sequences into clusters, characterized by specific global structural features and (ii) draw possible relations with their functional features.
Collapse
Affiliation(s)
- Leonardo Lenzini
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Istituto Nazionale di Fisica Nucleare, Sesto Fiorentino, 50019, Italy.
| | - Francesca Di Patti
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Centro Interdipartimentale per lo Studio delle Dinamiche Complesse, Sesto Fiorentino, 50019, Italy.
| | - Roberto Livi
- Dipartimento di Fisica e Astronomia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
- Istituto Nazionale di Fisica Nucleare, Sesto Fiorentino, 50019, Italy.
- Centro Interdipartimentale per lo Studio delle Dinamiche Complesse, Sesto Fiorentino, 50019, Italy.
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, Sesto Fiorentino, 50019, Italy.
| | - Marco Fondi
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| | - Renato Fani
- Istituto dei Sistemi Complessi, Consiglio Nazionale delle Ricerche, Sesto Fiorentino, 50019, Italy.
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| | - Alessio Mengoni
- Dipartimento di Biologia, Università degli Studi di Firenze, Sesto Fiorentino, 50019, Italy.
| |
Collapse
|
4
|
Xie J, Li Y, Liu X, Zhao Y, Li B, Ingvarsson PK, Zhang D. Evolutionary Origins of Pseudogenes and Their Association with Regulatory Sequences in Plants. THE PLANT CELL 2019; 31:563-578. [PMID: 30760562 PMCID: PMC6482637 DOI: 10.1105/tpc.18.00601] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 12/03/2018] [Accepted: 02/12/2019] [Indexed: 05/06/2023]
Abstract
Pseudogenes (Ψs), nonfunctional relatives of functional genes, form by duplication or retrotransposition, and loss of gene function by disabling mutations. Evolutionary analysis provides clues to Ψ origins and effects on gene regulation. However, few systematic studies of plant Ψs have been conducted, hampering comparative analyses. Here, we examined the origin, evolution, and expression patterns of Ψs and their relationships with noncoding sequences in seven angiosperm plants. We identified ∼250,000 Ψs, most of which are more lineage specific than protein-coding genes. The distribution of Ψs on the chromosome indicates that genome recombination may contribute to Ψ elimination. Most Ψs evolve rapidly in terms of sequence and expression levels, showing tissue- or stage-specific expression patterns. We found that a surprisingly large fraction of nontransposable element regulatory noncoding RNAs (microRNAs and long noncoding RNAs) originate from transcription of Ψ proximal upstream regions. We also found that transcription factor binding sites preferentially occur in putative Ψ proximal upstream regions compared with random intergenic regions, suggesting that Ψs have conditioned genome evolution by providing transcription factor binding sites that serve as promoters and enhancers. We therefore propose that rapid rewiring of Ψ transcriptional regulatory regions is a major mechanism driving the origin of novel regulatory modules.
Collapse
Affiliation(s)
- Jianbo Xie
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
| | - Ying Li
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
| | - Xiaomin Liu
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
| | - Yiyang Zhao
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
| | - Bailian Li
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Department of Forestry, North Carolina State University, Raleigh, North Carolina 27695-8203
| | - Pär K Ingvarsson
- Linnean Center for Plant Biology, Department of Plant Biology, Swedish University of Agricultural Sciences, Box 7080, SE-750 07 Uppsala, Sweden
| | - Deqiang Zhang
- Beijing Advanced Innovation Center for Tree Breeding by Molecular Design, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- National Engineering Laboratory for Tree Breeding, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
- Key Laboratory of Genetics and Breeding in Forest Trees and Ornamental Plants, Ministry of Education, College of Biological Sciences and Technology, Beijing Forestry University, No. 35, Qinghua East Road, Beijing 100083, People's Republic of China
| |
Collapse
|
5
|
Raghunath A, Nagarajan R, Sundarraj K, Panneerselvam L, Perumal E. Genome-wide identification and analysis of Nrf2 binding sites - Antioxidant response elements in zebrafish. Toxicol Appl Pharmacol 2018; 360:236-248. [PMID: 30243843 DOI: 10.1016/j.taap.2018.09.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2018] [Revised: 09/08/2018] [Accepted: 09/13/2018] [Indexed: 12/30/2022]
Abstract
In the post-genomic era, deciphering the Nrf2 binding sites - antioxidant response elements (AREs) is an essential task that underlies and governs the Keap1-Nrf2-ARE pathway - a cell survival response pathway to environmental stresses in the vertebrate model system. AREs regulate the transcription of a repertoire of phase II detoxifying and/or oxidative-stress responsive genes, offering protection against toxic chemicals, carcinogens, and xenobiotics. In order to identify and analyze AREs in zebrafish, a pattern search algorithm was developed to identify AREs and computational tools available online were utilized to analyze the identified AREs in zebrafish. This study identified the AREs within 30 kb upstream from the transcription start site of antioxidant genes and mitochondrial genes. We report for the first time the AREs of all the known protein coding genes in the zebrafish genome. Western blotting, RT2 profiler array PCR, and qRT-PCR were performed to test whether AREs influence the Nrf2 target genes expression in the zebrafish larvae using sulforaphane. This study reveals unique AREs that have not been previously reported in the cytoprotective genes. Nine TGAG/CNNNTC and six TGAG/CNNNGC AREs were observed significantly. Our findings suggest that AREs drive the dynamic transcriptional events of Nrf2 target genes in the zebrafish larvae on exposure to sulforaphane. The identified abundant putative AREs will define the Keap1-Nrf2-ARE network and elucidate the precise regulation of Nrf2-ARE pathway in not only diseases but also in embryonic development, inflammation, and aerobic respiration. Our results help to understand the dynamic complexity of the Nrf2-ARE system in zebrafish.
Collapse
Affiliation(s)
- Azhwar Raghunath
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Raju Nagarajan
- Department of Biotechnology, Indian Institute of Technology Madras, Chennai 600 036, Tamilnadu, India
| | - Kiruthika Sundarraj
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Lakshmikanthan Panneerselvam
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India
| | - Ekambaram Perumal
- Molecular Toxicology Laboratory, Department of Biotechnology, Bharathiar University, Coimbatore 641 046, Tamilnadu, India.
| |
Collapse
|
6
|
Caldonazzo Garbelini JM, Kashiwabara AY, Sanches DS. Sequence motif finder using memetic algorithm. BMC Bioinformatics 2018; 19:4. [PMID: 29298679 PMCID: PMC5751424 DOI: 10.1186/s12859-017-2005-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 12/18/2017] [Indexed: 11/10/2022] Open
Abstract
Background De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs. Results We here present Memetic Framework for Motif Discovery (MFMD), an algorithm that uses semi-greedy constructive heuristics as a local optimizer. In addition, we used a hybridization of the classic genetic algorithm as a global optimizer to refine the solutions initially found. MFMD can find and classify overrepresented patterns in DNA sequences and predict their respective initial positions. MFMD performance was assessed using ChIP-seq data retrieved from the JASPAR site, promoter sequences extracted from the ABS site, and artificially generated synthetic data. The MFMD was evaluated and compared with well-known approaches in the literature, called MEME and Gibbs Motif Sampler, achieving a higher f-score in the most datasets used in this work. Conclusions We have developed an approach for detecting motifs in biopolymers sequences. MFMD is a freely available software that can be promising as an alternative to the development of new tools for de novo motif discovery. Its open-source software can be downloaded at https://github.com/jadermcg/mfmd. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-2005-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jader M Caldonazzo Garbelini
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil.
| | - André Y Kashiwabara
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| |
Collapse
|
7
|
Peters B, Casey J, Aidley J, Zohrab S, Borg M, Twell D, Brownfield L. A Conserved cis-Regulatory Module Determines Germline Fate through Activation of the Transcription Factor DUO1 Promoter. PLANT PHYSIOLOGY 2017; 173:280-293. [PMID: 27624837 PMCID: PMC5210719 DOI: 10.1104/pp.16.01192] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 09/07/2016] [Indexed: 05/07/2023]
Abstract
The development of the male germline within pollen relies upon the activation of numerous target genes by the transcription factor DUO POLLEN1 (DUO1). The expression of DUO1 is restricted to the male germline and is first detected shortly after the asymmetric division that segregates the germ cell lineage. Transcriptional regulation is critical in controlling DUO1 expression, since transcriptional and translational fusions show similar expression patterns. Here, we identify key promoter sequences required for the germline-specific regulation of DUO1 transcription. Combining promoter deletion analyses with phylogenetic footprinting in eudicots and in Arabidopsis accessions, we identify a cis-regulatory module, Regulatory region of DUO1 (ROD1), which replicates the expression pattern of DUO1 in Arabidopsis (Arabidopsis thaliana). We show that ROD1 from the legume Medicago truncatula directs male germline-specific expression in Arabidopsis, demonstrating conservation of DUO1 regulation among eudicots. ROD1 contains several short conserved cis-regulatory elements, including three copies of the motif DNGTGGV, required for germline expression and tandem repeats of the motif YAACYGY, which enhance DUO1 transcription in a positive feedback loop. We conclude that a cis-regulatory module conserved in eudicots directs the spatial and temporal expression of the transcription factor DUO1 to specify male germline fate and sperm cell differentiation.
Collapse
Affiliation(s)
- Benjamin Peters
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - Jonathan Casey
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - Jack Aidley
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - Stuart Zohrab
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - Michael Borg
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - David Twell
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| | - Lynette Brownfield
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand (B.P., J.C., S.Z., L.B.); and
- Department of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom (J.A., M.B., D.T.)
| |
Collapse
|
8
|
Acevedo-Luna N, Mariño-Ramírez L, Halbert A, Hansen U, Landsman D, Spouge JL. Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 2016; 17:479. [PMID: 27871221 PMCID: PMC5117513 DOI: 10.1186/s12859-016-1354-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 11/11/2016] [Indexed: 11/24/2022] Open
Abstract
Background Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA, to control specific sets of genes. Some transcription factor binding sites (TFBSs) near the transcription start site (TSS) display tight positional preferences relative to the TSS. Furthermore, near the TSS, RMs can co-localize TFBSs with each other and the TSS. The proportion of TFBS positional preferences due to TFBS co-localization within RMs is unknown, however. ChIP experiments confirm co-localization of some TFBSs genome-wide, including near the TSS, but they typically examine only a few TFs at a time, using non-physiological conditions that can vary from lab to lab. In contrast, sequence analysis can examine many TFs uniformly and methodically, broadly surveying the co-localization of TFBSs with tight positional preferences relative to the TSS. Results Our statistics found 43 significant sets of human motifs in the JASPAR TF Database with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a gene group of 135 to 3304 genes, with 42/43 (98%) gene groups independently validated by DAVID, a gene ontology database, with FDR < 0.05. Motifs corresponding to two TFBSs in a RM should co-occur more than by chance alone, enriching the intersection of the gene groups corresponding to the two TFs. Thus, a gene-group intersection systematically enriched beyond chance alone provides evidence that the two TFs participate in an RM. Of the 903 = 43*42/2 intersections of the 43 significant gene groups, we found 768/903 (85%) pairs of gene groups with significantly enriched intersections, with 564/768 (73%) intersections independently validated by DAVID with FDR < 0.05. A user-friendly web site at http://go.usa.gov/3kjsH permits biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Conclusions Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM near the TSS that binds a particular TF subunit. Of all intersections of our 43 significant gene groups, 85% were significantly enriched, with 73% of the significant enrichments independently validated by gene ontology. The co-localization of TFBSs within RMs therefore likely explains much of the tight TFBS positional preferences near the TSS. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1354-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Natalia Acevedo-Luna
- Department of Genetics, Development and Cell Biology, Iowa State University, Ames, IA, 50011, USA
| | - Leonardo Mariño-Ramírez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Armand Halbert
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Ulla Hansen
- Department of Biology, Boston University, 5 Cummington Mall, Boston, MA, 02215, USA
| | - David Landsman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
9
|
Taher L, Narlikar L, Ovcharenko I. Identification and computational analysis of gene regulatory elements. Cold Spring Harb Protoc 2015; 2015:pdb.top083642. [PMID: 25561628 PMCID: PMC5885252 DOI: 10.1101/pdb.top083642] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Over the last two decades, advances in experimental and computational technologies have greatly facilitated genomic research. Next-generation sequencing technologies have made de novo sequencing of large genomes affordable, and powerful computational approaches have enabled accurate annotations of genomic DNA sequences. Charting functional regions in genomes must account for not only the coding sequences, but also noncoding RNAs, repetitive elements, chromatin states, epigenetic modifications, and gene regulatory elements. A mix of comparative genomics, high-throughput biological experiments, and machine learning approaches has played a major role in this truly global effort. Here we describe some of these approaches and provide an account of our current understanding of the complex landscape of the human genome. We also present overviews of different publicly available, large-scale experimental data sets and computational tools, which we hope will prove beneficial for researchers working with large and complex genomes.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, 18051 Rostock, Germany
| | - Leelavati Narlikar
- Chemical Engineering and Process Development Division, National Chemical Laboratory, CSIR, Pune 411008, India
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| |
Collapse
|
10
|
Cha M, Zhou Q. Detecting clustering and ordering binding patterns among transcription factors via point process models. Bioinformatics 2014; 30:2263-71. [PMID: 24790155 DOI: 10.1093/bioinformatics/btu303] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Recent development in ChIP-Seq technology has generated binding data for many transcription factors (TFs) in various cell types and cellular conditions. This opens great opportunities for studying combinatorial binding patterns among a set of TFs active in a particular cellular condition, which is a key component for understanding the interaction between TFs in gene regulation. RESULTS As a first step to the identification of combinatorial binding patterns, we develop statistical methods to detect clustering and ordering patterns among binding sites (BSs) of a pair of TFs. Testing procedures based on Ripley's K-function and its generalizations are developed to identify binding patterns from large collections of BSs in ChIP-Seq data. We have applied our methods to the ChIP-Seq data of 91 pairs of TFs in mouse embryonic stem cells. Our methods have detected clustering binding patterns between most TF pairs, which is consistent with the findings in the literature, and have identified significant ordering preferences, relative to the direction of target gene transcription, among the BSs of seven TFs. More interestingly, our results demonstrate that the identified clustering and ordering binding patterns between TFs are associated with the expression of the target genes. These findings provide new insights into co-regulation between TFs. AVAILABILITY AND IMPLEMENTATION See 'www.stat.ucla.edu/∼zhou/TFKFunctions/' for source code.
Collapse
Affiliation(s)
- Maria Cha
- Department of Statistics, University of California, Los Angeles, CA 90095, USA
| | - Qing Zhou
- Department of Statistics, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
11
|
Hu ZP, Chen LS, Jia CY, Zhu HZ, Wang W, Zhong J. Screening of potential pseudo att sites of Streptomyces phage ΦC31 integrase in the human genome. Acta Pharmacol Sin 2013; 34:561-9. [PMID: 23416928 DOI: 10.1038/aps.2012.173] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
AIM ΦC31 integrase mediates site-specific recombination between two short sequences, attP and attB, in phage and bacterial genomes, which is a promising tool in gene regulation-based therapy since the zinc finger structure is probably the DNA recognizing domain that can further be engineered. The aim of this study was to screen potential pseudo att sites of ΦC31 integrase in the human genome, and evaluate the risks of its application in human gene therapy. METHODS TFBS (transcription factor binding sites) were found on the basis of reported pseudo att sites using multiple motif-finding tools, including AlignACE, BioProspector, Consensus, MEME, and Weeder. The human genome with the proposed motif was scanned to find the potential pseudo att sites of ΦC31 integrase. RESULTS The possible recognition motif of ΦC31 integrase was identified, which was composed of two co-occurrence conserved elements that were reverse complement to each other flanking the core sequence TTG. In the human genome, a total of 27924 potential pseudo att sites of ΦC31 integrase were found, which were distributed in each human chromosome with high-risk specificity values in the chromosomes 16, 17, and 19. When the risks of the sites were evaluate more rigorously, 53 hits were discovered, and some of them were just the vital functional genes or regulatory regions, such as ACYP2, AKR1B1, DUSP4, etc. CONCLUSION The results provide clues for more comprehensive evaluation of the risks of using ΦC31 integrase in human gene therapy and for drug discovery.
Collapse
|
12
|
Ding J, Li X, Hu H. Systematic prediction of cis-regulatory elements in the Chlamydomonas reinhardtii genome using comparative genomics. PLANT PHYSIOLOGY 2012; 160:613-23. [PMID: 22915576 PMCID: PMC3461543 DOI: 10.1104/pp.112.200840] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Chlamydomonas reinhardtii is one of the most important microalgae model organisms and has been widely studied toward the understanding of chloroplast functions and various cellular processes. Further exploitation of C. reinhardtii as a model system to elucidate various molecular mechanisms and pathways requires systematic study of gene regulation. However, there is a general lack of genome-scale gene regulation study, such as global cis-regulatory element (CRE) identification, in C. reinhardtii. Recently, large-scale genomic data in microalgae species have become available, which enable the development of efficient computational methods to systematically identify CREs and characterize their roles in microalgae gene regulation. Here, we performed in silico CRE identification at the whole genome level in C. reinhardtii using a comparative genomics-based method. We predicted a large number of CREs in C. reinhardtii that are consistent with experimentally verified CREs. We also discovered that a large percentage of these CREs form combinations and have the potential to work together for coordinated gene regulation in C. reinhardtii. Multiple lines of evidence from literature, gene transcriptional profiles, and gene annotation resources support our prediction. The predicted CREs will serve, to our knowledge, as the first large-scale collection of CREs in C. reinhardtii to facilitate further experimental study of microalgae gene regulation. The accompanying software tool and the predictions in C. reinhardtii are also made available through a Web-accessible database (http://hulab.ucf.edu/research/projects/Microalgae/sdcre/motifcomb.html).
Collapse
|
13
|
Sun H, Guns T, Fierro AC, Thorrez L, Nijssen S, Marchal K. Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection. Nucleic Acids Res 2012; 40:e90. [PMID: 22422841 PMCID: PMC3384348 DOI: 10.1093/nar/gks237] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Computationally retrieving biologically relevant cis-regulatory modules (CRMs) is not straightforward. Because of the large number of candidates and the imperfection of the screening methods, many spurious CRMs are detected that are as high scoring as the biologically true ones. Using ChIP-information allows not only to reduce the regions in which the binding sites of the assayed transcription factor (TF) should be located, but also allows restricting the valid CRMs to those that contain the assayed TF (here referred to as applying CRM detection in a query-based mode). In this study, we show that exploiting ChIP-information in a query-based way makes in silico CRM detection a much more feasible endeavor. To be able to handle the large datasets, the query-based setting and other specificities proper to CRM detection on ChIP-Seq based data, we developed a novel powerful CRM detection method 'CPModule'. By applying it on a well-studied ChIP-Seq data set involved in self-renewal of mouse embryonic stem cells, we demonstrate how our tool can recover combinatorial regulation of five known TFs that are key in the self-renewal of mouse embryonic stem cells. Additionally, we make a number of new predictions on combinatorial regulation of these five key TFs with other TFs documented in TRANSFAC.
Collapse
Affiliation(s)
- Hong Sun
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Leuven, Belgium
| | | | | | | | | | | |
Collapse
|
14
|
Girgis HZ, Ovcharenko I. Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs. BMC Bioinformatics 2012; 13:25. [PMID: 22313678 PMCID: PMC3359238 DOI: 10.1186/1471-2105-13-25] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2011] [Accepted: 02/07/2012] [Indexed: 12/26/2022] Open
Abstract
Background Researchers seeking to unlock the genetic basis of human physiology and diseases have been studying gene transcription regulation. The temporal and spatial patterns of gene expression are controlled by mainly non-coding elements known as cis-regulatory modules (CRMs) and epigenetic factors. CRMs modulating related genes share the regulatory signature which consists of transcription factor (TF) binding sites (TFBSs). Identifying such CRMs is a challenging problem due to the prohibitive number of sequence sets that need to be analyzed. Results We formulated the challenge as a supervised classification problem even though experimentally validated CRMs were not required. Our efforts resulted in a software system named CrmMiner. The system mines for CRMs in the vicinity of related genes. CrmMiner requires two sets of sequences: a mixed set and a control set. Sequences in the vicinity of the related genes comprise the mixed set, whereas the control set includes random genomic sequences. CrmMiner assumes that a large percentage of the mixed set is made of background sequences that do not include CRMs. The system identifies pairs of closely located motifs representing vertebrate TFBSs that are enriched in the training mixed set consisting of 50% of the gene loci. In addition, CrmMiner selects a group of the enriched pairs to represent the tissue-specific regulatory signature. The mixed and the control sets are searched for candidate sequences that include any of the selected pairs. Next, an optimal Bayesian classifier is used to distinguish candidates found in the mixed set from their control counterparts. Our study proposes 62 tissue-specific regulatory signatures and putative CRMs for different human tissues and cell types. These signatures consist of assortments of ubiquitously expressed TFs and tissue-specific TFs. Under controlled settings, CrmMiner identified known CRMs in noisy sets up to 1:25 signal-to-noise ratio. CrmMiner was 21-75% more precise than a related CRM predictor. The sensitivity of the system to locate known human heart enhancers reached up to 83%. CrmMiner precision reached 82% while mining for CRMs specific to the human CD4+ T cells. On several data sets, the system achieved 99% specificity. Conclusion These results suggest that CrmMiner predictions are accurate and likely to be tissue-specific CRMs. We expect that the predicted tissue-specific CRMs and the regulatory signatures broaden our knowledge of gene transcription regulation.
Collapse
Affiliation(s)
- Hani Z Girgis
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health 9600 Rockville Pike, Bethesda, MD 20896, USA
| | | |
Collapse
|
15
|
Huang Q, Gong C, Li J, Zhuo Z, Chen Y, Wang J, Hua ZC. Distance and helical phase dependence of synergistic transcription activation in cis-regulatory module. PLoS One 2012; 7:e31198. [PMID: 22299056 PMCID: PMC3267773 DOI: 10.1371/journal.pone.0031198] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2011] [Accepted: 01/03/2012] [Indexed: 01/21/2023] Open
Abstract
Deciphering of the spatial and stereospecific constraints on synergistic transcription activation mediated between activators bound to cis-regulatory elements is important for understanding gene regulation and remains largely unknown. It has been commonly believed that two activators will activate transcription most effectively when they are bound on the same face of DNA double helix and within a boundary distance from the transcription initiation complex attached to the TATA box. In this work, we studied the spatial and stereospecific constraints on activation by multiple copies of bound model activators using a series of engineered relative distances and stereospecific orientations. We observed that multiple copies of the activators GAL4-VP16 and ZEBRA bound to engineered promoters activated transcription more effectively when bound on opposite faces of the DNA double helix. This phenomenon was not affected by the spatial relationship between the proximal activator and initiation complex. To explain these results, we proposed the novel concentration field model, which posits the effective concentration of bound activators, and therefore the transcription activation potential, is affected by their stereospecific positioning. These results could be used to understand synergistic transcription activation anew and to aid the development of predictive models for the identification of cis-regulatory elements.
Collapse
Affiliation(s)
- Qilai Huang
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
- The State Key Laboratory of Quality Research in Chinese Medicine and Macau Institute for Applied Research in Medicine, Macau University of Science and Technology, Macau, People's Republic of China
- Changzhou High-Tech Research Institute of Nanjing University and Jiangsu TargetPharma Laboratories Inc., Changzhou, People's Republic of China
| | - Chenguang Gong
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
| | - Jiahuang Li
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
| | - Zhu Zhuo
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
| | - Yuan Chen
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
| | - Jin Wang
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
- * E-mail: (JW); (ZH)
| | - Zi-Chun Hua
- The State Key Laboratory of Pharmaceutical Biotechnology and Affiliated Stomatological Hospital, Nanjing University, Nanjing, People's Republic of China
- The State Key Laboratory of Quality Research in Chinese Medicine and Macau Institute for Applied Research in Medicine, Macau University of Science and Technology, Macau, People's Republic of China
- Changzhou High-Tech Research Institute of Nanjing University and Jiangsu TargetPharma Laboratories Inc., Changzhou, People's Republic of China
- * E-mail: (JW); (ZH)
| |
Collapse
|
16
|
A generalized hidden Markov model for determining sequence-based predictors of nucleosome positioning. Stat Appl Genet Mol Biol 2012; 11:/j/sagmb.2012.11.issue-2/1544-6115.1707/1544-6115.1707.xml. [PMID: 22499697 DOI: 10.2202/1544-6115.1707] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Chromatin structure, in terms of positioning of nucleosomes and nucleosome-free regions in the DNA, has been found to have an immense impact on various cell functions and processes, ranging from transcriptional regulation to growth and development. In spite of numerous experimental and computational approaches being developed in the past few years to determine the intrinsic relationship between chromatin structure (nucleosome positioning) and DNA sequence features, there is yet no universally accurate approach to predict nucleosome positioning from the underlying DNA sequence alone. We here propose an alternative approach to predicting nucleosome positioning from sequence, making use of characteristic sequence differences, and inherent dependencies in overlapping sequence features. Our nucleosomal positioning prediction algorithm, based on the idea of generalized hierarchical hidden Markov models (HGHMMs), was used to predict nucleosomal state based on the DNA sequence in yeast chromosome III, and compared with two other existing methods. The HGHMM method performed favorably among the three models in terms of specificity and sensitivity, and provided estimates that were largely consistent with predictions from the method of Yuan and Liu (2008). However, all the methods still give higher than desirable misclassification rates, indicating that sequence-based features may provide only limited information towards understanding positioning of nucleosomes. The method is implemented in the open-source statistical software R, and is freely available from the authors' website.
Collapse
|
17
|
Ding J, Hu H, Li X. Thousands of cis-regulatory sequence combinations are shared by Arabidopsis and poplar. PLANT PHYSIOLOGY 2012; 158:145-55. [PMID: 22058225 PMCID: PMC3252106 DOI: 10.1104/pp.111.186080] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
The identification of cis-regulatory modules (CRMs) can greatly advance our understanding of gene regulatory mechanisms. Despite the existence of binding sites of more than three transcription factors (TFs) in a CRM, studies in plants often consider only the cooccurrence of binding sites of one or two TFs. In addition, CRM studies in plants are limited to combinations of only a few families of TFs. It is thus not clear how widespread plant TFs work together, which TFs work together to regulate plant genes, and how the combinations of these TFs are shared by different plants. To fill these gaps, we applied a frequent pattern-mining-based approach to identify frequently used cis-regulatory sequence combinations in the promoter sequences of two plant species, Arabidopsis (Arabidopsis thaliana) and poplar (Populus trichocarpa). A cis-regulatory sequence here corresponds to a DNA motif bound by a TF. We identified 18,638 combinations composed of two to six cis-regulatory sequences that are shared by the two plant species. In addition, with known cis-regulatory sequence combinations, gene function annotation, gene expression data, and known functional gene sets, we showed that the functionality of at least 96.8% and 65.2% of these shared combinations in Arabidopsis are partially supported, under a false discovery rate of 0.1 and 0.05, respectively. Finally, we discovered that 796 of the 18,638 combinations might relate to functions that are important in bioenergy research. Our work will facilitate the study of gene transcriptional regulation in plants.
Collapse
|
18
|
Rye M, Sætrom P, Håndstad T, Drabløs F. Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements. BMC Biol 2011; 9:80. [PMID: 22115494 PMCID: PMC3239327 DOI: 10.1186/1741-7007-9-80] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2011] [Accepted: 11/24/2011] [Indexed: 12/16/2022] Open
Abstract
Background Transcription factor binding to DNA requires both an appropriate binding element and suitably open chromatin, which together help to define regulatory elements within the genome. Current methods of identifying regulatory elements, such as promoters or enhancers, typically rely on sequence conservation, existing gene annotations or specific marks, such as histone modifications and p300 binding methods, each of which has its own biases. Results Herein we show that an approach based on clustering of transcription factor peaks from high-throughput sequencing coupled with chromatin immunoprecipitation (Chip-Seq) can be used to evaluate markers for regulatory elements. We used 67 data sets for 54 unique transcription factors distributed over two cell lines to create regulatory element clusters. By integrating the clusters from our approach with histone modifications and data for open chromatin, we identified general methylation of lysine 4 on histone H3 (H3K4me) as the most specific marker for transcription factor clusters. Clusters mapping to annotated genes showed distinct patterns in cluster composition related to gene expression and histone modifications. Clusters mapping to intergenic regions fall into two groups either directly involved in transcription, including miRNAs and long noncoding RNAs, or facilitating transcription by long-range interactions. The latter clusters were specifically enriched with H3K4me1, but less with acetylation of lysine 27 on histone 3 or p300 binding. Conclusion By integrating genomewide data of transcription factor binding and chromatin structure and using our data-driven approach, we pinpointed the chromatin marks that best explain transcription factor association with different regulatory elements. Our results also indicate that a modest selection of transcription factors may be sufficient to map most regulatory elements in the human genome.
Collapse
Affiliation(s)
- Morten Rye
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
| | | | | | | |
Collapse
|
19
|
Cheng C, Shou C, Yip KY, Gerstein MB. Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors. Genome Biol 2011; 12:R111. [PMID: 22060676 PMCID: PMC3334597 DOI: 10.1186/gb-2011-12-11-r111] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Revised: 10/12/2011] [Accepted: 11/07/2011] [Indexed: 12/20/2022] Open
Abstract
We propose a method to predict yeast transcription factor targets by integrating histone modification profiles with transcription factor binding motif information. It shows improved predictive power compared to a binding motif-only method. We find that transcription factors cluster into histone-sensitive and -insensitive classes. The target genes of histone-sensitive transcription factors have stronger histone modification signals than those of histone-insensitive ones. The two classes also differ in tendency to interact with histone modifiers, degree of connectivity in protein-protein interaction networks, position in the transcriptional regulation hierarchy, and in a number of additional features, indicating possible differences in their transcriptional regulation mechanisms.
Collapse
Affiliation(s)
- Chao Cheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | | | | | | |
Collapse
|
20
|
Ab initio identification of novel regulatory elements in the genome of Trypanosoma brucei by Bayesian inference on sequence segmentation. PLoS One 2011; 6:e25666. [PMID: 21991330 PMCID: PMC3185004 DOI: 10.1371/journal.pone.0025666] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2011] [Accepted: 09/08/2011] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND The rapid increase in the availability of genome information has created considerable demand for both comparative and ab initio predictive bioinformatic analyses. The biology laid bare in the genomes of many organisms is often novel, presenting new challenges for bioinformatic interrogation. A paradigm for this is the collected genomes of the kinetoplastid parasites, a group which includes Trypanosoma brucei the causative agent of human African trypanosomiasis. These genomes, though outwardly simple in organisation and gene content, have historically challenged many theories for gene expression regulation in eukaryotes. METHODOLOGY/PRINCIPLE FINDINGS Here we utilise a Bayesian approach to identify local changes in nucleotide composition in the genome of T. brucei. We show that there are several elements which are found at the starts and ends of multicopy gene arrays and that there are compositional elements that are common to all intergenic regions. We also show that there is a composition-inversion element that occurs at the position of the trans-splice site. CONCLUSIONS/SIGNIFICANCE The nature of the elements discovered reinforces the hypothesis that context dependant RNA secondary structure has an important influence on gene expression regulation in Trypanosoma brucei.
Collapse
|
21
|
Wang Y, Li X, Hu H. Transcriptional regulation of co-expressed microRNA target genes. Genomics 2011; 98:445-52. [PMID: 22002038 DOI: 10.1016/j.ygeno.2011.09.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2011] [Revised: 08/12/2011] [Accepted: 09/24/2011] [Indexed: 01/26/2023]
Abstract
MicroRNAs play pivotal roles in gene regulation. Despite various research efforts on microRNAs, how microRNA target genes are transcriptionally regulated and how the transcriptional regulation of microRNA target genes relates to that of the microRNA genes are not well studied. By investigating the transcriptional regulation of microRNA target genes, we found that different groups of target genes of the same microRNA are co-expressed under different conditions, and these groups rarely overlap with each other for the majority of microRNAs. We also discovered that co-expressed microRNA target genes are often co-regulated, and different groups of target genes of the same microRNA are often regulated differently. In addition, we observed that transcription factors regulating a microRNA gene often regulate its target genes. Our study sheds light on the regulation of microRNA target genes, which will facilitate the prediction of microRNA target genes and the understanding of the transcriptional regulation of microRNA genes.
Collapse
Affiliation(s)
- Ying Wang
- Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | | | | |
Collapse
|
22
|
Xu M, Weinberg CR, Umbach DM, Li L. coMOTIF: a mixture framework for identifying transcription factor and a coregulator motif in ChIP-seq data. ACTA ACUST UNITED AC 2011; 27:2625-32. [PMID: 21775309 DOI: 10.1093/bioinformatics/btr397] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
MOTIVATION ChIP-seq data are enriched in binding sites for the protein immunoprecipitated. Some sequences may also contain binding sites for a coregulator. Biologists are interested in knowing which coregulatory factor motifs may be present in the sequences bound by the protein ChIP'ed. RESULTS We present a finite mixture framework with an expectation-maximization algorithm that considers two motifs jointly and simultaneously determines which sequences contain both motifs, either one or neither of them. Tested on 10 simulated ChIP-seq datasets, our method performed better than repeated application of MEME in predicting sequences containing both motifs. When applied to a mouse liver Foxa2 ChIP-seq dataset involving ~ 12 000 400-bp sequences, coMOTIF identified co-occurrence of Foxa2 with Hnf4a, Cebpa, E-box, Ap1/Maf or Sp1 motifs in ~6-33% of these sequences. These motifs are either known as liver-specific transcription factors or have an important role in liver function. AVAILABILITY Freely available at http://www.niehs.nih.gov/research/resources/software/comotif/. CONTACT li3@niehs.nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengyuan Xu
- Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | | | | | | |
Collapse
|
23
|
Dojer N, Biecek P, Tiuryn J. Bi-billboard: symmetrization and careful choice of informant species results in higher accuracy of regulatory element prediction. J Comput Biol 2011; 18:809-19. [PMID: 21563976 DOI: 10.1089/cmb.2010.0299] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The identification of cis-regulatory modules (CRM) is one of the most important problems towards the understanding of transcriptional regulation in higher eukaryotes. Computational methods for CRM detection are gaining importance due to the availability of genomic data on one side, and costs and difficulties of experimental methods on the other side. One of proposed approaches, called Billboard, predicts CRMs based on the location of transcription factor binding sites in an analyzed sequence and a related one in so-called informant species. In the present article, we show how to combine information obtained in two symmetric runs (on the sequence of interest and on the related one) of the Billboard tool. In a series of experiments on data from various organisms, we show that the predictive power of our symmetric approach is significantly higher than the power of the one-way approach of Billboard. Moreover, we show that the evolutionary distance between organisms considerably influences the quality of prediction and we provide guidelines on the choice of an informant species.
Collapse
Affiliation(s)
- Norbert Dojer
- Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.
| | | | | |
Collapse
|
24
|
Bickel PJ, Boley N, Brown JB, Huang H, Zhang NR. Subsampling methods for genomic inference. Ann Appl Stat 2010. [DOI: 10.1214/10-aoas363] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Cai X, Hou L, Su N, Hu H, Deng M, Li X. Systematic identification of conserved motif modules in the human genome. BMC Genomics 2010; 11:567. [PMID: 20946653 PMCID: PMC3091716 DOI: 10.1186/1471-2164-11-567] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2010] [Accepted: 10/14/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of motif modules, groups of multiple motifs frequently occurring in DNA sequences, is one of the most important tasks necessary for annotating the human genome. Current approaches to identifying motif modules are often restricted to searches within promoter regions or rely on multiple genome alignments. However, the promoter regions only account for a limited number of locations where transcription factor binding sites can occur, and multiple genome alignments often cannot align binding sites with their true counterparts because of the short and degenerative nature of these transcription factor binding sites. RESULTS To identify motif modules systematically, we developed a computational method for the entire non-coding regions around human genes that does not rely upon the use of multiple genome alignments. First, we selected orthologous DNA blocks approximately 1-kilobase in length based on discontiguous sequence similarity. Next, we scanned the conserved segments in these blocks using known motifs in the TRANSFAC database. Finally, a frequent pattern mining technique was applied to identify motif modules within these blocks. In total, with a false discovery rate cutoff of 0.05, we predicted 3,161,839 motif modules, 90.8% of which are supported by various forms of functional evidence. Compared with experimental data from 14 ChIP-seq experiments, on average, our methods predicted 69.6% of the ChIP-seq peaks with TFBSs of multiple TFs. Our findings also show that many motif modules have distance preference and order preference among the motifs, which further supports the functionality of these predictions. CONCLUSIONS Our work provides a large-scale prediction of motif modules in mammals, which will facilitate the understanding of gene regulation in a systematic way.
Collapse
Affiliation(s)
- Xiaohui Cai
- Center for Research in Biological Systems, University of California, SanDiego, La Jolla, CA 92093, USA
| | | | | | | | | | | |
Collapse
|
26
|
Picot E, Krusche P, Tiskin A, Carré I, Ott S. Evolutionary analysis of regulatory sequences (EARS) in plants. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2010; 64:165-176. [PMID: 20659275 DOI: 10.1111/j.1365-313x.2010.04314.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Identification of regulatory sequences within non-coding regions of DNA is an essential step towards elucidation of gene networks. This approach constitutes a major challenge, however, as only a very small fraction of non-coding DNA is thought to contribute to gene regulation. The mapping of regulatory regions traditionally involves the laborious construction of promoter deletion series which are then fused to reporter genes and assayed in transgenic organisms. Bioinformatic methods can be used to scan sequences for matches for known regulatory motifs, however these methods are currently hampered by the relatively small amount of such motifs and by a high false-discovery rate. Here, we demonstrate a robust and highly sensitive, in silico method to identify evolutionarily conserved regions within non-coding DNA. Sequence conservation within these regions is taken as evidence for evolutionary pressure against mutations, which is suggestive of functional importance. We test this method on a small set of well characterised promoters, and show that it successfully identifies known regulatory regions. We further show that these evolutionarily conserved sequences contain clusters of transcription binding sites, often described as regulatory modules. A version of the tool optimised for the analysis of plant promoters is available online at http://wsbc.warwick.ac.uk/ears/main.php.
Collapse
Affiliation(s)
- Emma Picot
- Systems Biology Doctoral Training Centre, University of Warwick, Coventry CV47AL, UK
| | | | | | | | | |
Collapse
|
27
|
Babu MM. Early Career Research Award Lecture. Structure, evolution and dynamics of transcriptional regulatory networks. Biochem Soc Trans 2010; 38:1155-78. [PMID: 20863280 DOI: 10.1042/bst0381155] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The availability of entire genome sequences and the wealth of literature on gene regulation have enabled researchers to model an organism's transcriptional regulation system in the form of a network. In such a network, TFs (transcription factors) and TGs (target genes) are represented as nodes and regulatory interactions between TFs and TGs are represented as directed links. In the present review, I address the following topics pertaining to transcriptional regulatory networks. (i) Structure and organization: first, I introduce the concept of networks and discuss our understanding of the structure and organization of transcriptional networks. (ii) Evolution: I then describe the different mechanisms and forces that influence network evolution and shape network structure. (iii) Dynamics: I discuss studies that have integrated information on dynamics such as mRNA abundance or half-life, with data on transcriptional network in order to elucidate general principles of regulatory network dynamics. In particular, I discuss how cell-to-cell variability in the expression level of TFs could permit differential utilization of the same underlying network by distinct members of a genetically identical cell population. Finally, I conclude by discussing open questions for future research and highlighting the implications for evolution, development, disease and applications such as genetic engineering.
Collapse
Affiliation(s)
- M Madan Babu
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 0QH, UK.
| |
Collapse
|
28
|
Hödar C, Assar R, Colombres M, Aravena A, Pavez L, González M, Martínez S, Inestrosa NC, Maass A. Genome-wide identification of new Wnt/beta-catenin target genes in the human genome using CART method. BMC Genomics 2010; 11:348. [PMID: 20515496 PMCID: PMC2996972 DOI: 10.1186/1471-2164-11-348] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 06/01/2010] [Indexed: 11/21/2022] Open
Abstract
Background The importance of in silico predictions for understanding cellular processes is now widely accepted, and a variety of algorithms useful for studying different biological features have been designed. In particular, the prediction of cis regulatory modules in non-coding human genome regions represents a major challenge for understanding gene regulation in several diseases. Recently, studies of the Wnt signaling pathway revealed a connection with neurodegenerative diseases such as Alzheimer's. In this article, we construct a classification tool that uses the transcription factor binding site motifs composition of some gene promoters to identify new Wnt/β-catenin pathway target genes potentially involved in brain diseases. Results In this study, we propose 89 new Wnt/β-catenin pathway target genes predicted in silico by using a method based on multiple Classification and Regression Tree (CART) analysis. We used as decision variables the presence of transcription factor binding site motifs in the upstream region of each gene. This prediction was validated by RT-qPCR in a sample of 9 genes. As expected, LEF1, a member of the T-cell factor/lymphoid enhancer-binding factor family (TCF/LEF1), was relevant for the classification algorithm and, remarkably, other factors related directly or indirectly to the inflammatory response and amyloidogenic processes also appeared to be relevant for the classification. Among the 89 new Wnt/β-catenin pathway targets, we found a group expressed in brain tissue that could be involved in diverse responses to neurodegenerative diseases, like Alzheimer's disease (AD). These genes represent new candidates to protect cells against amyloid β toxicity, in agreement with the proposed neuroprotective role of the Wnt signaling pathway. Conclusions Our multiple CART strategy proved to be an effective tool to identify new Wnt/β-catenin pathway targets based on the study of their regulatory regions in the human genome. In particular, several of these genes represent a new group of transcriptional dependent targets of the canonical Wnt pathway. The functions of these genes indicate that they are involved in pathophysiology related to Alzheimer's disease or other brain disorders.
Collapse
Affiliation(s)
- Christian Hödar
- Laboratorio de Bioinformática y Expresión Génica, INTA, Universidad de Chile, Santiago, Chile.
| | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Wang M, Yang F, Zhang X, Zhao H, Wang Q, Pan Y. Comparative analysis of MTF-1 binding sites between human and mouse. Mamm Genome 2010; 21:287-98. [PMID: 20383712 DOI: 10.1007/s00335-010-9257-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2009] [Accepted: 03/26/2010] [Indexed: 01/19/2023]
Abstract
MTF-1 is a crucial transcription factor involved in the cellular response to heavy-metal load and other stresses by specifically binding to metal response elements (MREs). Thus far only a handful of direct target genes are known for this transcription factor, limiting our understanding of the biological network it governs. In this article we try to employ a computational strategy based on the generation of literature-based positional weight matrices (PWM) and log-likelihood scoring of the candidate binding sites (BSs) for identification of direct targets of the transcription factor MTF-1 in human and mouse. Through comparisons, we explore the conservation and unique characteristics between two species. Our results show that the numbers of MREs differ dramatically between species and their positions relative to their cognate promoter is also flexible. Importantly, we identify a set of target genes generally well conserved between human and mouse. Finally, by combining expression analysis we provide two putative targets (HMGCR and CYP51A), which regulate lipid metabolism conserved in human and mouse. Overall, interspecies comparison from our study may provide some valuable information for further studying human Wilson disease (WD) using mouse model systems.
Collapse
Affiliation(s)
- Minghui Wang
- Department of Animal Sciences, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai, People's Republic of China
| | | | | | | | | | | |
Collapse
|
30
|
Won KJ, Ren B, Wang W. Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol 2010; 11:R7. [PMID: 20096096 PMCID: PMC2847719 DOI: 10.1186/gb-2010-11-1-r7] [Citation(s) in RCA: 82] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2009] [Revised: 10/30/2009] [Accepted: 01/22/2010] [Indexed: 12/19/2022] Open
Abstract
A new approach for genome-wide transcription factor binding site prediction is presented that integrates sequence and chromatin modification data. We present an integrated method called Chromia for the genome-wide identification of functional target loci of transcription factors. Designed to capture the characteristic patterns of transcription factor binding motif occurrences and the histone profiles associated with regulatory elements such as promoters and enhancers, Chromia significantly outperforms other methods in the identification of 13 transcription factor binding sites in mouse embryonic stem cells, evaluated by both binding (ChIP-seq) and functional (RNA interference knockdown) experiments.
Collapse
Affiliation(s)
- Kyoung-Jae Won
- University of California, San Diego, Department of Chemistry and Biochemistry, 9500 Gilman Drive, La Jolla, CA 92093, USA.
| | | | | |
Collapse
|
31
|
Abstract
The challenge of identifying cis-regulatory modules (CRMs) is an important milestone for the ultimate goal of understanding transcriptional regulation in eukaryotic cells. It has been approached, among others, by motif-finding algorithms that identify overrepresented motifs in regulatory sequences. These methods succeed in finding single, well-conserved motifs, but fail to identify combinations of degenerate binding sites, like the ones often found in CRMs. We have developed a method that combines the abilities of existing motif finding with the discriminative power of a machine learning technique to model the regulation of genes (Schultheiss et al. (2009) Bioinformatics 25, 2126-2133). Our software is called KIRMES: , which stands for kernel-based identification of regulatory modules in eukaryotic sequences. Starting from a set of genes thought to be co-regulated, KIRMES: can identify the key CRMs responsible for this behavior and can be used to determine for any other gene not included on that list if it is also regulated by the same mechanism. Such gene sets can be derived from microarrays, chromatin immunoprecipitation experiments combined with next-generation sequencing or promoter/whole genome microarrays. The use of an established machine learning method makes the approach fast to use and robust with respect to noise. By providing easily understood visualizations for the results returned, they become interpretable and serve as a starting point for further analysis. Even for complex regulatory relationships, KIRMES: can be a helpful tool in directing the design of biological experiments.
Collapse
|
32
|
Breen G. Practical informatics approaches to microsatellite and variable number tandem repeat analysis. Methods Mol Biol 2010; 628:181-94. [PMID: 20238082 DOI: 10.1007/978-1-60327-367-1_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2025]
Abstract
The second most common source of genetic variation after SNPs is polymorphic tandem repeats, the alleles of which consist of a variable number of repeated units that can be either small (e.g., CA) or large (to >100 nucleotides in length). There are perhaps over half a million of these in the human genome. They have been implicated as functional promoter polymorphisms acting as common genetic risk factors for complex disorders (in diabetes and depression), as pathogenic mutations (Spinocerebellar Ataxias, Huntington's Disease) and in association mapping, linkage and forensics, but while they enjoyed much success and use in early genetic linkage and association studies, they have recently been neglected. While SNPs are markers of great utility in genetic studies, different alleles of a polymorphic tandem repeat represent a very large physical and chemical change to a stretch of DNA sequence. They can act variously as: (a) functional elements binding transcription factors and other proteins that inhibit or promote expression; (b) motif elements affecting the efficiency of mRNA splicing; and (c) elements having physical effects, such as varying the spacing between functional motifs or in altering the structure and melting properties of DNA in their proximity. For these reasons, they are very good a priori functional candidates. Geneticists wishing to work with these polymorphisms need to know how to find them in sequence, use their annotation in genome browsers and online databases, use specialist bioinformatics web-tools for their analysis, and how to go about analyzing them in the lab and for genetic association.
Collapse
Affiliation(s)
- Gerome Breen
- Division of Psychological Medicine and Social, Genetic and Developmental Psychiatry Centre, Institute of Psychiatry, King's College London, London, UK
| |
Collapse
|
33
|
Schultheiss SJ, Busch W, Lohmann J, Kohlbacher O, Rätsch G. KIRMES: kernel-based identification of regulatory modules in euchromatic sequences. BMC Bioinformatics 2009; 10 Suppl 13:I1, O1-7, P1-7. [PMID: 19856525 PMCID: PMC2764125 DOI: 10.1186/1471-2105-10-s13-o1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
34
|
Benita Y, Kikuchi H, Smith AD, Zhang MQ, Chung DC, Xavier RJ. An integrative genomics approach identifies Hypoxia Inducible Factor-1 (HIF-1)-target genes that form the core response to hypoxia. Nucleic Acids Res 2009; 37:4587-602. [PMID: 19491311 PMCID: PMC2724271 DOI: 10.1093/nar/gkp425] [Citation(s) in RCA: 372] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2009] [Revised: 05/06/2009] [Accepted: 05/08/2009] [Indexed: 02/06/2023] Open
Abstract
The transcription factor Hypoxia-inducible factor 1 (HIF-1) plays a central role in the transcriptional response to oxygen flux. To gain insight into the molecular pathways regulated by HIF-1, it is essential to identify the downstream-target genes. We report here a strategy to identify HIF-1-target genes based on an integrative genomic approach combining computational strategies and experimental validation. To identify HIF-1-target genes microarrays data sets were used to rank genes based on their differential response to hypoxia. The proximal promoters of these genes were then analyzed for the presence of conserved HIF-1-binding sites. Genes were scored and ranked based on their response to hypoxia and their HIF-binding site score. Using this strategy we recovered 41% of the previously confirmed HIF-1-target genes that responded to hypoxia in the microarrays and provide a catalogue of predicted HIF-1 targets. We present experimental validation for ANKRD37 as a novel HIF-1-target gene. Together these analyses demonstrate the potential to recover novel HIF-1-target genes and the discovery of mammalian-regulatory elements operative in the context of microarray data sets.
Collapse
Affiliation(s)
- Yair Benita
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Hirotoshi Kikuchi
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Andrew D. Smith
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Michael Q. Zhang
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Daniel C. Chung
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Ramnik J. Xavier
- Center for Computational and Integrative Biology, Gastrointestinal Unit, Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114 and Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| |
Collapse
|
35
|
Drawid A, Gupta N, Nagaraj VH, Gélinas C, Sengupta AM. OHMM: a Hidden Markov Model accurately predicting the occupancy of a transcription factor with a self-overlapping binding motif. BMC Bioinformatics 2009; 10:208. [PMID: 19583839 PMCID: PMC2718928 DOI: 10.1186/1471-2105-10-208] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2008] [Accepted: 07/07/2009] [Indexed: 12/29/2022] Open
Abstract
Background DNA sequence binding motifs for several important transcription factors happen to be self-overlapping. Many of the current regulatory site identification methods do not explicitly take into account the overlapping sites. Moreover, most methods use arbitrary thresholds and fail to provide a biophysical interpretation of statistical quantities. In addition, commonly used approaches do not include the location of a site with respect to the transcription start site (TSS) in an integrated probabilistic framework while identifying sites. Ignoring these features can lead to inaccurate predictions as well as incorrect design and interpretation of experimental results. Results We have developed a tool based on a Hidden Markov Model (HMM) that identifies binding location of transcription factors with preference for self-overlapping DNA motifs by combining the effects of their alternative binding modes. Interpreting HMM parameters as biophysical quantities, this method uses the occupancy probability of a transcription factor on a DNA sequence as the discriminant function, earning the algorithm the name OHMM: Occupancy via Hidden Markov Model. OHMM learns the classification threshold by training emission probabilities using unaligned sequences containing known sites and estimating transition probabilities to reflect site density in all promoters in a genome. While identifying sites, it adjusts parameters to model site density changing with the distance from the transcription start site. Moreover, it provides guidance for designing padding sequences in gel shift experiments. In the context of binding sites to transcription factor NF-κB, we find that the occupancy probability predicted by OHMM correlates well with the binding affinity in gel shift experiments. High evolutionary conservation scores and enrichment in experimentally verified regulated genes suggest that NF-κB binding sites predicted by our method are likely to be functional. Conclusion Our method deals specifically with identifying locations with multiple overlapping binding sites by computing the local occupancy of the transcription factor. Moreover, considering OHMM as a biophysical model allows us to learn the classification threshold in a principled manner. Another feature of OHMM is that we allow transition probabilities to change with location relative to the TSS. OHMM could be used to predict physical occupancy, and provides guidance for proper design of gel-shift experiments. Based upon our predictions, new insights into NF-κB function and regulation and possible new biological roles of NF-κB were uncovered.
Collapse
Affiliation(s)
- Amar Drawid
- BioMAPS Institute for Quantitative Biology, Rutgers University, Piscataway, NJ, USA.
| | | | | | | | | |
Collapse
|
36
|
Zamdborg L, Ma P. Discovery of protein-DNA interactions by penalized multivariate regression. Nucleic Acids Res 2009; 37:5246-54. [PMID: 19578060 PMCID: PMC2760818 DOI: 10.1093/nar/gkp554] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Discovering which regulatory proteins, especially transcription factors (TFs), are active under certain experimental conditions and identifying the corresponding binding motifs is essential for understanding the regulatory circuits that control cellular programs. The experimental methods used for this purpose are laborious. Computational methods have been proven extremely effective in identifying TF-binding motifs (TFBMs). In this article, we propose a novel computational method called MotifExpress for discovering active TFBMs. Unlike existing methods, which either use only DNA sequence information or integrate sequence information with a single-sample measurement of gene expression, MotifExpress integrates DNA sequence information with gene expression measured in multiple samples. By selecting TFBMs that are significantly associated with gene expression, we can identify active TFBMs under specific experimental conditions and thus provide clues for the construction of regulatory networks. Compared with existing methods, MotifExpress substantially reduces the number of spurious results. Statistically, MotifExpress uses a penalized multivariate regression approach with a composite absolute penalty, which is highly stable and can effectively find the globally optimal set of active motifs. We demonstrate the excellent performance of MotifExpress by applying it to synthetic data and real examples of Saccharomyces cerevisiae. MotifExpress is available at http://www.stat.illinois.edu/~pingma/MotifExpress.htm.
Collapse
Affiliation(s)
- Leonid Zamdborg
- Department of Statistics, University of Illinois at Urbana-Champaign, Center for Biophysics and Computational Biology, Institute for Genomic Biology, IL, USA
| | | |
Collapse
|
37
|
Janky R, Helden JV, Babu MM. Investigating transcriptional regulation: from analysis of complex networks to discovery of cis-regulatory elements. Methods 2009; 48:277-86. [PMID: 19450688 DOI: 10.1016/j.ymeth.2009.04.022] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Revised: 04/17/2009] [Accepted: 04/18/2009] [Indexed: 10/20/2022] Open
Abstract
Regulation of gene expression at the transcriptional level is a fundamental mechanism that is well conserved in all cellular systems. Due to advances in large-scale experimental analyses, we now have a wealth of information on gene regulation such as mRNA expression level across multiple conditions, genome-wide location data of transcription factors and data on transcription factor binding sites. This knowledge can be used to reconstruct transcriptional regulatory networks. Such networks are usually represented as directed graphs where regulatory interactions are depicted as directed edges from the transcription factor nodes to the target gene nodes. This abstract representation allows us to apply graph theory to study transcriptional regulation at global and local levels, to predict regulatory motifs and regulatory modules such as regulons and to compare the regulatory network of different genomes. Here we review some of the available computational methodologies for studying transcriptional regulatory networks as well as their interpretation.
Collapse
Affiliation(s)
- Rekin's Janky
- Structural Studies Division, Medical Research Council - Laboratory of Molecular Biology, Hills Road, Cambridge CB20QH, United Kingdom.
| | | | | |
Collapse
|
38
|
Liu R, Hannenhalli S, Bucan M. Motifs and cis-regulatory modules mediating the expression of genes co-expressed in presynaptic neurons. Genome Biol 2009; 10:R72. [PMID: 19570198 PMCID: PMC2728526 DOI: 10.1186/gb-2009-10-7-r72] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Revised: 06/11/2009] [Accepted: 07/01/2009] [Indexed: 12/19/2022] Open
Abstract
An integrative strategy of comparative genomics, experimental and computational approaches reveals aspects of a regulatory network controlling neuronal-specific expression in presynaptic neurons. Background Hundreds of proteins modulate neurotransmitter release and synaptic plasticity during neuronal development and in response to synaptic activity. The expression of genes in the pre- and post-synaptic neurons is under stringent spatio-temporal control, but the mechanism underlying the neuronal expression of these genes remains largely unknown. Results Using unbiased in vivo and in vitro screens, we characterized the cis elements regulating the Rab3A gene, which is expressed abundantly in presynaptic neurons. A set of identified regulatory elements of the Rab3A gene corresponded to the defined Rab3A multi-species conserved elements. In order to identify clusters of enriched transcription factor binding sites, for example, cis-regulatory modules, we analyzed intergenic multi-species conserved elements in the vicinity of nine presynaptic genes, including Rab3A, that are highly and specifically expressed in brain regions. Sixteen transcription factor binding motifs were over-represented in these multi-species conserved elements. Based on a combined occurrence for these enriched motifs, multi-species conserved elements in the vicinity of 107 previously identified presynaptic genes were scored and ranked. We then experimentally validated the scoring strategy by showing that 12 of 16 (75%) high-scoring multi-species conserved elements functioned as neuronal enhancers in a cell-based assay. Conclusions This work introduces an integrative strategy of comparative genomics, experimental, and computational approaches to reveal aspects of a regulatory network controlling neuronal-specific expression of genes in presynaptic neurons.
Collapse
Affiliation(s)
- Rui Liu
- Department of Genetics and Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
39
|
Van Loo P, Marynen P. Computational methods for the detection of cis-regulatory modules. Brief Bioinform 2009; 10:509-24. [DOI: 10.1093/bib/bbp025] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
|
40
|
Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
Collapse
Affiliation(s)
- Leelavati Narlikar
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
41
|
Danko CG, Pertsov AM. Identification of gene co-regulatory modules and associated cis-elements involved in degenerative heart disease. BMC Med Genomics 2009; 2:31. [PMID: 19476647 PMCID: PMC2700136 DOI: 10.1186/1755-8794-2-31] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2008] [Accepted: 05/28/2009] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Cardiomyopathies, degenerative diseases of cardiac muscle, are among the leading causes of death in the developed world. Microarray studies of cardiomyopathies have identified up to several hundred genes that significantly alter their expression patterns as the disease progresses. However, the regulatory mechanisms driving these changes, in particular the networks of transcription factors involved, remain poorly understood. Our goals are (A) to identify modules of co-regulated genes that undergo similar changes in expression in various types of cardiomyopathies, and (B) to reveal the specific pattern of transcription factor binding sites, cis-elements, in the proximal promoter region of genes comprising such modules. METHODS We analyzed 149 microarray samples from human hypertrophic and dilated cardiomyopathies of various etiologies. Hierarchical clustering and Gene Ontology annotations were applied to identify modules enriched in genes with highly correlated expression and a similar physiological function. To discover motifs that may underly changes in expression, we used the promoter regions for genes in three of the most interesting modules as input to motif discovery algorithms. The resulting motifs were used to construct a probabilistic model predictive of changes in expression across different cardiomyopathies. RESULTS We found that three modules with the highest degree of functional enrichment contain genes involved in myocardial contraction (n = 9), energy generation (n = 20), or protein translation (n = 20). Using motif discovery tools revealed that genes in the contractile module were found to contain a TATA-box followed by a CACC-box, and are depleted in other GC-rich motifs; whereas genes in the translation module contain a pyrimidine-rich initiator, Elk-1, SP-1, and a novel motif with a GCGC core. Using a naïve Bayes classifier revealed that patterns of motifs are statistically predictive of expression patterns, with odds ratios of 2.7 (contractile), 1.9 (energy generation), and 5.5 (protein translation). CONCLUSION We identified patterns comprised of putative cis-regulatory motifs enriched in the upstream promoter sequence of genes that undergo similar changes in expression secondary to cardiomyopathies of various etiologies. Our analysis is a first step towards understanding transcription factor networks that are active in regulating gene expression during degenerative heart disease.
Collapse
Affiliation(s)
- Charles G Danko
- Department of Pharmacology, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Arkady M Pertsov
- Department of Pharmacology, SUNY Upstate Medical University, Syracuse, NY, USA
| |
Collapse
|
42
|
An integrated approach to identifying cis-regulatory modules in the human genome. PLoS One 2009; 4:e5501. [PMID: 19434238 PMCID: PMC2677454 DOI: 10.1371/journal.pone.0005501] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2008] [Accepted: 04/21/2009] [Indexed: 11/21/2022] Open
Abstract
In eukaryotic genomes, it is challenging to accurately determine target sites of transcription factors (TFs) by only using sequence information. Previous efforts were made to tackle this task by considering the fact that TF binding sites tend to be more conserved than other functional sites and the binding sites of several TFs are often clustered. Recently, ChIP-chip and ChIP-sequencing experiments have been accumulated to identify TF binding sites as well as survey the chromatin modification patterns at the regulatory elements such as promoters and enhancers. We propose here a hidden Markov model (HMM) to incorporate sequence motif information, TF-DNA interaction data and chromatin modification patterns to precisely identify cis-regulatory modules (CRMs). We conducted ChIP-chip experiments on four TFs, CREB, E2F1, MAX, and YY1 in 1% of the human genome. We then trained a hidden Markov model (HMM) to identify the labels of the CRMs by incorporating the sequence motifs recognized by these TFs and the ChIP-chip ratio. Chromatin modification data was used to predict the functional sites and to further remove false positives. Cross-validation showed that our integrated HMM had a performance superior to other existing methods on predicting CRMs. Incorporating histone signature information successfully penalized false prediction and improved the whole performance. The dataset we used and the software are available at http://nash.ucsd.edu/CIS/.
Collapse
|
43
|
Schultheiss SJ, Busch W, Lohmann JU, Kohlbacher O, Rätsch G. KIRMES: kernel-based identification of regulatory modules in euchromatic sequences. ACTA ACUST UNITED AC 2009; 25:2126-33. [PMID: 19389732 PMCID: PMC2722996 DOI: 10.1093/bioinformatics/btp278] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Motivation: Understanding transcriptional regulation is one of the main challenges in computational biology. An important problem is the identification of transcription factor (TF) binding sites in promoter regions of potential TF target genes. It is typically approached by position weight matrix-based motif identification algorithms using Gibbs sampling, or heuristics to extend seed oligos. Such algorithms succeed in identifying single, relatively well-conserved binding sites, but tend to fail when it comes to the identification of combinations of several degenerate binding sites, as those often found in cis-regulatory modules. Results: We propose a new algorithm that combines the benefits of existing motif finding with the ones of support vector machines (SVMs) to find degenerate motifs in order to improve the modeling of regulatory modules. In experiments on microarray data from Arabidopsis thaliana, we were able to show that the newly developed strategy significantly improves the recognition of TF targets. Availability: The python source code (open source-licensed under GPL), the data for the experiments and a Galaxy-based web service are available at http://www.fml.mpg.de/raetsch/suppl/kirmes/ Contact:sebi@tuebingen.mpg.de Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian J Schultheiss
- Friedrich Miescher Laboratory of the Max Planck Society, and Max Planck Institute for Developmental Biology, Tübingen, Germany.
| | | | | | | | | |
Collapse
|
44
|
Pape UJ, Klein H, Vingron M. Statistical detection of cooperative transcription factors with similarity adjustment. Bioinformatics 2009; 25:2103-9. [PMID: 19286833 PMCID: PMC2722994 DOI: 10.1093/bioinformatics/btp143] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment. Results: Based on previous work, we propose to adjust the window size for co-occurrence detection. Using the derived approximation, one obtains different window sizes for different sets of DNA motifs depending on their similarities. This ensures that the probability of co-occurrences in random sequences are equal. Applying the approach to selected similar and dissimilar DNA motifs from human TFs shows the necessity of adjustment and confirms the accuracy of the approximation by comparison to simulated data. Furthermore, it becomes clear that approaches ignoring similarities strongly underestimate P-values for cooperativity of TFs with similar DNA motifs. In addition, the approach is extended to deal with overlapping windows. We derive Chen–Stein error bounds for the approximation. Comparing the error bounds for similar and dissimilar DNA motifs shows that the approximation for similar DNA motifs yields large bounds. Hence, one has to be careful using overlapping windows. Based on the error bounds, one can precompute the approximation errors and select an appropriate overlap scheme before running the analysis. Availability: Software to perform the calculation for pairs of position frequency matrices (PFMs) is available at http://mosta.molgen.mpg.de as well as C++ source code for downloading. Contact:utz.pape@molgen.mpg.de
Collapse
Affiliation(s)
- Utz J Pape
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73 and Mathematics and Computer Science, Free University of Berlin, Takustr. 9, 14195 Berlin, Germany.
| | | | | |
Collapse
|
45
|
Sun H, De Bie T, Storms V, Fu Q, Dhollander T, Lemmens K, Verstuyf A, De Moor B, Marchal K. ModuleDigger: an itemset mining framework for the detection of cis-regulatory modules. BMC Bioinformatics 2009; 10 Suppl 1:S30. [PMID: 19208131 PMCID: PMC2648767 DOI: 10.1186/1471-2105-10-s1-s30] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background The detection of cis-regulatory modules (CRMs) that mediate transcriptional responses in eukaryotes remains a key challenge in the postgenomic era. A CRM is characterized by a set of co-occurring transcription factor binding sites (TFBS). In silico methods have been developed to search for CRMs by determining the combination of TFBS that are statistically overrepresented in a certain geneset. Most of these methods solve this combinatorial problem by relying on computational intensive optimization methods. As a result their usage is limited to finding CRMs in small datasets (containing a few genes only) and using binding sites for a restricted number of transcription factors (TFs) out of which the optimal module will be selected. Results We present an itemset mining based strategy for computationally detecting cis-regulatory modules (CRMs) in a set of genes. We tested our method by applying it on a large benchmark data set, derived from a ChIP-Chip analysis and compared its performance with other well known cis-regulatory module detection tools. Conclusion We show that by exploiting the computational efficiency of an itemset mining approach and combining it with a well-designed statistical scoring scheme, we were able to prioritize the biologically valid CRMs in a large set of coregulated genes using binding sites for a large number of potential TFs as input.
Collapse
Affiliation(s)
- Hong Sun
- Department of Electrical Engineering, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium.
| | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Vingron M, Brazma A, Coulson R, van Helden J, Manke T, Palin K, Sand O, Ukkonen E. Integrating sequence, evolution and functional genomics in regulatory genomics. Genome Biol 2009; 10:202. [PMID: 19226437 PMCID: PMC2687781 DOI: 10.1186/gb-2009-10-1-202] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
With genome analysis expanding from the study of genes to the study of gene regulation, 'regulatory genomics' utilizes sequence information, evolution and functional genomics measurements to unravel how regulatory information is encoded in the genome.
Collapse
Affiliation(s)
- Martin Vingron
- Computational Molecular Biology, Max-Planck-Institut für molekulare Genetik, Berlin, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
47
|
Wan L, Li D, Zhang D, Liu X, Fu WJ, Zhu L, Deng M, Sun F, Qian M. Conservation and implications of eukaryote transcriptional regulatory regions across multiple species. BMC Genomics 2008; 9:623. [PMID: 19099599 PMCID: PMC2640395 DOI: 10.1186/1471-2164-9-623] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Accepted: 12/20/2008] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Increasing evidence shows that whole genomes of eukaryotes are almost entirely transcribed into both protein coding genes and an enormous number of non-protein-coding RNAs (ncRNAs). Therefore, revealing the underlying regulatory mechanisms of transcripts becomes imperative. However, for a complete understanding of transcriptional regulatory mechanisms, we need to identify the regions in which they are found. We will call these transcriptional regulation regions, or TRRs, which can be considered functional regions containing a cluster of regulatory elements that cooperatively recruit transcriptional factors for binding and then regulating the expression of transcripts. RESULTS We constructed a hierarchical stochastic language (HSL) model for the identification of core TRRs in yeast based on regulatory cooperation among TRR elements. The HSL model trained based on yeast achieved comparable accuracy in predicting TRRs in other species, e.g., fruit fly, human, and rice, thus demonstrating the conservation of TRRs across species. The HSL model was also used to identify the TRRs of genes, such as p53 or OsALYL1, as well as microRNAs. In addition, the ENCODE regions were examined by HSL, and TRRs were found to pervasively locate in the genomes. CONCLUSION Our findings indicate that 1) the HSL model can be used to accurately predict core TRRs of transcripts across species and 2) identified core TRRs by HSL are proper candidates for the further scrutiny of specific regulatory elements and mechanisms. Meanwhile, the regulatory activity taking place in the abundant numbers of ncRNAs might account for the ubiquitous presence of TRRs across the genome. In addition, we also found that the TRRs of protein coding genes and ncRNAs are similar in structure, with the latter being more conserved than the former.
Collapse
Affiliation(s)
- Lin Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| | - Dayong Li
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Donglei Zhang
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Xue Liu
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Wenjiang J Fu
- Department of Epidemiology, Michigan State University, East Lansing, Michigan 48824, USA
| | - Lihuang Zhu
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| | - Fengzhu Sun
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100871, PR China
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California 90089, USA
| | - Minping Qian
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| |
Collapse
|
48
|
Won KJ, Chepelev I, Ren B, Wang W. Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics 2008; 9:547. [PMID: 19094206 PMCID: PMC2657164 DOI: 10.1186/1471-2105-9-547] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2008] [Accepted: 12/18/2008] [Indexed: 01/31/2023] Open
Abstract
Background Recent genomic scale survey of epigenetic states in the mammalian genomes has shown that promoters and enhancers are correlated with distinct chromatin signatures, providing a pragmatic way for systematic mapping of these regulatory elements in the genome. With rapid accumulation of chromatin modification profiles in the genome of various organisms and cell types, this chromatin based approach promises to uncover many new regulatory elements, but computational methods to effectively extract information from these datasets are still limited. Results We present here a supervised learning method to predict promoters and enhancers based on their unique chromatin modification signatures. We trained Hidden Markov models (HMMs) on the histone modification data for known promoters and enhancers, and then used the trained HMMs to identify promoter or enhancer like sequences in the human genome. Using a simulated annealing (SA) procedure, we searched for the most informative combination and the optimal window size of histone marks. Conclusion Compared with the previous methods, the HMM method can capture the complex patterns of histone modifications particularly from the weak signals. Cross validation and scanning the ENCODE regions showed that our method outperforms the previous profile-based method in mapping promoters and enhancers. We also showed that including more histone marks can further boost the performance of our method. This observation suggests that the HMM is robust and is capable of integrating information from multiple histone marks. To further demonstrate the usefulness of our method, we applied it to analyzing genome wide ChIP-Seq data in three mouse cell lines and correctly predicted active and inactive promoters with positive predictive values of more than 80%. The software is available at .
Collapse
Affiliation(s)
- Kyoung-Jae Won
- Dept of Chemistry & Biochemistry, University of California San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0359, USA.
| | | | | | | |
Collapse
|
49
|
Sandve GK, Abul O, Drabløs F. Compo: composite motif discovery using discrete models. BMC Bioinformatics 2008; 9:527. [PMID: 19063744 PMCID: PMC2614996 DOI: 10.1186/1471-2105-9-527] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Accepted: 12/08/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational discovery of motifs in biomolecular sequences is an established field, with applications both in the discovery of functional sites in proteins and regulatory sites in DNA. In recent years there has been increased attention towards the discovery of composite motifs, typically occurring in cis-regulatory regions of genes. RESULTS This paper describes Compo: a discrete approach to composite motif discovery that supports richer modeling of composite motifs and a more realistic background model compared to previous methods. Furthermore, multiple parameter and threshold settings are tested automatically, and the most interesting motifs across settings are selected. This avoids reliance on single hard thresholds, which has been a weakness of previous discrete methods. Comparison of motifs across parameter settings is made possible by the use of p-values as a general significance measure. Compo can either return an ordered list of motifs, ranked according to the general significance measure, or a Pareto front corresponding to a multi-objective evaluation on sensitivity, specificity and spatial clustering. CONCLUSION Compo performs very competitively compared to several existing methods on a collection of benchmark data sets. These benchmarks include a recently published, large benchmark suite where the use of support across sequences allows Compo to correctly identify binding sites even when the relevant PWMs are mixed with a large number of noise PWMs. Furthermore, the possibility of parameter-free running offers high usability, the support for multi-objective evaluation allows a rich view of potential regulators, and the discrete model allows flexibility in modeling and interpretation of motifs.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway.
| | | | | |
Collapse
|
50
|
Terenius O, Marinotti O, Sieglaff D, James AA. Molecular genetic manipulation of vector mosquitoes. Cell Host Microbe 2008; 4:417-23. [PMID: 18996342 PMCID: PMC2656434 DOI: 10.1016/j.chom.2008.09.002] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2008] [Revised: 08/29/2008] [Accepted: 09/09/2008] [Indexed: 01/01/2023]
Abstract
Genetic strategies for reducing populations of vector mosquitoes or replacing them with those that are not able to transmit pathogens benefit greatly from molecular tools that allow gene manipulation and transgenesis. Mosquito genome sequences and associated EST (expressed sequence tags) databases enable large-scale investigations to provide new insights into evolutionary, biochemical, genetic, metabolic, and physiological pathways. Additionally, comparative genomics reveals the bases for evolutionary mechanisms with particular focus on specific interactions between vectors and pathogens. We discuss how this information may be exploited for the optimization of transgenes that interfere with the propagation and development of pathogens in their mosquito hosts.
Collapse
Affiliation(s)
- Olle Terenius
- Department of Molecular Biology and Biochemistry, 3205 McGaugh Hall, University of California, Irvine, CA 92697, USA
| | - Osvaldo Marinotti
- Department of Molecular Biology and Biochemistry, 3205 McGaugh Hall, University of California, Irvine, CA 92697, USA
| | - Douglas Sieglaff
- Department of Molecular Biology and Biochemistry, 3205 McGaugh Hall, University of California, Irvine, CA 92697, USA
- Institute for Genomics and Bioinformatics, University of California, Irvine
| | - Anthony A. James
- Department of Molecular Biology and Biochemistry, 3205 McGaugh Hall, University of California, Irvine, CA 92697, USA
- Department of Microbiology & Molecular Genetics, University of California, Irvine, CA 92697, USA
| |
Collapse
|