1
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
2
|
Keränen SVE, Villahoz-Baleta A, Bruno AE, Halfon MS. REDfly: An Integrated Knowledgebase for Insect Regulatory Genomics. INSECTS 2022; 13:618. [PMID: 35886794 PMCID: PMC9323752 DOI: 10.3390/insects13070618] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/01/2022] [Accepted: 07/06/2022] [Indexed: 11/29/2022]
Abstract
We provide here an updated description of the REDfly (Regulatory Element Database for Fly) database of transcriptional regulatory elements, a unique resource that provides regulatory annotation for the genome of Drosophila and other insects. The genomic sequences regulating insect gene expression-transcriptional cis-regulatory modules (CRMs, e.g., "enhancers") and transcription factor binding sites (TFBSs)-are not currently curated by any other major database resources. However, knowledge of such sequences is important, as CRMs play critical roles with respect to disease as well as normal development, phenotypic variation, and evolution. Characterized CRMs also provide useful tools for both basic and applied research, including developing methods for insect control. REDfly, which is the most detailed existing platform for metazoan regulatory-element annotation, includes over 40,000 experimentally verified CRMs and TFBSs along with their DNA sequences, their associated genes, and the expression patterns they direct. Here, we briefly describe REDfly's contents and data model, with an emphasis on the new features implemented since 2020. We then provide an illustrated walk-through of several common REDfly search use cases.
Collapse
Affiliation(s)
| | - Angel Villahoz-Baleta
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA; (A.V.-B.); (A.E.B.)
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Andrew E. Bruno
- Center for Computational Research, State University of New York at Buffalo, Buffalo, NY 14203, USA; (A.V.-B.); (A.E.B.)
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- New York State Center of Excellence in Bioinformatics and Life Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY 14203, USA
- Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
3
|
Abstract
BACKGROUND Topic models are statistical algorithms which try to discover the structure of a set of documents according to the abstract topics contained in them. Here we try to apply this approach to the discovery of the structure of the transcription factor binding sites (TFBS) contained in a set of biological sequences, which is a fundamental problem in molecular biology research for the understanding of transcriptional regulation. Here we present two methods that make use of topic models for motif finding. First, we developed an algorithm in which first a set of biological sequences are treated as text documents, and the k-mers contained in them as words, to then build a correlated topic model (CTM) and iteratively reduce its perplexity. We also used the perplexity measurement of CTMs to improve our previous algorithm based on a genetic algorithm and several statistical coefficients. RESULTS The algorithms were tested with 56 data sets from four different species and compared to 14 other methods by the use of several coefficients both at nucleotide and site level. The results of our first approach showed a performance comparable to the other methods studied, especially at site level and in sensitivity scores, in which it scored better than any of the 14 existing tools. In the case of our previous algorithm, the new approach with the addition of the perplexity measurement clearly outperformed all of the other methods in sensitivity, both at nucleotide and site level, and in overall performance at site level. CONCLUSIONS The statistics obtained show that the performance of a motif finding method based on the use of a CTM is satisfying enough to conclude that the application of topic models is a valid method for developing motif finding algorithms. Moreover, the addition of topic models to a previously developed method dramatically increased its performance, suggesting that this combined algorithm can be a useful tool to successfully predict motifs in different kinds of sets of DNA sequences.
Collapse
Affiliation(s)
- Josep Basha Gutierrez
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561 Chiba, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639 Tokyo, Japan
| | - Kenta Nakai
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 277-8561 Chiba, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, 108-8639 Tokyo, Japan
| |
Collapse
|
4
|
Rouault H, Santolini M, Schweisguth F, Hakim V. Imogene: identification of motifs and cis-regulatory modules underlying gene co-regulation. Nucleic Acids Res 2014; 42:6128-45. [PMID: 24682824 PMCID: PMC4041412 DOI: 10.1093/nar/gku209] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Cis-regulatory modules (CRMs) and motifs play a central role in tissue and condition-specific gene expression. Here we present Imogene, an ensemble of statistical tools that we have developed to facilitate their identification and implemented in a publicly available software. Starting from a small training set of mammalian or fly CRMs that drive similar gene expression profiles, Imogene determines de novocis-regulatory motifs that underlie this co-expression. It can then predict on a genome-wide scale other CRMs with a regulatory potential similar to the training set. Imogene bypasses the need of large datasets for statistical analyses by making central use of the information provided by the sequenced genomes of multiple species, based on the developed statistical tools and explicit models for transcription factor binding site evolution. We test Imogene on characterized tissue-specific mouse developmental CRMs. Its ability to identify CRMs with the same specificity based on its de novo created motifs is comparable to that of previously evaluated ‘motif-blind’ methods. We further show, both in flies and in mammals, that Imogene de novo generated motifs are sufficient to discriminate CRMs related to different developmental programs. Notably, purely relying on sequence data, Imogene performs as well in this discrimination task as a previously reported learning algorithm based on Chromatin Immunoprecipitation (ChIP) data for multiple transcription factors at multiple developmental stages.
Collapse
Affiliation(s)
- Hervé Rouault
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Marc Santolini
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| | - François Schweisguth
- Developmental and Stem Cell Biology Department, Institut Pasteur, F-75015 Paris, France CNRS, URA2578, F-75015 Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, École Normale Supérieure, Université P. et M. Curie, Université Paris-Diderot
| |
Collapse
|
5
|
Shu JJ, Li Y. A statistical thin-tail test of predicting regulatory regions in the Drosophila genome. Theor Biol Med Model 2013; 10:11. [PMID: 23409927 PMCID: PMC3598831 DOI: 10.1186/1742-4682-10-11] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Accepted: 02/07/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs) is a crucial step in studying gene expression, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved. METHODS The statistical properties of cis-regulatory modules (CRMs) are explored by estimating the similar-word set distribution with overrepresentation (Z-score). It is observed that CRMs tend to have a thin-tail Z-score distribution. A new statistical thin-tail test with two thinness coefficients is proposed to distinguish CRMs from non-coding non-regulatory regions (NCNRs). RESULTS As compared with the existing fluffy-tail test, the first thinness coefficient is designed to reduce computational time, making the novel thin-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve the separation accuracy between CRMs and NCNRs. These two thinness coefficients may serve as valuable filtering indexes to predict CRMs experimentally. CONCLUSIONS The novel thin-tail test provides an efficient and effective means for distinguishing CRMs from NCNRs based on the specific statistical properties of CRMs and can guide future experiments aimed at finding new CRMs in the post-genome time.
Collapse
Affiliation(s)
- Jian-Jun Shu
- School of Mechanical & Aerospace Engineering, Nanyang Technological University, Singapore 639798, Singapore.
| | | |
Collapse
|
6
|
Shu JJ, Li Y. A statistical fat-tail test of predicting regulatory regions in the Drosophila genome. Comput Biol Med 2012; 42:935-41. [PMID: 22884312 DOI: 10.1016/j.compbiomed.2012.07.007] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2010] [Revised: 05/29/2012] [Accepted: 07/18/2012] [Indexed: 11/19/2022]
Affiliation(s)
- Jian-Jun Shu
- School of Mechanical & Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore.
| | | |
Collapse
|
7
|
Busser BW, Taher L, Kim Y, Tansey T, Bloom MJ, Ovcharenko I, Michelson AM. A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 2012; 8:e1002531. [PMID: 22412381 PMCID: PMC3297574 DOI: 10.1371/journal.pgen.1002531] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2011] [Accepted: 12/23/2011] [Indexed: 12/22/2022] Open
Abstract
Transcriptional enhancers integrate the contributions of multiple classes of transcription factors (TFs) to orchestrate the myriad spatio-temporal gene expression programs that occur during development. A molecular understanding of enhancers with similar activities requires the identification of both their unique and their shared sequence features. To address this problem, we combined phylogenetic profiling with a DNA-based enhancer sequence classifier that analyzes the TF binding sites (TFBSs) governing the transcription of a co-expressed gene set. We first assembled a small number of enhancers that are active in Drosophila melanogaster muscle founder cells (FCs) and other mesodermal cell types. Using phylogenetic profiling, we increased the number of enhancers by incorporating orthologous but divergent sequences from other Drosophila species. Functional assays revealed that the diverged enhancer orthologs were active in largely similar patterns as their D. melanogaster counterparts, although there was extensive evolutionary shuffling of known TFBSs. We then built and trained a classifier using this enhancer set and identified additional related enhancers based on the presence or absence of known and putative TFBSs. Predicted FC enhancers were over-represented in proximity to known FC genes; and many of the TFBSs learned by the classifier were found to be critical for enhancer activity, including POU homeodomain, Myb, Ets, Forkhead, and T-box motifs. Empirical testing also revealed that the T-box TF encoded by org-1 is a previously uncharacterized regulator of muscle cell identity. Finally, we found extensive diversity in the composition of TFBSs within known FC enhancers, suggesting that motif combinatorics plays an essential role in the cellular specificity exhibited by such enhancers. In summary, machine learning combined with evolutionary sequence analysis is useful for recognizing novel TFBSs and for facilitating the identification of cognate TFs that coordinate cell type-specific developmental gene expression patterns.
Collapse
Affiliation(s)
- Brian W. Busser
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Yongsok Kim
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Terese Tansey
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Molly J. Bloom
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| | - Alan M. Michelson
- Laboratory of Developmental Systems Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland, United States of America
- * E-mail: (IO); (AMM)
| |
Collapse
|
8
|
Abnizova I, Walter K, Te Boekhorst R, Elgar G, Gilks WR. STATISTICAL INFORMATION CHARACTERIZATION OF CONSERVED NON-CODING ELEMENTS IN VERTEBRATES. J Bioinform Comput Biol 2011; 5:533-47. [PMID: 17636860 DOI: 10.1142/s0219720007002898] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2006] [Revised: 02/23/2007] [Accepted: 02/23/2007] [Indexed: 01/28/2023]
Abstract
Recently, a set of highly conserved non-coding elements (CNEs) has been derived from a comparison between the genomes of the puffer fish, Takifugu or Fugu rubripes, and man. In order to facilitate the identification of these conserved elements in silico, we characterize them by a number of statistical features.We found a pronounced information pattern around CNE borders; although the CNEs themselves are AT rich and have high entropy (complexity), they are flanked by GC-rich regions of low entropy (complexity). We also identified the most abundant motifs within and around of CNEs, and identified those that group around their borders. Like in human promoter regions, the TBP, NF-Y and some other binding motifs are clustered around CNE boundaries, which may suggest a possible transcription regulatory function of CNEs.
Collapse
|
9
|
Nourmohammad A, Lässig M. Formation of regulatory modules by local sequence duplication. PLoS Comput Biol 2011; 7:e1002167. [PMID: 21998564 PMCID: PMC3188502 DOI: 10.1371/journal.pcbi.1002167] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 06/30/2011] [Indexed: 11/24/2022] Open
Abstract
Turnover of regulatory sequence and function is an important part of molecular evolution. But what are the modes of sequence evolution leading to rapid formation and loss of regulatory sites? Here we show that a large fraction of neighboring transcription factor binding sites in the fly genome have formed from a common sequence origin by local duplications. This mode of evolution is found to produce regulatory information: duplications can seed new sites in the neighborhood of existing sites. Duplicate seeds evolve subsequently by point mutations, often towards binding a different factor than their ancestral neighbor sites. These results are based on a statistical analysis of 346 cis-regulatory modules in the Drosophila melanogaster genome, and a comparison set of intergenic regulatory sequence in Saccharomyces cerevisiae. In fly regulatory modules, pairs of binding sites show significantly enhanced sequence similarity up to distances of about 50 bp. We analyze these data in terms of an evolutionary model with two distinct modes of site formation: (i) evolution from independent sequence origin and (ii) divergent evolution following duplication of a common ancestor sequence. Our results suggest that pervasive formation of binding sites by local sequence duplications distinguishes the complex regulatory architecture of higher eukaryotes from the simpler architecture of unicellular organisms. Since Jacob and Monod stressed the importance of gene regulation in evolution, our understanding of the mechanisms of regulation has substantially advanced. In higher eukaryotes, genes often have complex regulatory input, which is encoded in cis-regulatory sequence with multiple transcription factor binding sites. However, the modes of genome evolution generating regulatory complexity are much less understood. This study reports a surprising finding: in fly regulatory modules, the majority of transcription factor binding sites show evidence of a local sequence duplication in their evolutionary history, which relates their sequence information to that of neighboring binding sites. Our analysis suggests that local sequence duplications are a pervasive production mode of regulatory information. This mode appears to be specific to higher eukaryotes; we have not found evidence of frequent local duplications in the yeast genome. Our results affect genomic sequence analysis, in particular, computational identification of cis-regulatory elements and alignment of regulatory DNA. At the same time, they address fundamental questions on the evolution of regulation: How much of the regulatory “grammar” observed in higher eukaryotes is due to optimization of function, and how much reflects the underlying sequence evolution modes? What is the result and what is the substrate of natural selection?
Collapse
Affiliation(s)
| | - Michael Lässig
- Institute for Theoretical Physics, University of Cologne, Köln, Germany
- * E-mail:
| |
Collapse
|
10
|
Bernard V, Lecharny A, Brunaud V. Improved detection of motifs with preferential location in promoters. Genome 2011; 53:739-52. [PMID: 20924423 DOI: 10.1139/g10-042] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Many transcription factor binding sites (TFBSs) involved in gene expression regulation are preferentially located relative to the transcription start site. This property is exploited in in silico prediction approaches, one of which involves studying the local overrepresentation of motifs using a sliding window to scan promoters with considerable accuracy. Nevertheless, the consequences of the choice of the sliding window size have never before been analysed. We propose an automatic adaptation of this size to each motif distribution profile. This approach allows a better characterization of the topological constraints of the motifs and the lists of genes containing them. Moreover, our approach allowed us to highlight a nonconstant frequency of occurrence of spurious motifs that could be counter-selected close to their functional area. Therefore, to improve the accuracy of in silico prediction of TFBSs and the sensitivity of the promoter cartography, we propose, in addition to automatic adaptation of window size, consideration of the nonconstant frequency of motifs in promoters.
Collapse
Affiliation(s)
- Virginie Bernard
- Unité de Recherche en Génomique Végétale (URGV), UMR INRA 1165 - CNRS 8114 - UEVE, 91057 Evry CEDEX, France
| | | | | |
Collapse
|
11
|
Genome-wide identification of cis-regulatory motifs and modules underlying gene coregulation using statistics and phylogeny. Proc Natl Acad Sci U S A 2010; 107:14615-20. [PMID: 20671200 DOI: 10.1073/pnas.1002876107] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Cell fate determination depends in part on the establishment of specific transcriptional programs of gene expression. These programs result from the interpretation of the genomic cis-regulatory information by sequence-specific factors. Decoding this information in sequenced genomes is an important issue. Here, we developed statistical analysis tools to computationally identify the cis-regulatory elements that control gene expression in a set of coregulated genes. Starting with a small number of validated and/or predicted cis-regulatory modules (CRMs) in a reference species as a training set, but with no a priori knowledge of the factors acting in trans, we computationally predicted transcription factor binding sites (TFBSs) and genomic CRMs underlying coregulation. This method was applied to the gene expression program active in Drosophila melanogaster sensory organ precursor cells (SOPs), a specific type of neural progenitor cells. Mutational analysis showed that four, including one newly characterized, out of the five top-ranked families of predicted TFBSs were required for SOP-specific gene expression. Additionaly, 19 out of the 29 top-ranked predicted CRMs directed gene expression in neural progenitor cells, i.e., SOPs or larval brain neuroblasts, with a notable fraction active in SOPs (11/29). We further identified the lola gene as the target of two SOP-specific CRMs and found that the lola gene contributed to SOP specification. The statistics and phylogeny-based tools described here can be more generally applied to identify the cis-regulatory elements of specific gene regulatory networks in any family of related species with sequenced genomes.
Collapse
|
12
|
Jung CH, Makunin IV, Mattick JS. Identification of conserved Drosophila-specific euchromatin-restricted non-coding sequence motifs. Genomics 2010; 96:154-66. [PMID: 20595017 DOI: 10.1016/j.ygeno.2010.05.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2010] [Revised: 05/25/2010] [Accepted: 05/26/2010] [Indexed: 01/19/2023]
Abstract
Non-protein-coding DNA comprises the majority of animal genomes but its functions are largely unknown. We identified over 17,000 different tetranucleotide pairs in the Drosophila melanogaster genome that are over-represented at distances up to 100nt in conserved non-exonic sequences. Those exhibiting the highest information content in surrounding nucleotides were classified into five groups: tRNAs, motifs associated with histone genes, Suppressor-of-Hairy-wing binding sites, and two sets of previously unrecognized motifs (DLM3 and DLM4). There are hundreds to thousands of copies of DLM3 and DLM4, respectively, in the genome, located almost exclusively in non-coding regions. They have similar copy numbers among drosophilids, but are largely absent in other insects. DLM3 is likely a cis-regulatory element, whereas DLM4 sequences are capable of forming a short hairpin structure and are expressed as approximately 80nt RNAs. This work reports the existence of Drosophila genus-specific sequence motifs, and suggests that many more novel functional elements may be discovered in genomes using the general approach outlined herein.
Collapse
Affiliation(s)
- Chol-Hee Jung
- Institute for Molecular Bioscience, The University of Queensland, St Lucia QLD, Australia
| | | | | |
Collapse
|
13
|
Evans KJ. Most transcription factor binding sites are in a few mosaic classes of the human genome. BMC Genomics 2010; 11:286. [PMID: 20459624 PMCID: PMC2881025 DOI: 10.1186/1471-2164-11-286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2010] [Accepted: 05/06/2010] [Indexed: 12/02/2022] Open
Abstract
Background Many algorithms for finding transcription factor binding sites have concentrated on the characterisation of the binding site itself: and these algorithms lead to a large number of false positive sites. The DNA sequence which does not bind has been modeled only to the extent necessary to complement this formulation. Results We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of seventeen transcription factors — taken from different families of transcription factors — the average proportion of sites in this subset of classes was 75%, with values for individual factors ranging from 48% to 98%. By contrast these same classes contain only 26% of the bases of the genome and only 31% of occurrences of the motifs of these factors — that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions. Conclusions This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes.
Collapse
Affiliation(s)
- Kenneth J Evans
- School of Crystallography, Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK.
| |
Collapse
|
14
|
Kantorovitz MR, Kazemian M, Kinston S, Miranda-Saavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S. Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev Cell 2009; 17:568-79. [PMID: 19853570 DOI: 10.1016/j.devcel.2009.09.002] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2009] [Revised: 07/02/2009] [Accepted: 09/09/2009] [Indexed: 12/24/2022]
Abstract
We present new approaches to cis-regulatory module (CRM) discovery in the common scenario where relevant transcription factors and/or motifs are unknown. Beginning with a small list of CRMs mediating a common gene expression pattern, we search genome-wide for CRMs with similar functionality, using new statistical scores and without requiring known motifs or accurate motif discovery. We cross-validate our predictions on 31 regulatory networks in Drosophila and through correlations with gene expression data. Five predicted modules tested using an in vivo reporter gene assay all show tissue-specific regulatory activity. We also demonstrate our methods' ability to predict mammalian tissue-specific enhancers. Finally, we predict human CRMs that regulate early blood and cardiovascular development. In vivo transgenic mouse analysis of two predicted CRMs demonstrates that both have appropriate enhancer activity. Overall, 7/7 predictions were validated successfully in vivo, demonstrating the effectiveness of our approach for insect and mammalian genomes.
Collapse
Affiliation(s)
- Miriam R Kantorovitz
- Department of Mathematics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Identifying cis-regulatory sequences by word profile similarity. PLoS One 2009; 4:e6901. [PMID: 19730735 PMCID: PMC2731932 DOI: 10.1371/journal.pone.0006901] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2008] [Accepted: 08/07/2009] [Indexed: 12/13/2022] Open
Abstract
Background Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples. Methodology/Principal Findings We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila. Conclusions/Significance Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz.
Collapse
|
16
|
Sun Y, Robinson M, Adams R, te Boekhorst R, Rust AG, Davey N. Integrating genomic binding site predictions using real-valued meta classifiers. Neural Comput Appl 2009. [DOI: 10.1007/s00521-008-0204-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
17
|
Wilczynski B, Dojer N, Patelak M, Tiuryn J. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs. BMC Bioinformatics 2009; 10:82. [PMID: 19284541 PMCID: PMC2669485 DOI: 10.1186/1471-2105-10-82] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 03/10/2009] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult. RESULTS We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms. CONCLUSION We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied.
Collapse
|
18
|
Ma X, Li-Ling J, Huang Q, Chen X, Hou L, Ma F. Systematic analysis of alternative promoters correlated with alternative splicing in human genes. Genomics 2009; 93:420-5. [PMID: 19442634 DOI: 10.1016/j.ygeno.2009.01.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2008] [Revised: 01/22/2009] [Accepted: 01/28/2009] [Indexed: 11/17/2022]
Abstract
Interactions between various events are essential for complex and delicate transcriptional regulation. To delineate the features and potential roles of alternative promoters (APs) correlated with alternative splicing (AS), we have systematically analyzed 9908 putative alternative promoters (PAPs) from 3797 human genes. Our results showed that approximately 65% of AS events are associated with PAPs. Intriguingly, PAPs per human AS gene only averaged 2.6 for our dataset, which was significantly lower than previously reported. This seems to imply that the human genome contains a small pool of appropriable PAPs for AS genes. Exploration of the characteristics of PAPs such as CpG islands, TATA boxes, GC-content, transcription factor binding sites (TFBSs) and repetitive elements suggested that, respectively, 87% and 90% of PAPs of human AS genes are CpG- and TATA box-poor. The GC-content is significantly higher in the downstream of transcription start sites (TSSs) than upstream (58% vs. 53%), and there is a strong negative correlation between the GC-content and the number of PAPs. These suggested that GC-content around the TSSs plays an important role in the regulation of AS. Moreover, different APs contain distinct densities of repetitive elements and TFBSs, indicating that such sequences have an intrinsic role in the divergent regulation of PAPs and AS. Substantial difference was also found between human AS genes in terms of PAP numbers. A close connection between PAPs and AS may play a critical role in the choice of APs and regulation of AS genes. Furthermore, the distribution of AS genes on different human chromosomes also influences the numbers of PAPs and isoforms of AS genes. Our results may provide important clues for further studies on regulatory network of transcription-related events.
Collapse
Affiliation(s)
- Xiaojuan Ma
- College of Life Science, Liaoning Normal University, Dalian 116029, China
| | | | | | | | | | | |
Collapse
|
19
|
Comparative analysis of distinct non-coding characteristics potentially contributing to the divergence of human tissue-specific genes. Genetica 2008; 136:127-34. [DOI: 10.1007/s10709-008-9323-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2007] [Accepted: 08/25/2008] [Indexed: 10/21/2022]
|
20
|
Li L, Zhu Q, He X, Sinha S, Halfon MS. Large-scale analysis of transcriptional cis-regulatory modules reveals both common features and distinct subclasses. Genome Biol 2008; 8:R101. [PMID: 17550599 PMCID: PMC2394749 DOI: 10.1186/gb-2007-8-6-r101] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2007] [Revised: 05/23/2007] [Accepted: 06/05/2007] [Indexed: 02/01/2023] Open
Abstract
Analysis of 280 experimentally-verified cis-regulatory modules from Drosophila reveal features both common to all and unique to distinct subclasses of modules. Background Transcriptional cis-regulatory modules (for example, enhancers) play a critical role in regulating gene expression. While many individual regulatory elements have been characterized, they have never been analyzed as a class. Results We have performed the first such large-scale study of cis-regulatory modules in order to determine whether they have common properties that might aid in their identification and contribute to our understanding of the mechanisms by which they function. A total of 280 individual, experimentally verified cis-regulatory modules from Drosophila were analyzed for a range of sequence-level and functional properties. We report here that regulatory modules do indeed share common properties, among them an elevated GC content, an increased level of interspecific sequence conservation, and a tendency to be transcribed into RNA. However, we find that dense clustering of transcription factor binding sites, especially homotypic clustering, which is commonly believed to be a general characteristic of regulatory modules, is rather a feature that belongs chiefly to a specific subclass. This has important implications for current computational approaches, many of which are biased toward this subset. We explore two new strategies to assess binding site clustering and gauge their performances with respect to their ability to detect all 280 modules and various functionally coherent subsets. Conclusion Our findings demonstrate that cis-regulatory modules share common features that help to define them as a class and that may lead to new insights into mechanisms of gene regulation. However, these properties alone may not be sufficient to reliably distinguish regulatory from non-regulatory sequences. We also demonstrate that there are distinct subclasses of cis-regulatory modules that are more amenable to in silico detection than others and that these differences must be taken into account when attempting genome-wide regulatory element discovery.
Collapse
Affiliation(s)
- Long Li
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Qianqian Zhu
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
| | - Xin He
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Marc S Halfon
- Department of Biochemistry, State University of New York at Buffalo, Buffalo, NY 14214, USA
- Department of Biological Sciences, State University of New York at Buffalo, Buffalo, NY 14214, USA
- New York State Center of Excellence in Bioinformatics and the Life Sciences, Buffalo, NY 14203, USA
- Department of Molecular and Cellular Biology, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
21
|
Abnizova I, Gilks WR. Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes. Brief Bioinform 2006; 7:48-54. [PMID: 16761364 DOI: 10.1093/bib/bbk004] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
There are no well-known properties in regulatory DNA analogous to those in coding sequences; their spatial location is not regular, the consensus regulatory elements are often degenerate and there are no understandable rules governing their evolution. This makes it difficult to recognize regulatory regions within genome. We review developments in the statistical characterization of regulatory regions and methods of their recognition in eukaryotic genomes.
Collapse
|
22
|
Abstract
Bioinformatics studies of transcriptional regulation in the metazoa are significantly hindered by the absence of readily available data on large numbers of transcriptional cis-regulatory modules (CRMs). Even the richly annotated Drosophila melanogaster genome lacks extensive CRM information. We therefore present here a database of Drosophila CRMs curated from the literature complete with both DNA sequence and a searchable description of the gene expression pattern regulated by each CRM. This resource should greatly facilitate the development of computational approaches to CRM discovery as well as bioinformatics analyses of regulatory sequence properties and evolution.
Collapse
Affiliation(s)
- Steven M Gallo
- Center for Computational Research, 140 Farber Hall, State University of New York at Buffalo, 3435 Main Street, Buffalo, NY 14214, USA
| | | | | | | |
Collapse
|
23
|
Chan BY, Kibler D. Using hexamers to predict cis-regulatory motifs in Drosophila. BMC Bioinformatics 2005; 6:262. [PMID: 16253142 PMCID: PMC1291357 DOI: 10.1186/1471-2105-6-262] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2005] [Accepted: 10/27/2005] [Indexed: 12/22/2022] Open
Abstract
Background Cis-regulatory modules (CRMs) are short stretches of DNA that help regulate gene expression in higher eukaryotes. They have been found up to 1 megabase away from the genes they regulate and can be located upstream, downstream, and even within their target genes. Due to the difficulty of finding CRMs using biological and computational techniques, even well-studied regulatory systems may contain CRMs that have not yet been discovered. Results We present a simple, efficient method (HexDiff) based only on hexamer frequencies of known CRMs and non-CRM sequence to predict novel CRMs in regulatory systems. On a data set of 16 gap and pair-rule genes containing 52 known CRMs, predictions made by HexDiff had a higher correlation with the known CRMs than several existing CRM prediction algorithms: Ahab, Cluster Buster, MSCAN, MCAST, and LWF. After combining the results of the different algorithms, 10 putative CRMs were identified and are strong candidates for future study. The hexamers used by HexDiff to distinguish between CRMs and non-CRM sequence were also analyzed and were shown to be enriched in regulatory elements. Conclusion HexDiff provides an efficient and effective means for finding new CRMs based on known CRMs, rather than known binding sites.
Collapse
Affiliation(s)
- Bob Y Chan
- School of Information and Computer Science, University of California, Irvine, Irvine, California, USA
| | - Dennis Kibler
- School of Information and Computer Science, University of California, Irvine, Irvine, California, USA
| |
Collapse
|