1
|
Leoncini M, Montangero M, Pellegrini M, Tillan KP. CMStalker: A Combinatorial Tool for Composite Motif Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1123-1136. [PMID: 26451824 DOI: 10.1109/tcbb.2014.2359444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper, we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with 12 data sets and 11 state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant.
Collapse
|
2
|
Abstract
The term “transcriptional network” refers to the mechanism(s) that underlies coordinated expression of genes, typically involving transcription factors (TFs) binding to the promoters of multiple genes, and individual genes controlled by multiple TFs. A multitude of studies in the last two decades have aimed to map and characterize transcriptional networks in the yeast Saccharomyces cerevisiae. We review the methodologies and accomplishments of these studies, as well as challenges we now face. For most yeast TFs, data have been collected on their sequence preferences, in vivo promoter occupancy, and gene expression profiles in deletion mutants. These systematic studies have led to the identification of new regulators of numerous cellular functions and shed light on the overall organization of yeast gene regulation. However, many yeast TFs appear to be inactive under standard laboratory growth conditions, and many of the available data were collected using techniques that have since been improved. Perhaps as a consequence, comprehensive and accurate mapping among TF sequence preferences, promoter binding, and gene expression remains an open challenge. We propose that the time is ripe for renewed systematic efforts toward a complete mapping of yeast transcriptional regulatory mechanisms.
Collapse
|
3
|
Functional analysis: evaluation of response intensities--tailoring ANOVA for lists of expression subsets. BMC Bioinformatics 2010; 11:510. [PMID: 20942918 PMCID: PMC2964684 DOI: 10.1186/1471-2105-11-510] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2010] [Accepted: 10/13/2010] [Indexed: 02/06/2023] Open
Abstract
Background Microarray data is frequently used to characterize the expression profile of a whole genome and to compare the characteristics of that genome under several conditions. Geneset analysis methods have been described previously to analyze the expression values of several genes related by known biological criteria (metabolic pathway, pathology signature, co-regulation by a common factor, etc.) at the same time and the cost of these methods allows for the use of more values to help discover the underlying biological mechanisms. Results As several methods assume different null hypotheses, we propose to reformulate the main question that biologists seek to answer. To determine which genesets are associated with expression values that differ between two experiments, we focused on three ad hoc criteria: expression levels, the direction of individual gene expression changes (up or down regulation), and correlations between genes. We introduce the FAERI methodology, tailored from a two-way ANOVA to examine these criteria. The significance of the results was evaluated according to the self-contained null hypothesis, using label sampling or by inferring the null distribution from normally distributed random data. Evaluations performed on simulated data revealed that FAERI outperforms currently available methods for each type of set tested. We then applied the FAERI method to analyze three real-world datasets on hypoxia response. FAERI was able to detect more genesets than other methodologies, and the genesets selected were coherent with current knowledge of cellular response to hypoxia. Moreover, the genesets selected by FAERI were confirmed when the analysis was repeated on two additional related datasets. Conclusions The expression values of genesets are associated with several biological effects. The underlying mathematical structure of the genesets allows for analysis of data from several genes at the same time. Focusing on expression levels, the direction of the expression changes, and correlations, we showed that two-step data reduction allowed us to significantly improve the performance of geneset analysis using a modified two-way ANOVA procedure, and to detect genesets that current methods fail to detect.
Collapse
|
4
|
Abstract
ErmineJ is software for the analysis of functionally interesting patterns in large gene lists drawn from gene expression profiling data or other high-throughput genomics studies. It can be used by biologists with no bioinformatics background to conduct sophisticated analyses of gene sets with multiple methods. It allows users to assess whether microarray data or other gene lists are enriched for a particular pathway or gene class. This protocol provides steps on how to format data files, determine analysis type, create custom gene sets and perform specific analyses-including overrepresentation analysis, genes score resampling and correlation resampling. ErmineJ differs from other methods in providing a rapid, simple and customizable analysis, including high-level visualization through its graphical user interface and scripting tools through its command-line interface, as well as custom gene sets and a variety of statistical methods. The protocol should take approximately 1 h, including (one-time) installation and setup.
Collapse
|
5
|
Wang LY, Snyder M, Gerstein M. BoCaTFBS: a boosted cascade learner to refine the binding sites suggested by ChIP-chip experiments. Genome Biol 2007; 7:R102. [PMID: 17078876 PMCID: PMC1794589 DOI: 10.1186/gb-2006-7-11-r102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2006] [Revised: 08/29/2006] [Accepted: 11/01/2006] [Indexed: 11/23/2022] Open
Abstract
BoCaTFBS, a new method that combines noisy data from ChIP-chip experiments with known binding-site patterns, is described and applied to the ENCODE project. Comprehensive mapping of transcription factor binding sites is essential in postgenomic biology. For this, we propose a mining approach combining noisy data from ChIP (chromatin immunoprecipitation)-chip experiments with known binding site patterns. Our method (BoCaTFBS) uses boosted cascades of classifiers for optimum efficiency, in which components are alternating decision trees; it exploits interpositional correlations; and it explicitly integrates massive negative information from ChIP-chip experiments. We applied BoCaTFBS within the ENCODE project and showed that it outperforms many traditional binding site identification methods (for instance, profiles).
Collapse
Affiliation(s)
- Lu-yong Wang
- Integrated Data Systems Department, Siemens Corporate Research, 755 College Road East, Princeton, New Jersey 08540, USA
| | - Michael Snyder
- Department of Molecular, Cellular, and Developmental Biology, KBT 926, 266 Whitney Ave, Yale University, New Haven, Connecticut 06520, USA
| | - Mark Gerstein
- Department of Molecular Biophysics and Biochemistry, Bass 432A, 266 Whitney Ave, Yale University, New Haven, CT 06520, USA
- Program in Computational Biology and Bioinformatics, Bass 432A, 266 Whitney Ave, Yale University, New Haven, CT 06520, USA
- Department of Computer Science, 51 Prospect Street, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
6
|
Friberg M, von Rohr P, Gonnet G. Scoring functions for transcription factor binding site prediction. BMC Bioinformatics 2005; 6:84. [PMID: 15807889 PMCID: PMC1140076 DOI: 10.1186/1471-2105-6-84] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2004] [Accepted: 04/04/2005] [Indexed: 11/12/2022] Open
Abstract
Background Transcription factor binding site (TFBS) prediction is a difficult problem, which requires a good scoring function to discriminate between real binding sites and background noise. Many scoring functions have been proposed in the literature, but it is difficult to assess their relative performance, because they are implemented in different software tools using different search methods and different TFBS representations. Results Here we compare how several scoring functions perform on both real and semi-simulated data sets in a common test environment. We have also developed two new scoring functions and included them in the comparison. The data sets are from the yeast (S. cerevisiae) genome. Our new scoring function LLBG (least likely under the background model) performs best in this study. It achieves the best average rank for the correct motifs. Scoring functions based on positional bias performed quite poorly in this study. Conclusion LLBG may provide an interesting alternative to current scoring functions for TFBS prediction.
Collapse
Affiliation(s)
- Markus Friberg
- Institute of Computational Science, ETH, 8092 Zurich, Switzerland
| | - Peter von Rohr
- Institute of Computational Science, ETH, 8092 Zurich, Switzerland
| | - Gaston Gonnet
- Institute of Computational Science, ETH, 8092 Zurich, Switzerland
| |
Collapse
|
7
|
Abstract
Various experimental and computational approaches have been used to identify genomic locations of transcription-factor binding sites; methods involving computational comparisons of related genomes have been particularly successful. Identifying genomic locations of transcription-factor binding sites, particularly in higher eukaryotic genomes, has been an enormous challenge. Various experimental and computational approaches have been used to detect these sites; methods involving computational comparisons of related genomes have been particularly successful.
Collapse
Affiliation(s)
- Martha L Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA.
| |
Collapse
|
8
|
Chiang DY, Moses AM, Kellis M, Lander ES, Eisen MB. Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts. Genome Biol 2003; 4:R43. [PMID: 12844359 PMCID: PMC193630 DOI: 10.1186/gb-2003-4-7-r43] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2003] [Revised: 04/28/2003] [Accepted: 05/15/2003] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Transcriptional regulation in eukaryotes often involves multiple transcription factors binding to the same transcription control region, and to understand the regulatory content of eukaryotic genomes it is necessary to consider the co-occurrence and spatial relationships of individual binding sites. The determination of conserved sequences (often known as phylogenetic footprinting) has identified individual transcription factor binding sites. We extend this concept of functional conservation to higher-order features of transcription control regions. RESULTS We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 989 potential regulatory 'templates': pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of word pairs to define sequence features yields more specific predictions of average expression profiles and more informative regression models for genome-wide expression data than considering sequence conservation alone. CONCLUSIONS The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that are specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions.
Collapse
Affiliation(s)
- Derek Y Chiang
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | - Alan M Moses
- Graduate Group in Biophysics, University of California, Berkeley, CA 94720, USA
| | - Manolis Kellis
- Whitehead/MIT Center for Genome Research, Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Eric S Lander
- Whitehead/MIT Center for Genome Research, Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Michael B Eisen
- Department of Genome Sciences, Life Sciences Division, Ernest Orlando Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA 94720, USA
- Center for Integrative Genomics and Division of Genetics and Development, Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
9
|
Coin L, Bateman A, Durbin R. Enhanced protein domain discovery by using language modeling techniques from speech recognition. Proc Natl Acad Sci U S A 2003; 100:4516-20. [PMID: 12668763 PMCID: PMC404693 DOI: 10.1073/pnas.0737502100] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Most modern speech recognition uses probabilistic models to interpret a sequence of sounds. Hidden Markov models, in particular, are used to recognize words. The same techniques have been adapted to find domains in protein sequences of amino acids. To increase word accuracy in speech recognition, language models are used to capture the information that certain word combinations are more likely than others, thus improving detection based on context. However, to date, these context techniques have not been applied to protein domain discovery. Here we show that the application of statistical language modeling methods can significantly enhance domain recognition in protein sequences. As an example, we discover an unannotated Tf_Otx Pfam domain on the cone rod homeobox protein, which suggests a possible mechanism for how the V242M mutation on this protein causes cone-rod dystrophy.
Collapse
Affiliation(s)
- Lachlan Coin
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, United Kingdom
| | | | | |
Collapse
|
10
|
Frith MC, Spouge JL, Hansen U, Weng Z. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res 2002; 30:3214-24. [PMID: 12136103 PMCID: PMC135758 DOI: 10.1093/nar/gkf438] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The human genome encodes the transcriptional control of its genes in clusters of cis-elements that constitute enhancers, silencers and promoter signals. The sequence motifs of individual cis- elements are usually too short and degenerate for confident detection. In most cases, the requirements for organization of cis-elements within these clusters are poorly understood. Therefore, we have developed a general method to detect local concentrations of cis-element motifs, using predetermined matrix representations of the cis-elements, and calculate the statistical significance of these motif clusters. The statistical significance calculation is highly accurate not only for idealized, pseudorandom DNA, but also for real human DNA. We use our method 'cluster of motifs E-value tool' (COMET) to make novel predictions concerning the regulation of genes by transcription factors associated with muscle. COMET performs comparably with two alternative state-of-the-art techniques, which are more complex and lack E-value calculations. Our statistical method enables us to clarify the major bottleneck in the hard problem of detecting cis-regulatory regions, which is that many known enhancers do not contain very significant clusters of the motif types that we search for. Thus, discovery of additional signals that belong to these regulatory regions will be the key to future progress.
Collapse
Affiliation(s)
- Martin C Frith
- Bioinformatics Program, Boston University, 44 Cummington Street, Boston MA 02215, USA
| | | | | | | |
Collapse
|