51
|
fREDUCE: detection of degenerate regulatory elements using correlation with expression. BMC Bioinformatics 2007; 8:399. [PMID: 17941998 PMCID: PMC2174516 DOI: 10.1186/1471-2105-8-399] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2007] [Accepted: 10/17/2007] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND The precision of transcriptional regulation is made possible by the specificity of physical interactions between transcription factors and their cognate binding sites on DNA. A major challenge is to decipher transcription factor binding sites from sequence and functional genomic data using computational means. While current methods can detect strong binding sites, they are less sensitive to degenerate motifs. RESULTS We present fREDUCE, a computational method specialized for the detection of weak or degenerate binding motifs from gene expression or ChIP-chip data. fREDUCE is built upon the widely applied program REDUCE, which elicits motifs by global statistical correlation of motif counts with expression data. fREDUCE introduces several algorithmic refinements that allow efficient exhaustive searches of oligonucleotides with a specified number of degenerate IUPAC symbols. On yeast ChIP-chip benchmarks, fREDUCE correctly identified motifs and their degeneracies with accuracies greater than its predecessor REDUCE as well as other known motif-finding programs. We have also used fREDUCE to make novel motif predictions for transcription factors with poorly characterized binding sites. CONCLUSION We demonstrate that fREDUCE is a valuable tool for the prediction of degenerate transcription factor binding sites, especially from array datasets with weak signals that may elude other motif detection methods.
Collapse
|
52
|
Shi W, Zhou W, Xu D. Identifying cis-regulatory elements by statistical analysis and phylogenetic footprinting and analyzing their coexistence and related gene ontology. Physiol Genomics 2007; 31:374-84. [PMID: 17848606 DOI: 10.1152/physiolgenomics.00085.2006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Discovery of cis-regulatory elements in gene promoters is a highly challenging research issue in computational molecular biology. This paper presents a novel approach to searching putative cis-regulatory elements in human promoters by first finding 8-mer sequences of high statistical significance from gene promoters of humans, mice, and Drosophila melanogaster, respectively, and then identifying the most conserved ones across the three species (phylogenetic footprinting). In this study, a conservation analysis on both closely related species (humans and mice) and distantly related species (humans/mice and Drosophila) is conducted not only to examine more candidates but also to improve the prediction accuracy. We have found 124 putative cis-regulatory elements and grouped these into 20 clusters. The investigation on the coexistence of these clusters in human gene promoters reveals that SP1, EGR, and NRF-1 are the dominant clusters appearing in the combinatorial combination of up to five clusters. Gene Ontology (GO) analysis also shows that many GO categories of transcription factors binding to these cis-regulatory elements match the GO categories of genes whose promoters contain these elements. Compared with previous research, the contribution of this study lies not only in the finding of new cis-regulatory elements, but also in its pioneering exploration on the coexistence of discovered elements and the GO relationship between transcription factors and regulated genes. This exploration verifies the putative cis-regulatory elements that have been found from this study and also gives new insight on the regulation mechanisms of gene expression.
Collapse
Affiliation(s)
- Wei Shi
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia.
| | | | | |
Collapse
|
53
|
So AYL, Chaivorapol C, Bolton EC, Li H, Yamamoto KR. Determinants of cell- and gene-specific transcriptional regulation by the glucocorticoid receptor. PLoS Genet 2007; 3:e94. [PMID: 17559307 PMCID: PMC1904358 DOI: 10.1371/journal.pgen.0030094] [Citation(s) in RCA: 230] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2007] [Accepted: 04/23/2007] [Indexed: 11/19/2022] Open
Abstract
The glucocorticoid receptor (GR) associates with glucocorticoid response elements (GREs) and regulates selective gene transcription in a cell-specific manner. Native GREs are typically thought to be composite elements that recruit GR as well as other regulatory factors into functional complexes. We assessed whether GR occupancy is commonly a limiting determinant of GRE function as well as the extent to which core GR binding sequences and GRE architecture are conserved at functional loci. We surveyed 100-kb regions surrounding each of 548 known or potentially glucocorticoid-responsive genes in A549 human lung cells for GR-occupied GREs. We found that GR was bound in A549 cells predominately near genes responsive to glucocorticoids in those cells and not at genes regulated by GR in other cells. The GREs were positionally conserved at each responsive gene but across the set of responsive genes were distributed equally upstream and downstream of the transcription start sites, with 63% of them >10 kb from those sites. Strikingly, although the core GR binding sequences across the set of GREs varied extensively around a consensus, the precise sequence at an individual GRE was conserved across four mammalian species. Similarly, sequences flanking the core GR binding sites also varied among GREs but were conserved at individual GREs. We conclude that GR occupancy is a primary determinant of glucocorticoid responsiveness in A549 cells and that core GR binding sequences as well as GRE architecture likely harbor gene-specific regulatory information. The glucocorticoid receptor (GR) regulates a myriad of physiological functions, such as cell differentiation and metabolism, achieved through modulating transcription in a cell- and gene-specific manner. However, the determinants that specify cell- and gene-specific GR transcriptional regulation are not well established. We describe three properties that contribute to this specificity: (1) GR occupancy at genomic glucocorticoid response elements (GREs) appears to be a primary determinant of glucocorticoid responsiveness; (2) the DNA sequences bound by GR vary widely around a consensus, but the precise sequences of individual GREs are highly conserved, suggesting a role for these sequences in gene-specific GR transcriptional regulation; and (3) native chromosomal GREs were generally found to be composite elements, comprised of multiple factor binding sites that were highly variable in composition, but as with the GR binding sequences, highly conserved at individual GREs. In addition, we discovered that most GREs were positioned far from their GR target genes and that they were equally distributed upstream and downstream of the target genes. These findings, which may be applicable to other regulatory factors, provide fundamental insights for understanding cell- and gene-specific transcriptional regulation.
Collapse
Affiliation(s)
- Alex Yick-Lun So
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, California, United States of America
- Chemistry and Chemical Biology Graduate Program, University of California San Francisco, San Francisco, California, United States of America
| | - Christina Chaivorapol
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America
| | - Eric C Bolton
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, California, United States of America
| | - Hao Li
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California, United States of America
- Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America
| | - Keith R Yamamoto
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, California, United States of America
- Chemistry and Chemical Biology Graduate Program, University of California San Francisco, San Francisco, California, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
54
|
Segal L, Lapidot M, Solan Z, Ruppin E, Pilpel Y, Horn D. Nucleotide variation of regulatory motifs may lead to distinct expression patterns. ACTA ACUST UNITED AC 2007; 23:i440-9. [PMID: 17646329 DOI: 10.1093/bioinformatics/btm183] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Current methodologies for the selection of putative transcription factor binding sites (TFBS) rely on various assumptions such as over-representation of motifs occurring on gene promoters, and the use of motif descriptions such as consensus or position-specific scoring matrices (PSSMs). In order to avoid bias introduced by such assumptions, we apply an unsupervised motif extraction (MEX) algorithm to sequences of promoters. The extracted motifs are assessed for their likely cis-regulatory function by calculating the expression coherence (EC) of the corresponding genes, across a set of biological conditions. RESULTS Applying MEX to all Saccharomyces cerevisiae promoters, followed by EC analysis across 40 biological conditions, we obtained a high percentage of putative cis-regulatory motifs. We clustered motifs that obtained highly significant EC scores, based on both their sequence similarity and similarity in the biological conditions these motifs appear to regulate. We describe 20 clusters, some of which regroup known TFBS. The clusters display different mRNA expression profiles, correlated with typical changes in the nucleotide composition of their relevant motifs. In several cases, a variation of a single nucleotide is shown to lead to distinct differences in expression patterns. These results are confronted with additional information, such as binding of transcription factors to groups of genes. Detailed analysis is presented for clusters related to MCB/SCB, STRE and PAC. In the first two cases, we provide evidence for different binding mechanisms of different clusters of motifs. For PAC-related motifs we uncover a new cluster that has so far been overshadowed by the stronger effects of known PAC motifs. SUPPLEMENTARY INFORMATION Supplementary data are available at http://adios.tau.ac.il/regmotifs and at Bioinformatics online.
Collapse
Affiliation(s)
- Liat Segal
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | | | | | |
Collapse
|
55
|
Bolton EC, So AY, Chaivorapol C, Haqq CM, Li H, Yamamoto KR. Cell- and gene-specific regulation of primary target genes by the androgen receptor. Genes Dev 2007; 21:2005-17. [PMID: 17699749 PMCID: PMC1948856 DOI: 10.1101/gad.1564207] [Citation(s) in RCA: 262] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2007] [Accepted: 07/06/2007] [Indexed: 01/08/2023]
Abstract
The androgen receptor (AR) mediates the physiologic and pathophysiologic effects of androgens including sexual differentiation, prostate development, and cancer progression by binding to genomic androgen response elements (AREs), which influence transcription of AR target genes. The composition and context of AREs differ between genes, thus enabling AR to confer multiple regulatory functions within a single nucleus. We used expression profiling of an immortalized human prostate epithelial cell line to identify 205 androgen-responsive genes (ARGs), most of them novel. In addition, we performed chromatin immunoprecipitation to identify 524 AR binding regions and validated in reporter assays the ARE activities of several such regions. Interestingly, 67% of our AREs resided within approximately 50 kb of the transcription start sites of 84% of our ARGs. Indeed, most ARGs were associated with two or more AREs, and ARGs were sometimes themselves linked in gene clusters containing up to 13 AREs and 12 ARGs. AREs appeared typically to be composite elements, containing AR binding sequences adjacent to binding motifs for other transcriptional regulators. Functionally, ARGs were commonly involved in prostate cell proliferation, communication, differentiation, and possibly cancer progression. Our results provide new insights into cell- and gene-specific mechanisms of transcriptional regulation of androgen-responsive gene networks.
Collapse
Affiliation(s)
- Eric C. Bolton
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, California 94143, USA
| | - Alex Y. So
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, California 94143, USA
- Chemistry and Chemical Biology Graduate Program, University of California, San Francisco, California 94143, USA
| | - Christina Chaivorapol
- Department of Biochemistry and Biophysics, University of California, San Francisco, California 94143, USA
- California Institute for Quantitative Biomedical Research, University of California, San Francisco, California 94143, USA
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, California 94143, USA
| | - Christopher M. Haqq
- Department of Urology, University of California, San Francisco, California 94143, USA
| | - Hao Li
- Department of Biochemistry and Biophysics, University of California, San Francisco, California 94143, USA
- California Institute for Quantitative Biomedical Research, University of California, San Francisco, California 94143, USA
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, California 94143, USA
| | - Keith R. Yamamoto
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, California 94143, USA
- Chemistry and Chemical Biology Graduate Program, University of California, San Francisco, California 94143, USA
| |
Collapse
|
56
|
Lladser ME, Betterton MD, Knight R. Multiple pattern matching: a Markov chain approach. J Math Biol 2007; 56:51-92. [PMID: 17668213 DOI: 10.1007/s00285-007-0109-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2007] [Revised: 05/15/2007] [Indexed: 10/23/2022]
Abstract
RNA motifs typically consist of short, modular patterns that include base pairs formed within and between modules. Estimating the abundance of these patterns is of fundamental importance for assessing the statistical significance of matches in genomewide searches, and for predicting whether a given function has evolved many times in different species or arose from a single common ancestor. In this manuscript, we review in an integrated and self-contained manner some basic concepts of automata theory, generating functions and transfer matrix methods that are relevant to pattern analysis in biological sequences. We formalize, in a general framework, the concept of Markov chain embedding to analyze patterns in random strings produced by a memoryless source. This conceptualization, together with the capability of automata to recognize complicated patterns, allows a systematic analysis of problems related to the occurrence and frequency of patterns in random strings. The applications we present focus on the concept of synchronization of automata, as well as automata used to search for a finite number of keywords (including sets of patterns generated according to base pairing rules) in a general text.
Collapse
Affiliation(s)
- Manuel E Lladser
- Department of Applied Mathematics, University of Colorado at Boulder, 526 UCB, Boulder, CO 80309-0526, USA.
| | | | | |
Collapse
|
57
|
Cho KH, Choo SM, Jung SH, Kim JR, Choi HS, Kim J. Reverse engineering of gene regulatory networks. IET Syst Biol 2007; 1:149-63. [PMID: 17591174 DOI: 10.1049/iet-syb:20060075] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Systems biology is a multi-disciplinary approach to the study of the interactions of various cellular mechanisms and cellular components. Owing to the development of new technologies that simultaneously measure the expression of genetic information, systems biological studies involving gene interactions are increasingly prominent. In this regard, reconstructing gene regulatory networks (GRNs) forms the basis for the dynamical analysis of gene interactions and related effects on cellular control pathways. Various approaches of inferring GRNs from gene expression profiles and biological information, including machine learning approaches, have been reviewed, with a brief introduction of DNA microarray experiments as typical tools for measuring levels of messenger ribonucleic acid (mRNA) expression. In particular, the inference methods are classified according to the required input information, and the main idea of each method is elucidated by comparing its advantages and disadvantages with respect to the other methods. In addition, recent developments in this field are introduced and discussions on the challenges and opportunities for future research are provided.
Collapse
Affiliation(s)
- K H Cho
- College of Medicine, Seoul National University, Jongnogu, Seoul 110-799, South Korea.
| | | | | | | | | | | |
Collapse
|
58
|
Zhou Q, Wong WH. Coupling hidden Markov models for the discovery of Cis-regulatory modules in multiple species. Ann Appl Stat 2007. [DOI: 10.1214/07-aoas103] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
59
|
Teif VB. General transfer matrix formalism to calculate DNA-protein-drug binding in gene regulation: application to OR operator of phage lambda. Nucleic Acids Res 2007; 35:e80. [PMID: 17526526 PMCID: PMC1920246 DOI: 10.1093/nar/gkm268] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2007] [Revised: 04/09/2007] [Accepted: 04/09/2007] [Indexed: 11/24/2022] Open
Abstract
The transfer matrix methodology is proposed as a systematic tool for the statistical-mechanical description of DNA-protein-drug binding involved in gene regulation. We show that a genetic system of several cis-regulatory modules is calculable using this method, considering explicitly the site-overlapping, competitive, cooperative binding of regulatory proteins, their multilayer assembly and DNA looping. In the methodological section, the matrix models are solved for the basic types of short- and long-range interactions between DNA-bound proteins, drugs and nucleosomes. We apply the matrix method to gene regulation at the O(R) operator of phage lambda. The transfer matrix formalism allowed the description of the lambda-switch at a single-nucleotide resolution, taking into account the effects of a range of inter-protein distances. Our calculations confirm previously established roles of the contact CI-Cro-RNAP interactions. Concerning long-range interactions, we show that while the DNA loop between the O(R) and O(L) operators is important at the lysogenic CI concentrations, the interference between the adjacent promoters P(R) and P(RM) becomes more important at small CI concentrations. A large change in the expression pattern may arise in this regime due to anticooperative interactions between DNA-bound RNA polymerases. The applicability of the matrix method to more complex systems is discussed.
Collapse
Affiliation(s)
- Vladimir B Teif
- Institute of Bioorganic Chemistry, Belarus National Academy of Sciences, Street Kuprevich 5/2, 220141, Minsk, Belarus.
| |
Collapse
|
60
|
Feng J, Naiman DQ, Cooper B. Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies. Bioinformatics 2007; 23:2210-7. [PMID: 17510167 DOI: 10.1093/bioinformatics/btm267] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In proteomics, reverse database searching is used to control the false match frequency for tandem mass spectrum/peptide sequence matches, but reversal creates sequences devoid of patterns that usually challenge database-search software. RESULTS We designed an unsupervised pattern recognition algorithm for detecting patterns with various lengths from large sequence datasets. The patterns found in a protein sequence database were used to create decoy databases using a Monte Carlo sampling algorithm. Searching these decoy databases led to the prediction of false positive rates for spectrum/peptide sequence matches. We show examples where this method, independent of instrumentation, database-search software and samples, provides better estimation of false positive identification rates than a prevailing reverse database searching method. The pattern detection algorithm can also be used to analyze sequences for other purposes in biology or cryptology. AVAILABILITY On request from the authors. SUPPLEMENTARY INFORMATION http://bioinformatics.psb.ugent.be/.
Collapse
Affiliation(s)
- Jian Feng
- Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore, Maryland, USA
| | | | | |
Collapse
|
61
|
Oiwa NN. The nucleotide sequence and the local electronic structure. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2007; 19:181001. [PMID: 21690977 DOI: 10.1088/0953-8984/19/18/181001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Affiliation(s)
- Nestor Norio Oiwa
- Núcleo de Neurociências e Comportamento, Departamento de Psicologia Experimental, Instituto de Psicologia, Universidade de São Paulo, Brazil. Departamento de Física Geral, Instituto de Física, Universidade de São Paulo, Brazil
| |
Collapse
|
62
|
Reddy TE, DeLisi C, Shakhnovich BE. Binding site graphs: a new graph theoretical framework for prediction of transcription factor binding sites. PLoS Comput Biol 2007; 3:e90. [PMID: 17500587 PMCID: PMC1866359 DOI: 10.1371/journal.pcbi.0030090] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Accepted: 04/09/2007] [Indexed: 11/25/2022] Open
Abstract
Computational prediction of nucleotide binding specificity for transcription factors remains a fundamental and largely unsolved problem. Determination of binding positions is a prerequisite for research in gene regulation, a major mechanism controlling phenotypic diversity. Furthermore, an accurate determination of binding specificities from high-throughput data sources is necessary to realize the full potential of systems biology. Unfortunately, recently performed independent evaluation showed that more than half the predictions from most widely used algorithms are false. We introduce a graph-theoretical framework to describe local sequence similarity as the pair-wise distances between nucleotides in promoter sequences, and hypothesize that densely connected subgraphs are indicative of transcription factor binding sites. Using a well-established sampling algorithm coupled with simple clustering and scoring schemes, we identify sets of closely related nucleotides and test those for known TF binding activity. Using an independent benchmark, we find our algorithm predicts yeast binding motifs considerably better than currently available techniques and without manual curation. Importantly, we reduce the number of false positive predictions in yeast to less than 30%. We also develop a framework to evaluate the statistical significance of our motif predictions. We show that our approach is robust to the choice of input promoters, and thus can be used in the context of predicting binding positions from noisy experimental data. We apply our method to identify binding sites using data from genome scale ChIP–chip experiments. Results from these experiments are publicly available at http://cagt10.bu.edu/BSG. The graphical framework developed here may be useful when combining predictions from numerous computational and experimental measures. Finally, we discuss how our algorithm can be used to improve the sensitivity of computational predictions of transcription factor binding specificities. A historically difficult problem in computational biology is the identification of transcription factor binding sites (TFBS) in the promoters of co-regulated genes. With increasing emphasis on research in transcriptional regulation, this problem is also uniquely relevant to emerging results from recent experiments in high-throughput and systems biology. Despite extensive research in the area, recent evaluations of previously published techniques show much room for improvement. In this paper, we introduce a fundamentally new approach to the identification of TFBS. First, we start by representing nucleotides in promoters as an undirected, weighted graph. Given this representation of a binding site graph (BSG), we employ relatively simple graph clustering techniques to identify functional TFBS. We show that BSG predictions significantly outperform all previously evaluated methods in nearly every performance measure using a standardized assessment benchmark. We also find that this approach is more robust than traditional Gibbs sampling to selection of input promoters, and thus more likely to perform well under noisy experimental conditions. Finally, BSGs are very good at predicting specificity determining nucleotides. Using BSG predictions, we were able to confirm recent experimental results on binding specificity of E-box TFs CBF1 and PHO4 and predict novel specificity determining nucleotides for TYE7.
Collapse
Affiliation(s)
- Timothy E Reddy
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, United States of America
| | - Charles DeLisi
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, United States of America
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, United States of America
| | - Boris E Shakhnovich
- Program in Bioinformatics and Systems Biology, Boston University, Boston, Massachusetts, United States of America
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
63
|
Abnizova I, Subhankulova T, Gilks WR. Recent computational approaches to understand gene regulation: mining gene regulation in silico. Curr Genomics 2007; 8:79-91. [PMID: 18660846 PMCID: PMC2435357 DOI: 10.2174/138920207780368150] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2006] [Revised: 12/13/2006] [Accepted: 12/15/2006] [Indexed: 01/03/2023] Open
Abstract
This paper reviews recent computational approaches to the understanding of gene regulation in eukaryotes. Cis-regulation of gene expression by the binding of transcription factors is a critical component of cellular physiology. In eukaryotes, a number of transcription factors often work together in a combinatorial fashion to enable cells to respond to a wide spectrum of environmental and developmental signals. Integration of genome sequences and/or Chromatin Immunoprecipitation on chip data with gene-expression data has facilitated in silico discovery of how the combinatorics and positioning of transcription factors binding sites underlie gene activation in a variety of cellular processes.The process of gene regulation is extremely complex and intriguing, therefore all possible points of view and related links should be carefully considered. Here we attempt to collect an inventory, not claiming it to be comprehensive and complete, of related computational biological topics covering gene regulation, which may en-lighten the process, and briefly review what is currently occurring in these areas.We will consider the following computational areas:o gene regulatory network construction;o evolution of regulatory DNA;o studies of its structural and statistical informational properties;o and finally, regulatory RNA.
Collapse
Affiliation(s)
| | - T Subhankulova
- Wellcome Trust/Cancer Research UK Gurdon Institute of Cancer and Developmental Biology, Cambridge, UK
| | | |
Collapse
|
64
|
Kirzhner V, Paz A, Volkovich Z, Nevo E, Korol A. Different clustering of genomes across life using the A-T-C-G and degenerate R-Y alphabets: early and late signaling on genome evolution? J Mol Evol 2007; 64:448-56. [PMID: 17479343 DOI: 10.1007/s00239-006-0178-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2006] [Accepted: 01/11/2007] [Indexed: 10/23/2022]
Abstract
In this study, we have calculated distances between genomes based on our previously developed compositional spectra (CS) analysis. The study was conducted using genomes of 39 species of Eukarya, Eubacteria, and Archaea. Based on CS distances, we produced two different consensus dendrograms for four- and two-letter (purine-pyrimidine) alphabets. A comparison of the obtained structure using purine-pyrimidine alphabet with the standard three-kingdom (3K) scheme reveals substantial similarity. Surprisingly, this is not the case when the same procedure is based on the four-letter alphabet. In this situation, we also found three main clusters-but different from those in the 3K scheme. In particular, one of the clusters includes Eukarya and thermophilic bacteria and a part of the considered Archaea species. We speculate that the key factor in the last classification (based on the A-T-G-C alphabet) is related to ecology: two ecological parameters, temperature and oxygen, distinctly explain the clustering revealed by compositional spectra in the four-letter alphabet. Therefore, we assume that this result reflects two interdependent processes: evolutionary divergence and superimposed ecological convergence of the genomes, albeit another process, horizontal transfer, cannot be excluded as an important contributing factor.
Collapse
Affiliation(s)
- V Kirzhner
- Institute of Evolution, University of Haifa, Mount Carmel, Haifa, Israel.
| | | | | | | | | |
Collapse
|
65
|
Abstract
Computational biology is a rapidly evolving area where methodologies from computer science, mathematics, and statistics are applied to address fundamental problems in biology. The study of gene regulatory information is a central problem in current computational biology. This article reviews recent development of statistical methods related to this field. Starting from microarray gene selection, we examine methods for finding transcription factor binding motifs and cis-regulatory modules in coregulated genes, and methods for utilizing information from cross-species comparisons and ChIP-chip experiments. The ultimate understanding of cis-regulatory logic in mammalian genomes may require the integration of information collected from all these steps.
Collapse
Affiliation(s)
- Hongkai Ji
- Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, Massachusetts 02138, USA.
| | | |
Collapse
|
66
|
Reddy TE, Shakhnovich BE, Roberts DS, Russek SJ, DeLisi C. Positional clustering improves computational binding site detection and identifies novel cis-regulatory sites in mammalian GABAA receptor subunit genes. Nucleic Acids Res 2007; 35:e20. [PMID: 17204484 PMCID: PMC1807961 DOI: 10.1093/nar/gkl1062] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2006] [Revised: 10/18/2006] [Accepted: 11/20/2006] [Indexed: 11/12/2022] Open
Abstract
Understanding transcription factor (TF) mediated control of gene expression remains a major challenge at the interface of computational and experimental biology. Computational techniques predicting TF-binding site specificity are frequently unreliable. On the other hand, comprehensive experimental validation is difficult and time consuming. We introduce a simple strategy that dramatically improves robustness and accuracy of computational binding site prediction. First, we evaluate the rate of recurrence of computational TFBS predictions by commonly used sampling procedures. We find that the vast majority of results are biologically meaningless. However clustering results based on nucleotide position improves predictive power. Additionally, we find that positional clustering increases robustness to long or imperfectly selected input sequences. Positional clustering can also be used as a mechanism to integrate results from multiple sampling approaches for improvements in accuracy over each one alone. Finally, we predict and validate regulatory sequences partially responsible for transcriptional control of the mammalian type A gamma-aminobutyric acid receptor (GABA(A)R) subunit genes. Positional clustering is useful for improving computational binding site predictions, with potential application to improving our understanding of mammalian gene expression. In particular, predicted regulatory mechanisms in the mammalian GABA(A)R subunit gene family may open new avenues of research towards understanding this pharmacologically important neurotransmitter receptor system.
Collapse
Affiliation(s)
- Timothy E. Reddy
- Bioinformatics Program, Boston University24 Cummington Street, Boston, MA 02215, USA
| | - Boris E. Shakhnovich
- Bioinformatics Program, Boston University24 Cummington Street, Boston, MA 02215, USA
| | - Daniel S. Roberts
- Laboratory of Molecular Neurobiology, Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine715 Albany St., Boston, MA 02118, USA
- Program in BioMedical Neuroscience, Boston University44 Cummington Street, Boston, MA 02215, USA
| | - Shelley J. Russek
- Laboratory of Molecular Neurobiology, Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine715 Albany St., Boston, MA 02118, USA
| | - Charles DeLisi
- Bioinformatics Program, Boston University24 Cummington Street, Boston, MA 02215, USA
- Laboratory of Molecular Neurobiology, Department of Pharmacology and Experimental Therapeutics, Boston University School of Medicine715 Albany St., Boston, MA 02118, USA
- Biomedical Engineering, Boston University44 Cummington Street, Boston, MA 02215, USA
| |
Collapse
|
67
|
Ayers KL, Sabatti C, Lange K. A dictionary model for haplotyping, genotype calling, and association testing. Genet Epidemiol 2007; 31:672-83. [PMID: 17487885 DOI: 10.1002/gepi.20232] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We propose a new method for haplotyping, genotype calling, and association testing based on a dictionary model for haplotypes. In this framework, a haplotype arises as a concatenation of conserved haplotype segments, drawn from a predefined dictionary according to segment specific probabilities. The observed data consist of unphased multimarker genotypes gathered on a random sample of unrelated individuals. These genotypes are subject to mutation, genotyping errors, and missing data. The true pair of haplotypes corresponding to a person's multimarker genotype is reconstructed using a Markov chain that visits haplotype pairs according to their posterior probabilities. Our implementation of the chain alternates Gibbs steps, which rearrange the phase of a single marker, and Metropolis steps, which swap maternal and paternal haplotypes from a given maker onward. Output of the chain include the most likely haplotype pairs, the most likely genotypes at each marker, and the expected number of occurrences of each haplotype segment. Reconstruction accuracy is comparable to that achieved by the best existing algorithms. More importantly, the dictionary model yields expected counts of conserved haplotype segments. These imputed counts can serve as genetic predictors in association studies, as we illustrate by examples on cystic fibrosis, Friedreich's ataxia, and angiotensin-I converting enzyme levels.
Collapse
Affiliation(s)
- Kristin L Ayers
- Department of Biomathematics, UCLA School of Medicine, Los Angeles, CA 90095-1766, USA.
| | | | | |
Collapse
|
68
|
|
69
|
Itzkovitz S, Tlusty T, Alon U. Coding limits on the number of transcription factors. BMC Genomics 2006; 7:239. [PMID: 16984633 PMCID: PMC1590034 DOI: 10.1186/1471-2164-7-239] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2006] [Accepted: 09/19/2006] [Indexed: 12/02/2022] Open
Abstract
Background Transcription factor proteins bind specific DNA sequences to control the expression of genes. They contain DNA binding domains which belong to several super-families, each with a specific mechanism of DNA binding. The total number of transcription factors encoded in a genome increases with the number of genes in the genome. Here, we examined the number of transcription factors from each super-family in diverse organisms. Results We find that the number of transcription factors from most super-families appears to be bounded. For example, the number of winged helix factors does not generally exceed 300, even in very large genomes. The magnitude of the maximal number of transcription factors from each super-family seems to correlate with the number of DNA bases effectively recognized by the binding mechanism of that super-family. Coding theory predicts that such upper bounds on the number of transcription factors should exist, in order to minimize cross-binding errors between transcription factors. This theory further predicts that factors with similar binding sequences should tend to have similar biological effect, so that errors based on mis-recognition are minimal. We present evidence that transcription factors with similar binding sequences tend to regulate genes with similar biological functions, supporting this prediction. Conclusion The present study suggests limits on the transcription factor repertoire of cells, and suggests coding constraints that might apply more generally to the mapping between binding sites and biological function.
Collapse
Affiliation(s)
- Shalev Itzkovitz
- Dept. Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel
- Dept. Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Tsvi Tlusty
- Dept. Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Uri Alon
- Dept. Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel
- Dept. Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel
| |
Collapse
|
70
|
Murphy CT. The search for DAF-16/FOXO transcriptional targets: approaches and discoveries. Exp Gerontol 2006; 41:910-21. [PMID: 16934425 DOI: 10.1016/j.exger.2006.06.040] [Citation(s) in RCA: 151] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2006] [Revised: 06/02/2006] [Accepted: 06/12/2006] [Indexed: 11/23/2022]
Abstract
The insulin/IGF-1 receptor (IIR)/FOXO pathway is remarkably conserved in worms, flies, and mammals, and downregulation of signaling in this pathway has been shown to extend lifespan in all of these animals. FOXO-mediated transcription is required for the long lifespan of IIR mutants; thus, there is great interest in identifying FOXO target genes, as they may carry out the biochemical activities that extend longevity. A number of approaches have been used to identify the transcriptional targets of FOXO. Thus far, the best data available on the components downstream of this pathway are from experiments involving the Caenorhabditis elegans FOXO transcription factor, DAF-16; some of these targets have been tested for their contributions to longevity, dauer formation, and fat storage. Here, I examine and compare the approaches used to identify DAF-16/FOXO targets, review the genes regulated by DAF-16, and discuss the processes that may be at work to extend lifespan in IIR mutants. Rather than upregulating every possible beneficial gene, DAF-16 appears to selectively upregulate genes that contribute to specific protective mechanisms, while simultaneously downregulating potentially deleterious genes. In addition to genes that carry out expected roles in stress protection, many previously unknown targets have been identified in these studies, suggesting that some mechanisms of lifespan extension still await discovery. These mechanisms may act cooperatively or cumulatively to increase longevity, and are likely to be at least partially conserved in higher organisms.
Collapse
Affiliation(s)
- Coleen T Murphy
- Department of Molecular Biology, Lewis-Sigler Institute of Genomics, 148 Carl Icahn Laboratory, Princeton University, Princeton, NJ 08544, USA.
| |
Collapse
|
71
|
GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006; 34:3585-98. [PMID: 16855295 PMCID: PMC1524905 DOI: 10.1093/nar/gkl372] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
Collapse
Affiliation(s)
- Debraj GuhaThakurta
- Research Genetics Division, Rosetta Inpharmatics LLC, Merck & Co., Inc, 401 Terry Avenue North, Seattle, WA 98109, USA.
| |
Collapse
|
72
|
Mahony S, Benos PV, Smith TJ, Golden A. Self-organizing neural networks to support the discovery of DNA-binding motifs. Neural Netw 2006; 19:950-62. [PMID: 16839740 DOI: 10.1016/j.neunet.2006.05.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Identification of the short DNA sequence motifs that serve as binding targets for transcription factors is an important challenge in bioinformatics. Unsupervised techniques from the statistical learning theory literature have often been applied to motif discovery, but effective solutions for large genomic datasets have yet to be found. We present here three self-organizing neural networks that have applicability to the motif-finding problem. The core system in this study is a previously described SOM-based motif-finder named SOMBRERO. The motif-finder is integrated in this work with a SOM-based method that automatically constructs generalized models for structurally related motifs and initializes SOMBRERO with relevant biological knowledge. A self-organizing tree method that displays the relationships between various motifs is also presented, and it is shown that such a method can act as an effective structural classifier of novel motifs. The performance of the three self-organizing neural networks is evaluated here using various datasets.
Collapse
Affiliation(s)
- Shaun Mahony
- Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| | | | | | | |
Collapse
|
73
|
Abstract
We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given dataset. Application of the model to simulated data gives encouraging results. In a real dataset, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.
Collapse
Affiliation(s)
- Kristin L Ayers
- Department of Biomathematics, University of California, Los Angeles, CA 90095-1766, USA
| | | | | |
Collapse
|
74
|
Abnizova I, Rust AG, Robinson M, Te Boekhorst R, Gilks WR. Transcription binding site prediction using Markov models. J Bioinform Comput Biol 2006; 4:425-41. [PMID: 16819793 DOI: 10.1142/s0219720006001813] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2005] [Revised: 12/28/2005] [Accepted: 01/08/2006] [Indexed: 11/18/2022]
Abstract
One of the main goals of analysing DNA sequences is to understand the temporal and positional information that specifies gene expression. An important step in this process is the recognition of gene expression regulatory elements. Experimental procedures for this are slow and costly. In this paper we present a computational non-supervised algorithm that facilitates the process by statistically identifying the most likely regions within a putative regulatory sequence. A probabilistic technique is presented, based on the approximation of regulatory DNA with a Markov chain, for the location of putative transcription factor binding sites in a single stretch of DNA. Hereto we developed a procedure to approximate the order of Markov model for a given DNA sequence that circumvents some of the prohibitive assumptions underlying Markov modeling. Application of the algorithm to data from 55 genes in five species shows the high sensitivity of this Markov search algorithm. Our algorithm does not require any prior knowledge in the form of description or cross-genomic comparison; it is context sensitive and takes DNA heterogeneity into account.
Collapse
|
75
|
|
76
|
Huber BR, Bulyk ML. Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data. BMC Bioinformatics 2006; 7:229. [PMID: 16643658 PMCID: PMC1522027 DOI: 10.1186/1471-2105-7-229] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2005] [Accepted: 04/27/2006] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND A key step in the regulation of gene expression is the sequence-specific binding of transcription factors (TFs) to their DNA recognition sites. However, elucidating TF binding site (TFBS) motifs in higher eukaryotes has been challenging, even when employing cross-species sequence conservation. We hypothesized that for human and mouse, many orthologous genes expressed in a similarly tissue-specific manner in both human and mouse gene expression data, are likely to be co-regulated by orthologous TFs that bind to DNA sequence motifs present within noncoding sequence conserved between these genomes. RESULTS We performed automated motif searching and merging across four different motif finding algorithms, followed by filtering of the resulting motifs for those that contain blocks of information content. Applying this motif finding strategy to conserved noncoding regions surrounding co-expressed tissue-specific human genes allowed us to discover both previously known, and many novel candidate, regulatory DNA motifs in all 18 tissue-specific expression clusters that we examined. For previously known TFBS motifs, we observed that if a TF was expressed in the specified tissue of interest, then in most cases we identified a motif that matched its TRANSFAC motif; conversely, of all those discovered motifs that matched TRANSFAC motifs, most of the corresponding TF transcripts were expressed in the tissue(s) corresponding to the expression cluster for which the motif was found. CONCLUSION Our results indicate that the integration of the results from multiple motif finding tools identifies and ranks highly more known and novel motifs than does the use of just one of these tools. In addition, we believe that our simultaneous enrichment strategies helped to identify likely human cis regulatory elements. A number of the discovered motifs may correspond to novel binding site motifs for as yet uncharacterized tissue-specific TFs. We expect this strategy to be useful for identifying motifs in other metazoan genomes.
Collapse
Affiliation(s)
- Bertrand R Huber
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA 02115, USA
| | - Martha L Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 02115, USA
- Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
77
|
Sadovsky MG. Information capacity of nucleotide sequences and its applications. Bull Math Biol 2006; 68:785-806. [PMID: 16802083 DOI: 10.1007/s11538-005-9017-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2004] [Accepted: 03/10/2005] [Indexed: 10/24/2022]
Abstract
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.
Collapse
Affiliation(s)
- M G Sadovsky
- Institute of Biophysics of Siberian Division of Russian Academy of Sciences, Akademgorodok, Krasnoyarsk, 660036, Russia.
| |
Collapse
|
78
|
Sandve GK, Drabløs F. A survey of motif discovery methods in an integrated framework. Biol Direct 2006; 1:11. [PMID: 16600018 PMCID: PMC1479319 DOI: 10.1186/1745-6150-1-11] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2006] [Accepted: 04/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There has been a growing interest in computational discovery of regulatory elements, and a multitude of motif discovery methods have been proposed. Computational motif discovery has been used with some success in simple organisms like yeast. However, as we move to higher organisms with more complex genomes, more sensitive methods are needed. Several recent methods try to integrate additional sources of information, including microarray experiments (gene expression and ChlP-chip). There is also a growing awareness that regulatory elements work in combination, and that this combinatorial behavior must be modeled for successful motif discovery. However, the multitude of methods and approaches makes it difficult to get a good understanding of the current status of the field. RESULTS This paper presents a survey of methods for motif discovery in DNA, based on a structured and well defined framework that integrates all relevant elements. Existing methods are discussed according to this framework. CONCLUSION The survey shows that although no single method takes all relevant elements into consideration, a very large number of different models treating the various elements separately have been tried. Very often the choices that have been made are not explicitly stated, making it difficult to compare different implementations. Also, the tests that have been used are often not comparable. Therefore, a stringent framework and improved test methods are needed to evaluate the different approaches in order to conclude which ones are most promising. REVIEWERS This article was reviewed by Eugene V. Koonin, Philipp Bucher (nominated by Mikhail Gelfand) and Frank Eisenhaber.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, NTNU – Norwegian University of Science and Technology, N-7052, Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, NTNU – Norwegian University of Science and Technology, N-7006, Trondheim, Norway
| |
Collapse
|
79
|
Abnizova I, Gilks WR. Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes. Brief Bioinform 2006; 7:48-54. [PMID: 16761364 DOI: 10.1093/bib/bbk004] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
There are no well-known properties in regulatory DNA analogous to those in coding sequences; their spatial location is not regular, the consensus regulatory elements are often degenerate and there are no understandable rules governing their evolution. This makes it difficult to recognize regulatory regions within genome. We review developments in the statistical characterization of regulatory regions and methods of their recognition in eukaryotic genomes.
Collapse
|
80
|
Abstract
MOTIVATION We present a novel algorithm, MaMF, for identifying transcription factor (TF) binding site motifs. The method is deterministic and depends on an indexing technique to optimize the search process. On common yeast datasets, MaMF performs competitively with other methods. We also present results on a challenging group of eight sets of human genes known to be responsive to a diverse group of TFs. In every case, MaMF finds the annotated motif among the top scoring putative motifs. We compared MaMF against other motif finders on a larger human group of 21 gene sets and found that MaMF performs better than other algorithms. We analyzed the remaining high scoring motifs and show that many correspond to other TFs that are known to co-occur with the annotated TF motifs. The significant and frequent presence of co-occurring transcription factor binding sites explains in part the difficulty of human motif finding. MaMF is a very fast algorithm, suitable for application to large numbers of interesting gene sets.
Collapse
Affiliation(s)
- Lawrence S Hon
- UCSF Cancer Research Institute and Comprehensive Cancer Center, University of California San Francisco, CA, USA
| | | |
Collapse
|
81
|
Wang G, Zhang W. A steganalysis-based approach to comprehensive identification and characterization of functional regulatory elements. Genome Biol 2006; 7:R49. [PMID: 16787547 PMCID: PMC1779545 DOI: 10.1186/gb-2006-7-6-r49] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2006] [Revised: 04/10/2006] [Accepted: 05/17/2006] [Indexed: 11/23/2022] Open
Abstract
The comprehensive identification of cis-regulatory elements on a genome scale is a challenging problem. We develop a novel, steganalysis-based approach for genome-wide motif finding, called WordSpy, by viewing regulatory regions as a stegoscript with cis-elements embedded in 'background' sequences. We apply WordSpy to the promoters of cell-cycle-related genes of Saccharomyces cerevisiae and Arabidopsis thaliana, identifying all known cell-cycle motifs with high ranking. WordSpy can discover a complete set of cis-elements and facilitate the systematic study of regulatory networks.
Collapse
Affiliation(s)
- Guandong Wang
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA
| | - Weixiong Zhang
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA
- Department of Genetics, Washington University, St. Louis, MO 63130, USA
| |
Collapse
|
82
|
Ronen M, Botstein D. Transcriptional response of steady-state yeast cultures to transient perturbations in carbon source. Proc Natl Acad Sci U S A 2005; 103:389-94. [PMID: 16381818 PMCID: PMC1326188 DOI: 10.1073/pnas.0509978103] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
To understand the dynamics of transcriptional response to changing environments, well defined, easily controlled, and short-term perturbation experiments were undertaken. We subjected steady-state cultures of Saccharomyces cerevisiae in chemostats growing on limiting galactose to two different size pulses of glucose, well known to be a preferred carbon source. Although these pulses were not large enough to change growth rates or cell size, approximately 25% of the genes changed their expression at least 2-fold. Using DNA microarrays to estimate mRNA abundance, we found a number of distinguishable patterns of transcriptional response among the many genes whose expression changed. Many of these genes were already known to be regulated by particular transcription factors; we estimated five potentially relevant transcription factor activities from the observed changes in gene expression (i.e., Mig1p, Gal4p, Cat8p, Rgt1p, Adr1p, and Rcs1p). With these estimates, for two regulatory circuits involving interaction among multiple regulators we could generate dynamical models that quantitatively account for the observed transcriptional responses to the transient perturbations.
Collapse
Affiliation(s)
- Michal Ronen
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | | |
Collapse
|
83
|
Schwartz D, Gygi SP. An iterative statistical approach to the identification of protein phosphorylation motifs from large-scale data sets. Nat Biotechnol 2005; 23:1391-8. [PMID: 16273072 DOI: 10.1038/nbt1146] [Citation(s) in RCA: 721] [Impact Index Per Article: 36.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the recent exponential increase in protein phosphorylation sites identified by mass spectrometry, a unique opportunity has arisen to understand the motifs surrounding such sites. Here we present an algorithm designed to extract motifs from large data sets of naturally occurring phosphorylation sites. The methodology relies on the intrinsic alignment of phospho-residues and the extraction of motifs through iterative comparison to a dynamic statistical background. Results show the identification of dozens of novel and known phosphorylation motifs from recently published serine, threonine and tyrosine phosphorylation studies. When applied to a linguistic data set to test the versatility of the approach, the algorithm successfully extracted hundreds of language motifs. This method, in addition to shedding light on the consensus sequences of identified and as yet unidentified kinases and modular protein domains, may also eventually be used as a tool to determine potential phosphorylation sites in proteins of interest.
Collapse
Affiliation(s)
- Daniel Schwartz
- Department of Cell Biology, 240 Longwood Ave., Harvard Medical School, Boston, Massachusetts 02115, USA.
| | | |
Collapse
|
84
|
Mahony S, Hendrix D, Smith TJ, Golden A. Self-Organizing Maps of Position Weight Matrices for Motif Discovery in Biological Sequences. Artif Intell Rev 2005. [DOI: 10.1007/s10462-005-9011-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
85
|
Riva A, Carpentier AS, Torrésani B, Hénaut A. Comments on selected fundamental aspects of microarray analysis. Comput Biol Chem 2005; 29:319-36. [PMID: 16219488 DOI: 10.1016/j.compbiolchem.2005.08.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2005] [Revised: 08/18/2005] [Accepted: 08/18/2005] [Indexed: 11/17/2022]
Abstract
Microarrays are becoming a ubiquitous tool of research in life sciences. However, the working principles of microarray-based methodologies are often misunderstood or apparently ignored by the researchers who actually perform and interpret experiments. This in turn seems to lead to a common over-expectation regarding the explanatory and/or knowledge-generating power of microarray analyses. In this note we intend to explain basic principles of five (5) major groups of analytical techniques used in studies of microarray data and their interpretation: the principal component analysis (PCA), the independent component analysis (ICA), the t-test, the analysis of variance (ANOVA), and self organizing maps (SOM). We discuss answers to selected practical questions related to the analysis of microarray data. We also take a closer look at the experimental setup and the rules, which have to be observed in order to exploit microarrays efficiently. Finally, we discuss in detail the scope and limitations of microarray-based methods. We emphasize the fact that no amount of statistical analysis can compensate for (or replace) a well thought through experimental setup. We conclude that microarrays are indeed useful tools in life sciences but by no means should they be expected to generate complete answers to complex biological questions. We argue that even well posed questions, formulated within a microarray-specific terminology, cannot be completely answered with the use of microarray analyses alone.
Collapse
Affiliation(s)
- Alessandra Riva
- Laboratoire Génome et Informatique UMR 8116 Tour Evry2, 523 Place des Terrasses, 91034 Evry Cedex, France.
| | | | | | | |
Collapse
|
86
|
Wang G, Yu T, Zhang W. WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar. Nucleic Acids Res 2005; 33:W412-6. [PMID: 15980501 PMCID: PMC1160252 DOI: 10.1093/nar/gki492] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2005] [Revised: 04/25/2005] [Accepted: 04/25/2005] [Indexed: 11/14/2022] Open
Abstract
Transcription factor (TF) binding sites or motifs (TFBMs) are functional cis-regulatory DNA sequences that play an essential role in gene transcriptional regulation. Although many experimental and computational methods have been developed, finding TFBMs remains a challenging problem. We propose and develop a novel dictionary based motif finding algorithm, which we call WordSpy. One significant feature of WordSpy is the combination of a word counting method and a statistical model which consists of a dictionary of motifs and a grammar specifying their usage. The algorithm is suitable for genome-wide motif finding; it is capable of discovering hundreds of motifs from a large set of promoters in a single run. We further enhance WordSpy by applying gene expression information to separate true TFBMs from spurious ones, and by incorporating negative sequences to identify discriminative motifs. In addition, we also use randomly selected promoters from the genome to evaluate the significance of the discovered motifs. The output from WordSpy consists of an ordered list of putative motifs and a set of regulatory sequences with motif binding sites highlighted. The web server of WordSpy is available at http://cic.cs.wustl.edu/wordspy.
Collapse
Affiliation(s)
- Guandong Wang
- Department of Computer Science and Engineering, Washington University in Saint LouisSaint Louis, MO 63130, USA
| | - Taotao Yu
- Department of Computer Science and Engineering, Washington University in Saint LouisSaint Louis, MO 63130, USA
| | - Weixiong Zhang
- Department of Computer Science and Engineering, Washington University in Saint LouisSaint Louis, MO 63130, USA
- Department of Genetics, Washington University in Saint LouisSaint Louis, MO 63130, USA
| |
Collapse
|
87
|
Sivaraman K, Seshasayee ASN, Swaminathan K, Muthukumaran G, Pennathur G. Promoter addresses: revelations from oligonucleotide profiling applied to the Escherichia coli genome. Theor Biol Med Model 2005; 2:20. [PMID: 15927055 PMCID: PMC1166578 DOI: 10.1186/1742-4682-2-20] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2005] [Accepted: 05/31/2005] [Indexed: 11/17/2022] Open
Abstract
Background Transcription is the first step in cellular information processing. It is regulated by cis-acting elements such as promoters and operators in the DNA, and trans-acting elements such as transcription factors and sigma factors. Identification of cis-acting regulatory elements on a genomic scale requires computational analysis. Results We have used oligonucleotide profiling to predict regulatory regions in a bacterial genome. The method has been applied to the Escherichia coli K12 genome and the results analyzed. The information content of the putative regulatory oligonucleotides so predicted is validated through intra-genomic analyses, correlations with experimental data and inter-genome comparisons. Based on the results we have proposed a model for the bacterial promoter. The results show that the method is capable of identifying, in the E.coli genome, cis-acting elements such as TATAAT (sigma70 binding site), CCCTAT (1 base relative of sigma32 binding site), CTATNN (LexA binding site), AGGA-containing hexanucleotides (Shine Dalgarno consensus) and CTAG-containing hexanucleotides (core binding sites for Trp and Met repressors). Conclusion The method adopted is simple yet effective in predicting upstream regulatory elements in bacteria. It does not need any prior experimental data except the sequence itself. This method should be applicable to most known genomes. Profiling, as applied to the E.coli genome, picks up known cis-acting and regulatory elements. Based on the profile results, we propose a model for the bacterial promoter that is extensible even to eukaryotes. The model is that the core promoter lies within a plateau of bent AT-rich DNA. This bent DNA acts as a homing segment for the sigma factor to recognize the promoter. The model thus suggests an important role for local landscapes in prokaryotic and eukaryotic gene regulation.
Collapse
Affiliation(s)
| | | | | | | | - Gautam Pennathur
- Centre for Biotechnology, Anna University, Chennai, India
- AU-KBC for Research, MIT Campus, Anna University, Chennai, India
| |
Collapse
|
88
|
Aalberts DP, Daub EG, Dill JW. Quantifying optimal accuracy of local primary sequence bioinformatics methods. Bioinformatics 2005; 21:3347-51. [PMID: 15923206 DOI: 10.1093/bioinformatics/bti521] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Traditional bioinformatics methods scan primary sequences for local patterns. It is important to assess how accurate local primary sequence methods can be. RESULTS We study the problem of donor pre-mRNA splice site recognition, where the sequence overlaps between real and decoy datasets can be quantified, exposing the intrinsic limitations of the performance of local primary sequence methods. We assess the accuracy of primary sequence methods generally by studying how they scale with dataset size and demonstrate that our new primary sequence ranking methods have superior performance.
Collapse
Affiliation(s)
- Daniel P Aalberts
- Department of Physics, Williams College, Williamstown, MA 01267, USA.
| | | | | |
Collapse
|
89
|
Gupta M, Liu JS. De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci U S A 2005; 102:7079-84. [PMID: 15883375 PMCID: PMC1129096 DOI: 10.1073/pnas.0408743102] [Citation(s) in RCA: 91] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2004] [Indexed: 11/18/2022] Open
Abstract
Transcription regulation is controlled by coordinated binding of one or more transcription factors in the promoter regions of genes. In many species, especially higher eukaryotes, transcription factor binding sites tend to occur as homotypic or heterotypic clusters, also known as cis-regulatory modules. The number of sites and distances between the sites, however, vary greatly in a module. We propose a statistical model to describe the underlying cluster structure as well as individual motif conservation and develop a Monte Carlo motif screening strategy for predicting novel regulatory modules in upstream sequences of coregulated genes. We demonstrate the power of the method with examples ranging from bacterial to insect and human genomes.
Collapse
Affiliation(s)
- Mayetri Gupta
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA.
| | | |
Collapse
|
90
|
Chen HD, Chang CH, Hsieh LC, Lee HC. Divergence and Shannon information in genomes. PHYSICAL REVIEW LETTERS 2005; 94:178103. [PMID: 15904339 DOI: 10.1103/physrevlett.94.178103] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2004] [Indexed: 05/02/2023]
Abstract
Shannon information (SI) and its special case, divergence, are defined for a DNA sequence in terms of probabilities of chemical words in the sequence and are computed for a set of complete genomes highly diverse in length and composition. We find the following: SI (but not divergence) is inversely proportional to sequence length for a random sequence but is length independent for genomes; the genomic SI is always greater and, for shorter words and longer sequences, hundreds to thousands times greater than the SI in a random sequence whose length and composition match those of the genome; genomic SIs appear to have word-length dependent universal values. The universality is inferred to be an evolution footprint of a universal mode for genome growth.
Collapse
Affiliation(s)
- Hong-Da Chen
- Department of Physics, National Central University, Chungli, Taiwan 320, Republic of China
| | | | | | | |
Collapse
|
91
|
Abstract
How is the information from a thousand gene-expression arrays, the location of more than two hundred regulatory factors, and nine sequenced genomes to be integrated into a global view of the regulatory network in budding yeast? Computational methods that fit incomplete noisy data provide the outlines of regulatory pathways, but the errors are not quantified. In the fly, embryonic patterning has proved amenable to computational prediction, but only when the DNA-binding preferences of the relevant factors are taken into account. In both these model organisms, simply restricting attention to regulatory sequences that align with related species (i.e. "conserved") discards much information regarding what is functional.
Collapse
Affiliation(s)
- Eric D Siggia
- Center for Studies in Physics and Biology, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA.
| |
Collapse
|
92
|
Marinescu VD, Kohane IS, Riva A. MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes. BMC Bioinformatics 2005; 6:79. [PMID: 15799782 PMCID: PMC1131891 DOI: 10.1186/1471-2105-6-79] [Citation(s) in RCA: 161] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2004] [Accepted: 03/30/2005] [Indexed: 12/19/2022] Open
Abstract
Background Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity. Results We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method. Conclusion The search engine, available at , allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation.
Collapse
Affiliation(s)
- Voichita D Marinescu
- Children's Hospital Informatics Program, Children's Hospital Boston, Harvard Medical School,300 Longwood Avenue, Boston, MA 02115, USA
| | - Isaac S Kohane
- Children's Hospital Informatics Program, Children's Hospital Boston, Harvard Medical School,300 Longwood Avenue, Boston, MA 02115, USA
| | - Alberto Riva
- Children's Hospital Informatics Program, Children's Hospital Boston, Harvard Medical School,300 Longwood Avenue, Boston, MA 02115, USA
| |
Collapse
|
93
|
Cole SW, Yan W, Galic Z, Arevalo J, Zack JA. Expression-based monitoring of transcription factor activity: the TELiS database. Bioinformatics 2005; 21:803-10. [PMID: 15374858 DOI: 10.1093/bioinformatics/bti038] [Citation(s) in RCA: 141] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In microarray studies it is often of interest to identify upstream transcription control pathways mediating observed changes in gene expression. The Transcription Element Listening System (TELiS) combines sequence-based analysis of gene regulatory regions with statistical prevalence analyses to identify transcription-factor binding motifs (TFBMs) that are over-represented among the promoters of up- or down-regulated genes. Efficiency is maximized by decomposing the problem into two steps: (1) a priori compilation of prevalence matrices specifying the number of putative binding sites for a variety of transcription factors in promoters from all genes assayed by a given microarray, and (2) real-time statistical analysis of pre-compiled prevalence matrices to identify TFBMs that are over- or under-represented in promoters of differentially expressed genes. The interlocking JAVA applications namely, PromoterScan and PromoterStats carry out these tasks, and together constitute the TELiS database for reverse inference of transcription factor activity. RESULTS In two validation studies, TELiS accurately detected in vivo activation of NF-kappaB and the Type I interferon system by HIV-1 infection and pharmacologic activation of the glucocorticoid receptor in peripheral blood mononuclear cells. The population-based statistical inference underlying TELiS out-performed conventional statistical tests in analytic sensitivity, with parametric studies demonstrating accurate identification of transcription factor activity from as few as 20 differentially expressed genes. TELiS thus provides a simple, rapid and sensitive tool for identifying transcription control pathways mediating observed gene expression dynamics.
Collapse
Affiliation(s)
- Steve W Cole
- UCLA Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1678, USA.
| | | | | | | | | |
Collapse
|
94
|
Zhang Z, Gu J, Gu X. How much expression divergence after yeast gene duplication could be explained by regulatory motif evolution? Trends Genet 2005; 20:403-7. [PMID: 15313547 DOI: 10.1016/j.tig.2004.07.006] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We used the yeast genome sequences of gene families, microarray profiles and regulatory motif data to test the current wisdom that there is a strong correlation between regulatory motif structure and gene expression profile. Our results suggest that duplicate genes tend to be co-expressed but the correlation between motif content and expression similarity is generally poor, only approximately 2-3% of expression variation can be explained by the motif divergence. Our observations suggest that, in addition to the cis-regulatory motif structure in the upstream region of the gene, multiple trans-acting factors in the gene network can influence the pattern of gene expression significantly.
Collapse
Affiliation(s)
- Zhongqi Zhang
- Department of Genetics, Development & Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, IA 50011, USA
| | | | | |
Collapse
|
95
|
McCarroll SA, Li H, Bargmann CI. Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation. Curr Biol 2005; 15:347-52. [PMID: 15723796 DOI: 10.1016/j.cub.2005.02.023] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2004] [Revised: 12/19/2004] [Accepted: 12/21/2004] [Indexed: 11/16/2022]
Abstract
Genome sequencing has allowed many gene regulatory elements to be identified through cross-species comparisons . However, the expression of genes in multigene families can diverge rapidly between related species . An alternative approach to characterizing multigene families utilizes the fact that genes within the group are likely to share aspects of their regulation. Here, we use a statistical approach, probabilistic segmentation , to identify sequences that are overrepresented in the regions upstream of C. elegans chemosensory receptor genes. Although each of these elements is present in only a subset of the genes, their distribution across and within the promoters of chemosensory receptor genes makes it possible to detect them. Many of the motifs show positional preference with respect to the translational start site and correspond to the binding sites of known families of transcription factors. We verified one motif, the E-box sequence WWYCACSTGYY, by showing that it directs expression of reporter genes to the ADL chemosensory neurons. Thus, probabilistic segmentation can be used to identify functional regulatory elements with no previous knowledge of gene expression or regulation. This approach may be of particular value for rapidly evolving genes in the immune system and the nervous system.
Collapse
Affiliation(s)
- Steven A McCarroll
- Department of Anatomy, University of California, San Francisco, San Francisco, CA 94143 USA
| | | | | |
Collapse
|
96
|
Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS. Transcription factor binding site identification using the self-organizing map. Bioinformatics 2005; 21:1807-14. [PMID: 15647296 DOI: 10.1093/bioinformatics/bti256] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The automatic identification of over-represented motifs present in a collection of sequences continues to be a challenging problem in computational biology. In this paper, we propose a self-organizing map of position weight matrices as an alternative method for motif discovery. The advantage of this approach is that it can be used to simultaneously characterize every feature present in the dataset, thus lessening the chance that weaker signals will be missed. Features identified are ranked in terms of over-representation relative to a background model. RESULTS We present an implementation of this approach, named SOMBRERO (self-organizing map for biological regulatory element recognition and ordering), which is capable of discovering multiple distinct motifs present in a single dataset. Demonstrated here are the advantages of our approach on various datasets and SOMBRERO's improved performance over two popular motif-finding programs, MEME and AlignACE. AVAILABILITY SOMBRERO is available free of charge from http://bioinf.nuigalway.ie/sombrero SUPPLEMENTARY INFORMATION http://bioinf.nuigalway.ie/sombrero/additional.
Collapse
Affiliation(s)
- Shaun Mahony
- National Centre for Biomedical Engineering Science, NUI Galway, Ireland.
| | | | | | | | | |
Collapse
|
97
|
Ganapathiraju M, Balakrishnan N, Reddy R, Klein-Seetharaman J. Computational Biology and Language. AMBIENT INTELLIGENCE FOR SCIENTIFIC DISCOVERY 2005. [DOI: 10.1007/978-3-540-32263-4_2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
98
|
Adaptive evolution of transcription factor binding sites. BMC Evol Biol 2004; 4:42. [PMID: 15511291 PMCID: PMC535555 DOI: 10.1186/1471-2148-4-42] [Citation(s) in RCA: 140] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2004] [Accepted: 10/28/2004] [Indexed: 11/18/2022] Open
Abstract
Background The regulation of a gene depends on the binding of transcription factors to specific sites located in the regulatory region of the gene. The generation of these binding sites and of cooperativity between them are essential building blocks in the evolution of complex regulatory networks. We study a theoretical model for the sequence evolution of binding sites by point mutations. The approach is based on biophysical models for the binding of transcription factors to DNA. Hence we derive empirically grounded fitness landscapes, which enter a population genetics model including mutations, genetic drift, and selection. Results We show that the selection for factor binding generically leads to specific correlations between nucleotide frequencies at different positions of a binding site. We demonstrate the possibility of rapid adaptive evolution generating a new binding site for a given transcription factor by point mutations. The evolutionary time required is estimated in terms of the neutral (background) mutation rate, the selection coefficient, and the effective population size. Conclusions The efficiency of binding site formation is seen to depend on two joint conditions: the binding site motif must be short enough and the promoter region must be long enough. These constraints on promoter architecture are indeed seen in eukaryotic systems. Furthermore, we analyse the adaptive evolution of genetic switches and of signal integration through binding cooperativity between different sites. Experimental tests of this picture involving the statistics of polymorphisms and phylogenies of sites are discussed.
Collapse
|
99
|
Sabatti C, Rohlin L, Lange K, Liao JC. Vocabulon: a dictionary model approach for reconstruction and localization of transcription factor binding sites. Bioinformatics 2004; 21:922-31. [PMID: 15509602 DOI: 10.1093/bioinformatics/bti083] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Gene expression arrays enable measurements of transcription values for a large number or all genes in the genome. In order to better interpret these results and to use them to reconstruct transcription networks, information on location of binding sites for regulatory proteins in the entire genome is needed. In particular, this represents an open problem in Escherichia coli. RESULTS We describe the first implementation of dictionary-style models to the study of transcription factors binding sites in an entire genome. Vocabulon's unique feature is that it can both reconstruct binding sites characterized by unknown motifs and impute locations of known binding sites in long sequences by simultaneous search. On one hand, the dictionary model specifies a probability for the entire sequence taking simultaneously into account all the possible binding sites. This greatly reduces the number of false positives. On the other hand, the possibility of refining motif description, as an increasing number of binding sites are identified, augments the sensitivity of the method. We illustrate these properties with examples in E.coli. The results of gene expression arrays are used both to guide the search and corroborate it.
Collapse
Affiliation(s)
- Chiara Sabatti
- Department of Human Genetics, UCLA Los Angeles, CA 90095-7088, USA.
| | | | | | | |
Collapse
|
100
|
Berg J, Lässig M. Local graph alignment and motif search in biological networks. Proc Natl Acad Sci U S A 2004; 101:14689-94. [PMID: 15448202 PMCID: PMC522014 DOI: 10.1073/pnas.0305199101] [Citation(s) in RCA: 145] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Interaction networks are of central importance in postgenomic molecular biology, with increasing amounts of data becoming available by high-throughput methods. Examples are gene regulatory networks or protein interaction maps. The main challenge in the analysis of these data is to read off biological functions from the topology of the network. Topological motifs, i.e., patterns occurring repeatedly at different positions in the network, have recently been identified as basic modules of molecular information processing. In this article, we discuss motifs derived from families of mutually similar but not necessarily identical patterns. We establish a statistical model for the occurrence of such motifs, from which we derive a scoring function for their statistical significance. Based on this scoring function, we develop a search algorithm for topological motifs called graph alignment, a procedure with some analogies to sequence alignment. The algorithm is applied to the gene regulation network of Escherichia coli.
Collapse
Affiliation(s)
- Johannes Berg
- Institut für Theoretische Physik, Universität zu Köln, Zülpicherstrasse 77, 50937 Cologne, Germany.
| | | |
Collapse
|