1
|
Lengyel IM, Morelli LG. Multiple binding sites for transcriptional repressors can produce regular bursting and enhance noise suppression. Phys Rev E 2017; 95:042412. [PMID: 28505727 DOI: 10.1103/physreve.95.042412] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Indexed: 06/07/2023]
Abstract
Cells may control fluctuations in protein levels by means of negative autoregulation, where transcription factors bind DNA sites to repress their own production. Theoretical studies have assumed a single binding site for the repressor, while in most species it is found that multiple binding sites are arranged in clusters. We study a stochastic description of negative autoregulation with multiple binding sites for the repressor. We find that increasing the number of binding sites induces regular bursting of gene products. By tuning the threshold for repression, we show that multiple binding sites can also suppress fluctuations. Our results highlight possible roles for the presence of multiple binding sites of negative autoregulators.
Collapse
Affiliation(s)
- Iván M Lengyel
- Instituto de Investigación en Biomedicina de Buenos Aires (IBioBA)-CONICET-Partner Institute of the Max Planck Society, Polo Científico Tecnológico, Godoy Cruz 2390, C1425FQD, Buenos Aires, Argentina
- Departamento de Física, FCEyN UBA, Ciudad Universitaria, 1428 Buenos Aires, Argentina
| | - Luis G Morelli
- Instituto de Investigación en Biomedicina de Buenos Aires (IBioBA)-CONICET-Partner Institute of the Max Planck Society, Polo Científico Tecnológico, Godoy Cruz 2390, C1425FQD, Buenos Aires, Argentina
- Departamento de Física, FCEyN UBA, Ciudad Universitaria, 1428 Buenos Aires, Argentina
- Max Planck Institute for Molecular Physiology, Department of Systemic Cell Biology, Otto-Hahn-Strasse 11, D-44227 Dortmund, Germany
| |
Collapse
|
2
|
Bi C. SEAM: A STOCHASTIC EM-TYPE ALGORITHM FOR MOTIF-FINDING IN BIOPOLYMER SEQUENCES. J Bioinform Comput Biol 2011; 5:47-77. [PMID: 17477491 DOI: 10.1142/s0219720007002527] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2006] [Revised: 08/22/2006] [Accepted: 10/14/2006] [Indexed: 12/21/2022]
Abstract
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
Collapse
Affiliation(s)
- Chengpeng Bi
- Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Pediatrics Research Building, Third Floor, Kansas City, Missouri 64108, USA.
| |
Collapse
|
3
|
|
4
|
Li XY, Thomas S, Sabo PJ, Eisen MB, Stamatoyannopoulos JA, Biggin MD. The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome Biol 2011; 12:R34. [PMID: 21473766 PMCID: PMC3218860 DOI: 10.1186/gb-2011-12-4-r34] [Citation(s) in RCA: 156] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2011] [Accepted: 04/07/2011] [Indexed: 12/11/2022] Open
Abstract
Background In Drosophila embryos, many biochemically and functionally unrelated transcription factors bind quantitatively to highly overlapping sets of genomic regions, with much of the lowest levels of binding being incidental, non-functional interactions on DNA. The primary biochemical mechanisms that drive these genome-wide occupancy patterns have yet to be established. Results Here we use data resulting from the DNaseI digestion of isolated embryo nuclei to provide a biophysical measure of the degree to which proteins can access different regions of the genome. We show that the in vivo binding patterns of 21 developmental regulators are quantitatively correlated with DNA accessibility in chromatin. Furthermore, we find that levels of factor occupancy in vivo correlate much more with the degree of chromatin accessibility than with occupancy predicted from in vitro affinity measurements using purified protein and naked DNA. Within accessible regions, however, the intrinsic affinity of the factor for DNA does play a role in determining net occupancy, with even weak affinity recognition sites contributing. Finally, we show that programmed changes in chromatin accessibility between different developmental stages correlate with quantitative alterations in factor binding. Conclusions Based on these and other results, we propose a general mechanism to explain the widespread, overlapping DNA binding by animal transcription factors. In this view, transcription factors are expressed at sufficiently high concentrations in cells such that they can occupy their recognition sequences in highly accessible chromatin without the aid of physical cooperative interactions with other proteins, leading to highly overlapping, graded binding of unrelated factors.
Collapse
Affiliation(s)
- Xiao-Yong Li
- Genomics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road MS 84-171, Berkeley, CA 94720, USA
| | | | | | | | | | | |
Collapse
|
5
|
Wang X, Gowik U, Tang H, Bowers JE, Westhoff P, Paterson AH. Comparative genomic analysis of C4 photosynthetic pathway evolution in grasses. Genome Biol 2009; 10:R68. [PMID: 19549309 PMCID: PMC2718502 DOI: 10.1186/gb-2009-10-6-r68] [Citation(s) in RCA: 112] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2009] [Revised: 05/27/2009] [Accepted: 06/23/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sorghum is the first C4 plant and the second grass with a full genome sequence available. This makes it possible to perform a whole-genome-level exploration of C4 pathway evolution by comparing key photosynthetic enzyme genes in sorghum, maize (C4) and rice (C3), and to investigate a long-standing hypothesis that a reservoir of duplicated genes is a prerequisite for the evolution of C4 photosynthesis from a C3 progenitor. RESULTS We show that both whole-genome and individual gene duplication have contributed to the evolution of C4 photosynthesis. The C4 gene isoforms show differential duplicability, with some C4 genes being recruited from whole genome duplication duplicates by multiple modes of functional innovation. The sorghum and maize carbonic anhydrase genes display a novel mode of new gene formation, with recursive tandem duplication and gene fusion accompanied by adaptive evolution to produce C4 genes with one to three functional units. Other C4 enzymes in sorghum and maize also show evidence of adaptive evolution, though differing in level and mode. Intriguingly, a phosphoenolpyruvate carboxylase gene in the C3 plant rice has also been evolving rapidly and shows evidence of adaptive evolution, although lacking key mutations that are characteristic of C4 metabolism. We also found evidence that both gene redundancy and alternative splicing may have sheltered the evolution of new function. CONCLUSIONS Gene duplication followed by functional innovation is common to evolution of most but not all C4 genes. The apparently long time-lag between the availability of duplicates for recruitment into C4 and the appearance of C4 grasses, together with the heterogeneity of origins of C4 genes, suggests that there may have been a long transition process before the establishment of C4 photosynthesis.
Collapse
Affiliation(s)
- Xiyin Wang
- Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA
- College of Sciences, Hebei Polytechnic University, Tangshan, Hebei 063000, China
| | - Udo Gowik
- Institut fur Entwicklungs- und Molekularbiologie der Pflanzen, Heinrich-Heine-Universitat 1, Universitatsstrasse, D-40225 Dusseldorf, Germany
| | - Haibao Tang
- Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA
- Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| | - John E Bowers
- Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA
| | - Peter Westhoff
- Institut fur Entwicklungs- und Molekularbiologie der Pflanzen, Heinrich-Heine-Universitat 1, Universitatsstrasse, D-40225 Dusseldorf, Germany
| | - Andrew H Paterson
- Plant Genome Mapping Laboratory, University of Georgia, Athens, GA 30602, USA
- Department of Plant Biology, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
6
|
Pape UJ, Klein H, Vingron M. Statistical detection of cooperative transcription factors with similarity adjustment. Bioinformatics 2009; 25:2103-9. [PMID: 19286833 PMCID: PMC2722994 DOI: 10.1093/bioinformatics/btp143] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment. Results: Based on previous work, we propose to adjust the window size for co-occurrence detection. Using the derived approximation, one obtains different window sizes for different sets of DNA motifs depending on their similarities. This ensures that the probability of co-occurrences in random sequences are equal. Applying the approach to selected similar and dissimilar DNA motifs from human TFs shows the necessity of adjustment and confirms the accuracy of the approximation by comparison to simulated data. Furthermore, it becomes clear that approaches ignoring similarities strongly underestimate P-values for cooperativity of TFs with similar DNA motifs. In addition, the approach is extended to deal with overlapping windows. We derive Chen–Stein error bounds for the approximation. Comparing the error bounds for similar and dissimilar DNA motifs shows that the approximation for similar DNA motifs yields large bounds. Hence, one has to be careful using overlapping windows. Based on the error bounds, one can precompute the approximation errors and select an appropriate overlap scheme before running the analysis. Availability: Software to perform the calculation for pairs of position frequency matrices (PFMs) is available at http://mosta.molgen.mpg.de as well as C++ source code for downloading. Contact:utz.pape@molgen.mpg.de
Collapse
Affiliation(s)
- Utz J Pape
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestr. 73 and Mathematics and Computer Science, Free University of Berlin, Takustr. 9, 14195 Berlin, Germany.
| | | | | |
Collapse
|
7
|
Sun YV, Jacobsen DM, Turner ST, Boerwinkle E, Kardia SLR. A Fast Implementation of a Scan Statistic for Identifying Chromosomal Patterns of Genome Wide Association Studies. Comput Stat Data Anal 2009; 53:1794-1801. [PMID: 20161066 DOI: 10.1016/j.csda.2008.04.013] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In order to take into account the complex genomic distribution of SNP variations when identifying chromosomal regions with significant SNP effects, a single nucleotide polymorphism (SNP) association scan statistic was developed. To address the computational needs of genome wide association (GWA) studies, a fast Java application, which combines single-locus SNP tests and a scan statistic for identifying chromosomal regions with significant clusters of significant SNP effects, was developed and implemented. To illustrate this application, SNP associations were analyzed in a pharmacogenomic study of the blood pressure lowering effect of thiazide-diuretics (N=195) using the Affymetrix Human Mapping 100K Set. 55,335 tagSNPs (pair-wise linkage disequilibrium R(2)<0.5) were selected to reduce the frequency correlation between SNPs. A typical workstation can complete the whole genome scan including 10,000 permutation tests within 3 hours. The most significant regions locate on chromosome 3, 6, 13 and 16, two of which contain candidate genes that may be involved in the underlying drug response mechanism. The computational performance of ChromoScan-GWA and its scalability were tested with up to 1,000,000 SNPs and up to 4,000 subjects. Using 10,000 permutations, the computation time grew linearly in these datasets. This scan statistic application provides a robust statistical and computational foundation for identifying genomic regions associated with disease and provides a method to compare GWA results even across different platforms.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, Michigan
| | | | | | | | | |
Collapse
|
8
|
Wang Q, Wan L, Li D, Zhu L, Qian M, Deng M. Searching for bidirectional promoters in Arabidopsis thaliana. BMC Bioinformatics 2009; 10 Suppl 1:S29. [PMID: 19208129 DOI: 10.1186/1471-2105-10si-s29] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2023] Open
Abstract
BACKGROUND A "bidirectional gene pair" is defined as two adjacent genes which are located on opposite strands of DNA with transcription start sites (TSSs) not more than 1000 base pairs apart and the intergenic region between two TSSs is commonly designated as a putative "bidirectional promoter". Individual examples of bidirectional gene pairs have been reported for years, as well as a few genome-wide analyses have been studied in mammalian and human genomes. However, no genome-wide analysis of bidirectional genes for plants has been done. Furthermore, the exact mechanism of this gene organization is still less understood. RESULTS We conducted comprehensive analysis of bidirectional gene pairs through the whole Arabidopsis thaliana genome and identified 2471 bidirectional gene pairs. The analysis shows that bidirectional genes are often coexpressed and tend to be involved in the same biological function. Furthermore, bidirectional gene pairs associated with similar functions seem to have stronger expression correlation. We pay more attention to the regulatory analysis on the intergenic regions between bidirectional genes. Using a hierarchical stochastic language model (HSL) (which is developed by ourselves), we can identify intergenic regions enriched of regulatory elements which are essential for the initiation of transcription. Finally, we picked 27 functionally associated bidirectional gene pairs with their intergenic regions enriched of regulatory elements and hypothesized them to be regulated by bidirectional promoters, some of which have the same orthologs in ancient organisms. More than half of these bidirectional gene pairs are further supported by sharing similar functional categories as these of handful experimental verified bidirectional genes. CONCLUSION Bidirectional gene pairs are concluded also prevalent in plant genome. Promoter analyses of the intergenic regions between bidirectional genes could be a new way to study the bidirectional gene structure, which may provide a important clue for further analysis. Such a method could be applied to other genomes.
Collapse
Affiliation(s)
- Quan Wang
- Center for Theoretical Biology, Peking University, Beijing100871, PR China.
| | | | | | | | | | | |
Collapse
|
9
|
Wang Q, Wan L, Li D, Zhu L, Qian M, Deng M. Searching for bidirectional promoters in Arabidopsis thaliana. BMC Bioinformatics 2009; 10 Suppl 1:S29. [PMID: 19208129 PMCID: PMC2648788 DOI: 10.1186/1471-2105-10-s1-s29] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background A "bidirectional gene pair" is defined as two adjacent genes which are located on opposite strands of DNA with transcription start sites (TSSs) not more than 1000 base pairs apart and the intergenic region between two TSSs is commonly designated as a putative "bidirectional promoter". Individual examples of bidirectional gene pairs have been reported for years, as well as a few genome-wide analyses have been studied in mammalian and human genomes. However, no genome-wide analysis of bidirectional genes for plants has been done. Furthermore, the exact mechanism of this gene organization is still less understood. Results We conducted comprehensive analysis of bidirectional gene pairs through the whole Arabidopsis thaliana genome and identified 2471 bidirectional gene pairs. The analysis shows that bidirectional genes are often coexpressed and tend to be involved in the same biological function. Furthermore, bidirectional gene pairs associated with similar functions seem to have stronger expression correlation. We pay more attention to the regulatory analysis on the intergenic regions between bidirectional genes. Using a hierarchical stochastic language model (HSL) (which is developed by ourselves), we can identify intergenic regions enriched of regulatory elements which are essential for the initiation of transcription. Finally, we picked 27 functionally associated bidirectional gene pairs with their intergenic regions enriched of regulatory elements and hypothesized them to be regulated by bidirectional promoters, some of which have the same orthologs in ancient organisms. More than half of these bidirectional gene pairs are further supported by sharing similar functional categories as these of handful experimental verified bidirectional genes. Conclusion Bidirectional gene pairs are concluded also prevalent in plant genome. Promoter analyses of the intergenic regions between bidirectional genes could be a new way to study the bidirectional gene structure, which may provide a important clue for further analysis. Such a method could be applied to other genomes.
Collapse
Affiliation(s)
- Quan Wang
- Center for Theoretical Biology, Peking University, Beijing100871, PR China.
| | | | | | | | | | | |
Collapse
|
10
|
Wan L, Li D, Zhang D, Liu X, Fu WJ, Zhu L, Deng M, Sun F, Qian M. Conservation and implications of eukaryote transcriptional regulatory regions across multiple species. BMC Genomics 2008; 9:623. [PMID: 19099599 PMCID: PMC2640395 DOI: 10.1186/1471-2164-9-623] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2008] [Accepted: 12/20/2008] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Increasing evidence shows that whole genomes of eukaryotes are almost entirely transcribed into both protein coding genes and an enormous number of non-protein-coding RNAs (ncRNAs). Therefore, revealing the underlying regulatory mechanisms of transcripts becomes imperative. However, for a complete understanding of transcriptional regulatory mechanisms, we need to identify the regions in which they are found. We will call these transcriptional regulation regions, or TRRs, which can be considered functional regions containing a cluster of regulatory elements that cooperatively recruit transcriptional factors for binding and then regulating the expression of transcripts. RESULTS We constructed a hierarchical stochastic language (HSL) model for the identification of core TRRs in yeast based on regulatory cooperation among TRR elements. The HSL model trained based on yeast achieved comparable accuracy in predicting TRRs in other species, e.g., fruit fly, human, and rice, thus demonstrating the conservation of TRRs across species. The HSL model was also used to identify the TRRs of genes, such as p53 or OsALYL1, as well as microRNAs. In addition, the ENCODE regions were examined by HSL, and TRRs were found to pervasively locate in the genomes. CONCLUSION Our findings indicate that 1) the HSL model can be used to accurately predict core TRRs of transcripts across species and 2) identified core TRRs by HSL are proper candidates for the further scrutiny of specific regulatory elements and mechanisms. Meanwhile, the regulatory activity taking place in the abundant numbers of ncRNAs might account for the ubiquitous presence of TRRs across the genome. In addition, we also found that the TRRs of protein coding genes and ncRNAs are similar in structure, with the latter being more conserved than the former.
Collapse
Affiliation(s)
- Lin Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| | - Dayong Li
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Donglei Zhang
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Xue Liu
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Wenjiang J Fu
- Department of Epidemiology, Michigan State University, East Lansing, Michigan 48824, USA
| | - Lihuang Zhu
- State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, PR China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| | - Fengzhu Sun
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100871, PR China
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California 90089, USA
| | - Minping Qian
- School of Mathematical Sciences, Peking University, Beijing 100871, PR China
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| |
Collapse
|
11
|
Morgan XC, Ni S, Miranker DP, Iyer VR. Predicting combinatorial binding of transcription factors to regulatory elements in the human genome by association rule mining. BMC Bioinformatics 2007; 8:445. [PMID: 18005433 PMCID: PMC2211755 DOI: 10.1186/1471-2105-8-445] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2007] [Accepted: 11/15/2007] [Indexed: 12/20/2022] Open
Abstract
Background Cis-acting transcriptional regulatory elements in mammalian genomes typically contain specific combinations of binding sites for various transcription factors. Although some cis-regulatory elements have been well studied, the combinations of transcription factors that regulate normal expression levels for the vast majority of the 20,000 genes in the human genome are unknown. We hypothesized that it should be possible to discover transcription factor combinations that regulate gene expression in concert by identifying over-represented combinations of sequence motifs that occur together in the genome. In order to detect combinations of transcription factor binding motifs, we developed a data mining approach based on the use of association rules, which are typically used in market basket analysis. We scored each segment of the genome for the presence or absence of each of 83 transcription factor binding motifs, then used association rule mining algorithms to mine this dataset, thus identifying frequently occurring pairs of distinct motifs within a segment. Results Support for most pairs of transcription factor binding motifs was highly correlated across different chromosomes although pair significance varied. Known true positive motif pairs showed higher association rule support, confidence, and significance than background. Our subsets of high-confidence, high-significance mined pairs of transcription factors showed enrichment for co-citation in PubMed abstracts relative to all pairs, and the predicted associations were often readily verifiable in the literature. Conclusion Functional elements in the genome where transcription factors bind to regulate expression in a combinatorial manner are more likely to be predicted by identifying statistically and biologically significant combinations of transcription factor binding motifs than by simply scanning the genome for the occurrence of binding sites for a single transcription factor.
Collapse
Affiliation(s)
- Xochitl C Morgan
- Institute for Cellular and Molecular Biology and Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712-0159, USA.
| | | | | | | |
Collapse
|
12
|
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2007; 2:13. [PMID: 17927813 PMCID: PMC2174486 DOI: 10.1186/1748-7188-2-13] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2007] [Accepted: 10/10/2007] [Indexed: 11/15/2022] Open
Abstract
Background cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. Results We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ks or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. Method The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋ
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|K) ∏i ki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋ
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and ki is the number of occurrences of the ith motif. Conclusion The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs. Availability Project web page, stand-alone version and documentation can be found at
Collapse
|
13
|
Abstract
Positive selection in genes and genomes can point to the evolutionary basis for differences among species and among races within a species. The detection of positive selection can also help identify functionally important protein regions and thus guide protein engineering. Many existing tests for positive selection are excessively conservative, vulnerable to artifacts caused by demographic population history, or computationally very intensive. I here propose a simple and rapid test that is complementary to existing tests and that overcomes some of these problems. It relies on the null hypothesis that neutrally evolving DNA regions should show a Poisson distribution of nucleotide substitutions. The test detects significant deviations from this expectation in the form of variation clusters, highly localized groups of amino acid changes in a coding region. In applying this test to several thousand human-chimpanzee gene orthologs, I show that such variation clusters are not generally caused by relaxed selection. They occur in well-defined domains of a protein's tertiary structure and show a large excess of amino acid replacement over silent substitutions. I also identify multiple new human-chimpanzee orthologs subject to positive selection, among them genes that are involved in reproductive functions, immune defense, and the nervous system.
Collapse
Affiliation(s)
- Andreas Wagner
- Department of Biochemistry, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland.
| |
Collapse
|
14
|
Papatsenko D. ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors. ACTA ACUST UNITED AC 2007; 23:1032-4. [PMID: 17308342 DOI: 10.1093/bioinformatics/btm047] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
ClusterDraw is a program aimed to identification of binding sites and binding-site clusters. Major difference of the ClusterDraw from existing tools is its ability to scan a wide range of parameter values and weigh statistical significance of all possible clusters, smaller than a selected size. The program produces graphs along with decorated FASTA files. ClusterDraw web server is available at the following URL: http://flydev.berkeley.edu/cgi-bin/cld/submit.cgi
Collapse
Affiliation(s)
- Dmitri Papatsenko
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA.
| |
Collapse
|
15
|
Schones DE, Smith AD, Zhang MQ. Statistical significance of cis-regulatory modules. BMC Bioinformatics 2007; 8:19. [PMID: 17241466 PMCID: PMC1796902 DOI: 10.1186/1471-2105-8-19] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2006] [Accepted: 01/22/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software.
Collapse
Affiliation(s)
- Dustin E Schones
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
- Department of Physics and Astronomy, Stony Brook University, Stony Brook, NY 11790, USA
| | - Andrew D Smith
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Michael Q Zhang
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| |
Collapse
|
16
|
Sun YV, Levin AM, Boerwinkle E, Robertson H, Kardia SLR. A scan statistic for identifying chromosomal patterns of SNP association. Genet Epidemiol 2007; 30:627-35. [PMID: 16858698 DOI: 10.1002/gepi.20173] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We have developed a single nucleotide polymorphism (SNP) association scan statistic that takes into account the complex distribution of the human genome variation in the identification of chromosomal regions with significant SNP associations. This scan statistic has wide applicability for genetic analysis, whether to identify important chromosomal regions associated with common diseases based on whole-genome SNP association studies or to identify disease susceptibility genes based on dense SNP positional candidate studies. To illustrate this method, we analyzed patterns of SNP associations on chromosome 19 in a large cohort study. Among 2,944 SNPs, we found seven regions that contained clusters of significantly associated SNPs. The average width of these regions was 35 kb with a range of 10-72 kb. We compared the scan statistic results to Fisher's product method using a sliding window approach, and detected 22 regions with significant clusters of SNP associations. The average width of these regions was 131 kb with a range of 10.1-615 kb. Given that the distances between SNPs are not taken into consideration in the sliding window approach, it is likely that a large fraction of these regions represents false positives. However, all seven regions detected by the scan statistic were also detected by the sliding window approach. The linkage disequilibrium (LD) patterns within the seven regions were highly variable indicating that the clusters of SNP associations were not due to LD alone. The scan statistic developed here can be used to make gene-based or region-based SNP inferences about disease association.
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, University of Michigan, 611 Church Street, Ann Arbor, MI 48104, USA
| | | | | | | | | |
Collapse
|
17
|
Sun YV, Jacobsen DM, Kardia SLR. ChromoScan: a scan statistic application for identifying chromosomal regions in genomic studies. Bioinformatics 2006; 22:2945-7. [PMID: 17032677 DOI: 10.1093/bioinformatics/btl503] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED ChromoScan is an implementation of a genome-based scan statistic that detects genomic regions, which are statistically significant for targeted measurements, such as genetic associations with disease, gene expression profiles, DNA copy number variations, as well as other genome-based measurements. A Java graphic user interface (GUI) is provided to allow users to select appropriate data transformations and thresholds for defining the significant events. AVAILABILITY ChromoScan is freely available from http://www.epidkardia.sph.umich.edu/software/chromoscan/
Collapse
Affiliation(s)
- Yan V Sun
- Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI 48104-3028, USA.
| | | | | |
Collapse
|
18
|
Papatsenko D, Levine M. Computational identification of regulatory DNAs underlying animal development. Nat Methods 2005; 2:529-34. [PMID: 16170869 DOI: 10.1038/nmeth0705-529] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Dmitri Papatsenko
- Department of Molecular and Cellular Biology, Division of Genetics & Development, University of California, Berkeley, California 94720, USA.
| | | |
Collapse
|
19
|
Abnizova I, te Boekhorst R, Walter K, Gilks WR. Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: the fluffy-tail test. BMC Bioinformatics 2005; 6:109. [PMID: 15857505 PMCID: PMC1127108 DOI: 10.1186/1471-2105-6-109] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2004] [Accepted: 04/27/2005] [Indexed: 11/16/2022] Open
Abstract
Background This paper addresses the problem of recognising DNA cis-regulatory modules which are located far from genes. Experimental procedures for this are slow and costly, and computational methods are hard, because they lack positional information. Results We present a novel statistical method, the "fluffy-tail test", to recognise regulatory DNA. We exploit one of the basic informational properties of regulatory DNA: abundance of over-represented transcription factor binding site (TFBS) motifs, although we do not look for specific TFBS motifs, per se . Though overrepresentation of TFBS motifs in regulatory DNA has been intensively exploited by many algorithms, it is still a difficult problem to distinguish regulatory from other genomic DNA. Conclusion We show that, in the data used, our method is able to distinguish cis-regulatory modules by exploiting statistical differences between the probability distributions of similar words in regulatory and other DNA. The potential application of our method includes annotation of new genomic sequences and motif discovery.
Collapse
Affiliation(s)
- Irina Abnizova
- MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK
| | - Rene te Boekhorst
- Computer Science Department, University of Hertfordshire, College Lane, AL10 92BA, Hatfield Campus, UK
| | - Klaudia Walter
- MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK
| | - Walter R Gilks
- MRC Biostatistics Unit, Institute of Public Health, Robinson Way, Cambridge CB2 2SR, UK
| |
Collapse
|
20
|
Papatsenko D, Levine M. Quantitative analysis of binding motifs mediating diverse spatial readouts of the Dorsal gradient in the Drosophila embryo. Proc Natl Acad Sci U S A 2005; 102:4966-71. [PMID: 15795372 PMCID: PMC555988 DOI: 10.1073/pnas.0409414102] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2004] [Indexed: 01/26/2023] Open
Abstract
Dorsal is a sequence-specific transcription factor that is distributed in a broad nuclear gradient across the dorsal-ventral (DV) axis of the early Drosophila embryo. It initiates gastrulation by regulating at least 30-50 target genes in a concentration-dependent fashion. Previous studies identified 18 enhancers that are directly regulated by different concentrations of Dorsal. Here, we employ computational methods to determine the basis for these distinct transcriptional outputs. Orthologous enhancers were identified in a variety of divergent Drosophila species, and their comparison revealed several conserved sequence features responsible for DV patterning. In particular, the quality of Dorsal and Twist recognition sequences correlates with the DV coordinates of gene expression relative to the Dorsal gradient. These findings are entirely consistent with a gradient threshold model for DV patterning, whereby the quality of individual Dorsal binding sites determines in vivo occupancy of target enhancers by the Dorsal gradient. Linked Dorsal and Twist binding sites constitute a conserved composite element in certain "type 2" Dorsal target enhancers, which direct gene expression in ventral regions of the neurogenic ectoderm in response to intermediate levels of the Dorsal gradient. Similar motif arrangements were identified in orthologous loci in the distant mosquito genome, Anopheles gambiae. We discuss how Dorsal and Twist work either additively or synergistically to activate different target enhancers.
Collapse
Affiliation(s)
- Dmitri Papatsenko
- Department of Molecular and Cell Biology, Division of Genetics, Genomics, and Development, Center for Integrative Genomics, University of California, 16 Barker Hall No. 3204, Berkeley, CA 94720-3204, USA.
| | | |
Collapse
|
21
|
Lapidot M, Pilpel Y. Comprehensive quantitative analyses of the effects of promoter sequence elements on mRNA transcription. Nucleic Acids Res 2003; 31:3824-8. [PMID: 12824429 PMCID: PMC168999 DOI: 10.1093/nar/gkg593] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2003] [Revised: 04/02/2003] [Accepted: 04/02/2003] [Indexed: 11/13/2022] Open
Abstract
We have generated a WWW interface for automated comprehensive analyses of promoter regulatory motifs and the effect they exert on mRNA expression profiles. The server provides a wide spectrum of analysis tools that allow de novo discovery of regulatory motifs, along with refinement and in-depth investigation of fully or partially characterized motifs. The presented discovery and analysis tools are fundamentally different from existing tools in their basic rational, statistical background and specificity and sensitivity towards true regulatory elements. We thus anticipate that the service will be of great importance to the experimental and computational biology communities alike. The motif discovery and diagnosis workbench is available at http://longitude.weizmann.ac.il/rMotif/.
Collapse
Affiliation(s)
- Michal Lapidot
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 76100, Israel
| | | |
Collapse
|
22
|
Abstract
Cis-regulatory modules (CRMs) are transcription regulatory DNA segments (approximately 1 Kb range) that control the expression of developmental genes in higher eukaryotes. We analyzed clustering of known binding motifs for transcription factors (TFs) in over 60 known CRMs from 20 Drosophila developmental genes, and we present evidence that each type of recognition motif forms significant clusters within the regulatory regions regulated by the corresponding TF. We demonstrate how a search with a single binding motif can be applied to explore gene regulatory networks and to discover coregulated genes in the genome. We also discuss the potential of the clustering method in interpreting the differential response of genes to various levels of transcriptional regulators.
Collapse
|
23
|
Sudarsanam P, Pilpel Y, Church GM. Genome-wide co-occurrence of promoter elements reveals a cis-regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae. Genome Res 2002; 12:1723-31. [PMID: 12421759 PMCID: PMC187556 DOI: 10.1101/gr.301202] [Citation(s) in RCA: 70] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2002] [Accepted: 09/10/2002] [Indexed: 11/25/2022]
Abstract
Combinatorial regulation is an important feature of eukaryotic transcription. However, only a limited number of studies have characterized this aspect on a whole-genome level. We have conducted a genome-wide computational survey to identify cis-regulatory motif pairs that co-occur in a significantly high number of promoters in the S. cerevisiae genome. A pair of novel motifs, mRRPE and PAC, co-occur most highly in the genome, primarily in the promoters of genes involved in rRNA transcription and processing. The two motifs show significant positional and orientational bias with mRRPE being closer to the ATG than PAC in most promoters. Two additional rRNA-related motifs, mRRSE3 and mRRSE10, also co-occur with mRRPE and PAC. mRRPE and PAC are the primary determinants of expression profiles while mRRSE3 and mRRSE10 modulate these patterns. We describe a new computational approach for studying the functional significance of the physical locations of promoter elements that combine analyses of genome sequence and microarray data. Applying this methodology to the regulatory cassette containing the four rRNA motifs demonstrates that the relative promoter locations of these elements have a profound effect on the expression patterns of the downstream genes. These findings provide a function for these novel motifs and insight into the mechanism by which they regulate gene expression. The methodology introduced here should prove particularly useful for analyzing transcriptional regulation in more complex genomes.
Collapse
Affiliation(s)
- Priya Sudarsanam
- Department of Genetics and Lipper Center for Computational Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | | | | |
Collapse
|
24
|
Loots GG, Ovcharenko I, Pachter L, Dubchak I, Rubin EM. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res 2002; 12:832-9. [PMID: 11997350 PMCID: PMC186580 DOI: 10.1101/gr.225502] [Citation(s) in RCA: 273] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Identifying transcriptional regulatory elements represents a significant challenge in annotating the genomes of higher vertebrates. We have developed a computational tool, rVista, for high-throughput discovery of cis-regulatory elements that combines clustering of predicted transcription factor binding sites (TFBSs) and the analysis of interspecies sequence conservation to maximize the identification of functional sites. To assess the ability of rVista to discover true positive TFBSs while minimizing the prediction of false positives, we analyzed the distribution of several TFBSs across 1 Mb of the well-annotated cytokine gene cluster (Hs5q31; Mm11). Because a large number of AP-1, NFAT, and GATA-3 sites have been experimentally identified in this interval, we focused our analysis on the distribution of all binding sites specific for these transcription factors. The exploitation of the orthologous human-mouse dataset resulted in the elimination of > 95% of the approximately 58,000 binding sites predicted on analysis of the human sequence alone, whereas it identified 88% of the experimentally verified binding sites in this region.
Collapse
Affiliation(s)
- Gabriela G Loots
- Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA.
| | | | | | | | | |
Collapse
|
25
|
Caselle M, Cunto FD, Provero P. Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes. BMC Bioinformatics 2002; 3:7. [PMID: 11876822 PMCID: PMC77394 DOI: 10.1186/1471-2105-3-7] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2001] [Accepted: 02/14/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene regulation in eukaryotes is mainly effected through transcription factors binding to rather short recognition motifs generally located upstream of the coding region. We present a novel computational method to identify regulatory elements in the upstream region of eukaryotic genes. The genes are grouped in sets sharing an overrepresented short motif in their upstream sequence. For each set, the average expression level from a microarray experiment is determined: If this level is significantly higher or lower than the average taken over the whole genome, then the overerpresented motif shared by the genes in the set is likely to play a role in their regulation. RESULTS The method was tested by applying it to the genome of Saccharomyces cerevisiae, using the publicly available results of a DNA microarray experiment, in which expression levels for virtually all the genes were measured during the diauxic shift from fermentation to respiration. Several known motifs were correctly identified, and a new candidate regulatory sequence was determined. CONCLUSIONS We have described and successfully tested a simple computational method to identify upstream motifs relevant to gene regulation in eukaryotes by studying the statistical correlation between overepresented upstream motifs and gene expression levels.
Collapse
Affiliation(s)
- Michele Caselle
- Dipartimento di Fisica Teorica, Università di Torino, and INFN, Sezione di Torino, Torino, Italy
| | - Ferdinando Di Cunto
- Dipartimento di Genetica, Biologia e Biochimica, Università di Torino, Torino, Italy
| | - Paolo Provero
- Dipartimento di Fisica Teorica, Università di Torino, and INFN, Sezione di Torino, Torino, Italy
- Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale, Alessandria, Italy
| |
Collapse
|
26
|
Birnbaum K, Benfey PN, Shasha DE. cis element/transcription factor analysis (cis/TF): a method for discovering transcription factor/cis element relationships. Genome Res 2001; 11:1567-73. [PMID: 11544201 PMCID: PMC311103 DOI: 10.1101/gr.158301] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2001] [Accepted: 06/13/2001] [Indexed: 11/25/2022]
Abstract
We report a simple new algorithm, cis/TF, that uses genomewide expression data and the full genomic sequence to match transcription factors to their binding sites. Most previous computational methods discovered binding sites by clustering genes having similar expression patterns and then identifying over-represented subsequences in the promoter regions of those genes. By contrast, cis/TF asserts that B is a likely binding site of a transcription factor T if the expression pattern of T is correlated to the composite expression patterns of all genes containing B, even when those genes are not mutually correlated. Thus, our method focuses on binding sites rather than genes. The algorithm has successfully identified experimentally-supported transcription factor binding relationships in tests on several data sets from Saccharomyces cerevisiae.
Collapse
Affiliation(s)
- K Birnbaum
- Department of Biology, New York University, New York, New York 10003, USA
| | | | | |
Collapse
|
27
|
Abstract
With the continuing accomplishments of the human genome project, high-throughput strategies to identify DNA sequences that are important in mammalian gene regulation are becoming increasingly feasible. In contrast to the historic, labour-intensive, wet-laboratory methods for identifying regulatory sequences, many modern approaches are heavily focused on the computational analysis of large genomic data sets. Data from inter-species genomic sequence comparisons and genome-wide expression profiling, integrated with various computational tools, are poised to contribute to the decoding of genomic sequence and to the identification of those sequences that orchestrate gene regulation. In this review, we highlight several genomic approaches that are being used to identify regulatory sequences in mammalian genomes.
Collapse
Affiliation(s)
- L A Pennacchio
- Genome Sciences Department, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California 94720, USA
| | | |
Collapse
|
28
|
Abstract
Computational genomics is a subfield of computational biology that deals with the analysis of entire genome sequences. Transcending the boundaries of classical sequence analysis, computational genomics exploits the inherent properties of entire genomes by modelling them as systems. We review recent developments in the field, discuss in some detail a number of novel approaches that take into account the genomic context and argue that progress will be made by novel knowledge representation and simulation technologies.
Collapse
Affiliation(s)
- S Tsoka
- Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, UK
| | | |
Collapse
|
29
|
Abstract
Availability of complete bacterial genomes opens the way to the comparative approach to the recognition of transcription regulatory sites. Assumption of regulon conservation in conjunction with profile analysis provides two lines of independent evidence making it possible to make highly specific predictions. Recently this approach was used to analyze several regulons in eubacteria and archaebacteria. The present review covers recent advances in the comparative analysis of transcriptional regulation in prokaryotes and phylogenetic fingerprinting techniques in eukaryotes, and describes the emerging patterns of the evolution of regulatory systems.
Collapse
Affiliation(s)
- M S Gelfand
- State Scientific Center for Biotechnology 'NIIGenetika', Moscow, Russia.
| |
Collapse
|
30
|
Zhang MQ. Large-Scale Gene Expression Data Analysis: A New Challenge to Computational Biologists. Genome Res 1999. [DOI: 10.1101/gr.9.8.681] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The use of high-density DNA arrays to monitor gene expression at a genome-wide scale constitutes a fundamental advance in biology. In particular, the expression pattern of all genes in Saccharomyces cerevisiae can be interrogated using microarray analysis where cDNAs are hybridized to an array of each of the ∼6000 genes in the yeast genome. In this survey I review three recent experiments related to transcriptional regulation and discuss the great challenge for computational biologists trying to extract functional information from such large-scale gene expression data.
Collapse
|
31
|
Wagner A. Distribution of transcription factor binding sites in the yeast genome suggests abundance of coordinately regulated genes. Genomics 1998; 50:293-5. [PMID: 9653659 DOI: 10.1006/geno.1998.5303] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- A Wagner
- Department of Biology, University of New Mexico, Albuquerque 87131, USA.
| |
Collapse
|
32
|
Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998; 278:167-81. [PMID: 9571041 DOI: 10.1006/jmbi.1998.1700] [Citation(s) in RCA: 306] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
For many newly sequenced genes, sequence analysis of the putative protein yields no clue on function. It would be beneficial to be able to identify in the genome the regulatory regions that confer temporal and spatial expression patterns for the uncharacterized genes. Additionally, it would be advantageous to identify regulatory regions within genes of known expression pattern without performing the costly and time consuming laboratory studies now required. To achieve these goals, the wealth of case studies performed over the past 15 years will have to be collected into predictive models of expression. Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. However, potential binding sites for these factors occur with sufficient frequency that it is rare for a gene to be found without one. Analysis of experimentally determined muscle regulatory sequences indicates that muscle expression requires multiple elements in close proximity. A model is generated with predictive capability for identifying these muscle-specific regulatory modules. Phylogenetic footprinting, the identification of sequences conserved between distantly related species, complements the statistical predictions. Through the use of logistic regression analysis, the model promises to be easily modified to take advantage of the elucidation of additional factors, cooperation rules, and spacing constraints.
Collapse
Affiliation(s)
- W W Wasserman
- Bioinformatics Research Group, SmithKline Beecham Pharmaceuticals, 709 Swedeland Road, King of Prussia, PA 19406, USA
| | | |
Collapse
|