1
|
Abstract
Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.
Collapse
|
2
|
Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 2016; 17:479. [PMID: 27871221 PMCID: PMC5117513 DOI: 10.1186/s12859-016-1354-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 11/11/2016] [Indexed: 11/24/2022] Open
Abstract
Background Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA, to control specific sets of genes. Some transcription factor binding sites (TFBSs) near the transcription start site (TSS) display tight positional preferences relative to the TSS. Furthermore, near the TSS, RMs can co-localize TFBSs with each other and the TSS. The proportion of TFBS positional preferences due to TFBS co-localization within RMs is unknown, however. ChIP experiments confirm co-localization of some TFBSs genome-wide, including near the TSS, but they typically examine only a few TFs at a time, using non-physiological conditions that can vary from lab to lab. In contrast, sequence analysis can examine many TFs uniformly and methodically, broadly surveying the co-localization of TFBSs with tight positional preferences relative to the TSS. Results Our statistics found 43 significant sets of human motifs in the JASPAR TF Database with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a gene group of 135 to 3304 genes, with 42/43 (98%) gene groups independently validated by DAVID, a gene ontology database, with FDR < 0.05. Motifs corresponding to two TFBSs in a RM should co-occur more than by chance alone, enriching the intersection of the gene groups corresponding to the two TFs. Thus, a gene-group intersection systematically enriched beyond chance alone provides evidence that the two TFs participate in an RM. Of the 903 = 43*42/2 intersections of the 43 significant gene groups, we found 768/903 (85%) pairs of gene groups with significantly enriched intersections, with 564/768 (73%) intersections independently validated by DAVID with FDR < 0.05. A user-friendly web site at http://go.usa.gov/3kjsH permits biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Conclusions Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM near the TSS that binds a particular TF subunit. Of all intersections of our 43 significant gene groups, 85% were significantly enriched, with 73% of the significant enrichments independently validated by gene ontology. The co-localization of TFBSs within RMs therefore likely explains much of the tight TFBS positional preferences near the TSS. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1354-5) contains supplementary material, which is available to authorized users.
Collapse
|
3
|
NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data. BMC Genomics 2013; 14:349. [PMID: 23706083 PMCID: PMC3672025 DOI: 10.1186/1471-2164-14-349] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2013] [Accepted: 05/20/2013] [Indexed: 11/18/2022] Open
Abstract
Background Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies. Results We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region. Conclusions The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.
Collapse
|
4
|
The PARIGA server for real time filtering and analysis of reciprocal BLAST results. PLoS One 2013; 8:e62224. [PMID: 23667459 PMCID: PMC3646873 DOI: 10.1371/journal.pone.0062224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2012] [Accepted: 03/19/2013] [Indexed: 12/03/2022] Open
Abstract
BLAST-based similarity searches are commonly used in several applications involving both nucleotide and protein sequences. These applications span from simple tasks such as mapping sequences over a database to more complex procedures as clustering or annotation processes. When the amount of analysed data increases, manual inspection of BLAST results become a tedious procedure. Tools for parsing or filtering BLAST results for different purposes are then required. We describe here PARIGA (http://resources.bioinformatica.crs4.it/pariga/), a server that enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. Results can be interrogated by the user using logical operations, for example to retrieve cases where two queries match same targets, or when sequences from the two datasets are reciprocal best hits, or when a query matches a target in multiple regions. The Pariga web server is designed to be a helpful tool for managing the results of sequence similarity searches. The design and implementation of the server renders all operations very fast and easy to use.
Collapse
|
5
|
Transcriptional regulation of tocopherol biosynthesis in tomato. PLANT MOLECULAR BIOLOGY 2013; 81:309-25. [PMID: 23247837 DOI: 10.1007/s11103-012-0001-4] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2012] [Accepted: 12/10/2012] [Indexed: 05/21/2023]
Abstract
Tocopherols, compounds with vitamin E (VTE) activity, are potent lipid-soluble antioxidants synthesized only by photosynthetic organisms. Their biosynthesis requires the condensation of phytyl-diphosphate and homogentisate, derived from the methylerythritol phosphate (MEP) and shikimate pathways (SK), respectively. These metabolic pathways are central in plant chloroplast metabolism and are involved in the biosynthesis of important molecules such as chlorophyll, carotenoids, aromatic amino-acids and prenylquinones. In the last decade, few studies have provided insights into the regulation of VTE biosynthesis and its accumulation. However, the pathway regulatory mechanism/s at mRNA level remains unclear. We have recently identified a collection of tomato genes involved in tocopherol biosynthesis. In this work, by a dedicated qPCR array platform, the transcript levels of 47 genes, including paralogs, were determined in leaves and across fruit development. Expression data were analyzed for correlation with tocopherol profiles by coregulation network and neural clustering approaches. The results showed that tocopherol biosynthesis is controlled both temporally and spatially however total tocopherol content remains constant. These analyses exposed 18 key genes from MEP, SK, phytol recycling and VTE-core pathways highly associated with VTE content in leaves and fruits. Moreover, genomic analyses of promoter regions suggested that the expression of the tocopherol-core pathway genes is trancriptionally coregulated with specific genes of the upstream pathways. Whilst the transcriptional profiles of the precursor pathway genes would suggest an increase in VTE content across fruit development, the data indicate that in the M82 cultivar phytyl diphosphate supply limits tocopherol biosynthesis in later fruit stages. This is in part due to the decreasing transcript levels of geranylgeranyl reductase (GGDR) which restricts the isoprenoid precursor availability. As a proof of concept, by analyzing a collection of Andean landrace tomato genotypes, the role of the pinpointed genes in determining fruit tocopherol content was confirmed. The results uncovered a finely tuned regulation able to shift the precursor pathways controlling substrate influx for VTE biosynthesis and overcoming endogenous competition for intermediates. The whole set of data allowed to propose that 1-deoxy-D-xylulose-5-phosphate synthase and GGDR encoding genes, which determine phytyl-diphosphate availability, together with enzyme encoding genes involved in chlorophyll-derived phytol metabolism appear as the most plausible targets to be engineered aiming to improve tomato fruit nutritional value.
Collapse
|
6
|
Differences in local genomic context of bound and unbound motifs. Gene 2012; 506:125-34. [PMID: 22692006 DOI: 10.1016/j.gene.2012.06.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2012] [Accepted: 06/04/2012] [Indexed: 11/25/2022]
Abstract
Understanding gene regulation is a major objective in molecular biology research. Frequently, transcription is driven by transcription factors (TFs) that bind to specific DNA sequences. These motifs are usually short and degenerate, rendering the likelihood of multiple copies occurring throughout the genome due to random chance as high. Despite this, TFs only bind to a small subset of sites, thus prompting our investigation into the differences between motifs that are bound by TFs and those that remain unbound. Here we constructed vectors representing various chromatin- and sequence-based features for a published set of bound and unbound motifs representing nine TFs in the budding yeast Saccharomyces cerevisiae. Using a machine learning approach, we identified a set of features that can be used to discriminate between bound and unbound motifs. We also discovered that some TFs bind most or all of their strong motifs in intergenic regions. Our data demonstrate that local sequence context can be strikingly different around motifs that are bound compared to motifs that are unbound. We concluded that there are multiple combinations of genomic features that characterize bound or unbound motifs.
Collapse
|
7
|
Abstract
Transcription factors and the short, often degenerate DNA sequences they recognize are central regulators of gene expression, but their regulatory code is challenging to dissect experimentally. Thus, computational approaches have long been used to identify putative regulatory elements from the patterns in promoter sequences. Here we present a new algorithm “POWRS” (POsition-sensitive WoRd Set) for identifying regulatory sequence motifs, specifically developed to address two common shortcomings of existing algorithms. First, POWRS uses the position-specific enrichment of regulatory elements near transcription start sites to significantly increase sensitivity, while providing new information about the preferred localization of those elements. Second, POWRS forgoes position weight matrices for a discrete motif representation that appears more resistant to over-generalization. We apply this algorithm to discover sequences related to constitutive, high-level gene expression in the model plant Arabidopsis thaliana, and then experimentally validate the importance of those elements by systematically mutating two endogenous promoters and measuring the effect on gene expression levels. This provides a foundation for future efforts to rationally engineer gene expression in plants, a problem of great importance in developing biotech crop varieties. Availability: BSD-licensed Python code at http://grassrootsbio.com/papers/powrs/.
Collapse
|
8
|
SEAM: A STOCHASTIC EM-TYPE ALGORITHM FOR MOTIF-FINDING IN BIOPOLYMER SEQUENCES. J Bioinform Comput Biol 2011; 5:47-77. [PMID: 17477491 DOI: 10.1142/s0219720007002527] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2006] [Revised: 08/22/2006] [Accepted: 10/14/2006] [Indexed: 12/21/2022]
Abstract
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
Collapse
|
9
|
The spatial distribution of cis regulatory elements in yeast promoters and its implications for transcriptional regulation. BMC Genomics 2010; 11:581. [PMID: 20958978 PMCID: PMC3091728 DOI: 10.1186/1471-2164-11-581] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2010] [Accepted: 10/19/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND How the transcription factor binding sites (TFBSs) are distributed in the promoter region have implications for gene regulation. Previous studies used the translation start codon as the reference point to infer the TFBS distribution. However, it is biologically more relevant to use the transcription start site (TSS) as the reference point. In this study, we reexamined the spatial distribution of TFBSs, investigated various promoter features that may affect the distribution, and studied the effect of TFBS distribution on transcriptional regulation. RESULTS We found a sharp peak for the distribution of TFBSs at ~115 bp upstream of the TSS, but no clear peak when the translation start codon was used as the reference point. Our analysis of sequence variation data among 63 yeast strains revealed very low deletion polymorphisms in the region between the distribution peak and the TSS, suggesting that the distances between TFBSs and the TSS have been selectively constrained in evolution. As in previous studies, we found that the nucleosome occupancy and the presence/absence of TATA-box in the promoter region affect the TFBS distribution pattern. In addition, we found that there exists a correlation between the 5'UTR length and the TFBS distribution pattern and we showed that the TFBS distribution pattern affects gene transcription level and plasticity. CONCLUSIONS The spatial distribution of TFBSs obtained using the TSS as the reference point shows a much sharper peak than does the distribution obtained using the translation start codon as the reference point. The TFBS distribution pattern is affected by nucleosome occupancy and presence of TATA-box and it affects the transcription level and transcription plasticity of the gene.
Collapse
|
10
|
Abstract
MOTIVATION Discovery of nucleotide motifs that are localized with respect to a certain biological landmark is important in several appli-cations, such as in regulatory sequences flanking the transcription start site, in the neighborhood of known transcription factor binding sites, and in transcription factor binding regions discovered by massively parallel sequencing (ChIP-Seq). RESULTS We report an algorithm called LocalMotif to discover such localized motifs. The algorithm is based on a novel scoring function, called spatial confinement score, which can determine the exact interval of localization of a motif. This score is combined with other existing scoring measures including over-representation and relative entropy to determine the overall prominence of the motif. The approach successfully discovers biologically relevant motifs and their intervals of localization in scenarios where the motifs cannot be discovered by general motif finding tools. It is especially useful for discovering multiple co-localized motifs in a set of regulatory sequences, such as those identified by ChIP-Seq. AVAILABILITY AND IMPLEMENTATION The LocalMotif software is available at http://www.comp.nus.edu.sg/~bioinfo/LocalMotif.
Collapse
|
11
|
Abstract
MOTIVATION Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. RESULTS This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. AVAILABILITY AND IMPLEMENTATION The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
12
|
Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
Collapse
|
13
|
Measuring spatial preferences at fine-scale resolution identifies known and novel cis-regulatory element candidates and functional motif-pair relationships. Nucleic Acids Res 2009; 37:e92. [PMID: 19483094 PMCID: PMC2715254 DOI: 10.1093/nar/gkp423] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Transcriptional regulation is mediated by the collective binding of proteins called transcription factors to cis-regulatory elements. A handful of factors are known to function at particular distances from the transcription start site, although the extent to which this occurs is not well understood. Spatial dependencies can also exist between pairs of binding motifs, facilitating factor-pair interactions. We sought to determine to what extent spatial preferences measured at high-scale resolution could be utilized to predict cis-regulatory elements as well as motif-pairs binding interacting proteins. We introduce the ‘motif positional function’ model which predicts spatial biases using regression analysis, differentiating noise from true position-specific overrepresentation at single-nucleotide resolution. Our method predicts 48 consensus motifs exhibiting positional enrichment within human promoters, including fourteen motifs without known binding partners. We then extend the model to analyze distance preferences between pairs of motifs. We find that motif-pairs binding interacting factors often co-occur preferentially at multiple distances, with intervals between preferred distances often corresponding to the turn of the DNA double-helix. This offers a novel means by which to predict sequence elements with a collective role in gene regulation.
Collapse
|
14
|
Genomewide bioinformatic analysis negates any specific role for Dof, GATA and Ag/cTCA motifs in nitrate responsive gene expression in Arabidopsis. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2009; 15:145-50. [PMID: 23572923 PMCID: PMC3550365 DOI: 10.1007/s12298-009-0016-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Nitrate response at the plant level is mediated by the transcriptional regulation of several hundreds of genes, but no common cis-acting nitrate-responsive elements (NREs) have been identified so far. Earlier, we bioinformatically ruled out the possibility that the previously published [(a/t)7Ag/cTCA] motif could act as NRE on its own (Das et al., 2007, Mol. Genet. Genomics, 278: 519-525). In the present study, we examined other motifs such as Dof and GATA binding elements in homologous as well as heterologous pairwise combinations in the Arabidopsis genome in silico. None of the above three motifs revealed any unique association with nitrate responsive genes or their subsets in any combination, either within their ORFs or 1 kb flanking sequences on either side. Additionally, twelve new, top-scoring candidate motifs that were generated using different online motif samplers were analyzed in silico using a subset of 21 'early' nitrate responsive genes, but did not reveal any specificity of occurence. These results underscore the need to continue the search for novel candidate NREs, as possible sites of intervention to understand/improve nitrate-responsive gene expression and nitrate use efficiency.
Collapse
|
15
|
Repetitive DNA elements, nucleosome binding and human gene expression. Gene 2009; 436:12-22. [PMID: 19393174 DOI: 10.1016/j.gene.2009.01.013] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2009] [Accepted: 01/23/2009] [Indexed: 11/26/2022]
Abstract
We evaluated the epigenetic contributions of repetitive DNA elements to human gene regulation. Human proximal promoter sequences show distinct distributions of transposable elements (TEs) and simple sequence repeats (SSRs). TEs are enriched distal from transcriptional start sites (TSSs) and their frequency decreases closer to TSSs, being largely absent from the core promoter region. SSRs, on the other hand, are found at low frequency distal to the TSS and then increase in frequency starting approximately 150 bp upstream of the TSS. The peak of SSR density is centered around the -35 bp position where the basal transcriptional machinery assembles. These trends in repetitive sequence distribution are strongly correlated, positively for TEs and negatively for SSRs, with relative nucleosome binding affinities along the promoters. Nucleosomes bind with highest probability distal from the TSS and the nucleosome binding affinity steadily decreases reaching its nadir just upstream of the TSS at the same point where SSR frequency is at its highest. Promoters that are enriched for TEs are more highly and broadly expressed, on average, than promoters that are devoid of TEs. In addition, promoters that have similar repetitive DNA profiles regulate genes that have more similar expression patterns and encode proteins with more similar functions than promoters that differ with respect to their repetitive DNA. Furthermore, distinct repetitive DNA promoter profiles are correlated with tissue-specific patterns of expression. These observations indicate that repetitive DNA elements mediate chromatin accessibility in proximal promoter regions and the repeat content of promoters is relevant to both gene expression and function.
Collapse
|
16
|
Abstract
Reliable identification and assignment of cis-regulatory elements in promoter regions is a challenging problem in biology. The sophistication of transcriptional regulation in higher eukaryotes, particularly in metazoans, could be an important factor contributing to their organismal complexity. Here we present an integrated approach where networks of co-expressed genes are combined with gene ontology-derived functional networks to discover clusters of genes that share both similar expression patterns and functions. Regulatory elements are identified in the promoter regions of these gene clusters using a Gibbs sampling algorithm implemented in the A-GLAM software package. Using this approach, we analyze the cell-cycle co-expression network of the yeast Saccharomyces cerevisiae, showing that this approach correctly identifies cis-regulatory elements present in clusters of co-expressed genes.
Collapse
|
17
|
Abstract
Reliable detection of cis-regulatory elements in promoter regions is a difficult and unsolved problem in computational biology. The intricacy of transcriptional regulation in higher eukaryotes, primarily in metazoans, could be a major driving force of organismal complexity. Eukaryotic genome annotations have improved greatly due to large-scale characterization of full-length cDNAs, transcriptional start sites (TSSs), and comparative genomics. Regulatory elements are identified in promoter regions using a variety of enumerative or alignment-based methods. Here we present a survey of recent computational methods for eukaryotic promoter analysis and describe the use of an alignment-based method implemented in the A-GLAM program.
Collapse
|
18
|
Evolutionary computation for discovery of composite transcription factor binding sites. Nucleic Acids Res 2008; 36:e142. [PMID: 18927103 PMCID: PMC2588514 DOI: 10.1093/nar/gkn738] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2008] [Revised: 09/05/2008] [Accepted: 10/02/2008] [Indexed: 12/02/2022] Open
Abstract
Previous research demonstrated the use of evolutionary computation for the discovery of transcription factor binding sites (TFBS) in promoter regions upstream of coexpressed genes. However, it remained unclear whether or not composite TFBS elements, commonly found in higher organisms where two or more TFBSs form functional complexes, could also be identified by using this approach. Here, we present an important refinement of our previous algorithm and test the identification of composite elements using NFAT/AP-1 as an example. We demonstrate that by using appropriate existing parameters such as window size, novel-scoring methods such as central bonusing and methods of self-adaptation to automatically adjust the variation operators during the evolutionary search, TFBSs of different sizes and complexity can be identified as top solutions. Some of these solutions have known experimental relationships with NFAT/AP-1. We also indicate that even after properly tuning the model parameters, the choice of the appropriate window size has a significant effect on algorithm performance. We believe that this improved algorithm will greatly augment TFBS discovery.
Collapse
|
19
|
Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 2008; 9:262. [PMID: 18533028 PMCID: PMC2432075 DOI: 10.1186/1471-2105-9-262] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 06/04/2008] [Indexed: 12/03/2022] Open
Abstract
Background Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with ad hoc methods, making them unsuitable for general motif searches. Results A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly. Conclusion Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.
Collapse
|
20
|
The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 2008; 36:2777-86. [PMID: 18367472 PMCID: PMC2377430 DOI: 10.1093/nar/gkn137] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
A number of previous studies have predicted transcription factor binding sites (TFBSs) by exploiting the position of genomic landmarks like the transcriptional start site (TSS). The studies’ methods are generally too computationally intensive for genome-scale investigation, so the full potential of ‘positional regulomics’ to discover TFBSs and determine their function remains unknown. Because databases often annotate the genomic landmarks in DNA sequences, the methodical exploitation of positional regulomics has become increasingly urgent. Accordingly, we examined a set of 7914 human putative promoter regions (PPRs) with a known TSS. Our methods identified 1226 eight-letter DNA words with significant positional preferences with respect to the TSS, of which only 608 of the 1226 words matched known TFBSs. Many groups of genes whose PPRs contained a common word displayed similar expression profiles and related biological functions, however. Most interestingly, our results included 78 words, each of which clustered significantly in two or three different positions relative to the TSS. Often, the gene groups corresponding to different positional clusters of the same word corresponded to diverse functions, e.g. activation or repression in different tissues. Thus, different clusters of the same word likely reflect the phenomenon of ‘positional regulation’, i.e. a word's regulatory function can vary with its position relative to a genomic landmark, a conclusion inaccessible to methods based purely on sequence. Further integrative analysis of words co-occurring in PPRs also yielded 24 different groups of genes, likely identifying cis-regulatory modules de novo. Whereas comparative genomics requires precise sequence alignments, positional regulomics exploits genomic landmarks to provide a ‘poor man's alignment’. By exploiting the phenomenon of positional regulation, it uses position to differentiate the biological functions of subsets of TFBSs sharing a common sequence motif.
Collapse
|
21
|
Features generated for computational splice-site prediction correspond to functional elements. BMC Bioinformatics 2007; 8:410. [PMID: 17958908 PMCID: PMC2241647 DOI: 10.1186/1471-2105-8-410] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Accepted: 10/24/2007] [Indexed: 11/16/2022] Open
Abstract
Background Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. Results We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. Conclusion Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.
Collapse
|
22
|
Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res 2007; 35:6219-26. [PMID: 17855401 PMCID: PMC2094075 DOI: 10.1093/nar/gkm685] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Mammalian promoters are categorized into TATA and CpG-related groups, and they have complementary roles associated with differentiated transcriptional characteristics. While the TATA box is also found in plant promoters, it is not known if CpG-type promoters exist in plants. Plant promoters contain Y Patches (pyrimidine patches) in the core promoter region, and the ubiquity of these beyond higher plants is not understood as well. Sets of promoter sequences were utilized for the analysis of local distribution of short sequences (LDSS), and approximately one thousand octamer sequences have been identified as promoter constituents from Arabidopsis, rice, human and mouse, respectively. Based on their localization profiles, the identified octamer sequences were classified into several major groups, REG (Regulatory Element Group), TATA box, Inr (Initiator), Kozak, CpG and Y Patch. Comparison of the four species has revealed three categories: (i) shared groups found in both plants and mammals (TATA box), (ii) common groups found in both kingdoms but the utilized sequence is differentiated (REG, Inr and Kozak) and (iii) specific groups found in either plants or mammals (CpG and Y Patch). Our comparative LDSS analysis has identified conservation and differentiation of promoter architectures between higher plants and mammals.
Collapse
|
23
|
Abstract
MOTIVATION A tool that simultaneously aligns multiple protein sequences, automatically utilizes information about protein domains, and has a good compromise between speed and accuracy will have practical advantages over current tools. RESULTS We describe COBALT, a constraint based alignment tool that implements a general framework for multiple alignment of protein sequences. COBALT finds a collection of pairwise constraints derived from database searches, sequence similarity and user input, combines these pairwise constraints, and then incorporates them into a progressive multiple alignment. We show that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality. We also show that COBALT has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems. AVAILABILITY COBALT is included in the NCBI C++ toolkit. A Linux executable for COBALT, and CDD and PROSITE data used is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt
Collapse
|
24
|
Abstract
UNLABELLED A multitude of motif-finding tools have been published, which can generally be assigned to one of three classes: expectation-maximization, Gibbs-sampling or enumeration. Irrespective of this grouping, most motif detection tools only take into account similarities across ungapped sequence regions, possibly causing short motifs located peripherally and in varying distance to a 'core' motif to be missed. We present a new method, adding to the set of expectation-maximization approaches, that permits the use of gapped alignments for motif elucidation. AVAILABILITY The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/mclip.jar. SUPPLEMENTARY INFORMATION http://bioinfoserver.rsbs.anu.edu.au/utils/mclip/info.php.
Collapse
|
25
|
Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements. BMC Bioinformatics 2006; 7:408. [PMID: 16961919 PMCID: PMC1599759 DOI: 10.1186/1471-2105-7-408] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2006] [Accepted: 09/08/2006] [Indexed: 12/05/2022] Open
Abstract
Background Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. Results We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. Conclusion Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances.
Collapse
|
26
|
Exopolysaccharide-associated protein sorting in environmental organisms: the PEP-CTERM/EpsH system. Application of a novel phylogenetic profiling heuristic. BMC Biol 2006; 4:29. [PMID: 16930487 PMCID: PMC1569441 DOI: 10.1186/1741-7007-4-29] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2006] [Accepted: 08/24/2006] [Indexed: 11/13/2022] Open
Abstract
Background Protein translocation to the proper cellular destination may be guided by various classes of sorting signals recognizable in the primary sequence. Detection in some genomes, but not others, may reveal sorting system components by comparison of the phylogenetic profile of the class of sorting signal to that of various protein families. Results We describe a short C-terminal homology domain, sporadically distributed in bacteria, with several key characteristics of protein sorting signals. The domain includes a near-invariant motif Pro-Glu-Pro (PEP). This possible recognition or processing site is followed by a predicted transmembrane helix and a cluster rich in basic amino acids. We designate this domain PEP-CTERM. It tends to occur multiple times in a genome if it occurs at all, with a median count of eight instances; Verrucomicrobium spinosum has sixty-five. PEP-CTERM-containing proteins generally contain an N-terminal signal peptide and exhibit high diversity and little homology to known proteins. All bacteria with PEP-CTERM have both an outer membrane and exopolysaccharide (EPS) production genes. By a simple heuristic for screening phylogenetic profiles in the absence of pre-formed protein families, we discovered that a homolog of the membrane protein EpsH (exopolysaccharide locus protein H) occurs in a species when PEP-CTERM domains are found. The EpsH family contains invariant residues consistent with a transpeptidase function. Most PEP-CTERM proteins are encoded by single-gene operons preceded by large intergenic regions. In the Proteobacteria, most of these upstream regions share a DNA sequence, a probable cis-regulatory site that contains a sigma-54 binding motif. The phylogenetic profile for this DNA sequence exactly matches that of three proteins: a sigma-54-interacting response regulator (PrsR), a transmembrane histidine kinase (PrsK), and a TPR protein (PrsT). Conclusion These findings are consistent with the hypothesis that PEP-CTERM and EpsH form a protein export sorting system, analogous to the LPXTG/sortase system of Gram-positive bacteria, and correlated to EPS expression. It occurs preferentially in bacteria from sediments, soils, and biofilms. The novel method that led to these findings, partial phylogenetic profiling, requires neither global sequence clustering nor arbitrary similarity cutoffs and appears to be a rapid, effective alternative to other profiling methods.
Collapse
|
27
|
A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Res 2006; 34:2238-46. [PMID: 16670430 PMCID: PMC1456330 DOI: 10.1093/nar/gkl248] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein–DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection.
Collapse
|