1
|
Using methylation data to improve transcription factor binding prediction. Epigenetics 2024; 19:2309826. [PMID: 38300850 PMCID: PMC10841018 DOI: 10.1080/15592294.2024.2309826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Modelling the regulatory mechanisms that determine cell fate, response to external perturbation, and disease state depends on measuring many factors, a task made more difficult by the plasticity of the epigenome. Scanning the genome for the sequence patterns defined by Position Weight Matrices (PWM) can be used to estimate transcription factor (TF) binding locations. However, this approach does not incorporate information regarding the epigenetic context necessary for TF binding. CpG methylation is an epigenetic mark influenced by environmental factors that is commonly assayed in human cohort studies. We developed a framework to score inferred TF binding locations using methylation data. We intersected motif locations identified using PWMs with methylation information captured in both whole-genome bisulfite sequencing and Illumina EPIC array data for six cell lines, scored motif locations based on these data, and compared with experimental data characterizing TF binding (ChIP-seq). We found that for most TFs, binding prediction improves using methylation-based scoring compared to standard PWM-scores. We also illustrate that our approach can be generalized to infer TF binding when methylation information is only proximally available, i.e. measured for nearby CpGs that do not directly overlap with a motif location. Overall, our approach provides a framework for inferring context-specific TF binding using methylation data. Importantly, the availability of DNA methylation data in existing patient populations provides an opportunity to use our approach to understand the impact of methylation on gene regulatory processes in the context of human disease.
Collapse
|
2
|
Detecting clusters of transcription factors based on a nonhomogeneous poisson process model. BMC Bioinformatics 2022; 23:535. [PMID: 36494794 PMCID: PMC9738027 DOI: 10.1186/s12859-022-05090-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 11/30/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Rapidly growing genome-wide ChIP-seq data have provided unprecedented opportunities to explore transcription factor (TF) binding under various cellular conditions. Despite the rich resources, development of analytical methods for studying the interaction among TFs in gene regulation still lags behind. RESULTS In order to address cooperative TF binding and detect TF clusters with coordinative functions, we have developed novel computational methods based on clustering the sample paths of nonhomogeneous Poisson processes. Simulation studies demonstrated the capability of these methods to accurately detect TF clusters and uncover the hierarchy of TF interactions. A further application to the multiple-TF ChIP-seq data in mouse embryonic stem cells (ESCs) showed that our methods identified the cluster of core ESC regulators reported in the literature and provided new insights on functional implications of transcrisptional regulatory modules. CONCLUSIONS Effective analytical tools are essential for studying protein-DNA relations. Information derived from this research will help us better understand the orchestration of transcription factors in gene regulation processes.
Collapse
|
3
|
Identification of Enriched Regions in ChIP-Seq Data via a Linear-Time Multi-Level Thresholding Algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2842-2850. [PMID: 34398762 DOI: 10.1109/tcbb.2021.3104734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Chromatin immunoprecipitation (ChIP-Seq) has emerged as a superior alternative to microarray technology as it provides higher resolution, less noise, greater coverage and wider dynamic range. While ChIP-Seq enables probing of DNA-protein interaction over the entire genome, it requires the use of sophisticated tools to recognize hidden patterns and extract meaningful data. Over the years, various attempts have resulted in several algorithms making use of different heuristics to accurately determine individual peaks corresponding to unique DNA-protein. However, finding all the significant peaks with high accuracy in a reasonable time is still a challenge. In this work, we propose the use of Multi-level thresholding algorithm, which we call LinMLTBS, used to identify the enriched regions on ChIP-Seq data. Although various suboptimal heuristics have been proposed for multi-level thresholding, we emphasize on the use of an algorithm capable of obtaining an optimal solution, while maintaining linear-time complexity. Testing various algorithm on various ENCODE project datasets shows that our approach attains higher accuracy relative to previously proposed peak finders while retaining a reasonable processing speed.
Collapse
|
4
|
A sequence-based two-layer predictor for identifying enhancers and their strength through enhanced feature extraction. J Bioinform Comput Biol 2022; 20:2250005. [PMID: 35264081 DOI: 10.1142/s0219720022500056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Enhancers are short regulatory DNA fragments that are bound with proteins called activators. They are free-bound and distant elements, which play a vital role in controlling gene expression. It is challenging to identify enhancers and their strength due to their dynamic nature. Although some machine learning methods exist to accelerate identification process, their prediction accuracy and efficiency will need more improvement. In this regard, we propose a two-layer prediction model with enhanced feature extraction strategy which does feature combination from improved position-specific amino acid propensity (PSTKNC) method along with Enhanced Nucleic Acid Composition (ENAC) and Composition of k-spaced Nucleic Acid Pairs (CKSNAP). The feature sets from all three feature extraction approaches were concatenated and then sent through a simple artificial neural network (ANN) to accurately identify enhancers in the first layer and their strength in the second layer. Experiments are conducted on benchmark chromatin nine cell lines dataset. A 10-fold cross validation method is employed to evaluate model's performance. The results show that the proposed model gives an outstanding performance with 94.50%, 0.8903 of accuracy and Matthew's correlation coefficient (MCC) in predicting enhancers and fairly does well with independent test also when compared with all other existing methods.
Collapse
|
5
|
Enhancers as potential targets for engineering salinity stress tolerance in crop plants. PHYSIOLOGIA PLANTARUM 2021; 173:1382-1391. [PMID: 33837536 DOI: 10.1111/ppl.13421] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 03/19/2021] [Accepted: 04/06/2021] [Indexed: 06/12/2023]
Abstract
Enhancers represent noncoding regulatory regions of the genome located distantly from their target genes. They regulate gene expression programs in a context-specific manner via interacting with promoters of one or more target genes and are generally associated with transcription factor binding sites and epi(genomic)/chromatin features, such as regions of chromatin accessibility and histone modifications. The enhancers are difficult to identify due to the modularity of their associated features. Although enhancers have been studied extensively in human and animals, only a handful of them has been identified in few plant species till date due to nonavailability of plant-specific experimental and computational approaches for their discovery. Being an important regulatory component of the genome, enhancers represent potential targets for engineering agronomic traits, including salinity stress tolerance in plants. Here, we provide a review of the available experimental and computational approaches along with the associated sequence and chromatin/epigenetic features for the discovery of enhancers in plants. In addition, we provide insights into the challenges and future prospects of enhancer research in plant biology with emphasis on potential applications in engineering salinity stress tolerance in crop plants.
Collapse
|
6
|
CrepHAN: Cross-species prediction of enhancers by using hierarchical attention networks. Bioinformatics 2021; 37:3436-3443. [PMID: 33978703 DOI: 10.1093/bioinformatics/btab349] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 04/21/2021] [Accepted: 05/06/2021] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes. RESULTS In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally-supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow, and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the AUC score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets. AVAILABILITY bcmi.sjtu.edu.cn/~yangyang/CrepHAN.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Inferring time series chromatin states for promoter-enhancer pairs based on Hi-C data. BMC Genomics 2021; 22:84. [PMID: 33509077 PMCID: PMC7841892 DOI: 10.1186/s12864-021-07373-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 01/07/2021] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Co-localized combinations of histone modifications ("chromatin states") have been shown to correlate with promoter and enhancer activity. Changes in chromatin states over multiple time points ("chromatin state trajectories") have previously been analyzed at promoter and enhancers separately. With the advent of time series Hi-C data it is now possible to connect promoters and enhancers and to analyze chromatin state trajectories at promoter-enhancer pairs. RESULTS We present TimelessFlex, a framework for investigating chromatin state trajectories at promoters and enhancers and at promoter-enhancer pairs based on Hi-C information. TimelessFlex extends our previous approach Timeless, a Bayesian network for clustering multiple histone modification data sets at promoter and enhancer feature regions. We utilize time series ATAC-seq data measuring open chromatin to define promoters and enhancer candidates. We developed an expectation-maximization algorithm to assign promoters and enhancers to each other based on Hi-C interactions and jointly cluster their feature regions into paired chromatin state trajectories. We find jointly clustered promoter-enhancer pairs showing the same activation patterns on both sides but with a stronger trend at the enhancer side. While the promoter side remains accessible across the time series, the enhancer side becomes dynamically more open towards the gene activation time point. Promoter cluster patterns show strong correlations with gene expression signals, whereas Hi-C signals get only slightly stronger towards activation. The code of the framework is available at https://github.com/henriettemiko/TimelessFlex . CONCLUSIONS TimelessFlex clusters time series histone modifications at promoter-enhancer pairs based on Hi-C and it can identify distinct chromatin states at promoter and enhancer feature regions and their changes over time.
Collapse
|
8
|
MEIRLOP: improving score-based motif enrichment by incorporating sequence bias covariates. BMC Bioinformatics 2020; 21:410. [PMID: 32938397 PMCID: PMC7493370 DOI: 10.1186/s12859-020-03739-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 09/04/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Motif enrichment analysis (MEA) identifies over-represented transcription factor binding (TF) motifs in the DNA sequence of regulatory regions, enabling researchers to infer which transcription factors can regulate transcriptional response to a stimulus, or identify sequence features found near a target protein in a ChIP-seq experiment. Score-based MEA determines motifs enriched in regions exhibiting extreme differences in regulatory activity, but existing methods do not control for biases in GC content or dinucleotide composition. This lack of control for sequence bias, such as those often found in CpG islands, can obscure the enrichment of biologically relevant motifs. RESULTS We developed Motif Enrichment In Ranked Lists of Peaks (MEIRLOP), a novel MEA method that determines enrichment of TF binding motifs in a list of scored regulatory regions, while controlling for sequence bias. In this study, we compare MEIRLOP against other MEA methods in identifying binding motifs found enriched in differentially active regulatory regions after interferon-beta stimulus, finding that using logistic regression and covariates improves the ability to call enrichment of ISGF3 binding motifs from differential acetylation ChIP-seq data compared to other methods. Our method achieves similar or better performance compared to other methods when quantifying the enrichment of TF binding motifs from ENCODE TF ChIP-seq datasets. We also demonstrate how MEIRLOP is broadly applicable to the analysis of numerous types of NGS assays and experimental designs. CONCLUSIONS Our results demonstrate the importance of controlling for sequence bias when accurately identifying enriched DNA sequence motifs using score-based MEA. MEIRLOP is available for download from https://github.com/npdeloss/meirlop under the MIT license.
Collapse
|
9
|
On the problem of confounders in modeling gene expression. Bioinformatics 2019; 35:711-719. [PMID: 30084962 PMCID: PMC6530814 DOI: 10.1093/bioinformatics/bty674] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 06/21/2018] [Accepted: 08/02/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
10
|
Functional characterization of two enhancers located downstream FOXP2. BMC MEDICAL GENETICS 2019; 20:65. [PMID: 31046704 PMCID: PMC6498672 DOI: 10.1186/s12881-019-0810-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2019] [Accepted: 04/17/2019] [Indexed: 01/01/2023]
Abstract
BACKGROUND Mutations in the coding region of FOXP2 are known to cause speech and language impairment. However, it is not clear how dysregulation of the gene contributes to language deficit. Interestingly, microdeletions of the region downstream the gene have been associated with cognitive deficits. METHODS Here, we investigate changes in FOXP2 expression in the SK-N-MC neuroblastoma human cell line after deletion by CRISPR-Cas9 of two enhancers located downstream of the gene. RESULTS Deletion of any of these two functional enhancers downregulates FOXP2, but also upregulates the closest 3' gene MDFIC. Because this effect is not statistically significant in a HEK 293 cell line, derived from the human kidney, both enhancers might confer a tissue specific regulation to both genes. We have also found that the deletion of any of these enhancers downregulates six well-known FOXP2 target genes in the SK-N-MC cell line. CONCLUSIONS We expect these findings contribute to a deeper understanding of how FOXP2 and MDFIC are regulated to pace neuronal development supporting cognition, speech and language.
Collapse
|
11
|
Changes in long-range rDNA-genomic interactions associate with altered RNA polymerase II gene programs during malignant transformation. Commun Biol 2019; 2:39. [PMID: 30701204 PMCID: PMC6349880 DOI: 10.1038/s42003-019-0284-y] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 12/28/2018] [Indexed: 12/15/2022] Open
Abstract
The three-dimensional organization of the genome contributes to its maintenance and regulation. While chromosomal regions associate with nucleolar ribosomal RNA genes (rDNA), the biological significance of rDNA-genome interactions and whether they are dynamically regulated during disease remain unclear. rDNA chromatin exists in multiple inactive and active states and their transition is regulated by the RNA polymerase I transcription factor UBTF. Here, using a MYC-driven lymphoma model, we demonstrate that during malignant progression the rDNA chromatin converts to the open state, which is required for tumor cell survival. Moreover, this rDNA transition co-occurs with a reorganization of rDNA-genome contacts which correlate with gene expression changes at associated loci, impacting gene ontologies including B-cell differentiation, cell growth and metabolism. We propose that UBTF-mediated conversion to open rDNA chromatin during malignant transformation contributes to the regulation of specific gene pathways that regulate growth and differentiation through reformed long-range physical interactions with the rDNA.
Collapse
|
12
|
Epigenetic impacts of stress priming of the neuroinflammatory response to sarin surrogate in mice: a model of Gulf War illness. J Neuroinflammation 2018; 15:86. [PMID: 29549885 PMCID: PMC5857314 DOI: 10.1186/s12974-018-1113-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Accepted: 03/01/2018] [Indexed: 12/12/2022] Open
Abstract
Background Gulf War illness (GWI) is an archetypal, medically unexplained, chronic condition characterised by persistent sickness behaviour and neuroimmune and neuroinflammatory components. An estimated 25–32% of the over 900,000 veterans of the 1991 Gulf War fulfil the requirements of a GWI diagnosis. It has been hypothesised that the high physical and psychological stress of combat may have increased vulnerability to irreversible acetylcholinesterase (AChE) inhibitors leading to a priming of the neuroimmune system. A number of studies have linked high levels of psychophysiological stress and toxicant exposures to epigenetic modifications that regulate gene expression. Recent research in a mouse model of GWI has shown that pre-exposure with the stress hormone corticosterone (CORT) causes an increase in expression of specific chemokines and cytokines in response to diisopropyl fluorophosphate (DFP), a sarin surrogate and irreversible AChE inhibitor. Methods C57BL/6J mice were exposed to CORT for 4 days, and exposed to DFP on day 5, before sacrifice 6 h later. The transcriptome was examined using RNA-seq, and the epigenome was examined using reduced representation bisulfite sequencing and H3K27ac ChIP-seq. Results We show transcriptional, histone modification (H3K27ac) and DNA methylation changes in genes related to the immune and neuronal system, potentially relevant to neuroinflammatory and cognitive symptoms of GWI. Further evidence suggests altered proportions of myelinating oligodendrocytes in the frontal cortex, perhaps connected to white matter deficits seen in GWI sufferers. Conclusions Our findings may reflect the early changes which occurred in GWI veterans, and we observe alterations in several pathways altered in GWI sufferers. These close links to changes seen in veterans with GWI indicates that this model reflects the environmental exposures related to GWI and may provide a model for biomarker development and testing future treatments. Electronic supplementary material The online version of this article (10.1186/s12974-018-1113-9) contains supplementary material, which is available to authorized users.
Collapse
|
13
|
Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res 2017; 45:e27. [PMID: 27899659 PMCID: PMC5389469 DOI: 10.1093/nar/gkw1036] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 10/19/2016] [Indexed: 02/06/2023] Open
Abstract
Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes.
Collapse
|
14
|
Direct GR Binding Sites Potentiate Clusters of TF Binding across the Human Genome. Cell 2016; 166:1269-1281.e19. [PMID: 27565349 DOI: 10.1016/j.cell.2016.07.049] [Citation(s) in RCA: 98] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Revised: 05/12/2016] [Accepted: 07/27/2016] [Indexed: 12/21/2022]
Abstract
The glucocorticoid receptor (GR) binds the human genome at >10,000 sites but only regulates the expression of hundreds of genes. To determine the functional effect of each site, we measured the glucocorticoid (GC) responsive activity of nearly all GR binding sites (GBSs) captured using chromatin immunoprecipitation (ChIP) in A549 cells. 13% of GBSs assayed had GC-induced activity. The responsive sites were defined by direct GR binding via a GC response element (GRE) and exclusively increased reporter-gene expression. Meanwhile, most GBSs lacked GC-induced reporter activity. The non-responsive sites had epigenetic features of steady-state enhancers and clustered around direct GBSs. Together, our data support a model in which clusters of GBSs observed with ChIP-seq reflect interactions between direct and tethered GBSs over tens of kilobases. We further show that those interactions can synergistically modulate the activity of direct GBSs and may therefore play a major role in driving gene activation in response to GCs.
Collapse
|
15
|
ChARM: Discovery of combinatorial chromatin modification patterns in hepatitis B virus X-transformed mouse liver cancer using association rule mining. BMC Bioinformatics 2016; 17:452. [PMID: 28105934 PMCID: PMC5249029 DOI: 10.1186/s12859-016-1307-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Background Various chromatin modifications, identified in large-scale epigenomic analyses, are associated with distinct phenotypes of different cells and disease phases. To improve our understanding of these variations, many computational methods have been developed to discover novel sites and cell-specific chromatin modifications. Despite the availability of existing methods, there is still room for further improvement when they are applied to resolve the histone code hypothesis. Hence, we aim to investigate the development of a computational method to provide new insights into de novo combinatorial pattern discovery of chromatin modifications to characterize epigenetic variations in distinct phenotypes of different cells. Results We report a new computational approach, ChARM (Combinatorial Chromatin Modification Patterns using Association Rule Mining), that can be employed for the discovery of de novo combinatorial patterns of differential chromatin modifications. We used ChARM to analyse chromatin modification data from the livers of normal (non-cancerous) mice and hepatitis B virus X (HBx)-transgenic mice with hepatocellular carcinoma, and discovered 2,409 association rules representing combinatorial chromatin modification patterns. Among these, the combination of three histone modifications, a loss of H3K4Me3 and gains of H3K27Me3 and H3K36Me3, was the most striking pattern associated with the cancer. This pattern was enriched in functional elements of the mouse genome such as promoters, coding exons and 5′UTR with high CpG content, and CpG islands. It also showed strong correlations with polymerase activity at promoters and DNA methylation levels at gene bodies. We found that 30 % of the genes associated with the pattern were differentially expressed in the HBx compared to the normal, and 78.9 % of these genes were down-regulated. The significant canonical pathways (Wnt/ß-catenin, cAMP, Ras, and Notch signalling) that were enriched in the pattern could account for the pathogenesis of HBx. Conclusions ChARM, an unsupervised method for discovering combinatorial chromatin modification patterns, can identify histone modifications that occur globally. ChARM provides a scalable framework that can easily be applied to find various levels of combination patterns, which should reflect a range of globally common to locally rare chromatin modifications. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1307-z) contains supplementary material, which is available to authorized users.
Collapse
|
16
|
Abstract
Ribosomal RNA genes are highly repetitive and therefore not annotated in genome assemblies. We did recently analyze the epigenetic and architectural regulation of murine ribosomal genes by the Transcription Termination Factor I and made use of genome-wide histone modification ChIP-seq data. This method paper describes how repetitive genomic regions can be integrated into custom genomic assemblies and be used with genome-wide profiling data.
Collapse
|
17
|
Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform 2015; 17:967-979. [PMID: 26634919 PMCID: PMC5142011 DOI: 10.1093/bib/bbv101] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 10/22/2015] [Indexed: 12/20/2022] Open
Abstract
Enhancers are cis-acting DNA elements that play critical roles in distal regulation of gene expression. Identifying enhancers is an important step for understanding distinct gene expression programs that may reflect normal and pathogenic cellular conditions. Experimental identification of enhancers is constrained by the set of conditions used in the experiment. This requires multiple experiments to identify enhancers, as they can be active under specific cellular conditions but not in different cell types/tissues or cellular states. This has opened prospects for computational prediction methods that can be used for high-throughput identification of putative enhancers to complement experimental approaches. Potential functions and properties of predicted enhancers have been catalogued and summarized in several enhancer-oriented databases. Because the current methods for the computational prediction of enhancers produce significantly different enhancer predictions, it will be beneficial for the research community to have an overview of the strategies and solutions developed in this field. In this review, we focus on the identification and analysis of enhancers by bioinformatics approaches. First, we describe a general framework for computational identification of enhancers, present relevant data types and discuss possible computational solutions. Next, we cover over 30 existing computational enhancer identification methods that were developed since 2000. Our review highlights advantages, limitations and potentials, while suggesting pragmatic guidelines for development of more efficient computational enhancer prediction methods. Finally, we discuss challenges and open problems of this topic, which require further consideration.
Collapse
|
18
|
Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data. Biostatistics 2015; 16:641-54. [PMID: 25964663 DOI: 10.1093/biostatistics/kxv016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 04/10/2015] [Indexed: 11/12/2022] Open
Abstract
Modeling correlation structures is a challenge in bioinformatics, especially when dealing with high throughput genomic data. A compound hierarchical correlated beta mixture (CBM) with an exchangeable correlation structure is proposed to cluster genetic vectors into mixture components. The correlation coefficient, [Formula: see text], is homogenous within a mixture component and heterogeneous between mixture components. A random CBM with [Formula: see text] brings more flexibility in explaining correlation variations among genetic variables. Expectation-Maximization (EM) algorithm and Stochastic Expectation-Maximization (SEM) algorithm are used to estimate parameters of CBM. The number of mixture components can be determined using model selection criteria such as AIC, BIC and ICL-BIC. Extensive simulation studies were conducted to compare EM, SEM and model selection criteria. Simulation results suggest that CBM outperforms the traditional beta mixture model with lower estimation bias and higher classification accuracy. The proposed method is applied to cluster transcription factor-DNA binding probability in mouse genome data generated by Lahdesmaki and others (2008, Probabilistic inference of transcription factor binding from multiple data sources. PLoS One, 3: , e1820). The results reveal distinct clusters of transcription factors when binding to promoter regions of genes in JAK-STAT, MAPK and other two pathways.
Collapse
|
19
|
Modeling the relationship of epigenetic modifications to transcription factor binding. Nucleic Acids Res 2015; 43:3873-85. [PMID: 25820421 PMCID: PMC4417166 DOI: 10.1093/nar/gkv255] [Citation(s) in RCA: 74] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Accepted: 03/12/2015] [Indexed: 12/19/2022] Open
Abstract
Transcription factors (TFs) and epigenetic modifications play crucial roles in the regulation of gene expression, and correlations between the two types of factors have been discovered. However, methods for quantitatively studying the correlations remain limited. Here, we present a computational approach to systematically investigating how epigenetic changes in chromatin architectures or DNA sequences relate to TF binding. We implemented statistical analyses to illustrate that epigenetic modifications are predictive of TF binding affinities, without the need of sequence information. Intriguingly, by considering genome locations relative to transcription start sites (TSSs) or enhancer midpoints, our analyses show that different locations display various relationship patterns. For instance, H3K4me3, H3k9ac and H3k27ac contribute more in the regions near TSSs, whereas H3K4me1 and H3k79me2 dominate in the regions far from TSSs. DNA methylation plays relatively important roles when close to TSSs than in other regions. In addition, the results show that epigenetic modification models for the predictions of TF binding affinities are cell line-specific. Taken together, our study elucidates highly coordinated, but location- and cell type-specific relationships between epigenetic modifications and binding affinities of TFs.
Collapse
|
20
|
Chromatin properties of regulatory DNA probed by manipulation of transcription factors. J Comput Biol 2014; 21:569-77. [PMID: 24918633 DOI: 10.1089/cmb.2013.0126] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Transcription factors (TFs) bind to DNA and regulate the transcription of nearby genes. However, only a small fraction of TF binding sites have such regulatory effects. Here we search for the predictors of functional binding sites by carrying out a systematic computational screening of a variety of contextual factors (histone modifications, nuclear lamin-bindings, and cofactor bindings). We used regression analysis to test if contextual factors are associated with upregulation or downregulation of neighboring genes following the induction or knockdown of the 9 TFs in mouse embryonic stem (ES) cells. Functional TF binding sites appeared to be either active (i.e., bound by P300, CHD7, mediator, cohesin, and SWI/SNF) or repressed (i.e., with H3K27me3 histone marks and bound by Polycomb factors). Active binding sites mediated the downregulation of nearby genes upon knocking down the activating TFs or inducing repressors. Repressed TF binding sites mediated the upregulation of nearby genes (e.g., poised developmental regulators) upon inducing TFs. In addition, repressed binding sites mediated repressive effects of TFs, identified by the downregulation of target genes after the induction of TFs or by the upregulation of target genes after the knockdown of TFs. The contextual factors associated with functions of DNA-bound TFs were used to improve the identification of candidate target genes regulated by TFs.
Collapse
|
21
|
Alterations in DNA methylation of Fkbp5 as a determinant of blood-brain correlation of glucocorticoid exposure. Psychoneuroendocrinology 2014; 44:112-22. [PMID: 24767625 PMCID: PMC4047971 DOI: 10.1016/j.psyneuen.2014.03.003] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Revised: 02/17/2014] [Accepted: 03/10/2014] [Indexed: 11/21/2022]
Abstract
BACKGROUND Epigenetic studies that utilize peripheral tissues to identify molecular substrates of neuropsychiatric disorders rely on the assumption that disease-relevant, cellular alterations that occur in the brain are mirrored and detectable in peripheral tissues such as blood. We sought to test this assumption by using a mouse model of Cushing's disease and asking whether epigenetic changes induced by glucocorticoids can be correlated between these tissue types. METHODS Mice were treated with different doses of glucocorticoids in their drinking water for four weeks to assess gene expression and DNA methylation (DNAm) changes in the stress response gene Fkbp5. RESULTS Significant linear relationships were observed between DNAm and four-week mean plasma corticosterone levels for both blood (R(2)=0.68, P=7.1×10(-10)) and brain (R(2)=0.33, P=0.001). Further, degree of methylation change in blood correlated significantly with both methylation (R(2)=0.49, P=2.7×10(-5)) and expression (R(2)=0.43, P=3.5×10(-5)) changes in hippocampus, with the notable observation that methylation changes occurred at different intronic regions between blood and brain tissues. CONCLUSION Although our findings are limited to several intronic CpGs in a single gene, our results demonstrate that DNA from blood can be used to assess dynamic, glucocorticoid-induced changes occurring in the brain. However, for such correlation analyses to be effective, tissue-specific locations of these epigenetic changes may need to be considered when investigating brain-relevant changes in peripheral tissues.
Collapse
|
22
|
Chromatin-specific regulation of mammalian rDNA transcription by clustered TTF-I binding sites. PLoS Genet 2013; 9:e1003786. [PMID: 24068958 PMCID: PMC3772059 DOI: 10.1371/journal.pgen.1003786] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2013] [Accepted: 07/26/2013] [Indexed: 12/04/2022] Open
Abstract
Enhancers and promoters often contain multiple binding sites for the same transcription factor, suggesting that homotypic clustering of binding sites may serve a role in transcription regulation. Here we show that clustering of binding sites for the transcription termination factor TTF-I downstream of the pre-rRNA coding region specifies transcription termination, increases the efficiency of transcription initiation and affects the three-dimensional structure of rRNA genes. On chromatin templates, but not on free rDNA, clustered binding sites promote cooperative binding of TTF-I, loading TTF-I to the downstream terminators before it binds to the rDNA promoter. Interaction of TTF-I with target sites upstream and downstream of the rDNA transcription unit connects these distal DNA elements by forming a chromatin loop between the rDNA promoter and the terminators. The results imply that clustered binding sites increase the binding affinity of transcription factors in chromatin, thus influencing the timing and strength of DNA-dependent processes. The sequence-specific binding of proteins to regulatory regions controls gene expression. Binding sites for transcription factors are rather short and present several million times in large genomes. However, only a small number of these binding sites are functionally important. How proteins can discriminate and select their functional regions is not clear, to date. Regulatory loci like gene promoters and enhancers commonly comprise multiple binding sites for either one factor or a combination of several DNA binding proteins, allowing efficient factor recruitment. We studied the cluster of TTF-I binding sites downstream of the rRNA gene and identified that cooperative binding to the multimeric termination sites in combination with low-affinity binding of TTF-I to individual sites upstream of the gene serves multiple regulatory functions. Packaging of the clustered sites into chromatin is a prerequisite for high-affinity binding, coordinated activation of transcription and the formation of a chromatin loop between the promoter and the terminator.
Collapse
|
23
|
Abstract
Genome-wide binding assays can determine where individual transcription factors bind in the genome. However, these factors rarely bind chromatin alone, but instead frequently bind to cis-regulatory elements (CREs) together with other factors thus forming protein complexes. Currently there are no integrative analytical approaches that can predict which complexes are formed on chromatin. Here, we describe a computational methodology to systematically capture protein complexes and infer their impact on gene expression. We applied our method to three human cell types, identified thousands of CREs, inferred known and undescribed complexes recruited to these CREs, and determined the role of the complexes as activators or repressors. Importantly, we found that the predicted complexes have a higher number of physical interactions between their members than expected by chance. Our work provides a mechanism for developing hypotheses about gene regulation via binding partners, and deciphering the interplay between combinatorial binding and gene expression.
Collapse
|
24
|
Transcription factor and chromatin features predict genes associated with eQTLs. Nucleic Acids Res 2013; 41:1450-63. [PMID: 23275551 PMCID: PMC3561974 DOI: 10.1093/nar/gks1339] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2012] [Revised: 11/28/2012] [Accepted: 12/01/2012] [Indexed: 01/11/2023] Open
Abstract
Cell type-specific gene expression in humans involves complex interactions between regulatory factors and DNA at enhancers and promoters. Mapping studies for expression quantitative trait loci (eQTLs), transcription factors (TFs) and chromatin markers have become widely used tools for identifying gene regulatory elements, but prediction of target genes remains a major challenge. Here, we integrate genome-wide data on TF-binding sites, chromatin markers and functional annotations to predict genes associated with human eQTLs. Using the random forest classifier, we found that genomic proximity plus five TF and chromatin features are able to predict >90% of target genes within 1 megabase of eQTLs. Despite being regularly used to map target genes, proximity is not a good indicator of eQTL targets for genes 150 kilobases away, but insulators, TF co-occurrence, open chromatin and functional similarities between TFs and genes are better indicators. Using all six features in the classifier achieved an area under the specificity and sensitivity curve of 0.91, much better compared with at most 0.75 for using any single feature. We hope this study will not only provide validation of eQTL-mapping studies, but also provide insight into the molecular mechanisms explaining how genetic variation can influence gene expression.
Collapse
|
25
|
Cell-type specificity of ChIP-predicted transcription factor binding sites. BMC Genomics 2012; 13:372. [PMID: 22863112 PMCID: PMC3574057 DOI: 10.1186/1471-2164-13-372] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2012] [Accepted: 07/06/2012] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Context-dependent transcription factor (TF) binding is one reason for differences in gene expression patterns between different cellular states. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) identifies genome-wide TF binding sites for one particular context-the cells used in the experiment. But can such ChIP-seq data predict TF binding in other cellular contexts and is it possible to distinguish context-dependent from ubiquitous TF binding? RESULTS We compared ChIP-seq data on TF binding for multiple TFs in two different cell types and found that on average only a third of ChIP-seq peak regions are common to both cell types. Expectedly, common peaks occur more frequently in certain genomic contexts, such as CpG-rich promoters, whereas chromatin differences characterize cell-type specific TF binding. We also find, however, that genotype differences between the cell types can explain differences in binding. Moreover, ChIP-seq signal intensity and peak clustering are the strongest predictors of common peaks. Compared with strong peaks located in regions containing peaks for multiple transcription factors, weak and isolated peaks are less common between the cell types and are less associated with data that indicate regulatory activity. CONCLUSIONS Together, the results suggest that experimental noise is prevalent among weak peaks, whereas strong and clustered peaks represent high-confidence binding events that often occur in other cellular contexts. Nevertheless, 30-40% of the strongest and most clustered peaks show context-dependent regulation. We show that by combining signal intensity with additional data-ranging from context independent information such as binding site conservation and position weight matrix scores to context dependent chromatin structure-we can predict whether a ChIP-seq peak is likely to be present in other cellular contexts.
Collapse
|