1
|
Wanniarachchi DV, Viswakula S, Wickramasuriya AM. The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes. BMC Bioinformatics 2024; 25:371. [PMID: 39623329 PMCID: PMC11613939 DOI: 10.1186/s12859-024-05995-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 11/21/2024] [Indexed: 12/06/2024] Open
Abstract
BACKGROUND The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites. RESULTS Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools. CONCLUSION The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.
Collapse
Affiliation(s)
- Dinithi V Wanniarachchi
- Department of Plant Sciences, Faculty of Science, University of Colombo, Colombo 03, Sri Lanka
| | - Sameera Viswakula
- Department of Statistics, Faculty of Science, University of Colombo, Colombo 03, Sri Lanka
| | | |
Collapse
|
2
|
Xu J, Gao J, Ni P, Gerstein M. Less-is-more: selecting transcription factor binding regions informative for motif inference. Nucleic Acids Res 2024; 52:e20. [PMID: 38214231 PMCID: PMC10899791 DOI: 10.1093/nar/gkad1240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 12/06/2023] [Accepted: 12/17/2023] [Indexed: 01/13/2024] Open
Abstract
Numerous statistical methods have emerged for inferring DNA motifs for transcription factors (TFs) from genomic regions. However, the process of selecting informative regions for motif inference remains understudied. Current approaches select regions with strong ChIP-seq signal for a given TF, assuming that such strong signal primarily results from specific interactions between the TF and its motif. Additionally, these selection approaches do not account for non-target motifs, i.e. motifs of other TFs; they presume the occurrence of these non-target motifs infrequent compared to that of the target motif, and thus assume these have minimal interference with the identification of the target. Leveraging extensive ChIP-seq datasets, we introduced the concept of TF signal 'crowdedness', referred to as C-score, for each genomic region. The C-score helps in highlighting TF signals arising from non-specific interactions. Moreover, by considering the C-score (and adjusting for the length of genomic regions), we can effectively mitigate interference of non-target motifs. Using these tools, we find that in many instances, strong ChIP-seq signal stems mainly from non-specific interactions, and the occurrence of non-target motifs significantly impacts the accurate inference of the target motif. Prioritizing genomic regions with reduced crowdedness and short length markedly improves motif inference. This 'less-is-more' effect suggests that ChIP-seq region selection warrants more attention.
Collapse
Affiliation(s)
- Jinrui Xu
- Department of Biology, Howard University, Washington, DC 20059, USA
- Center for Applied Data Science and Analytics, Howard University, Washington, DC 20059, USA
| | - Jiahao Gao
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Pengyu Ni
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, New Haven, CT 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
3
|
Luan Y, Tang Z, He Y, Xie Z. Intra-Domain Residue Coevolution in Transcription Factors Contributes to DNA Binding Specificity. Microbiol Spectr 2023; 11:e0365122. [PMID: 36943132 PMCID: PMC10100741 DOI: 10.1128/spectrum.03651-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 02/22/2023] [Indexed: 03/23/2023] Open
Abstract
Understanding the basis of the DNA-binding specificity of transcription factors (TFs) has been of long-standing interest. Despite extensive efforts to map millions of putative TF binding sequences, identifying the critical determinants for DNA binding specificity remains a major challenge. The coevolution of residues in proteins occurs due to a shared evolutionary history. However, it is unclear how coevolving residues in TFs contribute to DNA binding specificity. Here, we systematically collected publicly available data sets from multiple large-scale high-throughput TF-DNA interaction screening experiments for the major TF families with large numbers of TF members. These families included the Homeobox, HLH, bZIP_1, Ets, HMG_box, ZF-C4, and Zn_clus TFs. We detected TF subclass-determining sites (TSDSs) and showed that the TSDSs were more likely to coevolve with other TSDSs than with non-TSDSs, particularly for the Homeobox, HLH, Ets, bZIP_1, and HMG_box TF families. By in silico modeling, we showed that mutation of the highly coevolving residues could significantly reduce the stability of the TF-DNA complex. The distant residues from the DNA interface also contributed to TF-DNA binding activity. Overall, our study gave evidence that coevolved residues relate to transcriptional regulation and provided insights into the potential application of engineered DNA-binding domains and proteins. IMPORTANCE While unraveling DNA-binding specificity of TFs is the key to understanding the basis and molecular mechanism of gene expression regulation, identifying the critical determinants that contribute to DNA binding specificity remains a major challenge. In this study, we provided evidence showing that coevolving residues in TF domains contributed to DNA binding specificity. We demonstrated that the TSDSs were more likely to coevolve with other TSDSs than with non-TSDSs. Mutation of the coevolving residue pairs (CRPs) could significantly reduce the stability of THE TF-DNA complex, and even the distant residues from the DNA interface contribute to TF-DNA binding activity. Collectively, our study expands our knowledge of the interactions among coevolved residues in TFs, tertiary contacting, and functional importance in refined transcriptional regulation. Understanding the impact of coevolving residues in TFs will help understand the details of transcription of gene regulation and advance the application of engineered DNA-binding domains and protein.
Collapse
Affiliation(s)
- Yizhao Luan
- State Key Laboratory of Ophthalmology, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Zehua Tang
- State Key Laboratory of Ophthalmology, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Yao He
- State Key Laboratory of Ophthalmology, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| | - Zhi Xie
- State Key Laboratory of Ophthalmology, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
4
|
Ding K, Dixit G, Parker BJ, Wen J. CRMnet: A deep learning model for predicting gene expression from large regulatory sequence datasets. Front Big Data 2023; 6:1113402. [PMID: 36999047 PMCID: PMC10043243 DOI: 10.3389/fdata.2023.1113402] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Accepted: 02/23/2023] [Indexed: 03/17/2023] Open
Abstract
Recent large datasets measuring the gene expression of millions of possible gene promoter sequences provide a resource to design and train optimized deep neural network architectures to predict expression from sequences. High predictive performance due to the modeling of dependencies within and between regulatory sequences is an enabler for biological discoveries in gene regulation through model interpretation techniques. To understand the regulatory code that delineates gene expression, we have designed a novel deep-learning model (CRMnet) to predict gene expression in Saccharomyces cerevisiae. Our model outperforms the current benchmark models and achieves a Pearson correlation coefficient of 0.971 and a mean squared error of 3.200. Interpretation of informative genomic regions determined from model saliency maps, and overlapping the saliency maps with known yeast motifs, supports that our model can successfully locate the binding sites of transcription factors that actively modulate gene expression. We compare our model's training times on a large compute cluster with GPUs and Google TPUs to indicate practical training times on similar datasets.
Collapse
Affiliation(s)
- Ke Ding
- Division of Genome Science and Cancer, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
| | - Gunjan Dixit
- Division of Genome Science and Cancer, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
| | - Brian J. Parker
- School of Computing and Biological Data Science Institute, Australian National University, Canberra, ACT, Australia
- *Correspondence: Brian J. Parker
| | - Jiayu Wen
- Division of Genome Science and Cancer, John Curtin School of Medical Research, Australian National University, Canberra, ACT, Australia
- Jiayu Wen
| |
Collapse
|
5
|
Sarkar S, Yadav S, Mehta P, Gupta G, Rajender S. Histone Methylation Regulates Gene Expression in the Round Spermatids to Set the RNA Payloads of Sperm. Reprod Sci 2022; 29:857-882. [PMID: 35015293 DOI: 10.1007/s43032-021-00837-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 12/19/2021] [Indexed: 12/30/2022]
Abstract
Gene expression during spermatogenesis undergoes significant changes due to a demanding sequence of mitosis, meiosis, and differentiation. We investigated the contribution of H3 histone modifications to gene regulation in the round spermatids. Round spermatids were purified from rat testes using centrifugal elutriation and Percoll density-gradient centrifugation. After enzymatic chromatin shearing, immuno-precipitation using antibodies against histone marks H3k4me3 and H3K9me3 was undertaken. The immunoprecipitated DNA fragments were subjected to massive parallel sequencing. Gene expression in round spermatids and sperm was analyzed by transcriptome sequencing using next-generation sequencing methods. ChIP-seq analysis showed significant peak enrichment in H3K4me3 marks in active chromatin regions and H3K9me3 peak enrichment in repressive regions. We found 53 genes which showed overlapping peak enrichment in both H3K4me3 and H3K9me3 marks. Some of the top H3K4me3-enriched genes were involved in sperm tail formation (Odf1, Odf3, Odf4, Oaz3, Ccdc42, Ccdc63, and Ccdc181), chromatin condensation (Dync1h1, Dynll1, and Kdm3a), and sperm functions such as acrosome reaction (Acrbp and Fabp9), energy generation (Gapdhs), and signaling for motility (Tssk1b, Tssk2, and Tssk4). Transcriptome sequencing in round spermatids found 64% transcripts of the H3K4me3-enriched genes at high levels and of about 25% of H3K9me3-enriched genes at very low levels. Transcriptome sequencing in sperm found that more than 99% of the ChIP-seq corresponding transcripts were also present in sperm. H3K4me3 enrichment in the round spermatids correlates significantly with gene expression and H3K9me3 correlates with gene silencing that contribute to sperm differentiation and setting the RNA payloads of sperm.
Collapse
Affiliation(s)
- Saumya Sarkar
- Division of Endocrinology, CSIR-Central Drug Research Institute, Lucknow, India
| | - Santosh Yadav
- Division of Endocrinology, CSIR-Central Drug Research Institute, Lucknow, India
| | - Poonam Mehta
- Division of Endocrinology, CSIR-Central Drug Research Institute, Lucknow, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| | - Gopal Gupta
- Division of Endocrinology, CSIR-Central Drug Research Institute, Lucknow, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India
| | - Singh Rajender
- Division of Endocrinology, CSIR-Central Drug Research Institute, Lucknow, India. .,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, India.
| |
Collapse
|
6
|
Corrales E, Levit-Zerdoun E, Metzger P, Kowar S, Ku M, Brummer T, Boerries M. Dynamic transcriptome analysis reveals signatures of paradoxical effect of vemurafenib on human dermal fibroblasts. Cell Commun Signal 2021; 19:123. [PMID: 34930313 PMCID: PMC8686565 DOI: 10.1186/s12964-021-00801-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 11/09/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Vemurafenib (PLX4032) is one of the most frequently used treatments for late-stage melanoma patients with the BRAFV600E mutation; however, acquired resistance to the drug poses as a major challenge. It remains to be determined whether off-target effects of vemurafenib on normal stroma components could reshape the tumor microenvironment in a way that contributes to cancer progression and drug resistance. METHODS By using temporally-resolved RNA- and ATAC-seq, we studied the early molecular changes induced by vemurafenib in human dermal fibroblast (HDF), a main stromal component in melanoma and other tumors with high prevalence of BRAFV600 mutations. RESULTS Transcriptomics analyses revealed a stepwise up-regulation of proliferation signatures, together with a down-regulation of autophagy and proteolytic processes. The gene expression changes in HDF strongly correlated in an inverse way with those in BRAFV600E mutant malignant melanoma (MaMel) cell lines, consistent with the observation of a paradoxical effect of vemurafenib, leading to hyperphosphorylation of MEK1/2 and ERK1/2. The transcriptional changes in HDF were not strongly determined by alterations in chromatin accessibility; rather, an already permissive chromatin landscape seemed to facilitate the early accessibility to MAPK/ERK-regulated transcription factor binding sites. Combinatorial treatment with the MEK inhibitor trametinib did not preclude the paradoxical activation of MAPK/ERK signaling in HDF. When administered together, vemurafenib partially compensated for the reduction of cell viability and proliferation induced by trametinib. These paradoxical changes were restrained by using the third generation BRAF inhibitor PLX8394, a so-called paradox breaker compound. However, the advantageous effects on HDF during combination therapies were also lost. CONCLUSIONS Vemurafenib induces paradoxical changes in HDF, enabled by a permissive chromatin landscape. These changes might provide an advantage during combination therapies, by compensating for the toxicity induced in stromal cells by less specific MAPK/ERK inhibitors. Our results highlight the relevance of evaluating the effects of the drugs on non-transformed stromal components, carefully considering the implications of their administration either as mono- or combination therapies. Video Abstract.
Collapse
Affiliation(s)
- Eyleen Corrales
- Institute of Molecular Medicine and Cell Research (IMMZ), University of Freiburg, Stefan-Meier-Str. 17, 79104 Freiburg, Germany
- Institute of Medical Bioinformatics and Systems Medicine (IBSM), Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacherstr. 153, 79110 Freiburg, Germany
- Faculty of Biology, University of Freiburg, Schänzlestr. 1, 79104 Freiburg, Germany
| | - Ella Levit-Zerdoun
- Institute of Molecular Medicine and Cell Research (IMMZ), University of Freiburg, Stefan-Meier-Str. 17, 79104 Freiburg, Germany
- Institute of Medical Bioinformatics and Systems Medicine (IBSM), Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacherstr. 153, 79110 Freiburg, Germany
- German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
| | - Patrick Metzger
- Institute of Medical Bioinformatics and Systems Medicine (IBSM), Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacherstr. 153, 79110 Freiburg, Germany
| | - Silke Kowar
- Institute of Molecular Medicine and Cell Research (IMMZ), University of Freiburg, Stefan-Meier-Str. 17, 79104 Freiburg, Germany
- Institute of Medical Bioinformatics and Systems Medicine (IBSM), Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacherstr. 153, 79110 Freiburg, Germany
| | - Manching Ku
- Department of Pediatrics and Adolescent Medicine, Division of Pediatric Hematology and Oncology, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Mathildenstr. 1, 79106 Freiburg, Germany
| | - Tilman Brummer
- Institute of Molecular Medicine and Cell Research (IMMZ), University of Freiburg, Stefan-Meier-Str. 17, 79104 Freiburg, Germany
- German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- German Cancer Consortium (DKTK), Freiburg, Germany
- Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Schänzlestr. 18, 79104 Freiburg, Germany
| | - Melanie Boerries
- Institute of Molecular Medicine and Cell Research (IMMZ), University of Freiburg, Stefan-Meier-Str. 17, 79104 Freiburg, Germany
- Institute of Medical Bioinformatics and Systems Medicine (IBSM), Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Breisacherstr. 153, 79110 Freiburg, Germany
- German Cancer Research Center (DKFZ), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- German Cancer Consortium (DKTK), Freiburg, Germany
- Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Schänzlestr. 18, 79104 Freiburg, Germany
| |
Collapse
|
7
|
Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics 2020; 36:2690-2696. [PMID: 31999322 PMCID: PMC7203737 DOI: 10.1093/bioinformatics/btaa045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 12/23/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| | - Pratyush K Das
- Applied Tumor Genomics, Research Programs Unit, University of Helsinki, Helsinki FI-00014, Finland
| | - Jussi Taipale
- Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, SE 141 83 Stockholm, Sweden
- Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
- Genome-Scale Biology Program, University of Helsinki, Helsinki FI-00014, Finland
| | - Esko Ukkonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
8
|
Carazo F, Romero JP, Rubio A. Upstream analysis of alternative splicing: a review of computational approaches to predict context-dependent splicing factors. Brief Bioinform 2020; 20:1358-1375. [PMID: 29390045 DOI: 10.1093/bib/bby005] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 12/14/2017] [Indexed: 12/13/2022] Open
Abstract
Alternative splicing (AS) has shown to play a pivotal role in the development of diseases, including cancer. Specifically, all the hallmarks of cancer (angiogenesis, cell immortality, avoiding immune system response, etc.) are found to have a counterpart in aberrant splicing of key genes. Identifying the context-specific regulators of splicing provides valuable information to find new biomarkers, as well as to define alternative therapeutic strategies. The computational models to identify these regulators are not trivial and require three conceptual steps: the detection of AS events, the identification of splicing factors that potentially regulate these events and the contextualization of these pieces of information for a specific experiment. In this work, we review the different algorithmic methodologies developed for each of these tasks. Main weaknesses and strengths of the different steps of the pipeline are discussed. Finally, a case study is detailed to help the reader be aware of the potential and limitations of this computational approach.
Collapse
|
9
|
Toivonen J, Kivioja T, Jolma A, Yin Y, Taipale J, Ukkonen E. Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets. Nucleic Acids Res 2019; 46:e44. [PMID: 29385521 PMCID: PMC5934673 DOI: 10.1093/nar/gky027] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Accepted: 01/12/2018] [Indexed: 01/06/2023] Open
Abstract
In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations. Given training data and seeds for monomeric motifs, the algorithm learns in the same probabilistic framework a mixture model which represents monomeric motifs as standard position-specific probability matrices (PPMs), and dimeric motifs as pairs of monomeric PPMs, with associated orientation and spacing preferences. For dimers the model represents deviations from pure modular model of two independent monomers, thus making co-operative binding effects explicit. MODER can analyze in reasonable time tens of Mbps of training data. We validated the tool on HT-SELEX and ChIP-seq data. Our findings include some TFs whose expected model has palindromic symmetry but the observed model is directional.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, P.O. Box 68, FI-00014 University of Helsinki, Helsinki, Finland
| | - Teemu Kivioja
- Genome-Scale Biology Program, P.O. Box 63, FI-00014 University of Helsinki, Helsinki, Finland
| | - Arttu Jolma
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
| | - Yimeng Yin
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
| | - Jussi Taipale
- Genome-Scale Biology Program, P.O. Box 63, FI-00014 University of Helsinki, Helsinki, Finland.,Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden.,Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
| | - Esko Ukkonen
- Department of Computer Science, P.O. Box 68, FI-00014 University of Helsinki, Helsinki, Finland.,Helsinki Institute for Information Technology HIIT, University of Helsinki & Aalto University, Helsinki, Finland
| |
Collapse
|
10
|
The Identification and Interpretation of cis-Regulatory Noncoding Mutations in Cancer. High Throughput 2018; 8:ht8010001. [PMID: 30577431 PMCID: PMC6473693 DOI: 10.3390/ht8010001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Revised: 12/11/2018] [Accepted: 12/14/2018] [Indexed: 12/30/2022] Open
Abstract
In the need to characterise the genomic landscape of cancers and to establish novel biomarkers and therapeutic targets, studies have largely focused on the identification of driver mutations within the protein-coding gene regions, where the most pathogenic alterations are known to occur. However, the noncoding genome is significantly larger than its protein-coding counterpart, and evidence reveals that regulatory sequences also harbour functional mutations that significantly affect the regulation of genes and pathways implicated in cancer. Due to the sheer number of noncoding mutations (NCMs) and the limited knowledge of regulatory element functionality in cancer genomes, differentiating pathogenic mutations from background passenger noise is particularly challenging technically and computationally. Here we review various up-to-date high-throughput sequencing data/studies and in silico methods that can be employed to interrogate the noncoding genome. We aim to provide an overview of available data resources as well as computational and molecular techniques that can help and guide the search for functional NCMs in cancer genomes.
Collapse
|
11
|
Kinjo S, Monma N, Misu S, Kitamura N, Imoto J, Yoshitake K, Gojobori T, Ikeo K. Maser: one-stop platform for NGS big data from analysis to visualization. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4970007. [PMID: 29688385 PMCID: PMC5905357 DOI: 10.1093/database/bay027] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 02/26/2018] [Indexed: 11/13/2022]
Abstract
A major challenge in analyzing the data from high-throughput next-generation sequencing (NGS) is how to handle the huge amounts of data and variety of NGS tools and visualize the resultant outputs. To address these issues, we developed a cloud-based data analysis platform, Maser (Management and Analysis System for Enormous Reads), and an original genome browser, Genome Explorer (GE). Maser enables users to manage up to 2 terabytes of data to conduct analyses with easy graphical user interface operations and offers analysis pipelines in which several individual tools are combined as a single pipeline for very common and standard analyses. GE automatically visualizes genome assembly and mapping results output from Maser pipelines, without requiring additional data upload. With this function, the Maser pipelines can graphically display the results output from all the embedded tools and mapping results in a web browser. Therefore Maser realized a more user-friendly analysis platform especially for beginners by improving graphical display and providing the selected standard pipelines that work with built-in genome browser. In addition, all the analyses executed on Maser are recorded in the analysis history, helping users to trace and repeat the analyses. The entire process of analysis and its histories can be shared with collaborators or opened to the public. In conclusion, our system is useful for managing, analyzing, and visualizing NGS data and achieves traceability, reproducibility, and transparency of NGS analysis. Database URL: http://cell-innovation.nig.ac.jp/maser/
Collapse
Affiliation(s)
- Sonoko Kinjo
- Center for Information Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
| | | | | | - Norikazu Kitamura
- Center for Information Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
| | - Junichi Imoto
- Center for Information Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
| | - Kazutoshi Yoshitake
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Kazuho Ikeo
- Center for Information Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan.,Department of Genetics, SOKENDAI, Mishima, Japan
| |
Collapse
|
12
|
Martins-Santana L, Nora LC, Sanches-Medeiros A, Lovate GL, Cassiano MHA, Silva-Rocha R. Systems and Synthetic Biology Approaches to Engineer Fungi for Fine Chemical Production. Front Bioeng Biotechnol 2018; 6:117. [PMID: 30338257 PMCID: PMC6178918 DOI: 10.3389/fbioe.2018.00117] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/02/2018] [Indexed: 01/16/2023] Open
Abstract
Since the advent of systems and synthetic biology, many studies have sought to harness microbes as cell factories through genetic and metabolic engineering approaches. Yeast and filamentous fungi have been successfully harnessed to produce fine and high value-added chemical products. In this review, we present some of the most promising advances from recent years in the use of fungi for this purpose, focusing on the manipulation of fungal strains using systems and synthetic biology tools to improve metabolic flow and the flow of secondary metabolites by pathway redesign. We also review the roles of bioinformatics analysis and predictions in synthetic circuits, highlighting in silico systemic approaches to improve the efficiency of synthetic modules.
Collapse
Affiliation(s)
- Leonardo Martins-Santana
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Luisa C Nora
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Ananda Sanches-Medeiros
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Gabriel L Lovate
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Murilo H A Cassiano
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Rafael Silva-Rocha
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| |
Collapse
|
13
|
Madsen JGS, Rauch A, Van Hauwaert EL, Schmidt SF, Winnefeld M, Mandrup S. Integrated analysis of motif activity and gene expression changes of transcription factors. Genome Res 2018; 28:243-255. [PMID: 29233921 PMCID: PMC5793788 DOI: 10.1101/gr.227231.117] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 12/01/2017] [Indexed: 01/01/2023]
Abstract
The ability to predict transcription factors based on sequence information in regulatory elements is a key step in systems-level investigation of transcriptional regulation. Here, we have developed a novel tool, IMAGE, for precise prediction of causal transcription factors based on transcriptome profiling and genome-wide maps of enhancer activity. High precision is obtained by combining a near-complete database of position weight matrices (PWMs), generated by compiling public databases and systematic prediction of PWMs for uncharacterized transcription factors, with a state-of-the-art method for PWM scoring and a novel machine learning strategy, based on both enhancers and promoters, to predict the contribution of motifs to transcriptional activity. We applied IMAGE to published data obtained during 3T3-L1 adipocyte differentiation and showed that IMAGE predicts causal transcriptional regulators of this process with higher confidence than existing methods. Furthermore, we generated genome-wide maps of enhancer activity and transcripts during human mesenchymal stem cell commitment and adipocyte differentiation and used IMAGE to identify positive and negative transcriptional regulators of this process. Collectively, our results demonstrate that IMAGE is a powerful and precise method for prediction of regulators of gene expression.
Collapse
Affiliation(s)
- Jesper Grud Skat Madsen
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Alexander Rauch
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Elvira Laila Van Hauwaert
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Søren Fisker Schmidt
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Marc Winnefeld
- Research and Development, Beiersdorf AG, 20245 Hamburg, Germany
| | - Susanne Mandrup
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| |
Collapse
|
14
|
Liu B, Yang J, Li Y, McDermaid A, Ma Q. An algorithmic perspective of de novo cis-regulatory motif finding based on ChIP-seq data. Brief Bioinform 2017; 19:1069-1081. [DOI: 10.1093/bib/bbx026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Indexed: 01/06/2023] Open
Affiliation(s)
- Bingqiang Liu
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Jinyu Yang
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Yang Li
- School of Mathematics, Shandong University, Jinan Shandong, P. R. China
| | - Adam McDermaid
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, USA
| | - Qin Ma
- Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, USA
| |
Collapse
|
15
|
Hirano Y, Ihara K, Masuda T, Yamamoto T, Iwata I, Takahashi A, Awata H, Nakamura N, Takakura M, Suzuki Y, Horiuchi J, Okuno H, Saitoe M. Shifting transcriptional machinery is required for long-term memory maintenance and modification in Drosophila mushroom bodies. Nat Commun 2016; 7:13471. [PMID: 27841260 PMCID: PMC5114576 DOI: 10.1038/ncomms13471] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Accepted: 10/06/2016] [Indexed: 01/08/2023] Open
Abstract
Accumulating evidence suggests that transcriptional regulation is required for maintenance of long-term memories (LTMs). Here we characterize global transcriptional and epigenetic changes that occur during LTM storage in the Drosophila mushroom bodies (MBs), structures important for memory. Although LTM formation requires the CREB transcription factor and its coactivator, CBP, subsequent early maintenance requires CREB and a different coactivator, CRTC. Late maintenance becomes CREB independent and instead requires the transcription factor Bx. Bx expression initially depends on CREB/CRTC activity, but later becomes CREB/CRTC independent. The timing of the CREB/CRTC early maintenance phase correlates with the time window for LTM extinction and we identify different subsets of CREB/CRTC target genes that are required for memory maintenance and extinction. Furthermore, we find that prolonging CREB/CRTC-dependent transcription extends the time window for LTM extinction. Our results demonstrate the dynamic nature of stored memory and its regulation by shifting transcription systems in the MBs. Transcriptional regulation is necessary for maintaining long-term memories (LTM) but the mechanistic details are not completely defined. Here the authors identify transcriptional machinery and histone modifiers required for LTM maintenance in Drosophila and show that transcriptional regulation for LTM maintenance is distinct from that for LTM formation.
Collapse
Affiliation(s)
- Yukinori Hirano
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan.,Japan Science and Technology Agency, PRESTO, 4-4-8 Honcho, Kawaguchi, Saitama 332-0012, Japan
| | - Kunio Ihara
- Center of Gene Research, Nagoya University, Huro-cho, Chikusa-ku, Nagoya 464-8602, Japan
| | - Tomoko Masuda
- Tokyo Metropolitan Institute of Medical Science, 2-1-6 Kamikitazawa, Setagaya, Tokyo 156-0057, Japan
| | - Takuya Yamamoto
- Center for iPS Cell Research and Application, Department of Reprogramming Science, Kyoto University, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto, Kyoto 606-8507, Japan.,Institute for Integrated Cell-Material Sciences (WPI-iCeMS), Kyoto University, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto, Kyoto 606-8507, Japan.,AMED-CREST, AMED 1-7-1 Otemach, Chiyodaku, Tokyo 100-0004, Japan
| | - Ikuko Iwata
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Aya Takahashi
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Hiroko Awata
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Naosuke Nakamura
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan.,Kyoto Sangyo University, Motoyama, Kamigamo, Kita-ku, Kyoto City 603-8555, Japan
| | - Mai Takakura
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Yusuke Suzuki
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Junjiro Horiuchi
- Tokyo Metropolitan Institute of Medical Science, 2-1-6 Kamikitazawa, Setagaya, Tokyo 156-0057, Japan
| | - Hiroyuki Okuno
- SK Project, Medical Innovation Center, Kyoto University Graduate School of Medicine, 53 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
| | - Minoru Saitoe
- Tokyo Metropolitan Institute of Medical Science, 2-1-6 Kamikitazawa, Setagaya, Tokyo 156-0057, Japan
| |
Collapse
|
16
|
Jayaram N, Usvyat D, R Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 2016; 17:547. [PMID: 27806697 PMCID: PMC6889335 DOI: 10.1186/s12859-016-1298-9] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 10/20/2016] [Indexed: 12/21/2022] Open
Abstract
Background Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. Results We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes — those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. Conclusions Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1298-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Narayan Jayaram
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Daniel Usvyat
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Andrew C R Martin
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
17
|
Adriaens ME, Prickaerts P, Chan-Seng-Yue M, van den Beucken T, Dahlmans VEH, Eijssen LM, Beck T, Wouters BG, Voncken JW, Evelo CTA. Quantitative analysis of ChIP-seq data uncovers dynamic and sustained H3K4me3 and H3K27me3 modulation in cancer cells under hypoxia. Epigenetics Chromatin 2016; 9:48. [PMID: 27822313 PMCID: PMC5090954 DOI: 10.1186/s13072-016-0090-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 09/02/2016] [Indexed: 01/16/2023] Open
Abstract
Background A comprehensive assessment of the epigenetic dynamics in cancer cells is the key to understanding the molecular mechanisms underlying cancer and to improving cancer diagnostics, prognostics and treatment. By combining genome-wide ChIP-seq epigenomics and microarray transcriptomics, we studied the effects of oxygen deprivation and subsequent reoxygenation on histone 3 trimethylation of lysine 4 (H3K4me3) and lysine 27 (H3K27me3) in a breast cancer cell line, serving as a model for abnormal oxygenation in solid tumors. A priori, epigenetic markings and gene expression levels not only are expected to vary greatly between hypoxic and normoxic conditions, but also display a large degree of heterogeneity across the cell population. Where traditionally ChIP-seq data are often treated as dichotomous data, the model and experiment here necessitate a quantitative, data-driven analysis of both datasets. Results We first identified genomic regions with sustained epigenetic markings, which provided a sample-specific reference enabling quantitative ChIP-seq data analysis. Sustained H3K27me3 marking was located around centromeres and intergenic regions, while sustained H3K4me3 marking is associated with genes involved in RNA binding, translation and protein transport and localization. Dynamic marking with both H3K4me3 and H3K27me3 (hypoxia-induced bivalency) was found in CpG-rich regions at loci encoding factors that control developmental processes, congruent with observations in embryonic stem cells. Conclusions In silico-identified epigenetically sustained and dynamic genomic regions were confirmed through ChIP-PCR in vitro, and obtained results are corroborated by published data and current insights regarding epigenetic regulation. Electronic supplementary material The online version of this article (doi:10.1186/s13072-016-0090-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Michiel E Adriaens
- Maastricht Centre for Systems Biology - MaCSBio, Maastricht University, Maastricht, The Netherlands.,Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, The Netherlands
| | - Peggy Prickaerts
- Department of Molecular Genetics, Maastricht University, Maastricht, The Netherlands
| | - Michelle Chan-Seng-Yue
- Departments of Informatics and Bio-computing, University Health Network, Toronto, ON Canada.,Heart Centre Biobank, The Hospital for Sick Children, Toronto, ON Canada
| | - Twan van den Beucken
- Princess Margaret Cancer Centre and Campbell Family Institute for Cancer Research, University Health Network, Toronto, ON Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON Canada.,Maastricht Radiation Oncology (MaastRO) Laboratory, Maastricht University, Maastricht, The Netherlands
| | - Vivian E H Dahlmans
- Department of Molecular Genetics, Maastricht University, Maastricht, The Netherlands
| | - Lars M Eijssen
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, The Netherlands
| | - Timothy Beck
- Departments of Informatics and Bio-computing, University Health Network, Toronto, ON Canada.,Human Longevity Inc., San Diego, CA USA
| | - Bradly G Wouters
- Princess Margaret Cancer Centre and Campbell Family Institute for Cancer Research, University Health Network, Toronto, ON Canada.,Department of Radiation Oncology, University of Toronto, Toronto, ON Canada.,Maastricht Radiation Oncology (MaastRO) Laboratory, Maastricht University, Maastricht, The Netherlands
| | - Jan Willem Voncken
- Department of Molecular Genetics, Maastricht University, Maastricht, The Netherlands
| | - Chris T A Evelo
- Department of Bioinformatics - BiGCaT, Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
18
|
Cormier N, Kolisnik T, Bieda M. Reusable, extensible, and modifiable R scripts and Kepler workflows for comprehensive single set ChIP-seq analysis. BMC Bioinformatics 2016; 17:270. [PMID: 27377783 PMCID: PMC4932705 DOI: 10.1186/s12859-016-1125-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2016] [Accepted: 06/07/2016] [Indexed: 11/10/2022] Open
Abstract
Background There has been an enormous expansion of use of chromatin immunoprecipitation followed by sequencing (ChIP-seq) technologies. Analysis of large-scale ChIP-seq datasets involves a complex series of steps and production of several specialized graphical outputs. A number of systems have emphasized custom development of ChIP-seq pipelines. These systems are primarily based on custom programming of a single, complex pipeline or supply libraries of modules and do not produce the full range of outputs commonly produced for ChIP-seq datasets. It is desirable to have more comprehensive pipelines, in particular ones addressing common metadata tasks, such as pathway analysis, and pipelines producing standard complex graphical outputs. It is advantageous if these are highly modular systems, available as both turnkey pipelines and individual modules, that are easily comprehensible, modifiable and extensible to allow rapid alteration in response to new analysis developments in this growing area. Furthermore, it is advantageous if these pipelines allow data provenance tracking. Results We present a set of 20 ChIP-seq analysis software modules implemented in the Kepler workflow system; most (18/20) were also implemented as standalone, fully functional R scripts. The set consists of four full turnkey pipelines and 16 component modules. The turnkey pipelines in Kepler allow data provenance tracking. Implementation emphasized use of common R packages and widely-used external tools (e.g., MACS for peak finding), along with custom programming. This software presents comprehensive solutions and easily repurposed code blocks for ChIP-seq analysis and pipeline creation. Tasks include mapping raw reads, peakfinding via MACS, summary statistics, peak location statistics, summary plots centered on the transcription start site (TSS), gene ontology, pathway analysis, and de novo motif finding, among others. Conclusions These pipelines range from those performing a single task to those performing full analyses of ChIP-seq data. The pipelines are supplied as both Kepler workflows, which allow data provenance tracking, and, in the majority of cases, as standalone R scripts. These pipelines are designed for ease of modification and repurposing.
Collapse
Affiliation(s)
- Nathan Cormier
- Department of Biochemistry and Molecular Biology, University of Calgary Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary, AB, T2N4N1, Canada
| | - Tyler Kolisnik
- Department of Biochemistry and Molecular Biology, University of Calgary Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary, AB, T2N4N1, Canada
| | - Mark Bieda
- Department of Biochemistry and Molecular Biology, University of Calgary Cumming School of Medicine, Rm HSC1151, 3330 Hospital Dr. NW, Calgary, AB, T2N4N1, Canada.
| |
Collapse
|
19
|
Silva TC, Colaprico A, Olsen C, D'Angelo F, Bontempi G, Ceccarelli M, Noushmehr H. TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Res 2016; 5:1542. [PMID: 28232861 PMCID: PMC5302158 DOI: 10.12688/f1000research.8923.2] [Citation(s) in RCA: 104] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/24/2016] [Indexed: 01/09/2023] Open
Abstract
Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as
The Cancer Genome Atlas (TCGA),
The Encyclopedia of DNA Elements (ENCODE), and
The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The
Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages:
AnnotationHub,
ChIPSeeker,
ComplexHeatmap,
pathview,
ELMER,
GAIA,
MINET,
RTCGAToolbox,
TCGAbiolinks.
Collapse
Affiliation(s)
- Tiago C Silva
- Department of Genetics, Ribeirao Preto Medical School, University of Sao Paulo, Ribeirao Preto, Brazil; Department of Biomedical Sciences, Cedars-Sinai, Los Angeles, CA, USA
| | - Antonio Colaprico
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium
| | - Catharina Olsen
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium
| | - Fulvio D'Angelo
- Department of Science and Technology, University of Sannio, Benevento, Italy; Biogem, Istituto di Ricerche Genetiche Gaetano Salvatore, Avellino, Italy
| | - Gianluca Bontempi
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium; Department of Science and Technology, University of Sannio, Benevento, Italy
| | | | - Houtan Noushmehr
- Department of Genetics, Ribeirao Preto Medical School, University of Sao Paulo, Ribeirao Preto, Brazil; Department of Neurosurgery, Henry Ford Hospital, Detroit, MI, USA
| |
Collapse
|
20
|
Silva TC, Colaprico A, Olsen C, D'Angelo F, Bontempi G, Ceccarelli M, Noushmehr H. TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Res 2016. [PMID: 28232861 DOI: 10.12688/f1000research.8923.1] [Citation(s) in RCA: 144] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM). This workflow introduces the following Bioconductor packages: AnnotationHub, ChIPSeeker, ComplexHeatmap, pathview, ELMER, GAIA, MINET, RTCGAToolbox, TCGAbiolinks.
Collapse
Affiliation(s)
- Tiago C Silva
- Department of Genetics, Ribeirao Preto Medical School, University of Sao Paulo, Ribeirao Preto, Brazil; Department of Biomedical Sciences, Cedars-Sinai, Los Angeles, CA, USA
| | - Antonio Colaprico
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium
| | - Catharina Olsen
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium
| | - Fulvio D'Angelo
- Department of Science and Technology, University of Sannio, Benevento, Italy; Biogem, Istituto di Ricerche Genetiche Gaetano Salvatore, Avellino, Italy
| | - Gianluca Bontempi
- Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium; Machine Learning Group, ULB, Brussels, Belgium; Department of Science and Technology, University of Sannio, Benevento, Italy
| | | | - Houtan Noushmehr
- Department of Genetics, Ribeirao Preto Medical School, University of Sao Paulo, Ribeirao Preto, Brazil; Department of Neurosurgery, Henry Ford Hospital, Detroit, MI, USA
| |
Collapse
|
21
|
Pan D, Huang L, Zhu LJ, Zou T, Ou J, Zhou W, Wang YX. Jmjd3-Mediated H3K27me3 Dynamics Orchestrate Brown Fat Development and Regulate White Fat Plasticity. Dev Cell 2015; 35:568-583. [PMID: 26625958 DOI: 10.1016/j.devcel.2015.11.002] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2014] [Revised: 09/30/2015] [Accepted: 11/03/2015] [Indexed: 01/09/2023]
Abstract
Progression from brown preadipocytes to adipocytes engages two transcriptional programs: the expression of adipogenic genes common to both brown fat (BAT) and white fat (WAT), and the expression of BAT-selective genes. However, the dynamics of chromatin states and epigenetic enzymes involved remain poorly understood. Here we show that BAT development is selectively marked and guided by repressive H3K27me3 and is executed by its demethylase Jmjd3. We find that a significant subset of BAT-selective genes, but not common fat genes or WAT-selective genes, are demarcated by H3K27me3 in both brown and white preadipocytes. Jmjd3-catalyzed removal of H3K27me3, in part through Rreb1-mediated recruitment, is required for expression of BAT-selective genes and for development of beige adipocytes both in vitro and in vivo. Moreover, gain- and loss-of-function Jmjd3 transgenic mice show age-dependent body weight reduction and cold intolerance, respectively. Together, we identify an epigenetic mechanism governing BAT fate determination and WAT plasticity.
Collapse
Affiliation(s)
- Dongning Pan
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - Lei Huang
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - Lihua J Zhu
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - Tie Zou
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - Jianhong Ou
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - William Zhou
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA
| | - Yong-Xu Wang
- Department of Molecular, Cell and Cancer Biology and Program in Molecular Medicine, University of Massachusetts Medical School, 364 Plantation Street, Worcester, MA 01605, USA.
| |
Collapse
|
22
|
Varco-Merth B, Rotwein P. Differential effects of STAT proteins on growth hormone-mediated IGF-I gene expression. Am J Physiol Endocrinol Metab 2014; 307:E847-55. [PMID: 25205818 PMCID: PMC4216947 DOI: 10.1152/ajpendo.00324.2014] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Growth hormone (GH) plays a key role regulating somatic growth and in controlling metabolism and other physiological processes in humans and other animal species. GH acts by binding to the extracellular part of its transmembrane receptor, leading to induction of multiple intracellular signal transduction pathways that culminate in changes in gene and protein expression. A key agent in GH-stimulated growth is the latent transcription factor signal transducer and activator of transcription (STAT) 5B, one of four STAT proteins induced by the GH receptor in cultured cells and in vivo. As shown by genetic and biochemical studies, GH-activated STAT5B promotes transcription of the gene encoding the critical growth peptide, insulin-like growth factor-I (IGF-I), and natural null mutations of STAT5B in humans lead to growth failure accompanied by diminished IGF-I expression. Here we have examined the possibility that other GH-activated STATs can enhance IGF-I gene transcription, and thus potentially contribute to GH-regulated somatic growth. We find that human STAT5A is nearly identical to STAT5B in its biochemical and functional responses to GH but that STAT1 and STAT3 show a weaker profile of in vitro binding to STAT DNA elements from the IGF-I gene than STAT5B, and are less potent inducers of gene transcription through these elements. Taken together, our results offer a molecular explanation for why STAT5B is a key in vivo mediator of GH-activated IGF-I gene transcription and thus of GH-regulated somatic growth.
Collapse
Affiliation(s)
- Ben Varco-Merth
- Department of Biochemistry and Molecular Biology, Oregon Health & Science University, Portland, Oregon
| | - Peter Rotwein
- Department of Biochemistry and Molecular Biology, Oregon Health & Science University, Portland, Oregon
| |
Collapse
|
23
|
Yan H, Evans J, Kalmbach M, Moore R, Middha S, Luban S, Wang L, Bhagwate A, Li Y, Sun Z, Chen X, Kocher JPA. HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data. BMC Bioinformatics 2014; 15:280. [PMID: 25128017 PMCID: PMC4152589 DOI: 10.1186/1471-2105-15-280] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Accepted: 08/11/2014] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research. RESULTS We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes. CONCLUSIONS HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN 55905, USA.
| |
Collapse
|
24
|
Abstract
DNA methylation patterns are important for establishing cell, tissue, and organism phenotypes, but little is known about their contribution to natural human variation. To determine their contribution to variability, we have generated genome-scale DNA methylation profiles of three human populations (Caucasian-American, African-American, and Han Chinese-American) and examined the differentially methylated CpG sites. The distinctly methylated genes identified suggest an influence of DNA methylation on phenotype differences, such as susceptibility to certain diseases and pathogens, and response to drugs and environmental agents. DNA methylation differences can be partially traced back to genetic variation, suggesting that differentially methylated CpG sites serve as evolutionarily established mediators between the genetic code and phenotypic variability. Notably, one-third of the DNA methylation differences were not associated with any genetic variation, suggesting that variation in population-specific sites takes place at the genetic and epigenetic levels, highlighting the contribution of epigenetic modification to natural human variation.
Collapse
|
25
|
Baty F, Rüdiger J, Miglino N, Kern L, Borger P, Brutsche M. Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis. BMC Bioinformatics 2013; 14:178. [PMID: 23742070 PMCID: PMC3686578 DOI: 10.1186/1471-2105-14-178] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2012] [Accepted: 05/30/2013] [Indexed: 12/14/2022] Open
Abstract
Background Interpretation of gene expression microarray data in the light of external information on both columns and rows (experimental variables and gene annotations) facilitates the extraction of pertinent information hidden in these complex data. Biologists classically interpret genes of interest after retrieving functional information from a subset of genes of interest. Transcription factors play an important role in orchestrating the regulation of gene expression. Their activity can be deduced by examining the presence of putative transcription factors binding sites in the gene promoter regions. Results In this paper we present the multivariate statistical method RLQ which aims to analyze microarray data where additional information is available on both genes and samples. As an illustrative example, we applied RLQ methodology to analyze transcription factor activity associated with the time-course effect of steroids on the growth of primary human lung fibroblasts. RLQ could successfully predict transcription factor activity, and could integrate various other sources of external information in the main frame of the analysis. The approach was validated by means of alternative statistical methods and biological validation. Conclusions RLQ provides an efficient way of extracting and visualizing structures present in a gene expression dataset by directly modeling the link between experimental variables and gene annotations.
Collapse
Affiliation(s)
- Florent Baty
- Division of Pulmonary Medicine, Cantonal Hospital St, Gallen, Rorschacherstrasse 95, CH-9007 St, Gallen, Switzerland.
| | | | | | | | | | | |
Collapse
|
26
|
Sheffield NC, Thurman RE, Song L, Safi A, Stamatoyannopoulos JA, Lenhard B, Crawford GE, Furey TS. Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions. Genome Res 2013; 23:777-88. [PMID: 23482648 PMCID: PMC3638134 DOI: 10.1101/gr.152140.112] [Citation(s) in RCA: 158] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2012] [Accepted: 03/07/2013] [Indexed: 11/24/2022]
Abstract
Regulatory elements recruit transcription factors that modulate gene expression distinctly across cell types, but the relationships among these remains elusive. To address this, we analyzed matched DNase-seq and gene expression data for 112 human samples representing 72 cell types. We first defined more than 1800 clusters of DNase I hypersensitive sites (DHSs) with similar tissue specificity of DNase-seq signal patterns. We then used these to uncover distinct associations between DHSs and promoters, CpG islands, conserved elements, and transcription factor motif enrichment. Motif analysis within clusters identified known and novel motifs in cell-type-specific and ubiquitous regulatory elements and supports a role for AP-1 regulating open chromatin. We developed a classifier that accurately predicts cell-type lineage based on only 43 DHSs and evaluated the tissue of origin for cancer cell types. A similar classifier identified three sex-specific loci on the X chromosome, including the XIST lincRNA locus. By correlating DNase I signal and gene expression, we predicted regulated genes for more than 500K DHSs. Finally, we introduce a web resource to enable researchers to use these results to explore these regulatory patterns and better understand how expression is modulated within and across human cell types.
Collapse
Affiliation(s)
- Nathan C. Sheffield
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina 27710, USA
| | - Robert E. Thurman
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | - Lingyun Song
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina 27710, USA
| | - Alexias Safi
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina 27710, USA
| | | | - Boris Lenhard
- Bergen Center for Computational Science and Sars Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
- Department of Molecular Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom; and MRC Clinical Sciences Centre, London W12 0NN, United Kingdom
| | - Gregory E. Crawford
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina 27710, USA
- Department of Pediatrics, Division of Medical Genetics, Duke University, Durham, North Carolina 27710, USA
| | - Terrence S. Furey
- Department of Genetics and Department of Biology, Carolina Center for Genome Sciences, Linberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599, USA
| |
Collapse
|
27
|
Penkov D, Mateos San Martín D, Fernandez-Díaz LC, Rosselló CA, Torroja C, Sánchez-Cabo F, Warnatz HJ, Sultan M, Yaspo ML, Gabrieli A, Tkachuk V, Brendolan A, Blasi F, Torres M. Analysis of the DNA-binding profile and function of TALE homeoproteins reveals their specialization and specific interactions with Hox genes/proteins. Cell Rep 2013; 3:1321-33. [PMID: 23602564 DOI: 10.1016/j.celrep.2013.03.029] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 02/19/2013] [Accepted: 03/20/2013] [Indexed: 11/28/2022] Open
Abstract
The interactions of Meis, Prep, and Pbx1 TALE homeoproteins with Hox proteins are essential for development and disease. Although Meis and Prep behave similarly in vitro, their in vivo activities remain largely unexplored. We show that Prep and Meis interact with largely independent sets of genomic sites and select different DNA-binding sequences, Prep associating mostly with promoters and housekeeping genes and Meis with promoter-remote regions and developmental genes. Hox target sequences associate strongly with Meis but not with Prep binding sites, while Pbx1 cooperates with both Prep and Meis. Accordingly, Meis1 shows strong genetic interaction with Pbx1 but not with Prep1. Meis1 and Prep1 nonetheless coregulate a subset of genes, predominantly through opposing effects. Notably, the TALE homeoprotein binding profile subdivides Hox clusters into two domains differentially regulated by Meis1 and Prep1. During evolution, Meis and Prep thus specialized their interactions but maintained significant regulatory coordination.
Collapse
Affiliation(s)
- Dmitry Penkov
- Foundation FIRC Institute of Molecular Oncology at the IFOM-IEO Campus, via Adamello 16, 20139 Milan, Italy
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
A GWAS sequence variant for platelet volume marks an alternative DNM3 promoter in megakaryocytes near a MEIS1 binding site. Blood 2012; 120:4859-68. [PMID: 22972982 PMCID: PMC3520622 DOI: 10.1182/blood-2012-01-401893] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
We recently identified 68 genomic loci where common sequence variants are associated with platelet count and volume. Platelets are formed in the bone marrow by megakaryocytes, which are derived from hematopoietic stem cells by a process mainly controlled by transcription factors. The homeobox transcription factor MEIS1 is uniquely transcribed in megakaryocytes and not in the other lineage-committed blood cells. By ChIP-seq, we show that 5 of the 68 loci pinpoint a MEIS1 binding event within a group of 252 MK-overexpressed genes. In one such locus in DNM3, regulating platelet volume, the MEIS1 binding site falls within a region acting as an alternative promoter that is solely used in megakaryocytes, where allelic variation dictates different levels of a shorter transcript. The importance of dynamin activity to the latter stages of thrombopoiesis was confirmed by the observation that the inhibitor Dynasore reduced murine proplatelet for-mation in vitro.
Collapse
|
29
|
A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 2012; 7:1551-68. [DOI: 10.1038/nprot.2012.088] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
30
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|
31
|
Arabidopsis circadian clock protein, TOC1, is a DNA-binding transcription factor. Proc Natl Acad Sci U S A 2012; 109:3167-72. [PMID: 22315425 DOI: 10.1073/pnas.1200355109] [Citation(s) in RCA: 385] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The first described feedback loop of the Arabidopsis circadian clock is based on reciprocal regulation between Timing of CAB Expression 1 (TOC1) and Circadian Clock-associated 1 (CCA1)/late elongated hypocotyl (LHY). CCA1 and LHY are Myb transcription factors that bind directly to the TOC1 promoter to negatively regulate its expression. Conversely, the activity of TOC1 has remained less well characterized. Genetic data support that TOC1 is necessary for the reactivation of CCA1/LHY, but there is little description of its biochemical function. Here we show that TOC1 occupies specific genomic regions in the CCA1 and LHY promoters. Purified TOC1 binds directly to DNA through its CCT domain, which is similar to known DNA-binding domains. Chemical induction and transient overexpression of TOC1 in Arabidopsis seedlings cause repression of CCA1/LHY expression, demonstrating that TOC1 can repress direct targets, and mutation or deletion of the CCT domain prevents this repression showing that DNA-binding is necessary for TOC1 action. Furthermore, we use the Gal4/UAS system in Arabidopsis to show that TOC1 acts as a general transcriptional repressor, and that repression activity is in the pseudoreceiver domain of the protein. To identify the genes regulated by TOC1 on a genomic scale, we couple TOC1 chemical induction with microarray analysis and identify previously unexplored potential TOC1 targets and output pathways. Taken together, these results define a biochemical action for the core clock protein TOC1 and refine our perspective on how plant clocks function.
Collapse
|
32
|
Barozzi I, Termanini A, Minucci S, Natoli G. Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data. Biol Direct 2011; 6:51. [PMID: 21978789 PMCID: PMC3201895 DOI: 10.1186/1745-6150-6-51] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 10/06/2011] [Indexed: 11/21/2022] Open
Abstract
Background High-throughput sequencing is generating massive amounts of data at a pace that largely exceeds the throughput of data analysis routines. Here we introduce Fish the ChIPs (FC), a computational pipeline aimed at a broad public of users and designed to perform complete ChIP-Seq data analysis of an unlimited number of samples, thus increasing throughput, reproducibility and saving time. Results Starting from short read sequences, FC performs the following steps: 1) quality controls, 2) alignment to a reference genome, 3) peak calling, 4) genomic annotation, 5) generation of raw signal tracks for visualization on the UCSC and IGV genome browsers. FC exploits some of the fastest and most effective tools today available. Installation on a Mac platform requires very basic computational skills while configuration and usage are supported by a user-friendly graphic user interface. Alternatively, FC can be compiled from the source code on any Unix machine and then run with the possibility of customizing each single parameter through a simple configuration text file that can be generated using a dedicated user-friendly web-form. Considering the execution time, FC can be run on a desktop machine, even though the use of a computer cluster is recommended for analyses of large batches of data. FC is perfectly suited to work with data coming from Illumina Solexa Genome Analyzers or ABI SOLiD and its usage can potentially be extended to any sequencing platform. Conclusions Compared to existing tools, FC has two main advantages that make it suitable for a broad range of users. First of all, it can be installed and run by wet biologists on a Mac machine. Besides it can handle an unlimited number of samples, being convenient for large analyses. In this context, computational biologists can increase reproducibility of their ChIP-Seq data analyses while saving time for downstream analyses. Reviewers This article was reviewed by Gavin Huttley, George Shpakovski and Sarah Teichmann.
Collapse
Affiliation(s)
- Iros Barozzi
- Department of Experimental Oncology, European Institute of Oncology (IEO), IFOM-IEO Campus, Via Adamello 16, Milan, Italy.
| | | | | | | |
Collapse
|