1
|
Vorontsov IE, Eliseeva IA, Zinkevich A, Nikonov M, Abramov S, Boytsov A, Kamenets V, Kasianova A, Kolmykov S, Yevshin I, Favorov A, Medvedeva YA, Jolma A, Kolpakov F, Makeev V, Kulakovskiy I. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res 2024; 52:D154-D163. [PMID: 37971293 PMCID: PMC10767914 DOI: 10.1093/nar/gkad1077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 10/17/2023] [Accepted: 10/26/2023] [Indexed: 11/19/2023] Open
Abstract
We present a major update of the HOCOMOCO collection that provides DNA binding specificity patterns of 949 human transcription factors and 720 mouse orthologs. To make this release, we performed motif discovery in peak sets that originated from 14 183 ChIP-Seq experiments and reads from 2554 HT-SELEX experiments yielding more than 400 thousand candidate motifs. The candidate motifs were annotated according to their similarity to known motifs and the hierarchy of DNA-binding domains of the respective transcription factors. Next, the motifs underwent human expert curation to stratify distinct motif subtypes and remove non-informative patterns and common artifacts. Finally, the curated subset of 100 thousand motifs was supplied to the automated benchmarking to select the best-performing motifs for each transcription factor. The resulting HOCOMOCO v12 core collection contains 1443 verified position weight matrices, including distinct subtypes of DNA binding motifs for particular transcription factors. In addition to the core collection, HOCOMOCO v12 provides motif sets optimized for the recognition of binding sites in vivo and in vitro, and for annotation of regulatory sequence variants. HOCOMOCO is available at https://hocomoco12.autosome.org and https://hocomoco.autosome.org.
Collapse
Affiliation(s)
- Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
| | - Irina A Eliseeva
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Arsenii Zinkevich
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Mikhail Nikonov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121 Seattle, WA, USA
| | - Vasily Kamenets
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Alexandra Kasianova
- Skolkovo Institute of Science and Technology, 121205 Moscow, Russia
- Institute for Information Transmission Problems of the Russian Academy of Sciences, 127051 Moscow, Russia
| | - Semyon Kolmykov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
| | | | - Alexander Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Yulia A Medvedeva
- Research Center of Biotechnology RAS, Russian Academy of Sciences, 119071 Moscow, Russia
| | - Arttu Jolma
- Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Fedor Kolpakov
- Department of Computational Biology, Sirius University of Science and Technology, 354340 Sirius, Krasnodar region, Russia
- Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090 Novosibirsk, Russia
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Moscow Institute of Physics and Technology, 141700 Dolgoprudny, Russia
- Institute of Biochemistry and Genetics of the Ufa Federal Research Centre of the Russian Academy of Sciences, 450054 Ufa, Russia
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290 Pushchino, Russia
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, 420008 Kazan, Russia
| |
Collapse
|
2
|
Grau J, Schmidt F, Schulz MH. Widespread effects of DNA methylation and intra-motif dependencies revealed by novel transcription factor binding models. Nucleic Acids Res 2023; 51:e95. [PMID: 37650641 PMCID: PMC10570048 DOI: 10.1093/nar/gkad693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/20/2023] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Several studies suggested that transcription factor (TF) binding to DNA may be impaired or enhanced by DNA methylation. We present MeDeMo, a toolbox for TF motif analysis that combines information about DNA methylation with models capturing intra-motif dependencies. In a large-scale study using ChIP-seq data for 335 TFs, we identify novel TFs that show a binding behaviour associated with DNA methylation. Overall, we find that the presence of CpG methylation decreases the likelihood of binding for the majority of methylation-associated TFs. For a considerable subset of TFs, we show that intra-motif dependencies are pivotal for accurately modelling the impact of DNA methylation on TF binding. We illustrate that the novel methylation-aware TF binding models allow to predict differential ChIP-seq peaks and improve the genome-wide analysis of TF binding. Our work indicates that simplistic models that neglect the effect of DNA methylation on DNA binding may lead to systematic underperformance for methylation-associated TFs.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle 06120, Germany
| | - Florian Schmidt
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- Systems Biology and Data Analytics, Genome Institute of Singapore, Singapore 13862, Singapore
- ImmunoScape Pte Ltd, Singapore 228208, Singapore
| | - Marcel H Schulz
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- German Center for Cardiovascular Research, Partner site Rhein-Main, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University, Frankfurt am Main, Germany
| |
Collapse
|
3
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
4
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
5
|
Bang I, Lee SM, Park S, Park JY, Nong LK, Gao Y, Palsson BO, Kim D. Deep-learning optimized DEOCSU suite provides an iterable pipeline for accurate ChIP-exo peak calling. Brief Bioinform 2023; 24:7005164. [PMID: 36702751 DOI: 10.1093/bib/bbad024] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 01/02/2023] [Accepted: 01/08/2023] [Indexed: 01/28/2023] Open
Abstract
Recognizing binding sites of DNA-binding proteins is a key factor for elucidating transcriptional regulation in organisms. ChIP-exo enables researchers to delineate genome-wide binding landscapes of DNA-binding proteins with near single base-pair resolution. However, the peak calling step hinders ChIP-exo application since the published algorithms tend to generate false-positive and false-negative predictions. Here, we report the development of DEOCSU (DEep-learning Optimized ChIP-exo peak calling SUite), a novel machine learning-based ChIP-exo peak calling suite. DEOCSU entails the deep convolutional neural network model which was trained with curated ChIP-exo peak data to distinguish the visualized data of bona fide peaks from false ones. Performance validation of the trained deep-learning model indicated its high accuracy, high precision and high recall of over 95%. Applying the new suite to both in-house and publicly available ChIP-exo datasets obtained from bacteria, eukaryotes and archaea revealed an accurate prediction of peaks containing canonical motifs, highlighting the versatility and efficiency of DEOCSU. Furthermore, DEOCSU can be executed on a cloud computing platform or the local environment. With visualization software included in the suite, adjustable options such as the threshold of peak probability, and iterable updating of the pre-trained model, DEOCSU can be optimized for users' specific needs.
Collapse
Affiliation(s)
- Ina Bang
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Sang-Mok Lee
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Seojoung Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Joon Young Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Linh Khanh Nong
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| | - Ye Gao
- Department of Bioengineering, University of California San Diego, La Jolla CA 92093, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California San Diego, La Jolla CA 92093, USA
- Department of Pediatrics, University of California San Diego, La Jolla CA 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | - Donghyuk Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
| |
Collapse
|
6
|
Boytsov A, Abramov S, Makeev VJ, Kulakovskiy IV. Positional weight matrices have sufficient prediction power for analysis of noncoding variants. F1000Res 2022; 11:33. [PMID: 35811788 PMCID: PMC9237556 DOI: 10.12688/f1000research.75471.3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/30/2022] [Indexed: 11/23/2022] Open
Abstract
The position weight matrix, also called the position-specific scoring matrix, is the commonly accepted model to quantify the specificity of transcription factor binding to DNA. Position weight matrices are used in thousands of projects and software tools in regulatory genomics, including computational prediction of the regulatory impact of single-nucleotide variants. Yet, recently Yan et al. reported that "the position weight matrices of most transcription factors lack sufficient predictive power" if applied to the analysis of regulatory variants studied with a newly developed experimental method, SNP-SELEX. Here, we re-analyze the rich experimental dataset obtained by Yan et al. and show that appropriately selected position weight matrices in fact can adequately quantify transcription factor binding to alternative alleles.
Collapse
Affiliation(s)
- Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Vsevolod J. Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Ivan V. Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290, Russian Federation
| |
Collapse
|
7
|
Li Y, Ni P, Zhang S, Li G, Su Z. ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery. Bioinformatics 2020; 35:4632-4639. [PMID: 31070745 DOI: 10.1093/bioinformatics/btz290] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 03/29/2019] [Accepted: 04/18/2019] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. RESULTS We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Pengyu Ni
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
8
|
Chitpin JG, Awdeh A, Perkins TJ. RECAP reveals the true statistical significance of ChIP-seq peak calls. Bioinformatics 2020; 35:3592-3598. [PMID: 30824903 PMCID: PMC6761936 DOI: 10.1093/bioinformatics/btz150] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 01/18/2019] [Accepted: 02/27/2019] [Indexed: 12/29/2022] Open
Abstract
Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justin G Chitpin
- Translational and Molecular Medicine Program, University of Ottawa, Ottawa, ON K1H8M5, Canada.,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada
| | - Aseel Awdeh
- Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N6N5, Canada
| | - Theodore J Perkins
- Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, ON K1H8L6, Canada.,School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, ON K1N6N5, Canada.,Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, ON K1H8M5, Canada
| |
Collapse
|
9
|
Venters BJ. Insights from resolving protein-DNA interactions at near base-pair resolution. Brief Funct Genomics 2019; 17:80-88. [PMID: 29211822 DOI: 10.1093/bfgp/elx043] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
One of the central goals in molecular biology is to understand how cell-type-specific expression patterns arise through selective recruitment of RNA polymerase II (Pol II) to a subset of gene promoters. Pol II needs to be recruited to a precise genomic position at the proper time to produce messenger RNA from a DNA template. Ostensibly, transcription is a relatively simple cellular process; yet, experimentally measuring and then understanding the combinatorial possibilities of transcriptional regulators remain a daunting task. Since its introduction in 1985, chromatin immunoprecipitation (ChIP) has remained a key tool for investigating protein-DNA contacts in vivo. Over 30 years of intensive research using ChIP have provided numerous insights into mechanisms of gene regulation. As functional genomic technologies improve, they present new opportunities to address key biological questions. ChIP-exo is a refined version of ChIP-seq that significantly reduces background signal, while providing near base-pair mapping resolution for protein-DNA interactions. This review discusses the evolution of the ChIP assay over the years; the methodological differences between ChIP-seq, ChIP-exo and ChIP-nexus; and highlight new insights into epigenetic and transcriptional mechanisms that were uniquely enabled with the near base-pair resolution of ChIP-exo.
Collapse
|
10
|
Cavalli M, Baltzer N, Umer HM, Grau J, Lemnian I, Pan G, Wallerman O, Spalinskas R, Sahlén P, Grosse I, Komorowski J, Wadelius C. Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases. Sci Rep 2019; 9:2695. [PMID: 30804403 PMCID: PMC6389883 DOI: 10.1038/s41598-019-39633-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Accepted: 01/24/2019] [Indexed: 12/20/2022] Open
Abstract
Several Genome Wide Association Studies (GWAS) have reported variants associated to immune diseases. However, the identified variants are rarely the drivers of the associations and the molecular mechanisms behind the genetic contributions remain poorly understood. ChIP-seq data for TFs and histone modifications provide snapshots of protein-DNA interactions allowing the identification of heterozygous SNPs showing significant allele specific signals (AS-SNPs). AS-SNPs can change a TF binding site resulting in altered gene regulation and are primary candidates to explain associations observed in GWAS and expression studies. We identified 17,293 unique AS-SNPs across 7 lymphoblastoid cell lines. In this set of cell lines we interrogated 85% of common genetic variants in the population for potential regulatory effect and we identified 237 AS-SNPs associated to immune GWAS traits and 714 to gene expression in B cells. To elucidate possible regulatory mechanisms we integrated long-range 3D interactions data to identify putative target genes and motif predictions to identify TFs whose binding may be affected by AS-SNPs yielding a collection of 173 AS-SNPs associated to gene expression and 60 to B cell related traits. We present a systems strategy to find functional gene regulatory variants, the TFs that bind differentially between alleles and novel strategies to detect the regulated genes.
Collapse
Affiliation(s)
- Marco Cavalli
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Nicholas Baltzer
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden
| | - Husen M Umer
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Ioana Lemnian
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Gang Pan
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Ola Wallerman
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Rapolas Spalinskas
- Science for Life Laboratory, Division of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Pelin Sahlén
- Science for Life Laboratory, Division of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Jan Komorowski
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden.,Institute of Computer Science, Polish Academy of Sciences, Warszawa, Poland
| | - Claes Wadelius
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
11
|
Keilwagen J, Posch S, Grau J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol 2019; 20:9. [PMID: 30630522 PMCID: PMC6327544 DOI: 10.1186/s13059-018-1614-y] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 12/18/2018] [Indexed: 01/11/2023] Open
Abstract
Prediction of cell type-specific, in vivo transcription factor binding sites is one of the central challenges in regulatory genomics. Here, we present our approach that earned a shared first rank in the "ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge" in 2017. In post-challenge analyses, we benchmark the influence of different feature sets and find that chromatin accessibility and binding motifs are sufficient to yield state-of-the-art performance. Finally, we provide 682 lists of predicted peaks for a total of 31 transcription factors in 22 primary cell types and tissues and a user-friendly version of our approach, Catchitt, for download.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Erwin-Baur-Straße 27, Quedlinburg, 06484 Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| |
Collapse
|
12
|
Tedeschi F, Rizzo P, Huong BTM, Czihal A, Rutten T, Altschmied L, Scharfenberg S, Grosse I, Becker C, Weigel D, Bäumlein H, Kuhlmann M. EFFECTOR OF TRANSCRIPTION factors are novel plant-specific regulators associated with genomic DNA methylation in Arabidopsis. THE NEW PHYTOLOGIST 2019; 221:261-278. [PMID: 30252137 PMCID: PMC6585611 DOI: 10.1111/nph.15439] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 07/01/2018] [Indexed: 05/02/2023]
Abstract
Plant-specific EFFECTORS OF TRANSCRIPTION (ET) are characterised by a variable number of highly conserved ET repeats, which are involved in zinc and DNA binding. In addition, ETs share a GIY-YIG domain, involved in DNA nicking activity. It was hypothesised that ETs might act as epigenetic regulators. Here, methylome, transcriptome and phenotypic analyses were performed to investigate the role of ET factors and their involvement in DNA methylation in Arabidopsis thaliana. Comparative DNA methylation and transcriptome analyses in flowers and seedlings of et mutants revealed ET-specific differentially expressed genes and mostly independently characteristic, ET-specific differentially methylated regions. Loss of ET function results in pleiotropic developmental defects. The accumulation of cyclobutane pyrimidine dimers after ultraviolet stress in et mutants suggests an ET function in DNA repair.
Collapse
Affiliation(s)
- Francesca Tedeschi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Paride Rizzo
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Bui Thi Mai Huong
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Andreas Czihal
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Twan Rutten
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Lothar Altschmied
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | | | - Ivo Grosse
- Department of BioinformaticsMartin‐Luther‐University06120HalleGermany
| | - Claude Becker
- Department of Molecular BiologyMax Planck Institute for Developmental Biology72076TübingenGermany
- Gregor Mendel Institute of Molecular Plant Biology1030ViennaAustria
| | - Detlef Weigel
- Department of Molecular BiologyMax Planck Institute for Developmental Biology72076TübingenGermany
| | - Helmut Bäumlein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| | - Markus Kuhlmann
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK)06466Seeland OT GaterslebenGermany
| |
Collapse
|
13
|
Martins-Santana L, Nora LC, Sanches-Medeiros A, Lovate GL, Cassiano MHA, Silva-Rocha R. Systems and Synthetic Biology Approaches to Engineer Fungi for Fine Chemical Production. Front Bioeng Biotechnol 2018; 6:117. [PMID: 30338257 PMCID: PMC6178918 DOI: 10.3389/fbioe.2018.00117] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Accepted: 08/02/2018] [Indexed: 01/16/2023] Open
Abstract
Since the advent of systems and synthetic biology, many studies have sought to harness microbes as cell factories through genetic and metabolic engineering approaches. Yeast and filamentous fungi have been successfully harnessed to produce fine and high value-added chemical products. In this review, we present some of the most promising advances from recent years in the use of fungi for this purpose, focusing on the manipulation of fungal strains using systems and synthetic biology tools to improve metabolic flow and the flow of secondary metabolites by pathway redesign. We also review the roles of bioinformatics analysis and predictions in synthetic circuits, highlighting in silico systemic approaches to improve the efficiency of synthetic modules.
Collapse
Affiliation(s)
- Leonardo Martins-Santana
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Luisa C Nora
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Ananda Sanches-Medeiros
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Gabriel L Lovate
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Murilo H A Cassiano
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| | - Rafael Silva-Rocha
- Systems and Synthetic Biology Laboratory, Cell and Molecular Biology Department, Ribeirão Preto Medical School, São Paulo University (FMRP-USP), Ribeirão Preto, Brazil
| |
Collapse
|
14
|
Abstract
Motivation The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. Results We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. Availability and Implementation CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lin Zhu
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hong-Bo Zhang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
15
|
Rossi MJ, Lai WKM, Pugh BF. Genome-wide determinants of sequence-specific DNA binding of general regulatory factors. Genome Res 2018; 28:497-508. [PMID: 29563167 PMCID: PMC5880240 DOI: 10.1101/gr.229518.117] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Accepted: 03/05/2018] [Indexed: 01/01/2023]
Abstract
General regulatory factors (GRFs), such as Reb1, Abf1, Rap1, Mcm1, and Cbf1, positionally organize yeast chromatin through interactions with a core consensus DNA sequence. It is assumed that sequence recognition via direct base readout suffices for specificity and that spurious nonfunctional sites are rendered inaccessible by chromatin. We tested these assumptions through genome-wide mapping of GRFs in vivo and in purified biochemical systems at near–base pair (bp) resolution using several ChIP-exo–based assays. We find that computationally predicted DNA shape features (e.g., minor groove width, helix twist, base roll, and propeller twist) that are not defined by a unique consensus sequence are embedded in the nonunique portions of GRF motifs and contribute critically to sequence-specific binding. This dual source specificity occurs at GRF sites in promoter regions where chromatin organization starts. Outside of promoter regions, strong consensus sites lack the shape component and consequently lack an intrinsic ability to bind cognate GRFs, without regard to influences from chromatin. However, sites having a weak consensus and low intrinsic affinity do exist in these regions but are rendered inaccessible in a chromatin environment. Thus, GRF site-specificity is achieved through integration of favorable DNA sequence and shape readouts in promoter regions and by chromatin-based exclusion from fortuitous weak sites within gene bodies. This study further revealed a severe G/C nucleotide cross-linking selectivity inherent in all formaldehyde-based ChIP assays, which includes ChIP-seq. However, for most tested proteins, G/C selectivity did not appreciably affect binding site detection, although it does place limits on the quantitativeness of occupancy levels.
Collapse
Affiliation(s)
- Matthew J Rossi
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - William K M Lai
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - B Franklin Pugh
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
16
|
Ye Z, Ma T, Kalmbach MT, Dasari S, Kocher JPA, Wang L. CircularLogo: A lightweight web application to visualize intra-motif dependencies. BMC Bioinformatics 2017; 18:269. [PMID: 28532394 PMCID: PMC5440937 DOI: 10.1186/s12859-017-1680-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/11/2017] [Indexed: 01/09/2023] Open
Abstract
Background The sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization. Result We developed CircularLogo, a web-based interactive application, which is able to not only visualize the position-specific nucleotide consensus and diversity but also display the intra-motif dependencies. Applying CircularLogo to HNF6 binding sites and tRNA sequences demonstrated its ability to show intra-motif dependencies and intuitively reveal biomolecular structure. CircularLogo is implemented in JavaScript and Python based on the Django web framework. The program’s source code and user’s manual are freely available at http://circularlogo.sourceforge.net. CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html. Conclusion CircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.
Collapse
Affiliation(s)
- Zhenqing Ye
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Tao Ma
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - Michael T Kalmbach
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Surendra Dasari
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Liguo Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA. .,Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
17
|
Jankowsky E, Harris ME. Mapping specificity landscapes of RNA-protein interactions by high throughput sequencing. Methods 2017; 118-119:111-118. [PMID: 28263887 DOI: 10.1016/j.ymeth.2017.03.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Revised: 02/07/2017] [Accepted: 03/01/2017] [Indexed: 12/28/2022] Open
Abstract
To function in a biological setting, RNA binding proteins (RBPs) have to discriminate between alternative binding sites in RNAs. This discrimination can occur in the ground state of an RNA-protein binding reaction, in its transition state, or in both. The extent by which RBPs discriminate at these reaction states defines RBP specificity landscapes. Here, we describe the HiTS-Kin and HiTS-EQ techniques, which combine kinetic and equilibrium binding experiments with high throughput sequencing to quantitatively assess substrate discrimination for large numbers of substrate variants at ground and transition states of RNA-protein binding reactions. We discuss experimental design, practical considerations and data analysis and outline how a combination of HiTS-Kin and HiTS-EQ allows the mapping of RBP specificity landscapes.
Collapse
Affiliation(s)
- Eckhard Jankowsky
- Center for RNA Molecular Biology, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States; Department of Biochemistry, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States.
| | - Michael E Harris
- Department of Biochemistry, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States
| |
Collapse
|
18
|
Nettling M, Treutler H, Cerquides J, Grosse I. Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies. BMC Bioinformatics 2017; 18:141. [PMID: 28249564 PMCID: PMC5333389 DOI: 10.1186/s12859-017-1495-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 01/24/2017] [Indexed: 11/23/2022] Open
Abstract
Background Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for de-novo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and non-phylogenetic approaches based on sequences from only one species that typically take into account intra-motif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intra-motif dependencies separately improves de-novo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously. Results Here, we present an approach for de-novo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intra-motif dependencies. We study the degree of intra-motif dependencies inferred by this approach from ChIP-seq data of 35 transcription factors. We find that significant intra-motif dependencies of orders 1 and 2 are present in all 35 datasets and that intra-motif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intra-motif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1. Conclusion Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1495-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
| | | | - Jesus Cerquides
- Institut d'Investigació en Intel ·ligència Artificial, IIIA-CSIC, Campus UAB, Cerdanyola, Spain
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
19
|
Trenner J, Poeschl Y, Grau J, Gogol-Döring A, Quint M, Delker C. Auxin-induced expression divergence between Arabidopsis species may originate within the TIR1/AFB-AUX/IAA-ARF module. JOURNAL OF EXPERIMENTAL BOTANY 2017; 68:539-552. [PMID: 28007950 DOI: 10.1093/jxb/erw457] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Auxin is an essential regulator of plant growth and development, and auxin signaling components are conserved among land plants. Yet, a remarkable degree of natural variation in physiological and transcriptional auxin responses has been described among Arabidopsis thaliana accessions. As intraspecies comparisons offer only limited genetic variation, we here inspect the variation of auxin responses between A. thaliana and A. lyrata. This approach allowed the identification of conserved auxin response genes including novel genes with potential relevance for auxin biology. Furthermore, promoter divergences were analyzed for putative sources of variation. De novo motif discovery identified novel and variants of known elements with potential relevance for auxin responses, emphasizing the complex, and yet elusive, code of element combinations accounting for the diversity in transcriptional auxin responses. Furthermore, network analysis revealed correlations of interspecies differences in the expression of AUX/IAA gene clusters and classic auxin-related genes. We conclude that variation in general transcriptional and physiological auxin responses may originate substantially from functional or transcriptional variations in the TIR1/AFB, AUX/IAA, and ARF signaling network. In that respect, AUX/IAA gene expression divergence potentially reflects differences in the manner in which different species transduce identical auxin signals into gene expression responses.
Collapse
Affiliation(s)
- Jana Trenner
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Betty-Heimann, Halle (Saale), Germany
- Department of Molecular Signal Processing, Leibniz Institute of Plant Biochemistry, Weinberg 3, Halle (Saale), Germany
| | - Yvonne Poeschl
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Deutscher Platz 5e, Leipzig, Germany
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1,Halle (Saale), Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1,Halle (Saale), Germany
| | - Andreas Gogol-Döring
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Deutscher Platz 5e, Leipzig, Germany
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1,Halle (Saale), Germany
| | - Marcel Quint
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Betty-Heimann, Halle (Saale), Germany
- Department of Molecular Signal Processing, Leibniz Institute of Plant Biochemistry, Weinberg 3, Halle (Saale), Germany
| | - Carolin Delker
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Betty-Heimann, Halle (Saale), Germany
- Department of Molecular Signal Processing, Leibniz Institute of Plant Biochemistry, Weinberg 3, Halle (Saale), Germany
| |
Collapse
|
20
|
Austin RS, Hiu S, Waese J, Ierullo M, Pasha A, Wang TT, Fan J, Foong C, Breit R, Desveaux D, Moses A, Provart NJ. New BAR tools for mining expression data and exploring Cis-elements in Arabidopsis thaliana. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2016; 88:490-504. [PMID: 27401965 DOI: 10.1111/tpj.13261] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Revised: 06/23/2016] [Accepted: 07/01/2016] [Indexed: 05/21/2023]
Abstract
Identifying sets of genes that are specifically expressed in certain tissues or in response to an environmental stimulus is useful for designing reporter constructs, generating gene expression markers, or for understanding gene regulatory networks. We have developed an easy-to-use online tool for defining a desired expression profile (a modification of our Expression Angler program), which can then be used to identify genes exhibiting patterns of expression that match this profile as closely as possible. Further, we have developed another online tool, Cistome, for predicting or exploring cis-elements in the promoters of sets of co-expressed genes identified by such a method, or by other methods. We present two use cases for these tools, which are freely available on the Bio-Analytic Resource at http://BAR.utoronto.ca.
Collapse
Affiliation(s)
- Ryan S Austin
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Shu Hiu
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Jamie Waese
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Matthew Ierullo
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Asher Pasha
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Ting Ting Wang
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Jim Fan
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Curtis Foong
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Robert Breit
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Darrell Desveaux
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Alan Moses
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| | - Nicholas J Provart
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON, M5S 3B2, Canada
| |
Collapse
|
21
|
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet 2016; 7:24. [PMID: 26941778 PMCID: PMC4763482 DOI: 10.3389/fgene.2016.00024] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 02/05/2016] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
Collapse
Affiliation(s)
- Valentina Boeva
- Centre de Recherche, Institut CurieParis, France; INSERM, U900Paris, France; Mines ParisTechFontainebleau, France; PSL Research UniversityParis, France; Department of Development, Reproduction and Cancer, Institut CochinParis, France; INSERM, U1016Paris, France; Centre National de la Recherche Scientifique UMR 8104Paris, France; Université Paris Descartes UMR-S1016Paris, France
| |
Collapse
|
22
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 PMCID: PMC4821295 DOI: 10.12688/f1000research.7408.2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/29/2016] [Indexed: 11/22/2022] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
23
|
Nettling M, Treutler H, Grau J, Keilwagen J, Posch S, Grosse I. DiffLogo: a comparative visualization of sequence motifs. BMC Bioinformatics 2015; 16:387. [PMID: 26577052 PMCID: PMC4650857 DOI: 10.1186/s12859-015-0767-x] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 10/08/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. RESULTS Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. CONCLUSIONS DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Hendrik Treutler
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany.
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
24
|
Eggeling R, Roos T, Myllymäki P, Grosse I. Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data. BMC Bioinformatics 2015; 16:375. [PMID: 26552868 PMCID: PMC4640111 DOI: 10.1186/s12859-015-0797-4] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 10/23/2015] [Indexed: 11/29/2022] Open
Abstract
Background Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery. Results To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. Conclusions The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0797-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ralf Eggeling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. .,Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.
| | - Teemu Roos
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.
| | - Petri Myllymäki
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland.
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany. .,German Center for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
25
|
Keilwagen J, Grau J. Varying levels of complexity in transcription factor binding motifs. Nucleic Acids Res 2015; 43:e119. [PMID: 26116565 PMCID: PMC4605289 DOI: 10.1093/nar/gkv577] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2015] [Revised: 05/11/2015] [Accepted: 05/21/2015] [Indexed: 11/17/2022] Open
Abstract
Binding of transcription factors to DNA is one of the keystones of gene regulation. The existence of statistical dependencies between binding site positions is widely accepted, while their relevance for computational predictions has been debated. Building probabilistic models of binding sites that may capture dependencies is still challenging, since the most successful motif discovery approaches require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that combine putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously. We find that Slim models yield a substantially better prediction performance than previous models on genomic context protein binding microarray data sets and on ChIP-seq data sets. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The observed dependency structures range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, D-06484 Quedlinburg, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle (Saale), Germany
| |
Collapse
|
26
|
Karnik R, Beer MA. Identification of Predictive Cis-Regulatory Elements Using a Discriminative Objective Function and a Dynamic Search Space. PLoS One 2015; 10:e0140557. [PMID: 26465884 PMCID: PMC4605740 DOI: 10.1371/journal.pone.0140557] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2015] [Accepted: 09/28/2015] [Indexed: 01/06/2023] Open
Abstract
The generation of genomic binding or accessibility data from massively parallel sequencing technologies such as ChIP-seq and DNase-seq continues to accelerate. Yet state-of-the-art computational approaches for the identification of DNA binding motifs often yield motifs of weak predictive power. Here we present a novel computational algorithm called MotifSpec, designed to find predictive motifs, in contrast to over-represented sequence elements. The key distinguishing feature of this algorithm is that it uses a dynamic search space and a learned threshold to find discriminative motifs in combination with the modeling of motifs using a full PWM (position weight matrix) rather than k-mer words or regular expressions. We demonstrate that our approach finds motifs corresponding to known binding specificities in several mammalian ChIP-seq datasets, and that our PWMs classify the ChIP-seq signals with accuracy comparable to, or marginally better than motifs from the best existing algorithms. In other datasets, our algorithm identifies novel motifs where other methods fail. Finally, we apply this algorithm to detect motifs from expression datasets in C. elegans using a dynamic expression similarity metric rather than fixed expression clusters, and find novel predictive motifs.
Collapse
Affiliation(s)
- Rahul Karnik
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America
| | - Michael A. Beer
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, United States of America
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD, United States of America
- * E-mail:
| |
Collapse
|
27
|
Abstract
To fully understand the regulation of gene expression, it is critical to quantitatively define whether and how RNA-binding proteins (RBPs) discriminate between alternative binding sites in RNAs. Here, we describe new methods that measure protein binding to large numbers of RNA variants, and ways to analyse and interpret data obtained by these approaches, including affinity distributions and free energy landscapes. We discuss how the new methodologies and the associated concepts enable the development of inclusive, quantitative models for RNA-protein interactions that transcend the traditional binary classification of RBPs as either specific or nonspecific.
Collapse
|
28
|
Hennig J, Sattler M. Deciphering the protein-RNA recognition code: Combining large-scale quantitative methods with structural biology. Bioessays 2015; 37:899-908. [DOI: 10.1002/bies.201500033] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Janosch Hennig
- Institute of Structural Biology; Helmholtz Zentrum M; ü; nchen; München Germany
- Department Chemie; Center for Integrated Protein Science Munich at Biomolecular NMR Spectroscopy; Technische Universität München; Garching Germany
| | - Michael Sattler
- Institute of Structural Biology; Helmholtz Zentrum M; ü; nchen; München Germany
- Department Chemie; Center for Integrated Protein Science Munich at Biomolecular NMR Spectroscopy; Technische Universität München; Garching Germany
| |
Collapse
|
29
|
Grau J, Grosse I, Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics 2015; 31:2595-7. [PMID: 25810428 PMCID: PMC4514923 DOI: 10.1093/bioinformatics/btv153] [Citation(s) in RCA: 194] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 03/11/2015] [Indexed: 11/13/2022] Open
Abstract
Precision-recall (PR) and receiver operating characteristic (ROC) curves are valuable measures of classifier performance. Here, we present the R-package PRROC, which allows for computing and visualizing both PR and ROC curves. In contrast to available R-packages, PRROC allows for computing PR and ROC curves and areas under these curves for soft-labeled data using a continuous interpolation between the points of PR curves. In addition, PRROC provides a generic plot function for generating publication-quality graphics of PR and ROC curves.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science and Universitätszentrum Informatik, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| | - Ivo Grosse
- Institute of Computer Science and Universitätszentrum Informatik, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany, German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany and
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
| |
Collapse
|
30
|
Agostini F, Cirillo D, Ponti RD, Tartaglia GG. SeAMotE: a method for high-throughput motif discovery in nucleic acid sequences. BMC Genomics 2014; 15:925. [PMID: 25341390 PMCID: PMC4223730 DOI: 10.1186/1471-2164-15-925] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2014] [Accepted: 10/16/2014] [Indexed: 01/06/2023] Open
Abstract
Background The large amount of data produced by high-throughput sequencing poses new computational challenges. In the last decade, several tools have been developed for the identification of transcription and splicing factor binding sites. Results Here, we introduce the SeAMotE (Sequence Analysis of Motifs Enrichment) algorithm for discovery of regulatory regions in nucleic acid sequences. SeAMotE provides (i) a robust analysis of high-throughput sequence sets, (ii) a motif search based on pattern occurrences and (iii) an easy-to-use web-server interface. We applied our method to recently published data including 351 chromatin immunoprecipitation (ChIP) and 13 crosslinking immunoprecipitation (CLIP) experiments and compared our results with those of other well-established motif discovery tools. SeAMotE shows an average accuracy of 80% in finding discriminative motifs and outperforms other methods available in literature. Conclusions SeAMotE is a fast, accurate and flexible algorithm for the identification of sequence patterns involved in protein-DNA and protein-RNA recognition. The server can be freely accessed at http://s.tartaglialab.com/new_submission/seamote. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-925) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | - Gian Gaetano Tartaglia
- Gene Function and Evolution, Centre for Genomic Regulation (CRG), C/ Dr, Aiguader 88, 08003 Barcelona, Spain.
| |
Collapse
|
31
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 332] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
32
|
Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLoS One 2014; 9:e92209. [PMID: 24651729 PMCID: PMC3961324 DOI: 10.1371/journal.pone.0092209] [Citation(s) in RCA: 76] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Accepted: 02/20/2014] [Indexed: 12/25/2022] Open
Abstract
Precision-recall curves are highly informative about the performance of binary classifiers, and the area under these curves is a popular scalar performance measure for comparing different classifiers. However, for many applications class labels are not provided with absolute certainty, but with some degree of confidence, often reflected by weights or soft labels assigned to data points. Computing the area under the precision-recall curve requires interpolating between adjacent supporting points, but previous interpolation schemes are not directly applicable to weighted data. Hence, even in cases where weights were available, they had to be neglected for assessing classifiers using precision-recall curves. Here, we propose an interpolation for precision-recall curves that can also be used for weighted data, and we derive conditions for classification scores yielding the maximum and minimum area under the precision-recall curve. We investigate accordances and differences of the proposed interpolation and previous ones, and we demonstrate that taking into account existing weights of test data is important for the comparison of classifiers.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) – Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
- * E-mail:
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle (Saale), Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle (Saale), Germany
| |
Collapse
|
33
|
Application of experimentally verified transcription factor binding sites models for computational analysis of ChIP-Seq data. BMC Genomics 2014; 15:80. [PMID: 24472686 PMCID: PMC4234207 DOI: 10.1186/1471-2164-15-80] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 01/25/2014] [Indexed: 02/07/2023] Open
Abstract
Background ChIP-Seq is widely used to detect genomic segments bound by transcription factors (TF), either directly at DNA binding sites (BSs) or indirectly via other proteins. Currently, there are many software tools implementing different approaches to identify TFBSs within ChIP-Seq peaks. However, their use for the interpretation of ChIP-Seq data is usually complicated by the absence of direct experimental verification, making it difficult both to set a threshold to avoid recognition of too many false-positive BSs, and to compare the actual performance of different models. Results Using ChIP-Seq data for FoxA2 binding loci in mouse adult liver and human HepG2 cells we compared FoxA binding-site predictions for four computational models of two fundamental classes: pattern matching based on existing training set of experimentally confirmed TFBSs (oPWM and SiteGA) and de novo motif discovery (ChIPMunk and diChIPMunk). To properly select prediction thresholds for the models, we experimentally evaluated affinity of 64 predicted FoxA BSs using EMSA that allows safely distinguishing sequences able to bind TF. As a result we identified thousands of reliable FoxA BSs within ChIP-Seq loci from mouse liver and human HepG2 cells. It was found that the performance of conventional position weight matrix (PWM) models was inferior with the highest false positive rate. On the contrary, the best recognition efficiency was achieved by the combination of SiteGA & diChIPMunk/ChIPMunk models, properly identifying FoxA BSs in up to 90% of loci for both mouse and human ChIP-Seq datasets. Conclusions The experimental study of TF binding to oligonucleotides corresponding to predicted sites increases the reliability of computational methods for TFBS-recognition in ChIP-Seq data analysis. Regarding ChIP-Seq data interpretation, basic PWMs have inferior TFBS recognition quality compared to the more sophisticated SiteGA and de novo motif discovery methods. A combination of models from different principles allowed identification of proper TFBSs. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-80) contains supplementary material, which is available to authorized users.
Collapse
|
34
|
On the use of knowledge-based potentials for the evaluation of models of protein-protein, protein-DNA, and protein-RNA interactions. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:77-120. [PMID: 24629186 DOI: 10.1016/b978-0-12-800168-4.00004-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Proteins are the bricks and mortar of cells, playing structural and functional roles. In order to perform their function, they interact with each other as well as with other biomolecules such as DNA or RNA. Therefore, to fathom the function of a protein, we require knowing its partners and the atomic details of its interactions (i.e., the structure of the complex). However, the amount of protein interactions with an experimentally determined three-dimensional structure is scarce. Therefore, computational techniques such as homology modeling are foremost to fill this gap. Protein interactions can be modeled using as templates the interactions of homologous proteins, if the structure of the complex is known, or using docking methods. In both approaches, the estimation of the quality of models is essential. There are several ways to address this problem. In this review, we focus on the use of knowledge-based potentials for the analysis of protein interactions. We describe the procedure to derive statistical potentials and split them into different energetic terms that can be used for different purposes. We extensively discuss the fields where knowledge-based potentials have been successfully applied to (1) model protein-protein, protein-DNA, and protein-RNA interactions and (2) predict binding sites (in the protein and in the DNA). Moreover, we provide ready-to-use resources for docking and benchmarking protein interactions.
Collapse
|
35
|
Zhong S, He X, Bar-Joseph Z. Predicting tissue specific transcription factor binding sites. BMC Genomics 2013; 14:796. [PMID: 24238150 PMCID: PMC3898213 DOI: 10.1186/1471-2164-14-796] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 11/06/2013] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Studies of gene regulation often utilize genome-wide predictions of transcription factor (TF) binding sites. Most existing prediction methods are based on sequence information alone, ignoring biological contexts such as developmental stages and tissue types. Experimental methods to study in vivo binding, including ChIP-chip and ChIP-seq, can only study one transcription factor in a single cell type and under a specific condition in each experiment, and therefore cannot scale to determine the full set of regulatory interactions in mammalian transcriptional regulatory networks. RESULTS We developed a new computational approach, PIPES, for predicting tissue-specific TF binding. PIPES integrates in vitro protein binding microarrays (PBMs), sequence conservation and tissue-specific epigenetic (DNase I hypersensitivity) information. We demonstrate that PIPES improves over existing methods on distinguishing between in vivo bound and unbound sequences using ChIP-seq data for 11 mouse TFs. In addition, our predictions are in good agreement with current knowledge of tissue-specific TF regulation. CONCLUSIONS We provide a systematic map of computationally predicted tissue-specific binding targets for 284 mouse TFs across 55 tissue/cell types. Such comprehensive resource is useful for researchers studying gene regulation.
Collapse
Affiliation(s)
| | | | - Ziv Bar-Joseph
- Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA.
| |
Collapse
|