1
|
Dresch JM, Conrad RD, Klonaros D, Drewell RA. Investigating the sequence landscape in the Drosophila initiator core promoter element using an enhanced MARZ algorithm. PeerJ 2023; 11:e15597. [PMID: 37366427 PMCID: PMC10290830 DOI: 10.7717/peerj.15597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/29/2023] [Indexed: 06/28/2023] Open
Abstract
The core promoter elements are important DNA sequences for the regulation of RNA polymerase II transcription in eukaryotic cells. Despite the broad evolutionary conservation of these elements, there is extensive variation in the nucleotide composition of the actual sequences. In this study, we aim to improve our understanding of the complexity of this sequence variation in the TATA box and initiator core promoter elements in Drosophila melanogaster. Using computational approaches, including an enhanced version of our previously developed MARZ algorithm that utilizes gapped nucleotide matrices, several sequence landscape features are uncovered, including an interdependency between the nucleotides in position 2 and 5 in the initiator. Incorporating this information in an expanded MARZ algorithm improves predictive performance for the identification of the initiator element. Overall our results demonstrate the need to carefully consider detailed sequence composition features in core promoter elements in order to make more robust and accurate bioinformatic predictions.
Collapse
|
2
|
Hansen JL, Cohen BA. A quantitative metric of pioneer activity reveals that HNF4A has stronger in vivo pioneer activity than FOXA1. Genome Biol 2022; 23:221. [PMID: 36253868 PMCID: PMC9575205 DOI: 10.1186/s13059-022-02792-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 10/11/2022] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND We and others have suggested that pioneer activity - a transcription factor's (TF's) ability to bind and open inaccessible loci - is not a qualitative trait limited to a select class of pioneer TFs. We hypothesize that most TFs display pioneering activity that depends on the TF concentration and the motif content at their target loci. RESULTS Here, we present a quantitative in vivo measure of pioneer activity that captures the relative difference in a TF's ability to bind accessible versus inaccessible DNA. The metric is based on experiments that use CUT&Tag to measure the binding of doxycycline-inducible TFs. For each location across the genome, we determine the concentration of doxycycline required for a TF to reach half-maximal occupancy; lower concentrations reflect higher affinity. We propose that the relative difference in a TF's affinity between ATAC-seq labeled accessible and inaccessible binding sites is a measure of its pioneer activity. We estimate binding affinities at tens of thousands of genomic loci for the endodermal TFs FOXA1 and HNF4A and show that HNF4A has stronger pioneer activity than FOXA1. We show that both FOXA1 and HNF4A display higher binding affinity at inaccessible sites with more copies of their respective motifs. The quantitative analysis of binding suggests different modes of binding for FOXA1, including an anti-cooperative mode of binding at certain accessible loci. CONCLUSIONS Our results suggest that relative binding affinities are reasonable measures of pioneer activity and support the model wherein most TFs have some degree of context-dependent pioneer activity.
Collapse
Affiliation(s)
- Jeffrey L. Hansen
- The Edison Family Center for Genome Sciences and Systems Biology, School of Medicine, Washington University in St. Louis, Saint Louis, MO USA
- Department of Genetics, School of Medicine, Washington University in St. Louis, Saint Louis, MO USA
- Medical Scientist Training Program, Washington University in St. Louis, St. Louis, MO USA
| | - Barak A. Cohen
- The Edison Family Center for Genome Sciences and Systems Biology, School of Medicine, Washington University in St. Louis, Saint Louis, MO USA
- Department of Genetics, School of Medicine, Washington University in St. Louis, Saint Louis, MO USA
| |
Collapse
|
3
|
Steinhaus R, Robinson PN, Seelow D. FABIAN-variant: predicting the effects of DNA variants on transcription factor binding. Nucleic Acids Res 2022; 50:W322-W329. [PMID: 35639768 PMCID: PMC9252790 DOI: 10.1093/nar/gkac393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/22/2022] [Accepted: 05/06/2022] [Indexed: 12/03/2022] Open
Abstract
While great advances in predicting the effects of coding variants have been made, the assessment of non-coding variants remains challenging. This is especially problematic for variants within promoter regions which can lead to over-expression of a gene or reduce or even abolish its expression. The binding of transcription factors to the DNA can be predicted using position weight matrices (PWMs). More recently, transcription factor flexible models (TFFMs) have been introduced and shown to be more accurate than PWMs. TFFMs are based on hidden Markov models and can account for complex positional dependencies. Our new web-based application FABIAN-variant uses 1224 TFFMs and 3790 PWMs to predict whether and to which degree DNA variants affect the binding of 1387 different human transcription factors. For each variant and transcription factor, the software combines the results of different models for a final prediction of the resulting binding-affinity change. The software is written in C++ for speed but variants can be entered through a web interface. Alternatively, a VCF file can be uploaded to assess variants identified by high-throughput sequencing. The search can be restricted to variants in the vicinity of candidate genes. FABIAN-variant is available freely at https://www.genecascade.org/fabian/.
Collapse
Affiliation(s)
- Robin Steinhaus
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany.,Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06030, USA.,Institute for Systems Genomics, University of Connecticut, Farmington, CT 06030, USA
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, 10117 Berlin, Germany.,Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353 Berlin, Germany
| |
Collapse
|
4
|
Ge W, Meier M, Roth C, Söding J. Bayesian Markov models improve the prediction of binding motifs beyond first order. NAR Genom Bioinform 2021; 3:lqab026. [PMID: 33928244 PMCID: PMC8057495 DOI: 10.1093/nargab/lqab026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 03/11/2021] [Accepted: 03/30/2021] [Indexed: 12/13/2022] Open
Abstract
Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Accurate models for predicting binding affinities are crucial for quantitatively understanding of transcriptional regulation. Motifs are commonly described by position weight matrices, which assume that each position contributes independently to the binding energy. Models that can learn dependencies between positions, for instance, induced by DNA structure preferences, have yielded markedly improved predictions for most TFs on in vivo data. However, they are more prone to overfit the data and to learn patterns merely correlated with rather than directly involved in TF binding. We present an improved, faster version of our Bayesian Markov model software, BaMMmotif2. We tested it with state-of-the-art motif discovery tools on a large collection of ChIP-seq and HT-SELEX datasets. BaMMmotif2 models of fifth-order achieved a median false-discovery-rate-averaged recall 13.6% and 12.2% higher than the next best tool on 427 ChIP-seq datasets and 164 HT-SELEX datasets, respectively, while being 8 to 1000 times faster. BaMMmotif2 models showed no signs of overtraining in cross-cell line and cross-platform tests, with similar improvements on the next-best tool. These results demonstrate that dependencies beyond first order clearly improve binding models for most TFs.
Collapse
Affiliation(s)
- Wanwan Ge
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Markus Meier
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Christian Roth
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
5
|
Blum CF, Kollmann M. Neural networks with circular filters enable data efficient inference of sequence motifs. Bioinformatics 2020; 35:3937-3943. [PMID: 30918943 PMCID: PMC6792110 DOI: 10.1093/bioinformatics/btz194] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Revised: 11/15/2018] [Accepted: 03/26/2019] [Indexed: 11/13/2022] Open
Abstract
Motivation Nucleic acids and proteins often have localized sequence motifs that enable highly specific interactions. Due to the biological relevance of sequence motifs, numerous inference methods have been developed. Recently, convolutional neural networks (CNNs) have achieved state of the art performance. These methods were able to learn transcription factor binding sites from ChIP-seq data, resulting in accurate predictions on test data. However, CNNs typically distribute learned motifs across multiple filters, making them difficult to interpret. Furthermore, networks trained on small datasets often do not generalize well to new sequences. Results Here we present circular filters, a novel convolutional architecture, that convolves sequences with circularly permutated variants of the same filter. We motivate circular filters by the observation that CNNs frequently learn filters that correspond to shifted and truncated variants of the true motif. Circular filters enable learning of full-length motifs and allow easy interpretation of the learned filters. We show that circular filters improve motif inference performance over a wide range of hyperparameters as well as sequence length. Furthermore, we show that CNNs with circular filters in most cases outperform conventional CNNs at inferring DNA binding sites from ChIP-seq data. Availability and implementation Code is available at https://github.com/christopherblum. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher F Blum
- Institute for Mathematical Modeling of Biological Systems, Heinrich-Heine University of Düsseldorf, Düsseldorf, Germany
| | - Markus Kollmann
- Institute for Mathematical Modeling of Biological Systems, Heinrich-Heine University of Düsseldorf, Düsseldorf, Germany
| |
Collapse
|
6
|
Zhou J, Lu Q, Xu R, Gui L, Wang H. Prediction of TF-Binding Site by Inclusion of Higher Order Position Dependencies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1383-1393. [PMID: 30629513 DOI: 10.1109/tcbb.2019.2892124] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Most proposed methods for TF-binding site (TFBS) predictions only use low order dependencies for predictions due to the lack of efficient methods to extract higher order dependencies. In this work, we first propose a novel method to extract higher order dependencies by applying CNN on histone modification features. We then propose a novel TFBS prediction method, referred to as CNN_TF, by incorporating low order and higher order dependencies. CNN_TF is first evaluated on 13 TFs in the mES cell. Results show that using higher order dependencies outperforms low order dependencies significantly on 11 TFs. This indicates that higher order dependencies are indeed more effective for TFBS predictions than low order dependencies. Further experiments show that using both low order dependencies and higher order dependencies improves performance significantly on 12 TFs, indicating the two dependency types are complementary. To evaluate the influence of cell-types on prediction performances, CNN_TF was applied to five TFs in five cell-types of humans. Even though low order dependencies and higher order dependencies show different contributions in different cell-types, they are always complementary in predictions. When comparing to several state-of-the-art methods, CNN_TF outperforms them by at least 5.3 percent in AUPR.
Collapse
|
7
|
Zhou J, Lu Q, Gui L, Xu R, Long Y, Wang H. MTTFsite: cross-cell type TF binding site prediction by using multi-task learning. Bioinformatics 2020; 35:5067-5077. [PMID: 31161194 PMCID: PMC6954652 DOI: 10.1093/bioinformatics/btz451] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 05/19/2019] [Accepted: 05/30/2019] [Indexed: 12/30/2022] Open
Abstract
Motivation The prediction of transcription factor binding sites (TFBSs) is crucial for gene expression analysis. Supervised learning approaches for TFBS predictions require large amounts of labeled data. However, many TFs of certain cell types either do not have sufficient labeled data or do not have any labeled data. Results In this paper, a multi-task learning framework (called MTTFsite) is proposed to address the lack of labeled data problem by leveraging on labeled data available in cross-cell types. The proposed MTTFsite contains a shared CNN to learn common features for all cell types and a private CNN for each cell type to learn private features. The common features are aimed to help predicting TFBSs for all cell types especially those cell types that lack labeled data. MTTFsite is evaluated on 241 cell type TF pairs and compared with a baseline method without using any multi-task learning model and a fully shared multi-task model that uses only a shared CNN and do not use private CNNs. For cell types with insufficient labeled data, results show that MTTFsite performs better than the baseline method and the fully shared model on more than 89% pairs. For cell types without any labeled data, MTTFsite outperforms the baseline method and the fully shared model by more than 80 and 93% pairs, respectively. A novel gene expression prediction method (called TFChrome) using both MTTFsite and histone modification features is also presented. Results show that TFBSs predicted by MTTFsite alone can achieve good performance. When MTTFsite is combined with histone modification features, a significant 5.7% performance improvement is obtained. Availability and implementation The resource and executable code are freely available at http://hlt.hitsz.edu.cn/MTTFsite/ and http://www.hitsz-hlt.com:8080/MTTFsite/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiyun Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China.,Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Qin Lu
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Lin Gui
- Department of Computer Science, University of Warwick, Coventry CV4 4AL, UK
| | - Ruifeng Xu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| | - Yunfei Long
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
| | - Hongpeng Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
| |
Collapse
|
8
|
Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics 2020; 36:2690-2696. [PMID: 31999322 PMCID: PMC7203737 DOI: 10.1093/bioinformatics/btaa045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 12/23/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| | - Pratyush K Das
- Applied Tumor Genomics, Research Programs Unit, University of Helsinki, Helsinki FI-00014, Finland
| | - Jussi Taipale
- Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, SE 141 83 Stockholm, Sweden
- Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
- Genome-Scale Biology Program, University of Helsinki, Helsinki FI-00014, Finland
| | - Esko Ukkonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
9
|
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res 2019; 46:D260-D266. [PMID: 29140473 PMCID: PMC5753243 DOI: 10.1093/nar/gkx1126] [Citation(s) in RCA: 841] [Impact Index Per Article: 168.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/27/2017] [Indexed: 12/31/2022] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups. In the 2018 release of JASPAR, the CORE collection has been expanded with 322 new PFMs (60 for vertebrates and 262 for plants) and 33 PFMs were updated (24 for vertebrates, 8 for plants and 1 for insects). These new profiles represent a 30% expansion compared to the 2016 release. In addition, we have introduced 316 TFFMs (95 for vertebrates, 218 for plants and 3 for insects). This release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. The JASPAR 2018 CORE vertebrate collection of PFMs was used to predict TF-binding sites in the human genome. The predictions are made available to the scientific community through a UCSC Genome Browser track data hub. Finally, this update comes with a new web framework with an interactive and responsive user-interface, along with new features. All the underlying data can be retrieved programmatically using a RESTful API and through the JASPAR 2018 R/Bioconductor package.
Collapse
Affiliation(s)
- Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Arnaud Stigliani
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Robin van der Lee
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Adrien Bessy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Jeanne Chèneby
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Shubhada R Kulkarni
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Ge Tan
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - Damir Baranasic
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - David J Arenillas
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research & Innovation Centre, University of Copenhagen, DK2200 Copenhagen N, Denmark
| | - Klaas Vandepoele
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK.,Sars International Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
| | - Benoît Ballester
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0310 Oslo, Norway
| |
Collapse
|
10
|
Anderson AP, Jones AG. erefinder: Genome-wide detection of oestrogen response elements. Mol Ecol Resour 2019; 19:1366-1373. [PMID: 31177626 DOI: 10.1111/1755-0998.13046] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 05/31/2019] [Accepted: 05/31/2019] [Indexed: 11/28/2022]
Abstract
Oestrogen response elements (EREs) are specific DNA sequences to which ligand-bound oestrogen receptors (ERs) physically bind, allowing them to act as transcription factors for target genes. Locating EREs and ER responsive regions is therefore a potentially important component of the study of oestrogen-regulated pathways. Here, we report the development of a novel software tool, erefinder, which conducts a genome-wide, sliding-window analysis of oestrogen receptor binding affinity. We demonstrate the effects of adjusting window size and highlight the program's general agreement with ChIP studies. We further provide two examples of how erefinder can be used for comparative approaches. erefinder can handle large input files, has settings to allow for broad and narrow searches, and provides the full output to allow for greater data manipulation. These features facilitate a wide range of hypothesis testing for researchers and make erefinder an excellent tool to aid in oestrogen-related research.
Collapse
Affiliation(s)
- Andrew P Anderson
- Department of Biology, Texas A&M University, College Station, TX, USA
| | - Adam G Jones
- Department of Biological Sciences, University of Idaho, Moscow, ID, USA
| |
Collapse
|
11
|
Ozolinš TRS. Regulation and Control of AP-1 Binding Activity in Embryotoxicity. Methods Mol Biol 2019; 1965:375-388. [PMID: 31069687 DOI: 10.1007/978-1-4939-9182-2_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The electrophoretic mobility shift assay (EMSA) is a sensitive and relatively straightforward methodology used to detect sequence-specific DNA-protein interactions. It is the fundamental procedure of several variants that allow qualitative and quantitative assessments of protein-nucleic acid complexes. Classically, nuclear proteins and DNA are combined, and the resulting mixture is electrophoretically separated in polyacrylamide or agarose gel under native conditions. The distribution within the gel is generally detected with autoradiography of the 32P-labelled DNA. The underlying principle is that nucleic acid with protein bound to it will migrate more slowly through a gel matrix than the free nucleic acid. In this chapter, a representative protocol is described that addresses specific challenges of using whole embryos as the nuclear protein source, and the most common and informative EMSA variant, the "super-shift", is also presented. The important points are underscored, and approaches for troubleshooting are explained. References are provided for alternative methods and extensions of the basic protocol.
Collapse
Affiliation(s)
- Terence R S Ozolinš
- Department of Biomedical and Molecular Sciences, Queen's University, Kingston, ON, Canada.
| |
Collapse
|
12
|
Käppel S, Melzer R, Rümpler F, Gafert C, Theißen G. The floral homeotic protein SEPALLATA3 recognizes target DNA sequences by shape readout involving a conserved arginine residue in the MADS-domain. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 95:341-357. [PMID: 29744943 DOI: 10.1111/tpj.13954] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 04/17/2018] [Accepted: 04/23/2018] [Indexed: 05/05/2023]
Abstract
SEPALLATA3 of Arabidopsis thaliana is a MADS-domain transcription factor (TF) and a key regulator of flower development. MADS-domain proteins bind to sequences termed 'CArG-boxes' [consensus 5'-CC(A/T)6 GG-3']. Because only a fraction of the CArG-boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG-boxes into genuine recognition sites. Here, we investigate to what extent the shape of the DNA is involved in a 'shape readout' that contributes to the binding of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to DNA probes that all contain the CArG-box motif, but differ in their predicted DNA shape. We found that binding affinity correlates well with a narrow minor groove of the DNA. Substitution of canonical bases with non-standard bases supports the hypothesis of minor groove shape readout by SEPALLATA3. Analysis of mutant SEPALLATA3 proteins further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis-regulatory elements by a plant MADS-domain TF, and by inference probably also of other TFs of this type, heavily depends on shape readout mechanisms.
Collapse
Affiliation(s)
- Sandra Käppel
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Rainer Melzer
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Florian Rümpler
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Christian Gafert
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| |
Collapse
|
13
|
Guo Y, Tian K, Zeng H, Guo X, Gifford DK. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res 2018; 28:891-900. [PMID: 29654070 PMCID: PMC5991515 DOI: 10.1101/gr.226852.117] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2017] [Accepted: 04/04/2018] [Indexed: 12/15/2022]
Abstract
The representation and discovery of transcription factor (TF) sequence binding specificities is critical for understanding gene regulatory networks and interpreting the impact of disease-associated noncoding genetic variants. We present a novel TF binding motif representation, the k-mer set memory (KSM), which consists of a set of aligned k-mers that are overrepresented at TF binding sites, and a new method called KMAC for de novo discovery of KSMs. We find that KSMs more accurately predict in vivo binding sites than position weight matrix (PWM) models and other more complex motif models across a large set of ChIP-seq experiments. Furthermore, KSMs outperform PWMs and more complex motif models in predicting in vitro binding sites. KMAC also identifies correct motifs in more experiments than five state-of-the-art motif discovery methods. In addition, KSM-derived features outperform both PWM and deep learning model derived sequence features in predicting differential regulatory activities of expression quantitative trait loci (eQTL) alleles. Finally, we have applied KMAC to 1600 ENCODE TF ChIP-seq data sets and created a public resource of KSM and PWM motifs. We expect that the KSM representation and KMAC method will be valuable in characterizing TF binding specificities and in interpreting the effects of noncoding genetic variations.
Collapse
Affiliation(s)
- Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Xiaoyun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - David Kenneth Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
14
|
Chang YK, Zuo Z, Stormo GD. Quantitative profiling of BATF family proteins/JUNB/IRF hetero-trimers using Spec-seq. BMC Mol Biol 2018; 19:5. [PMID: 29587652 PMCID: PMC5869772 DOI: 10.1186/s12867-018-0106-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Accepted: 03/19/2018] [Indexed: 01/13/2023] Open
Abstract
Background BATF family transcription factors (BATF, BATF2 and BATF3) form hetero-trimers with JUNB and either IRF4 or IRF8 to regulate cell fate in T cells and dendritic cells in vivo. While each combination of the hetero-trimer has a distinct role, some degree of cross-compensation was observed. The basis for the differential actions of IRF4 and IRF8 with BATF factors and JUNB is still unknown. We propose that the differences in function between these hetero-trimers may be caused by differences in their DNA binding preferences. While all three BATF family transcription factors have similar binding preferences when binding as a hetero-dimer with JUNB, the cooperative binding of IRF4 or IRF8 to the hetero-dimer/DNA complex could change the preferences. We used Spec-seq, which allows for the efficient and accurate determination of relative affinity to a large collection of sequences in parallel, to find differences between cooperative DNA binding of IRF4, IRF8 and BATF family members. Results We found that without IRF binding, all three hetero-dimer pairs exhibit nearly the same binding preferences to both expected wildtype binding sites TRE (TGA(C/G)TCA) and CRE (TGACGTCA). IRF4 and IRF8 show the very similar DNA binding preferences when binding with any of the three hetero-dimers. No major change of binding preferences was found in the half-sites between different hetero-trimers. IRF proteins bind with substantially lower affinity with either a single nucleotide spacer between IRF and BATF binding site or with an alternative mode of binding in the opposite orientation. In addition, the preference to CRE binding site was reduced with either IRF binding in all BATF–JUNB combinations. Conclusions The specificities of BATF, BATF2 and BATF3 are all very similar as are their interactions with IRF4 and IRF8. IRF proteins binding adjacent to BATF sites increases affinity substantially compared to sequences with spacings between the sites, indicating cooperative binding through protein–protein interactions. The preference for the type of BATF binding site, TRE or CRE, is also altered when IRF proteins bind. These in vitro preferences aid in the understanding of in vivo binding activities. Electronic supplementary material The online version of this article (10.1186/s12867-018-0106-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yiming K Chang
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Zheng Zuo
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Gary D Stormo
- Department of Genetics and Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
15
|
Lima WR, Martins DC, Parreira KS, Scarpelli P, Santos de Moraes M, Topalis P, Hashimoto RF, Garcia CRS. Genome-wide analysis of the human malaria parasite Plasmodium falciparum transcription factor PfNF-YB shows interaction with a CCAAT motif. Oncotarget 2017; 8:113987-114001. [PMID: 29371963 PMCID: PMC5768380 DOI: 10.18632/oncotarget.23053] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Accepted: 11/26/2017] [Indexed: 12/04/2022] Open
Abstract
Little is known about transcription factor regulation during the Plasmodium falciparum intraerythrocytic cycle. In order to elucidate the role of the P. falciparum (Pf)NF-YB transcription factor we searched for target genes in the entire genome. PfNF-YB mRNA is highly expressed in late trophozoite and schizont stages relative to the ring stage. In order to determine the candidate genes bound by PfNF-YB a ChIP-on-chip assay was carried out and 297 genes were identified. Ninety nine percent of PfNF-YB binding was to putative promoter regions of protein coding genes of which only 16% comprise proteins of known function. Interestingly, our data reveal that PfNF-YB binding is not exclusively to a canonical CCAAT box motif. PfNF-YB binds to genes coding for proteins implicated in a range of different biological functions, such as replication protein A large subunit (DNA replication), hypoxanthine phosphoribosyltransferase (nucleic acid metabolism) and multidrug resistance protein 2 (intracellular transport).
Collapse
Affiliation(s)
- Wânia Rezende Lima
- Departamento de Fisiologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil.,Instituto de Ciências Exatas e Naturais-Medicina, Universidade Federal de Mato Grosso-Campus Rondonópolis, Mato Grosso, Brazil
| | - David Correa Martins
- Centro de Matemática, Computação e Cognição, Universidade Federal do ABC, Santo André, Brazil
| | - Kleber Simônio Parreira
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, São Paulo, Brazil.,Instituto de Ciências Exatas e Naturais-Medicina, Universidade Federal de Mato Grosso-Campus Rondonópolis, Mato Grosso, Brazil
| | - Pedro Scarpelli
- Departamento de Fisiologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Miriam Santos de Moraes
- Departamento de Fisiologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| | - Pantelis Topalis
- Institute of Molecular Biology and Biotechnology, FORTH, Hellas, Greece
| | - Ronaldo Fumio Hashimoto
- Departamento de Ciência da Computação, Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, Brazil
| | - Célia R S Garcia
- Departamento de Fisiologia, Instituto de Biociências, Universidade de São Paulo, São Paulo, Brazil
| |
Collapse
|
16
|
Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017; 33:514-521. [PMID: 28011774 DOI: 10.1093/bioinformatics/btw683] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 10/27/2016] [Indexed: 01/09/2023] Open
Abstract
Motivation While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact janne.h.korhonen@gmail.com.
Collapse
Affiliation(s)
- Janne H Korhonen
- School of Computer Science, Reykjavík University, Reykjavík, Iceland.,Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| | - Kimmo Palin
- Genome-Scale Biology Research Program, Research Programs Unit
| | - Jussi Taipale
- Department of Biosciences and Nutrition, Karolinska Institutet, Genome Scale Biology Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| |
Collapse
|
17
|
Elmas A, Wang X, Dresch JM. The folded k-spectrum kernel: A machine learning approach to detecting transcription factor binding sites with gapped nucleotide dependencies. PLoS One 2017; 12:e0185570. [PMID: 28982128 PMCID: PMC5628859 DOI: 10.1371/journal.pone.0185570] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 09/14/2017] [Indexed: 12/22/2022] Open
Abstract
Understanding the molecular machinery involved in transcriptional regulation is central to improving our knowledge of an organism's development, disease, and evolution. The building blocks of this complex molecular machinery are an organism's genomic DNA sequence and transcription factor proteins. Despite the vast amount of sequence data now available for many model organisms, predicting where transcription factors bind, often referred to as 'motif detection' is still incredibly challenging. In this study, we develop a novel bioinformatic approach to binding site prediction. We do this by extending pre-existing SVM approaches in an unbiased way to include all possible gapped k-mers, representing different combinations of complex nucleotide dependencies within binding sites. We show the advantages of this new approach when compared to existing SVM approaches, through a rigorous set of cross-validation experiments. We also demonstrate the effectiveness of our new approach by reporting on its improved performance on a set of 127 genomic regions known to regulate gene expression along the anterio-posterior axis in early Drosophila embryos.
Collapse
Affiliation(s)
- Abdulkadir Elmas
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY, United States of America
| | - Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, United States of America
| |
Collapse
|
18
|
Smaczniak C, Muiño JM, Chen D, Angenent GC, Kaufmann K. Differences in DNA Binding Specificity of Floral Homeotic Protein Complexes Predict Organ-Specific Target Genes. THE PLANT CELL 2017; 29:1822-1835. [PMID: 28733422 PMCID: PMC5590503 DOI: 10.1105/tpc.17.00145] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Revised: 05/30/2017] [Accepted: 07/18/2017] [Indexed: 05/20/2023]
Abstract
Floral organ identities in plants are specified by the combinatorial action of homeotic master regulatory transcription factors. However, how these factors achieve their regulatory specificities is still largely unclear. Genome-wide in vivo DNA binding data show that homeotic MADS domain proteins recognize partly distinct genomic regions, suggesting that DNA binding specificity contributes to functional differences of homeotic protein complexes. We used in vitro systematic evolution of ligands by exponential enrichment followed by high-throughput DNA sequencing (SELEX-seq) on several floral MADS domain protein homo- and heterodimers to measure their DNA binding specificities. We show that specification of reproductive organs is associated with distinct binding preferences of a complex formed by SEPALLATA3 and AGAMOUS. Binding specificity is further modulated by different binding site spacing preferences. Combination of SELEX-seq and genome-wide DNA binding data allows differentiation between targets in specification of reproductive versus perianth organs in the flower. We validate the importance of DNA binding specificity for organ-specific gene regulation by modulating promoter activity through targeted mutagenesis. Our study shows that intrafamily protein interactions affect DNA binding specificity of floral MADS domain proteins. Differential DNA binding of MADS domain protein complexes plays a role in the specificity of target gene regulation.
Collapse
Affiliation(s)
- Cezary Smaczniak
- Laboratory of Molecular Biology, Wageningen University, Wageningen 6708PB, The Netherlands
- Institute for Biochemistry and Biology, Potsdam University, Potsdam 14476, Germany
| | - Jose M Muiño
- Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | - Dijun Chen
- Institute for Biochemistry and Biology, Potsdam University, Potsdam 14476, Germany
| | - Gerco C Angenent
- Laboratory of Molecular Biology, Wageningen University, Wageningen 6708PB, The Netherlands
- Bioscience, Wageningen Plant Research, Wageningen 6708PB, The Netherlands
| | - Kerstin Kaufmann
- Institute for Biochemistry and Biology, Potsdam University, Potsdam 14476, Germany
| |
Collapse
|
19
|
Omidi S, Zavolan M, Pachkov M, Breda J, Berger S, van Nimwegen E. Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput Biol 2017; 13:e1005176. [PMID: 28753602 PMCID: PMC5550003 DOI: 10.1371/journal.pcbi.1005176] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 08/09/2017] [Accepted: 06/02/2017] [Indexed: 11/17/2022] Open
Abstract
Gene regulatory networks are ultimately encoded by the sequence-specific binding of (TFs) to short DNA segments. Although it is customary to represent the binding specificity of a TF by a position-specific weight matrix (PSWM), which assumes each position within a site contributes independently to the overall binding affinity, evidence has been accumulating that there can be significant dependencies between positions. Unfortunately, methodological challenges have so far hindered the development of a practical and generally-accepted extension of the PSWM model. On the one hand, simple models that only consider dependencies between nearest-neighbor positions are easy to use in practice, but fail to account for the distal dependencies that are observed in the data. On the other hand, models that allow for arbitrary dependencies are prone to overfitting, requiring regularization schemes that are difficult to use in practice for non-experts. Here we present a new regulatory motif model, called dinucleotide weight tensor (DWT), that incorporates arbitrary pairwise dependencies between positions in binding sites, rigorously from first principles, and free from tunable parameters. We demonstrate the power of the method on a large set of ChIP-seq data-sets, showing that DWTs outperform both PSWMs and motif models that only incorporate nearest-neighbor dependencies. We also demonstrate that DWTs outperform two previously proposed methods. Finally, we show that DWTs inferred from ChIP-seq data also outperform PSWMs on HT-SELEX data for the same TF, suggesting that DWTs capture inherent biophysical properties of the interactions between the DNA binding domains of TFs and their binding sites. We make a suite of DWT tools available at dwt.unibas.ch, that allow users to automatically perform ‘motif finding’, i.e. the inference of DWT motifs from a set of sequences, binding site prediction with DWTs, and visualization of DWT ‘dilogo’ motifs. Gene regulatory networks are ultimately encoded in constellations of short binding sites in the DNA and RNA that are recognized by regulatory factors such as transcription factors (TFs). For several decades, computational analysis of regulatory networks has relied on a model of TF sequence-specificity, the position-specific weight-matrix (PSWM), that assumes different positions in a binding site contribute independently to the total binding energy of the TF. However, in recent years evidence has been accumulating that, at least for some TFs, this assumption does not hold. Here we present a new model for the sequence-specificity of TFs, the dinucleotide weight tensor (DWT), that takes arbitrary dependencies between positions in binding sites into account and show that it consistently outperforms PSWMs on high-throughput datasets on TF binding. Moreover, in contrast to previous approaches, DWTs are directly derived from first principles within a Bayesian framework, and contain no tunable parameters. This allows them to be easily applied in practice and we make a suite of tools available for computational analysis with DWTs.
Collapse
Affiliation(s)
- Saeed Omidi
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mihaela Zavolan
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mikhail Pachkov
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Jeremie Breda
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Severin Berger
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Erik van Nimwegen
- Biozentrum, University of Basel, Basel, Switzerland.,Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
20
|
Ye Z, Ma T, Kalmbach MT, Dasari S, Kocher JPA, Wang L. CircularLogo: A lightweight web application to visualize intra-motif dependencies. BMC Bioinformatics 2017; 18:269. [PMID: 28532394 PMCID: PMC5440937 DOI: 10.1186/s12859-017-1680-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/11/2017] [Indexed: 01/09/2023] Open
Abstract
Background The sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization. Result We developed CircularLogo, a web-based interactive application, which is able to not only visualize the position-specific nucleotide consensus and diversity but also display the intra-motif dependencies. Applying CircularLogo to HNF6 binding sites and tRNA sequences demonstrated its ability to show intra-motif dependencies and intuitively reveal biomolecular structure. CircularLogo is implemented in JavaScript and Python based on the Django web framework. The program’s source code and user’s manual are freely available at http://circularlogo.sourceforge.net. CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html. Conclusion CircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.
Collapse
Affiliation(s)
- Zhenqing Ye
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Tao Ma
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - Michael T Kalmbach
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Surendra Dasari
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Liguo Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA. .,Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
21
|
Yang L, Orenstein Y, Jolma A, Yin Y, Taipale J, Shamir R, Rohs R. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol 2017; 13:910. [PMID: 28167566 PMCID: PMC5327724 DOI: 10.15252/msb.20167238] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Transcription factors (TFs) achieve DNA‐binding specificity through contacts with functional groups of bases (base readout) and readout of structural properties of the double helix (shape readout). Currently, it remains unclear whether DNA shape readout is utilized by only a few selected TF families, or whether this mechanism is used extensively by most TF families. We resequenced data from previously published HT‐SELEX experiments, the most extensive mammalian TF–DNA binding data available to date. Using these data, we demonstrated the contributions of DNA shape readout across diverse TF families and its importance in core motif‐flanking regions. Statistical machine‐learning models combined with feature‐selection techniques helped to reveal the nucleotide position‐dependent DNA shape readout in TF‐binding sites and the TF family‐specific position dependence. Based on these results, we proposed novel DNA shape logos to visualize the DNA shape preferences of TFs. Overall, this work suggests a way of obtaining mechanistic insights into TF–DNA binding without relying on experimentally solved all‐atom structures.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA, USA
| | - Yaron Orenstein
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Arttu Jolma
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Yimeng Yin
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Jussi Taipale
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
22
|
Chattopadhyay A, Zandarashvili L, Luu RH, Iwahara J. Thermodynamic Additivity for Impacts of Base-Pair Substitutions on Association of the Egr-1 Zinc-Finger Protein with DNA. Biochemistry 2016; 55:6467-6474. [PMID: 27933778 DOI: 10.1021/acs.biochem.6b00757] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The transcription factor Egr-1 specifically binds as a monomer to its 9 bp target DNA sequence, GCGTGGGCG, via three zinc fingers and plays important roles in the brain and cardiovascular systems. Using fluorescence-based competitive binding assays, we systematically analyzed the impacts of all possible single-nucleotide substitutions in the target DNA sequence and determined the change in binding free energy for each. Then, we measured the changes in binding free energy for sequences with multiple substitutions and compared them with the sum of the changes in binding free energy for each constituent single substitution. For the DNA variants with two or three nucleotide substitutions in the target sequence, we found excellent agreement between the measured and predicted changes in binding free energy. Interestingly, however, we found that this thermodynamic additivity broke down with a larger number of substitutions. For DNA sequences with four or more substitutions, the measured changes in binding free energy were significantly larger than predicted. On the basis of these results, we analyzed the occurrences of high-affinity sequences in the genome and found that the genome contains millions of such sequences that might functionally sequester Egr-1.
Collapse
Affiliation(s)
- Abhijnan Chattopadhyay
- Department of Biochemistry & Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch , Galveston, Texas 77555, United States
| | - Levani Zandarashvili
- Department of Biochemistry & Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch , Galveston, Texas 77555, United States
| | - Ross H Luu
- Department of Biochemistry & Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch , Galveston, Texas 77555, United States
| | - Junji Iwahara
- Department of Biochemistry & Molecular Biology, Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch , Galveston, Texas 77555, United States
| |
Collapse
|
23
|
Siebert M, Söding J. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. Nucleic Acids Res 2016; 44:6055-69. [PMID: 27288444 PMCID: PMC5291271 DOI: 10.1093/nar/gkw521] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Accepted: 05/29/2016] [Indexed: 01/01/2023] Open
Abstract
Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.
Collapse
Affiliation(s)
- Matthias Siebert
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
24
|
Aguilar-Gurrieri C, Larabi A, Vinayachandran V, Patel NA, Yen K, Reja R, Ebong IO, Schoehn G, Robinson CV, Pugh BF, Panne D. Structural evidence for Nap1-dependent H2A-H2B deposition and nucleosome assembly. EMBO J 2016; 35:1465-82. [PMID: 27225933 PMCID: PMC4931181 DOI: 10.15252/embj.201694105] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2016] [Accepted: 04/21/2016] [Indexed: 11/25/2022] Open
Abstract
Nap1 is a histone chaperone involved in the nuclear import of H2A–H2B and nucleosome assembly. Here, we report the crystal structure of Nap1 bound to H2A–H2B together with in vitro and in vivo functional studies that elucidate the principles underlying Nap1‐mediated H2A–H2B chaperoning and nucleosome assembly. A Nap1 dimer provides an acidic binding surface and asymmetrically engages a single H2A–H2B heterodimer. Oligomerization of the Nap1–H2A–H2B complex results in burial of surfaces required for deposition of H2A–H2B into nucleosomes. Chromatin immunoprecipitation‐exonuclease (ChIP‐exo) analysis shows that Nap1 is required for H2A–H2B deposition across the genome. Mutants that interfere with Nap1 oligomerization exhibit severe nucleosome assembly defects showing that oligomerization is essential for the chaperone function. These findings establish the molecular basis for Nap1‐mediated H2A–H2B deposition and nucleosome assembly.
Collapse
Affiliation(s)
- Carmen Aguilar-Gurrieri
- European Molecular Biology Laboratory, Grenoble, France Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, Grenoble, France
| | - Amédé Larabi
- European Molecular Biology Laboratory, Grenoble, France Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, Grenoble, France
| | - Vinesh Vinayachandran
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
| | - Nisha A Patel
- Department of Chemistry, University of Oxford, Oxford, UK
| | - Kuangyu Yen
- Department of Cell Biology, Southern Medical University, Guangzhou, China
| | - Rohit Reja
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
| | - Ima-O Ebong
- Department of Chemistry, University of Oxford, Oxford, UK
| | - Guy Schoehn
- Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, Grenoble, France Université Grenoble-Alpes, Grenoble, France Centre National de la Recherche Scientifique (CNRS) IBS, Grenoble, France CEA, IBS, Grenoble, France
| | | | - B Franklin Pugh
- Center for Eukaryotic Gene Regulation, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
| | - Daniel Panne
- European Molecular Biology Laboratory, Grenoble, France Unit for Virus Host-Cell Interactions, Univ. Grenoble Alpes-EMBL-CNRS, Grenoble, France
| |
Collapse
|
25
|
|
26
|
Yang C, Chang CH. Exploring comprehensive within-motif dependence of transcription factor binding in Escherichia coli. Sci Rep 2015; 5:17021. [PMID: 26592556 PMCID: PMC4655474 DOI: 10.1038/srep17021] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 10/16/2015] [Indexed: 01/18/2023] Open
Abstract
Modeling the binding of transcription factors helps to decipher the control logic behind transcriptional regulatory networks. Position weight matrix is commonly used to describe a binding motif but assumes statistical independence between positions. Although current approaches take within-motif dependence into account for better predictive performance, these models usually rely on prior knowledge and incorporate simple positional dependence to describe binding motifs. The inability to take complex within-motif dependence into account may result in an incomplete representation of binding motifs. In this work, we applied association rule mining techniques and constructed models to explore within-motif dependence for transcription factors in Escherichia coli. Our models can reflect transcription factor-DNA recognition where the explored dependence correlates with the binding specificity. We also propose a graphical representation of the explored within-motif dependence to illustrate the final binding configurations. Understanding the binding configurations also enables us to fine-tune or design transcription factor binding sites, and we attempt to present the configurations through exploring within-motif dependence.
Collapse
Affiliation(s)
- Chi Yang
- Institute of Biomedical Informatics, National Yang Ming University, Taipei, 11221, Taiwan
| | - Chuan-Hsiung Chang
- Institute of Biomedical Informatics, National Yang Ming University, Taipei, 11221, Taiwan.,Center for Systems and Synthetic Biology, National Yang Ming University, Taipei, 11221, Taiwan
| |
Collapse
|
27
|
Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, Zhang AW, Parcy F, Lenhard B, Sandelin A, Wasserman WW. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2015; 44:D110-5. [PMID: 26531826 PMCID: PMC4702842 DOI: 10.1093/nar/gkv1176] [Citation(s) in RCA: 727] [Impact Index Per Article: 80.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2015] [Accepted: 10/22/2015] [Indexed: 11/28/2022] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database storing curated, non-redundant transcription factor (TF) binding profiles representing transcription factor binding preferences as position frequency matrices for multiple species in six taxonomic groups. For this 2016 release, we expanded the JASPAR CORE collection with 494 new TF binding profiles (315 in vertebrates, 11 in nematodes, 3 in insects, 1 in fungi and 164 in plants) and updated 59 profiles (58 in vertebrates and 1 in fungi). The introduced profiles represent an 83% expansion and 10% update when compared to the previous release. We updated the structural annotation of the TF DNA binding domains (DBDs) following a published hierarchical structural classification. In addition, we introduced 130 transcription factor flexible models trained on ChIP-seq data for vertebrates, which capture dinucleotide dependencies within TF binding sites. This new JASPAR release is accompanied by a new web tool to infer JASPAR TF binding profiles recognized by a given TF protein sequence. Moreover, we provide the users with a Ruby module complementing the JASPAR API to ease programmatic access and use of the JASPAR collection of profiles. Finally, we provide the JASPAR2016 R/Bioconductor data package with the data of this release.
Collapse
Affiliation(s)
- Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - David J Arenillas
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Chih-Yu Chen
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Grégoire Denay
- Laboratoire Physiologie Cellulaire & Végétale, Université Grenoble Alpes, CNRS, CEA, iRTSV, INRA, 38054 Grenoble, France
| | - Jessica Lee
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Wenqiang Shi
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Casper Shyr
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Ge Tan
- Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial College London, Du Cane Road, London W12 0NN, UK
| | - Rebecca Worsley-Hunt
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - Allen W Zhang
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| | - François Parcy
- Laboratoire Physiologie Cellulaire & Végétale, Université Grenoble Alpes, CNRS, CEA, iRTSV, INRA, 38054 Grenoble, France
| | - Boris Lenhard
- Computational Regulatory Genomics, MRC Clinical Sciences Centre, Imperial College London, Du Cane Road, London W12 0NN, UK
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research and Innovation Centre, Copenhagen University, Ole Maaloes Vej 5, DK-2200, Denmark
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, V5Z 4H4, BC, Canada
| |
Collapse
|
28
|
MORPHEUS, a Webtool for Transcription Factor Binding Analysis Using Position Weight Matrices with Dependency. PLoS One 2015; 10:e0135586. [PMID: 26285209 PMCID: PMC4540572 DOI: 10.1371/journal.pone.0135586] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 07/24/2015] [Indexed: 12/21/2022] Open
Abstract
Transcriptional networks are central to any biological process and changes affecting transcription factors or their binding sites in the genome are a key factor driving evolution. As more organisms are being sequenced, tools are needed to easily predict transcription factor binding sites (TFBS) presence and affinity from mere inspection of genomic sequences. Although many TFBS discovery algorithms exist, tools for using the DNA binding models they generate are relatively scarce and their use is limited among the biologist community by the lack of flexible and user-friendly tools. We have developed a suite of web tools (called Morpheus) based on the proven Position Weight Matrices (PWM) formalism that can be used without any programing skills and incorporates some unique features such as the presence of dependencies between nucleotides positions or the possibility to compute the predicted occupancy of a large regulatory region using a biophysical model. To illustrate the possibilities and simplicity of Morpheus tools in functional and evolutionary analysis, we have analysed the regulatory link between LEAFY, a key plant transcription factor involved in flower development, and its direct target gene APETALA1 during the divergence of Brassicales clade.
Collapse
|
29
|
Anderson DW, McKeown AN, Thornton JW. Intermolecular epistasis shaped the function and evolution of an ancient transcription factor and its DNA binding sites. eLife 2015; 4:e07864. [PMID: 26076233 PMCID: PMC4500092 DOI: 10.7554/elife.07864] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 06/13/2015] [Indexed: 02/07/2023] Open
Abstract
Complexes of specifically interacting molecules, such as transcription factor proteins (TFs) and the DNA response elements (REs) they recognize, control most biological processes, but little is known concerning the functional and evolutionary effects of epistatic interactions across molecular interfaces. We experimentally characterized all combinations of genotypes in the joint protein-DNA sequence space defined by an historical transition in TF-RE specificity that occurred some 500 million years ago in the DNA-binding domain of an ancient steroid hormone receptor. We found that rampant epistasis within and between the two molecules was essential to specific TF-RE recognition and to the evolution of a novel TF-RE complex with unique derived specificity. Permissive and restrictive epistatic mutations across the TF-RE interface opened and closed potential evolutionary paths accessible by the other, making the evolution of each molecule contingent on its partner's history and allowing a molecular complex with novel specificity to evolve.
Collapse
Affiliation(s)
- Dave W Anderson
- Institute of Ecology and Evolution, University of Oregon, Eugene, United States
| | - Alesia N McKeown
- Institute of Ecology and Evolution, University of Oregon, Eugene, United States
| | - Joseph W Thornton
- Department of Ecology and Evolution, University of Chicago, Chicago, United States
| |
Collapse
|
30
|
Zellers RG, Drewell RA, Dresch JM. MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding. BMC Bioinformatics 2015; 16:30. [PMID: 25637281 PMCID: PMC4384306 DOI: 10.1186/s12859-014-0446-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Accepted: 11/24/2014] [Indexed: 12/28/2022] Open
Abstract
Background A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models. Results Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila. Conclusions Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0446-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rowan G Zellers
- Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. .,Department of Mathematics, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA.
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester MA, 01610, USA.
| | - Jacqueline M Dresch
- Department of Mathematics and Statistics, Amherst College, P.O. Box 5000, Amherst MA, 01002, USA.
| |
Collapse
|
31
|
Lewis DD, Villarreal FD, Wu F, Tan C. Synthetic biology outside the cell: linking computational tools to cell-free systems. Front Bioeng Biotechnol 2014; 2:66. [PMID: 25538941 PMCID: PMC4260521 DOI: 10.3389/fbioe.2014.00066] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 11/23/2014] [Indexed: 12/22/2022] Open
Abstract
As mathematical models become more commonly integrated into the study of biology, a common language for describing biological processes is manifesting. Many tools have emerged for the simulation of in vivo synthetic biological systems, with only a few examples of prominent work done on predicting the dynamics of cell-free synthetic systems. At the same time, experimental biologists have begun to study dynamics of in vitro systems encapsulated by amphiphilic molecules, opening the door for the development of a new generation of biomimetic systems. In this review, we explore both in vivo and in vitro models of biochemical networks with a special focus on tools that could be applied to the construction of cell-free expression systems. We believe that quantitative studies of complex cellular mechanisms and pathways in synthetic systems can yield important insights into what makes cells different from conventional chemical systems.
Collapse
Affiliation(s)
- Daniel D. Lewis
- Integrative Genetics and Genomics, University of California Davis, Davis, CA, USA
- Department of Biomedical Engineering, University of California Davis, Davis, CA, USA
| | | | - Fan Wu
- Department of Biomedical Engineering, University of California Davis, Davis, CA, USA
| | - Cheemeng Tan
- Department of Biomedical Engineering, University of California Davis, Davis, CA, USA
| |
Collapse
|
32
|
Stormo GD, Zuo Z, Chang YK. Spec-seq: determining protein-DNA-binding specificity by sequencing. Brief Funct Genomics 2014; 14:30-8. [PMID: 25362070 DOI: 10.1093/bfgp/elu043] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
The specificity of protein-DNA interactions can be determined directly by sequencing the bound and unbound fractions in a standard binding reaction. The procedure is easy and inexpensive, and the accuracy can be high for thousands of sequences assayed in parallel. From the measurements, simple models of specificity, such as position weight matrices, can be assessed for their accuracy and more complex models developed if useful. Those may provide more accurate predictions of in vivo binding sites and can help us to understand the details of recognition. As an example, we demonstrate new information gained about the binding of lac repressor. One can apply the same method to combinations of factors that bind simultaneously to a single DNA and determine both the specificity of the individual factors and the cooperativity between them.
Collapse
|
33
|
High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics 2014; 198:1329-43. [PMID: 25209146 DOI: 10.1534/genetics.114.170100] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Knowing the specificity of transcription factors is critical to understanding regulatory networks in cells. The lac repressor-operator system has been studied for many years, but not with high-throughput methods capable of determining specificity comprehensively. Details of its binding interaction and its selection of an asymmetric binding site have been controversial. We employed a new method to accurately determine relative binding affinities to thousands of sequences simultaneously, requiring only sequencing of bound and unbound fractions. An analysis of 2560 different DNA sequence variants, including both base changes and variations in operator length, provides a detailed view of lac repressor sequence specificity. We find that the protein can bind with nearly equal affinities to operators of three different lengths, but the sequence preference changes depending on the length, demonstrating alternative modes of interaction between the protein and DNA. The wild-type operator has an odd length, causing the two monomers to bind in alternative modes, making the asymmetric operator the preferred binding site. We tested two other members of the LacI/GalR protein family and find that neither can bind with high affinity to sites with alternative lengths or shows evidence of alternative binding modes. A further comparison with known and predicted motifs suggests that the lac repressor may be unique in this ability and that this may contribute to its selection.
Collapse
|
34
|
Jetha K, Theißen G, Melzer R. Arabidopsis SEPALLATA proteins differ in cooperative DNA-binding during the formation of floral quartet-like complexes. Nucleic Acids Res 2014; 42:10927-42. [PMID: 25183521 PMCID: PMC4176161 DOI: 10.1093/nar/gku755] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
The SEPALLATA (SEP) genes of Arabidopsis thaliana encode MADS-domain transcription factors that specify the identity of all floral organs. The four Arabidopsis SEP genes function in a largely yet not completely redundant manner. Here, we analysed interactions of the SEP proteins with DNA. All of the proteins were capable of forming tetrameric quartet-like complexes on DNA fragments carrying two sequence elements termed CArG-boxes. Distances between the CArG-boxes for strong cooperative DNA-binding were in the range of 4-6 helical turns. However, SEP1 also bound strongly to CArG-box pairs separated by smaller or larger distances, whereas SEP2 preferred large and SEP4 preferred small inter-site distances for binding. Cooperative binding of SEP3 was comparatively weak for most of the inter-site distances tested. All SEP proteins constituted floral quartet-like complexes together with the floral homeotic proteins APETALA3 (AP3) and PISTILLATA (PI) on the target genes AP3 and SEP3. Our results suggest an important part of an explanation for why the different SEP proteins have largely, but not completely redundant functions in determining floral organ identity: they may bind to largely overlapping, but not identical sets of target genes that differ in the arrangement and spacing of the CArG-boxes in their cis-regulatory regions.
Collapse
Affiliation(s)
- Khushboo Jetha
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena, Germany
| | - Günter Theißen
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena, Germany
| | - Rainer Melzer
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena, Germany Department of Genetics, Institute of Biology, University of Leipzig, Talstraße 33, D-04103 Leipzig, Germany
| |
Collapse
|
35
|
Wang J. Quality versus accuracy: result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence. BMC Bioinformatics 2014; 15:289. [PMID: 25158938 PMCID: PMC4161872 DOI: 10.1186/1471-2105-15-289] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 08/18/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics. A systematic comparison of 26 algorithms to learn TF-DNA binding specificity in in vitro protein-binding microarray (PBM) data was published recently, but the quality of those examined PBMs was not evaluated completely. RESULTS Here, new quality-control parameters such as principal component analysis (PCA) ellipse is proposed to assess the data quality for either single or paired PBMs. Additionally, a biophysical model of TF-DNA interactions including adjacent dinucleotide interdependence was implemented in a new program - BayesPI2, where sparse Bayesian learning and relevance vector machine are used to predict unknown model parameters. Then, 66 mouse TFs from the DREAM5 challenge were classified into two groups (i.e. good vs. bad) based on the paired PBM quality-control parameters. Subsequently, computational methods to model TF sequence specificity were evaluated between the two groups. CONCLUSION Results indicate that both the algorithm performance and the predicted TF-binding energy-level of a motif are significantly influenced by PBM data quality, where poor PBM data quality is linked to specific protein domains (e.g. C2H2 DNA-binding domain). Especially, the new dinucleotide energy-dependent model (BayesPI2) offers great improvement in testing prediction accuracy over the simple energy-independent model, for at least 21% of analyzed the TFs.
Collapse
Affiliation(s)
- Junbai Wang
- Pathology Department, Oslo University Hospital - Norwegian Radium Hospital, Montebello, Oslo, 0310, Norway.
| |
Collapse
|
36
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 332] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
37
|
Santolini M, Mora T, Hakim V. A general pairwise interaction model provides an accurate description of in vivo transcription factor binding sites. PLoS One 2014; 9:e99015. [PMID: 24926895 PMCID: PMC4057186 DOI: 10.1371/journal.pone.0099015] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Accepted: 05/09/2014] [Indexed: 11/19/2022] Open
Abstract
The identification of transcription factor binding sites (TFBSs) on genomic DNA is of crucial importance for understanding and predicting regulatory elements in gene networks. TFBS motifs are commonly described by Position Weight Matrices (PWMs), in which each DNA base pair contributes independently to the transcription factor (TF) binding. However, this description ignores correlations between nucleotides at different positions, and is generally inaccurate: analysing fly and mouse in vivo ChIPseq data, we show that in most cases the PWM model fails to reproduce the observed statistics of TFBSs. To overcome this issue, we introduce the pairwise interaction model (PIM), a generalization of the PWM model. The model is based on the principle of maximum entropy and explicitly describes pairwise correlations between nucleotides at different positions, while being otherwise as unconstrained as possible. It is mathematically equivalent to considering a TF-DNA binding energy that depends additively on each nucleotide identity at all positions in the TFBS, like the PWM model, but also additively on pairs of nucleotides. We find that the PIM significantly improves over the PWM model, and even provides an optimal description of TFBS statistics within statistical noise. The PIM generalizes previous approaches to interdependent positions: it accounts for co-variation of two or more base pairs, and predicts secondary motifs, while outperforming multiple-motif models consisting of mixtures of PWMs. We analyse the structure of pairwise interactions between nucleotides, and find that they are sparse and dominantly located between consecutive base pairs in the flanking region of TFBS. Nonetheless, interactions between pairs of non-consecutive nucleotides are found to play a significant role in the obtained accurate description of TFBS statistics. The PIM is computationally tractable, and provides a general framework that should be useful for describing and predicting TFBSs beyond PWMs.
Collapse
Affiliation(s)
- Marc Santolini
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| | - Thierry Mora
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| | - Vincent Hakim
- Laboratoire de Physique Statistique, CNRS, Université P. et M. Curie, Université D. Diderot, École Normale Supérieure, Paris, France
| |
Collapse
|
38
|
Eggeling R, Gohr A, Keilwagen J, Mohr M, Posch S, Smith AD, Grosse I. On the value of intra-motif dependencies of human insulator protein CTCF. PLoS One 2014; 9:e85629. [PMID: 24465627 PMCID: PMC3899044 DOI: 10.1371/journal.pone.0085629] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Accepted: 12/05/2013] [Indexed: 01/08/2023] Open
Abstract
The binding affinity of DNA-binding proteins such as transcription factors is mainly determined by the base composition of the corresponding binding site on the DNA strand. Most proteins do not bind only a single sequence, but rather a set of sequences, which may be modeled by a sequence motif. Algorithms for de novo motif discovery differ in their promoter models, learning approaches, and other aspects, but typically use the statistically simple position weight matrix model for the motif, which assumes statistical independence among all nucleotides. However, there is no clear justification for that assumption, leading to an ongoing debate about the importance of modeling dependencies between nucleotides within binding sites. In the past, modeling statistical dependencies within binding sites has been hampered by the problem of limited data. With the rise of high-throughput technologies such as ChIP-seq, this situation has now changed, making it possible to make use of statistical dependencies effectively. In this work, we investigate the presence of statistical dependencies in binding sites of the human enhancer-blocking insulator protein CTCF by using the recently developed model class of inhomogeneous parsimonious Markov models, which is capable of modeling complex dependencies while avoiding overfitting. These findings lead to a more detailed characterization of the CTCF binding motif, which is only poorly represented by independent nucleotide frequencies at several positions, predominantly at the 3' end.
Collapse
Affiliation(s)
- Ralf Eggeling
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - André Gohr
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
| | - Michaela Mohr
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
| | - Andrew D. Smith
- Molecular and Computational Biology, University of Southern California, Los Angeles, United States of America
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Halle/Saale, Germany
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Seeland OT Gatersleben, Germany
- German Center of Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
39
|
Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordân R. Stability selection for regression-based models of transcription factor-DNA binding specificity. Bioinformatics 2013; 29:i117-25. [PMID: 23812975 PMCID: PMC3694650 DOI: 10.1093/bioinformatics/btt221] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF–DNA binding specificity, trained using high resolution in vitro data from custom protein-binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max and Mad2) in their native genomic context. These high-throughput quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF–DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar position weight matrices, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step toward better sequence-based models of individual TF–DNA binding specificity. Availability: Our code is available at http://genome.duke.edu/labs/gordan/ISMB2013. The PBM data used in this article are available in the Gene Expression Omnibus under accession number GSE47026. Contact:raluca.gordan@duke.edu
Collapse
Affiliation(s)
- Fantine Mordelet
- Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | |
Collapse
|
40
|
Muiño JM, Smaczniak C, Angenent GC, Kaufmann K, van Dijk ADJ. Structural determinants of DNA recognition by plant MADS-domain transcription factors. Nucleic Acids Res 2013; 42:2138-46. [PMID: 24275492 PMCID: PMC3936718 DOI: 10.1093/nar/gkt1172] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Plant MADS-domain transcription factors act as key regulators of many developmental processes. Despite the wealth of information that exists about these factors, the mechanisms by which they recognize their cognate DNA-binding site, called CArG-box (consensus CCW6GG), and how different MADS-domain proteins achieve DNA-binding specificity, are still largely unknown. We used information from in vivo ChIP-seq experiments, in vitro DNA-binding data and evolutionary conservation to address these important questions. We found that structural characteristics of the DNA play an important role in the DNA binding of plant MADS-domain proteins. The central region of the CArG-box largely resembles a structural motif called ‘A-tract’, which is characterized by a narrow minor groove and may assist bending of the DNA by MADS-domain proteins. Periodically spaced A-tracts outside the CArG-box suggest additional roles for this structure in the process of DNA binding of these transcription factors. Structural characteristics of the CArG-box not only play an important role in DNA-binding site recognition of MADS-domain proteins, but also partly explain differences in DNA-binding specificity of different members of this transcription factor family and their heteromeric complexes.
Collapse
Affiliation(s)
- Jose M Muiño
- Bioscience, Plant Research International, Wageningen, PO Box 619, 6700 AP, The Netherlands, Laboratory of Bioinformatics, Wageningen University, PO Box 569, 6700 AN Wageningen, The Netherlands, Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin D-14195, Germany, Laboratory of Molecular Biology, Wageningen University, Wageningen, PO Box 633, 6700 AP, The Netherlands and Biometris, Wageningen University and Research Centre, Wageningen, PO Box 100, 6700 AC, The Netherlands
| | | | | | | | | |
Collapse
|
41
|
Siggers T, Gordân R. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res 2013; 42:2099-111. [PMID: 24243859 PMCID: PMC3936734 DOI: 10.1093/nar/gkt1112] [Citation(s) in RCA: 151] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Binding of proteins to particular DNA sites across the genome is a primary determinant of specificity in genome maintenance and gene regulation. DNA-binding specificity is encoded at multiple levels, from the detailed biophysical interactions between proteins and DNA, to the assembly of multi-protein complexes. At each level, variation in the mechanisms used to achieve specificity has led to difficulties in constructing and applying simple models of DNA binding. We review the complexities in protein–DNA binding found at multiple levels and discuss how they confound the idea of simple recognition codes. We discuss the impact of new high-throughput technologies for the characterization of protein–DNA binding, and how these technologies are uncovering new complexities in protein–DNA recognition. Finally, we review the concept of multi-protein recognition codes in which new DNA-binding specificities are achieved by the assembly of multi-protein complexes.
Collapse
Affiliation(s)
- Trevor Siggers
- Department of Biology, Boston University, Boston, MA 02215, USA, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA
| | | |
Collapse
|
42
|
Mathelier A, Wasserman WW. The next generation of transcription factor binding site prediction. PLoS Comput Biol 2013; 9:e1003214. [PMID: 24039567 PMCID: PMC3764009 DOI: 10.1371/journal.pcbi.1003214] [Citation(s) in RCA: 124] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2013] [Accepted: 07/22/2013] [Indexed: 12/29/2022] Open
Abstract
Finding where transcription factors (TFs) bind to the DNA is of key importance to decipher gene regulation at a transcriptional level. Classically, computational prediction of TF binding sites (TFBSs) is based on basic position weight matrices (PWMs) which quantitatively score binding motifs based on the observed nucleotide patterns in a set of TFBSs for the corresponding TF. Such models make the strong assumption that each nucleotide participates independently in the corresponding DNA-protein interaction and do not account for flexible length motifs. We introduce transcription factor flexible models (TFFMs) to represent TF binding properties. Based on hidden Markov models, TFFMs are flexible, and can model both position interdependence within TFBSs and variable length motifs within a single dedicated framework. The availability of thousands of experimentally validated DNA-TF interaction sequences from ChIP-seq allows for the generation of models that perform as well as PWMs for stereotypical TFs and can improve performance for TFs with flexible binding characteristics. We present a new graphical representation of the motifs that convey properties of position interdependence. TFFMs have been assessed on ChIP-seq data sets coming from the ENCODE project, revealing that they can perform better than both PWMs and the dinucleotide weight matrix extension in discriminating ChIP-seq from background sequences. Under the assumption that ChIP-seq signal values are correlated with the affinity of the TF-DNA binding, we find that TFFM scores correlate with ChIP-seq peak signals. Moreover, using available TF-DNA affinity measurements for the Max TF, we demonstrate that TFFMs constructed from ChIP-seq data correlate with published experimentally measured DNA-binding affinities. Finally, TFFMs allow for the straightforward computation of an integrated TF occupancy score across a sequence. These results demonstrate the capacity of TFFMs to accurately model DNA-protein interactions, while providing a single unified framework suitable for the next generation of TFBS prediction.
Collapse
Affiliation(s)
- Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Wyeth W. Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
43
|
Dresch JM, Richards M, Ay A. A primer on thermodynamic-based models for deciphering transcriptional regulatory logic. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2013; 1829:946-53. [PMID: 23643643 DOI: 10.1016/j.bbagrm.2013.04.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/26/2012] [Revised: 04/24/2013] [Accepted: 04/25/2013] [Indexed: 11/27/2022]
Abstract
A rigorous analysis of transcriptional regulation at the DNA level is crucial to the understanding of many biological systems. Mathematical modeling has offered researchers a new approach to understanding this central process. In particular, thermodynamic-based modeling represents the most biophysically informed approach aimed at connecting DNA level regulatory sequences to the expression of specific genes. The goal of this review is to give biologists a thorough description of the steps involved in building, analyzing, and implementing a thermodynamic-based model of transcriptional regulation. The data requirements for this modeling approach are described, the derivation for a specific regulatory region is shown, and the challenges and future directions for the quantitative modeling of gene regulation are discussed.
Collapse
|
44
|
Sun W, Hu X, Lim MHK, Ng CKL, Choo SH, Castro DS, Drechsel D, Guillemot F, Kolatkar PR, Jauch R, Prabhakar S. TherMos: Estimating protein-DNA binding energies from in vivo binding profiles. Nucleic Acids Res 2013; 41:5555-68. [PMID: 23595148 PMCID: PMC3675472 DOI: 10.1093/nar/gkt250] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Accurately characterizing transcription factor (TF)-DNA affinity is a central goal of regulatory genomics. Although thermodynamics provides the most natural language for describing the continuous range of TF-DNA affinity, traditional motif discovery algorithms focus instead on classification paradigms that aim to discriminate ‘bound’ and ‘unbound’ sequences. Moreover, these algorithms do not directly model the distribution of tags in ChIP-seq data. Here, we present a new algorithm named Thermodynamic Modeling of ChIP-seq (TherMos), which directly estimates a position-specific binding energy matrix (PSEM) from ChIP-seq/exo tag profiles. In cross-validation tests on seven genome-wide TF-DNA binding profiles, one of which we generated via ChIP-seq on a complex developing tissue, TherMos predicted quantitative TF-DNA binding with greater accuracy than five well-known algorithms. We experimentally validated TherMos binding energy models for Klf4 and Esrrb, using a novel protocol to measure PSEMs in vitro. Strikingly, our measurements revealed strong non-additivity at multiple positions within the two PSEMs. Among the algorithms tested, only TherMos was able to model the entire binding energy landscape of Klf4 and Esrrb. Our study reveals new insights into the energetics of TF-DNA binding in vivo and provides an accurate first-principles approach to binding energy inference from ChIP-seq and ChIP-exo data.
Collapse
Affiliation(s)
- Wenjie Sun
- Computational and Systems Biology, Genome Institute of Singapore, 60 Biopolis St, Singapore 138672, Singapore
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Abstract
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.
Collapse
|
46
|
Mukherjee R, Evans P, Singh LN, Hannenhalli S. Correlated evolution of positions within mammalian cis elements. PLoS One 2013; 8:e55521. [PMID: 23408994 PMCID: PMC3568137 DOI: 10.1371/journal.pone.0055521] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2012] [Accepted: 12/27/2012] [Indexed: 12/26/2022] Open
Abstract
Transcriptional regulation critically depends on proper interactions between transcription factors (TF) and their cognate DNA binding sites. The widely used model of TF-DNA binding – the Positional Weight Matrix (PWM) – presumes independence between positions within the binding site. However, there is evidence to show that the independence assumption may not always hold, and the extent of interposition dependence is not completely known. We hypothesize that the interposition dependence should partly be manifested as correlated evolution at the positions. We report a Maximum-Likelihood (ML) approach to infer correlated evolution at any two positions within a PWM, based on a multiple alignment of 5 mammalian genomes. Application to a genome-wide set of putative cis elements in human promoters reveals a prevalence of correlated evolution within cis elements. We found that the interdependence between two positions decreases with increasing distance between the positions. The interdependent positions tend to be evolutionarily more constrained and moreover, the dependence patterns are relatively similar across structurally related transcription factors. Although some of the detected mutational dependencies may be due to context-dependent genomic hyper-mutation, notably CG to TG, the majority is likely due to context-dependent preferences for specific nucleotide combinations within the cis elements. Patterns of evolution at individual nucleotide positions within mammalian TF binding sites are often significantly correlated, suggesting interposition dependence. The proposed methodology is also applicable to other classes of non-coding functional elements. A detailed investigation of mutational dependencies within specific motifs could reveal preferred nucleotide combinations that may help refine the DNA binding models.
Collapse
Affiliation(s)
- Rithun Mukherjee
- Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- * E-mail: (RM); (SH)
| | - Perry Evans
- Department of Pathology, School of Medicine, Yale University, New Haven, Connecticut, United States of America
| | - Larry N. Singh
- Genetic Diseases Research Branch, NHGRI, NIH, Bethesda, Maryland, United States of America
| | - Sridhar Hannenhalli
- Center for Bioinformatics and Computational Biology, Department of Cell and Molecular Biology, University of Maryland, College Park, Maryland, United States of America
- * E-mail: (RM); (SH)
| |
Collapse
|
47
|
Abstract
The electrophoretic mobility shift assay (EMSA) is a sensitive relatively straightforward methodology used to detect sequence-specific DNA-protein interactions. It is the fundamental procedure of several variants that allow qualitative and quantitative assessments of protein-nucleic acid complexes. Classically, nuclear proteins and DNA are combined and the resulting mixture is electrophoretically separated in polyacrylamide or agarose gel under native conditions. The distribution within the gel is generally detected with autoradiography of the ³²P-labeled DNA. The underlying principle is that nucleic acid with protein bound to it will migrate more slowly through a gel matrix than the free nucleic acid. In this chapter, a representative protocol is described that addresses specific challenges of using whole embryos as the nuclear protein source, and the most common and informative EMSA variant, the "supershift," is also presented. The important points are underscored and approaches for troubleshooting are explained. References are provided for alternative methods and extensions of the basic protocol.
Collapse
|
48
|
Liu LA, Bradley P. Atomistic modeling of protein-DNA interaction specificity: progress and applications. Curr Opin Struct Biol 2012; 22:397-405. [PMID: 22796087 DOI: 10.1016/j.sbi.2012.06.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 06/20/2012] [Indexed: 12/22/2022]
Abstract
An accurate, predictive understanding of protein-DNA binding specificity is crucial for the successful design and engineering of novel protein-DNA binding complexes. In this review, we summarize recent studies that use atomistic representations of interfaces to predict protein-DNA binding specificity computationally. Although methods with limited structural flexibility have proven successful at recapitulating consensus binding sequences from wild-type complex structures, conformational flexibility is likely important for design and template-based modeling, where non-native conformations need to be sampled and accurately scored. A successful application of such computational modeling techniques in the construction of the TAL-DNA complex structure is discussed. With continued improvements in energy functions, solvation models, and conformational sampling, we are optimistic that reliable and large-scale protein-DNA binding prediction and engineering is a goal within reach.
Collapse
|
49
|
Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 2012; 191:781-90. [PMID: 22505627 PMCID: PMC3389974 DOI: 10.1534/genetics.112.138685] [Citation(s) in RCA: 98] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2011] [Accepted: 04/07/2012] [Indexed: 12/27/2022] Open
Abstract
Identifying transcription factor (TF) binding sites is essential for understanding regulatory networks. The specificity of most TFs is currently modeled using position weight matrices (PWMs) that assume the positions within a binding site contribute independently to binding affinity for any site. Extensive, high-throughput quantitative binding assays let us examine, for the first time, the independence assumption for many TFs. We find that the specificity of most TFs is well fit with the simple PWM model, but in some cases more complex models are required. We introduce a binding energy model (BEM) that can include energy parameters for nonindependent contributions to binding affinity. We show that in most cases where a PWM is not sufficient, a BEM that includes energy parameters for adjacent dinucleotide contributions models the specificity very well. Having more accurate models of specificity greatly improves the interpretation of in vivo TF localization data, such as from chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments.
Collapse
Affiliation(s)
- Yue Zhao
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108
| | - Shuxiang Ruan
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108
| | - Manishi Pandey
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108
| | - Gary D. Stormo
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63108
| |
Collapse
|
50
|
Tan M, Yu D, Jin Y, Dou L, Li B, Wang Y, Yue J, Liang L. An information transmission model for transcription factor binding at regulatory DNA sites. Theor Biol Med Model 2012; 9:19. [PMID: 22672438 PMCID: PMC3442977 DOI: 10.1186/1742-4682-9-19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2012] [Accepted: 05/17/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements. RESULTS Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results. CONCLUSIONS In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Collapse
Affiliation(s)
- Mingfeng Tan
- Beijing Institute of Biotechnology, Beijing 100071, China
| | | | | | | | | | | | | | | |
Collapse
|