1
|
Oka H, Kojima T, Kato R, Ihara K, Nakano H. Construction of transcript regulation mechanism prediction models based on binding motif environment of transcription factor AoXlnR in Aspergillus oryzae. J Bioinform Comput Biol 2024; 22:2450017. [PMID: 39051143 DOI: 10.1142/s0219720024500173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
DNA-binding transcription factors (TFs) play a central role in transcriptional regulation mechanisms, mainly through their specific binding to target sites on the genome and regulation of the expression of downstream genes. Therefore, a comprehensive analysis of the function of these TFs will lead to the understanding of various biological mechanisms. However, the functions of TFs in vivo are diverse and complicated, and the identified binding sites on the genome are not necessarily involved in the regulation of downstream gene expression. In this study, we investigated whether DNA structural information around the binding site of TFs can be used to predict the involvement of the binding site in the regulation of the expression of genes located downstream of the binding site. Specifically, we calculated the structural parameters based on the DNA shape around the DNA binding motif located upstream of the gene whose expression is directly regulated by one TF AoXlnR from Aspergillus oryzae, and showed that the presence or absence of expression regulation can be predicted from the sequence information with high accuracy ([Formula: see text]-1.0) by machine learning incorporating these parameters.
Collapse
Affiliation(s)
- Hiroya Oka
- Department of Applied Biosciences, Graduate School of Bioagricultural Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Takaaki Kojima
- Department of Applied Biosciences, Graduate School of Bioagricultural Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
- Department of Agrobiological Resources, Faculty of Agriculture, Meijo University, Shiogamaguchi, Tempaku Nagoya 468-8502, Japan
| | - Ryuji Kato
- Department of Basic Medicinal Sciences, Graduate School of Pharmaceutical Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
| | - Kunio Ihara
- Center for Gene Research, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Japan
| | - Hideo Nakano
- Department of Applied Biosciences, Graduate School of Bioagricultural Sciences, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Japan
| |
Collapse
|
2
|
Proft S, Leiz J, Heinemann U, Seelow D, Schmidt-Ott KM, Rutkiewicz M. Discovery of a non-canonical GRHL1 binding site using deep convolutional and recurrent neural networks. BMC Genomics 2023; 24:736. [PMID: 38049725 PMCID: PMC10696883 DOI: 10.1186/s12864-023-09830-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 11/22/2023] [Indexed: 12/06/2023] Open
Abstract
BACKGROUND Transcription factors regulate gene expression by binding to transcription factor binding sites (TFBSs). Most models for predicting TFBSs are based on position weight matrices (PWMs), which require a specific motif to be present in the DNA sequence and do not consider interdependencies of nucleotides. Novel approaches such as Transcription Factor Flexible Models or recurrent neural networks consequently provide higher accuracies. However, it is unclear whether such approaches can uncover novel non-canonical, hitherto unexpected TFBSs relevant to human transcriptional regulation. RESULTS In this study, we trained a convolutional recurrent neural network with HT-SELEX data for GRHL1 binding and applied it to a set of GRHL1 binding sites obtained from ChIP-Seq experiments from human cells. We identified 46 non-canonical GRHL1 binding sites, which were not found by a conventional PWM approach. Unexpectedly, some of the newly predicted binding sequences lacked the CNNG core motif, so far considered obligatory for GRHL1 binding. Using isothermal titration calorimetry, we experimentally confirmed binding between the GRHL1-DNA binding domain and predicted GRHL1 binding sites, including a non-canonical GRHL1 binding site. Mutagenesis of individual nucleotides revealed a correlation between predicted binding strength and experimentally validated binding affinity across representative sequences. This correlation was neither observed with a PWM-based nor another deep learning approach. CONCLUSIONS Our results show that convolutional recurrent neural networks may uncover unanticipated binding sites and facilitate quantitative transcription factor binding predictions.
Collapse
Affiliation(s)
- Sebastian Proft
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany
| | - Janna Leiz
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
| | - Udo Heinemann
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berlin Institute of Health, Charité - Universitätsmedizin Berlin, 10117, Berlin, Germany.
- Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 13353, Berlin, Germany.
| | - Kai M Schmidt-Ott
- Department of Nephrology and Hypertension, Hannover Medical School, 30625, Hannover, Germany.
- Department of Nephrology and Intensive Care Medicine, Charité - Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt-Universität zu Berlin, 12203, Berlin, Germany.
- Molecular and Translational Kidney Research, Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany.
| | - Maria Rutkiewicz
- Macromolecular Structure and Interaction, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 13125, Berlin, Germany
- Department of Structural Biology of Eukaryotes, Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznań, 61-704, Poland
| |
Collapse
|
3
|
Grau J, Schmidt F, Schulz MH. Widespread effects of DNA methylation and intra-motif dependencies revealed by novel transcription factor binding models. Nucleic Acids Res 2023; 51:e95. [PMID: 37650641 PMCID: PMC10570048 DOI: 10.1093/nar/gkad693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/20/2023] [Accepted: 08/10/2023] [Indexed: 09/01/2023] Open
Abstract
Several studies suggested that transcription factor (TF) binding to DNA may be impaired or enhanced by DNA methylation. We present MeDeMo, a toolbox for TF motif analysis that combines information about DNA methylation with models capturing intra-motif dependencies. In a large-scale study using ChIP-seq data for 335 TFs, we identify novel TFs that show a binding behaviour associated with DNA methylation. Overall, we find that the presence of CpG methylation decreases the likelihood of binding for the majority of methylation-associated TFs. For a considerable subset of TFs, we show that intra-motif dependencies are pivotal for accurately modelling the impact of DNA methylation on TF binding. We illustrate that the novel methylation-aware TF binding models allow to predict differential ChIP-seq peaks and improve the genome-wide analysis of TF binding. Our work indicates that simplistic models that neglect the effect of DNA methylation on DNA binding may lead to systematic underperformance for methylation-associated TFs.
Collapse
Affiliation(s)
- Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle 06120, Germany
| | - Florian Schmidt
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- Systems Biology and Data Analytics, Genome Institute of Singapore, Singapore 13862, Singapore
- ImmunoScape Pte Ltd, Singapore 228208, Singapore
| | - Marcel H Schulz
- Goethe-University Frankfurt, Institute for Cardiovascular Regeneration, Theodor-Stern-Kai 7, 60590 Frankfurt, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken 66123, Germany
- German Center for Cardiovascular Research, Partner site Rhein-Main, 60590 Frankfurt am Main, Germany
- Cardio-Pulmonary Institute, Goethe University, Frankfurt am Main, Germany
| |
Collapse
|
4
|
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans. BMC Genomics 2023; 24:597. [PMID: 37805453 PMCID: PMC10560430 DOI: 10.1186/s12864-023-09692-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/21/2023] [Indexed: 10/09/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.
Collapse
Affiliation(s)
- Saeko Tahara
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- School of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Takaho Tsuchiya
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, 1-14, Bunkyo-Machi, Nagasaki City, Nagasaki, 852-8521, Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
5
|
Yin YH, Shen LC, Jiang Y, Gao S, Song J, Yu DJ. Improving the prediction of DNA-protein binding by integrating multi-scale dense convolutional network with fault-tolerant coding. Anal Biochem 2022; 656:114878. [DOI: 10.1016/j.ab.2022.114878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 08/18/2022] [Accepted: 08/23/2022] [Indexed: 11/01/2022]
|
6
|
Jin Y, Jiang J, Wang R, Qin ZS. Systematic Evaluation of DNA Sequence Variations on in vivo Transcription Factor Binding Affinity. Front Genet 2021; 12:667866. [PMID: 34567058 PMCID: PMC8458901 DOI: 10.3389/fgene.2021.667866] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 08/02/2021] [Indexed: 02/01/2023] Open
Abstract
The majority of the single nucleotide variants (SNVs) identified by genome-wide association studies (GWAS) fall outside of the protein-coding regions. Elucidating the functional implications of these variants has been a major challenge. A possible mechanism for functional non-coding variants is that they disrupted the canonical transcription factor (TF) binding sites that affect the in vivo binding of the TF. However, their impact varies since many positions within a TF binding motif are not well conserved. Therefore, simply annotating all variants located in putative TF binding sites may overestimate the functional impact of these SNVs. We conducted a comprehensive survey to study the effect of SNVs on the TF binding affinity. A sequence-based machine learning method was used to estimate the change in binding affinity for each SNV located inside a putative motif site. From the results obtained on 18 TF binding motifs, we found that there is a substantial variation in terms of a SNV’s impact on TF binding affinity. We found that only about 20% of SNVs located inside putative TF binding sites would likely to have significant impact on the TF-DNA binding.
Collapse
Affiliation(s)
- Yutong Jin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Jiahui Jiang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| | - Ruixuan Wang
- College of Environmental Sciences and Engineering, Peking University, Beijing, China
| | - Zhaohui S Qin
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, United States
| |
Collapse
|
7
|
Käppel S, Eggeling R, Rümpler F, Groth M, Melzer R, Theißen G. DNA-binding properties of the MADS-domain transcription factor SEPALLATA3 and mutant variants characterized by SELEX-seq. PLANT MOLECULAR BIOLOGY 2021; 105:543-557. [PMID: 33486697 PMCID: PMC7892521 DOI: 10.1007/s11103-020-01108-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 12/11/2020] [Indexed: 05/13/2023]
Abstract
We studied the DNA-binding profile of the MADS-domain transcription factor SEPALLATA3 and mutant variants by SELEX-seq. DNA-binding characteristics of SEPALLATA3 mutant proteins lead us to propose a novel DNA-binding mode. MIKC-type MADS-domain proteins, which function as essential transcription factors in plant development, bind as dimers to a 10-base-pair AT-rich motif termed CArG-box. However, this consensus motif cannot fully explain how the abundant family members in flowering plants can bind different target genes in specific ways. The aim of this study was to better understand the DNA-binding specificity of MADS-domain transcription factors. Also, we wanted to understand the role of a highly conserved arginine residue for binding specificity of the MADS-domain transcription factor family. Here, we studied the DNA-binding profile of the floral homeotic MADS-domain protein SEPALLATA3 by performing SELEX followed by high-throughput sequencing (SELEX-seq). We found a diverse set of bound sequences and could estimate the in vitro binding affinities of SEPALLATA3 to a huge number of different sequences. We found evidence for the preference of AT-rich motifs as flanking sequences. Whereas different CArG-boxes can act as SEPALLATA3 binding sites, our findings suggest that the preferred flanking motifs are almost always the same and thus mostly independent of the identity of the central CArG-box motif. Analysis of SEPALLATA3 proteins with a single amino acid substitution at position 3 of the DNA-binding MADS-domain further revealed that the conserved arginine residue, which has been shown to be involved in a shape readout mechanism, is especially important for the recognition of nucleotides at positions 3 and 8 of the CArG-box motif. This leads us to propose a novel DNA-binding mode for SEPALLATA3, which is different from that of other MADS-domain proteins known.
Collapse
Affiliation(s)
- Sandra Käppel
- Matthias Schleiden Institute/Genetics, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany
| | - Ralf Eggeling
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, 00014, Helsinki, Finland
- Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Sand 14, 72076, Tübingen, Germany
- Institute for Biomedical Informatics, University of Tübingen, Tübingen, Germany
| | - Florian Rümpler
- Matthias Schleiden Institute/Genetics, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany
| | - Marco Groth
- Leibniz Institute on Aging-Fritz Lipmann Institute (FLI), Core Facility DNA Sequencing, Beutenbergstraße 11, 07745, Jena, Germany
| | - Rainer Melzer
- School of Biology and Environmental Science and Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Günter Theißen
- Matthias Schleiden Institute/Genetics, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany.
| |
Collapse
|
8
|
Chiu TP, Xin B, Markarian N, Wang Y, Rohs R. TFBSshape: an expanded motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2020; 48:D246-D255. [PMID: 31665425 PMCID: PMC7145579 DOI: 10.1093/nar/gkz970] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 10/08/2019] [Accepted: 10/11/2019] [Indexed: 12/31/2022] Open
Abstract
TFBSshape (https://tfbsshape.usc.edu) is a motif database for analyzing structural profiles of transcription factor binding sites (TFBSs). The main rationale for this database is to be able to derive mechanistic insights in protein-DNA readout modes from sequencing data without available structures. We extended the quantity and dimensionality of TFBSshape, from mostly in vitro to in vivo binding and from unmethylated to methylated DNA. This new release of TFBSshape improves its functionality and launches a responsive and user-friendly web interface for easy access to the data. The current expansion includes new entries from the most recent collections of transcription factors (TFs) from the JASPAR and UniPROBE databases, methylated TFBSs derived from in vitro high-throughput EpiSELEX-seq binding assays and in vivo methylated TFBSs from the MeDReaders database. TFBSshape content has increased to 2428 structural profiles for 1900 TFs from 39 different species. The structural profiles for each TFBS entry now include 13 shape features and minor groove electrostatic potential for standard DNA and four shape features for methylated DNA. We improved the flexibility and accuracy for the shape-based alignment of TFBSs and designed new tools to compare methylated and unmethylated structural profiles of TFs and methods to derive DNA shape-preserving nucleotide mutations in TFBSs.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Beibei Xin
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Nicholas Markarian
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Yingfei Wang
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Quantitative and Computational Biology, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
9
|
Gheorghe M, Sandve GK, Khan A, Chèneby J, Ballester B, Mathelier A. A map of direct TF-DNA interactions in the human genome. Nucleic Acids Res 2019; 47:e21. [PMID: 30517703 PMCID: PMC6393237 DOI: 10.1093/nar/gky1210] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Revised: 10/31/2018] [Accepted: 11/20/2018] [Indexed: 12/11/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most popular assay to identify genomic regions, called ChIP-seq peaks, that are bound in vivo by transcription factors (TFs). These regions are derived from direct TF-DNA interactions, indirect binding of the TF to the DNA (through a co-binding partner), nonspecific binding to the DNA, and noise/bias/artifacts. Delineating the bona fide direct TF-DNA interactions within the ChIP-seq peaks remains challenging. We developed a dedicated software, ChIP-eat, that combines computational TF binding models and ChIP-seq peaks to automatically predict direct TF-DNA interactions. Our work culminated with predicted interactions covering >4% of the human genome, obtained by uniformly processing 1983 ChIP-seq peak data sets from the ReMap database for 232 unique TFs. The predictions were a posteriori assessed using protein binding microarray and ChIP-exo data, and were predominantly found in high quality ChIP-seq peaks. The set of predicted direct TF-DNA interactions suggested that high-occupancy target regions are likely not derived from direct binding of the TFs to the DNA. Our predictions derived co-binding TFs supported by protein-protein interaction data and defined cis-regulatory modules enriched for disease- and trait-associated SNPs. We provide this collection of direct TF-DNA interactions and cis-regulatory modules through the UniBind web-interface (http://unibind.uio.no).
Collapse
Affiliation(s)
- Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | | | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway
| | - Jeanne Chèneby
- Aix Marseille Université, INSERM, TAGC, Marseille, France
| | | | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), University of Oslo, Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Radiumhospitalet, Oslo, Norway
| |
Collapse
|
10
|
Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, Bessy A, Chèneby J, Kulkarni SR, Tan G, Baranasic D, Arenillas DJ, Sandelin A, Vandepoele K, Lenhard B, Ballester B, Wasserman WW, Parcy F, Mathelier A. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res 2019; 46:D260-D266. [PMID: 29140473 PMCID: PMC5753243 DOI: 10.1093/nar/gkx1126] [Citation(s) in RCA: 904] [Impact Index Per Article: 150.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/27/2017] [Indexed: 12/31/2022] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups. In the 2018 release of JASPAR, the CORE collection has been expanded with 322 new PFMs (60 for vertebrates and 262 for plants) and 33 PFMs were updated (24 for vertebrates, 8 for plants and 1 for insects). These new profiles represent a 30% expansion compared to the 2016 release. In addition, we have introduced 316 TFFMs (95 for vertebrates, 218 for plants and 3 for insects). This release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. The JASPAR 2018 CORE vertebrate collection of PFMs was used to predict TF-binding sites in the human genome. The predictions are made available to the scientific community through a UCSC Genome Browser track data hub. Finally, this update comes with a new web framework with an interactive and responsive user-interface, along with new features. All the underlying data can be retrieved programmatically using a RESTful API and through the JASPAR 2018 R/Bioconductor package.
Collapse
Affiliation(s)
- Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Arnaud Stigliani
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Robin van der Lee
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Adrien Bessy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Jeanne Chèneby
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Shubhada R Kulkarni
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Ge Tan
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - Damir Baranasic
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK
| | - David J Arenillas
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research & Innovation Centre, University of Copenhagen, DK2200 Copenhagen N, Denmark
| | - Klaas Vandepoele
- Ghent University, Department of Plant Biotechnology and Bioinformatics, Technologiepark 927, 9052 Ghent, Belgium.,VIB Center for Plant Systems Biology, Technologiepark 927, 9052 Ghent, Belgium.,Bioinformatics Institute Ghent, Ghent University, Technologiepark 927, 9052 Ghent, Belgium
| | - Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK.,Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W12 0NN, UK.,Sars International Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
| | - Benoît Ballester
- INSERM, UMR1090 TAGC, Marseille, F-13288, France.,Aix-Marseille Université, UMR1090 TAGC, Marseille, F-13288, France
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 28th Ave W, Vancouver, BC V5Z 4H4, Canada
| | - François Parcy
- University of Grenoble Alpes, CNRS, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway.,Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0310 Oslo, Norway
| |
Collapse
|
11
|
Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy EI, Medvedeva YA, Magana-Mora A, Bajic VB, Papatsenko DA, Kolpakov FA, Makeev VJ. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res 2019; 46:D252-D259. [PMID: 29140464 PMCID: PMC5753240 DOI: 10.1093/nar/gkx1106] [Citation(s) in RCA: 573] [Impact Index Per Article: 95.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Accepted: 10/31/2017] [Indexed: 12/15/2022] Open
Abstract
We present a major update of the HOCOMOCO collection that consists of patterns describing DNA binding specificities for human and mouse transcription factors. In this release, we profited from a nearly doubled volume of published in vivo experiments on transcription factor (TF) binding to expand the repertoire of binding models, replace low-quality models previously based on in vitro data only and cover more than a hundred TFs with previously unknown binding specificities. This was achieved by systematic motif discovery from more than five thousand ChIP-Seq experiments uniformly processed within the BioUML framework with several ChIP-Seq peak calling tools and aggregated in the GTRD database. HOCOMOCO v11 contains binding models for 453 mouse and 680 human transcription factors and includes 1302 mononucleotide and 576 dinucleotide position weight matrices, which describe primary binding preferences of each transcription factor and reliable alternative binding specificities. An interactive interface and bulk downloads are available on the web: http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco11. In this release, we complement HOCOMOCO by MoLoTool (Motif Location Toolbox, http://molotool.autosome.ru) that applies HOCOMOCO models for visualization of binding sites in short DNA sequences.
Collapse
Affiliation(s)
- Ivan V Kulakovskiy
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 119991, GSP-1, Vavilova 32, Moscow, Russia.,Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia.,Center for Data-Intensive Biomedicine and Biotechnology, Skolkovo Institute of Science and Technology, 143026 Moscow, Russia
| | - Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia
| | - Ivan S Yevshin
- BIOSOFT.RU Ltd, 630058, Russkaya 41/1, Novosibirsk, Russia
| | - Ruslan N Sharipov
- BIOSOFT.RU Ltd, 630058, Russkaya 41/1, Novosibirsk, Russia.,Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, 630090, Akad. Rzhanova 6, Novosibirsk, Russia.,Novosibirsk State University, 630090, Pirogova 2, Novosibirsk, Russia
| | - Alla D Fedorova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234, Leninskiye Gory 1-73, Moscow, Russia
| | - Eugene I Rumynskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia.,Moscow Institute of Physics and Technology (State University), 141700, 9 Institutskiy per, Dolgoprudny, Russia
| | - Yulia A Medvedeva
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia.,Moscow Institute of Physics and Technology (State University), 141700, 9 Institutskiy per, Dolgoprudny, Russia.,Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, 119071, 2 Leninsky Ave. 33, Moscow, Russia
| | - Arturo Magana-Mora
- National Institute of Advanced Industrial Science and Technology (AIST), Com. Bio Big-Data Open Innovation Lab. (CBBD-OIL), AIST Tokyo Waterfront Main Bldg. #323, 2-3-26 Aomi, Tokyo 135-0064, Japan.,King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal 23955-6900, Saudi Arabia
| | - Dmitry A Papatsenko
- Center for Data-Intensive Biomedicine and Biotechnology, Skolkovo Institute of Science and Technology, 143026 Moscow, Russia
| | - Fedor A Kolpakov
- BIOSOFT.RU Ltd, 630058, Russkaya 41/1, Novosibirsk, Russia.,Institute of Computational Technologies, Siberian Branch of the Russian Academy of Sciences, 630090, Akad. Rzhanova 6, Novosibirsk, Russia
| | - Vsevolod J Makeev
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 119991, GSP-1, Vavilova 32, Moscow, Russia.,Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, GSP-1, Gubkina 3, Moscow, Russia.,Moscow Institute of Physics and Technology (State University), 141700, 9 Institutskiy per, Dolgoprudny, Russia
| |
Collapse
|
12
|
Zhang SW, Wang Y, Zhang XX, Wang JQ. Prediction of the RBP binding sites on lncRNAs using the high-order nucleotide encoding convolutional neural network. Anal Biochem 2019; 583:113364. [PMID: 31323206 DOI: 10.1016/j.ab.2019.113364] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Revised: 07/10/2019] [Accepted: 07/15/2019] [Indexed: 01/09/2023]
Abstract
Long non-coding RNA (lncRNA) plays an important role in cells through the interaction with RNA-binding proteins (RBPs). Finding the RBPs binding sites on the lncRNA chains can help to understand the post-transcriptional regulatory mechanism, exploring the pathogenesis of cancers and possible roles in other diseases. Although many genome-wide RBP experimental techniques can identify the RNA-protein interactions and detect the binding sites on RNA chains, they are still time-consuming, labor-intensive and cost-heavy. Thus, many computational methods have been developed to predict the RBPs sites by integrating the RNA sequence, structure and domain specific features, etc. However, current approaches that focus on predicting the RBPs binding sites on RNA chains lack a consideration of the dependencies among nucleotides. In this work, we propose a higher-order nucleotide encoding convolutional neural network-based method (namely HOCNNLB) to predict the RBPs binding sites on lncRNA chains. HOCNNLB first employs a high-order one-hot encoding strategy to encode the lncRNA sequences by considering the dependence among nucleotides, then the encoded lncRNA sequences are fed into the convolutional neural network (CNN) to predict the RBP binding sites. We evaluate HOCNNLB on 31 experimental datasets of 12 lncRNA binding proteins. The average AUC of HOCNNLB achieves 0.953, which is 0.247, 0.175 higher than that of iDeepS and DeepBind, respectively. The average accuracy is 90.2%, which is 26.8%, 19.5% higher than that of iDeepS and DeepBind, respectively. These results demonstrate that HOCNNLB can reliably predict the RBP binding sites on lncRNA chains and outperforms the state-of-the-art methods. The source code of HOCNNLB and the datasets used in this work are available at https://github.com/NWPU-903PR/HOCNNLB for academic users.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China.
| | - Ya Wang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Xi-Xi Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Jia-Qi Wang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
13
|
Eggeling R. Disentangling transcription factor binding site complexity. Nucleic Acids Res 2019; 46:e121. [PMID: 30085218 PMCID: PMC6237759 DOI: 10.1093/nar/gky683] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Accepted: 07/17/2018] [Indexed: 12/15/2022] Open
Abstract
The binding motifs of many transcription factors (TFs) comprise a higher degree of complexity than a single position weight matrix model permits. Additional complexity is typically taken into account either as intra-motif dependencies via more sophisticated probabilistic models or as heterogeneities via multiple weight matrices. However, both orthogonal approaches have limitations when learning from in vivo data where binding sites of other factors in close proximity can interfere with motif discovery for the protein of interest. In this work, we demonstrate how intra-motif complexity can, purely by analyzing the statistical properties of a given set of TF-binding sites, be distinguished from complexity arising from an intermix with motifs of co-binding TFs or other artifacts. In addition, we study the related question whether intra-motif complexity is represented more effectively by dependencies, heterogeneities or variants in between. Benchmarks demonstrate the effectiveness of both methods for their respective tasks and applications on motif discovery output from recent tools detect and correct many undesirable artifacts. These results further suggest that the prevalence of intra-motif dependencies may have been overestimated in previous studies on in vivo data and should thus be reassessed.
Collapse
Affiliation(s)
- Ralf Eggeling
- Department of Computer Science, University of Helsinki, Gustaf-Hällströmin katu 2b, FIN-00140 Helsinki, Finland
| |
Collapse
|
14
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
15
|
Zhang Q, Shen Z, Huang DS. Modeling in-vivo protein-DNA binding by combining multiple-instance learning with a hybrid deep neural network. Sci Rep 2019; 9:8484. [PMID: 31186519 PMCID: PMC6559991 DOI: 10.1038/s41598-019-44966-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 05/15/2019] [Indexed: 01/26/2023] Open
Abstract
Modeling in-vivo protein-DNA binding is not only fundamental for further understanding of the regulatory mechanisms, but also a challenging task in computational biology. Deep-learning based methods have succeed in modeling in-vivo protein-DNA binding, but they often (1) follow the fully supervised learning framework and overlook the weakly supervised information of genomic sequences that a bound DNA sequence may has multiple TFBS(s), and, (2) use one-hot encoding to encode DNA sequences and ignore the dependencies among nucleotides. In this paper, we propose a weakly supervised framework, which combines multiple-instance learning with a hybrid deep neural network and uses k-mer encoding to transform DNA sequences, for modeling in-vivo protein-DNA binding. Firstly, this framework segments sequences into multiple overlapping instances using a sliding window, and then encodes all instances into image-like inputs of high-order dependencies using k-mer encoding. Secondly, it separately computes a score for all instances in the same bag using a hybrid deep neural network that integrates convolutional and recurrent neural networks. Finally, it integrates the predicted values of all instances as the final prediction of this bag using the Noisy-and method. The experimental results on in-vivo datasets demonstrate the superior performance of the proposed framework. In addition, we also explore the performance of the proposed framework when using k-mer encoding, and demonstrate the performance of the Noisy-and method by comparing it with other fusion methods, and find that adding recurrent layers can improve the performance of the proposed framework.
Collapse
Affiliation(s)
- Qinhu Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Zhen Shen
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
16
|
Cavalli M, Baltzer N, Umer HM, Grau J, Lemnian I, Pan G, Wallerman O, Spalinskas R, Sahlén P, Grosse I, Komorowski J, Wadelius C. Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases. Sci Rep 2019; 9:2695. [PMID: 30804403 PMCID: PMC6389883 DOI: 10.1038/s41598-019-39633-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Accepted: 01/24/2019] [Indexed: 12/20/2022] Open
Abstract
Several Genome Wide Association Studies (GWAS) have reported variants associated to immune diseases. However, the identified variants are rarely the drivers of the associations and the molecular mechanisms behind the genetic contributions remain poorly understood. ChIP-seq data for TFs and histone modifications provide snapshots of protein-DNA interactions allowing the identification of heterozygous SNPs showing significant allele specific signals (AS-SNPs). AS-SNPs can change a TF binding site resulting in altered gene regulation and are primary candidates to explain associations observed in GWAS and expression studies. We identified 17,293 unique AS-SNPs across 7 lymphoblastoid cell lines. In this set of cell lines we interrogated 85% of common genetic variants in the population for potential regulatory effect and we identified 237 AS-SNPs associated to immune GWAS traits and 714 to gene expression in B cells. To elucidate possible regulatory mechanisms we integrated long-range 3D interactions data to identify putative target genes and motif predictions to identify TFs whose binding may be affected by AS-SNPs yielding a collection of 173 AS-SNPs associated to gene expression and 60 to B cell related traits. We present a systems strategy to find functional gene regulatory variants, the TFs that bind differentially between alleles and novel strategies to detect the regulated genes.
Collapse
Affiliation(s)
- Marco Cavalli
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Nicholas Baltzer
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden
| | - Husen M Umer
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Ioana Lemnian
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Gang Pan
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Ola Wallerman
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Rapolas Spalinskas
- Science for Life Laboratory, Division of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Pelin Sahlén
- Science for Life Laboratory, Division of Gene Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Jan Komorowski
- Department of Cell and Molecular Biology, Computational Biology and Bioinformatics, Uppsala University, Uppsala, Sweden.,Institute of Computer Science, Polish Academy of Sciences, Warszawa, Poland
| | - Claes Wadelius
- Science for Life Laboratory, Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
17
|
Eggeling R, Grosse I, Koivisto M. Algorithms for learning parsimonious context trees. Mach Learn 2018. [DOI: 10.1007/s10994-018-5770-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
18
|
Djordjevic M, Djordjevic M, Zdobnov E. Scoring Targets of Transcription in Bacteria Rather than Focusing on Individual Binding Sites. Front Microbiol 2017; 8:2314. [PMID: 29213263 PMCID: PMC5702782 DOI: 10.3389/fmicb.2017.02314] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Accepted: 11/09/2017] [Indexed: 11/13/2022] Open
Abstract
Reliable identification of targets of bacterial regulators is necessary to understand bacterial gene expression regulation. These targets are commonly predicted by searching for high-scoring binding sites in the upstream genomic regions, which typically leads to a large number of false positives. In contrast to the common approach, here we propose a novel concept, where overrepresentation of the scoring distribution that corresponds to the entire searched region is assessed, as opposed to predicting individual binding sites. We explore two implementations of this concept, based on Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) tests, which both provide straightforward P-value estimates for predicted targets. This approach is implemented for pleiotropic bacterial regulators, including σ70 (bacterial housekeeping σ factor) target predictions, which is a classical bioinformatics problem characterized by low specificity. We show that KS based approach is both faster and more accurate, departing from the current paradigm of AD being slower, but more accurate. Moreover, KS approach leads to a significant increase in the search accuracy compared to the standard approach, while at the same time straightforwardly assigning well established P-values to each potential target. Consequently, the new KS based method proposed here, which assigns P-values to fixed length upstream regions, provides a fast and accurate approach for predicting bacterial transcription targets.
Collapse
Affiliation(s)
- Marko Djordjevic
- Institute of Physiology and Biochemistry, Faculty of Biology, University of Belgrade, Belgrade, Serbia
| | | | - Evgeny Zdobnov
- Swiss Institute of Bioinformatics and Department of Genetic Medicine and Development, University of Geneva, Geneva, Switzerland
| |
Collapse
|
19
|
Eggeling R, Grosse I, Grau J. InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites. Bioinformatics 2017; 33:580-582. [PMID: 28035026 PMCID: PMC5408807 DOI: 10.1093/bioinformatics/btw689] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 10/27/2016] [Indexed: 11/14/2022] Open
Abstract
Summary Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise. Availability and Implementation InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe.
Collapse
Affiliation(s)
- Ralf Eggeling
- Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany
| |
Collapse
|
20
|
|
21
|
Ye Z, Ma T, Kalmbach MT, Dasari S, Kocher JPA, Wang L. CircularLogo: A lightweight web application to visualize intra-motif dependencies. BMC Bioinformatics 2017; 18:269. [PMID: 28532394 PMCID: PMC5440937 DOI: 10.1186/s12859-017-1680-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 05/11/2017] [Indexed: 01/09/2023] Open
Abstract
Background The sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization. Result We developed CircularLogo, a web-based interactive application, which is able to not only visualize the position-specific nucleotide consensus and diversity but also display the intra-motif dependencies. Applying CircularLogo to HNF6 binding sites and tRNA sequences demonstrated its ability to show intra-motif dependencies and intuitively reveal biomolecular structure. CircularLogo is implemented in JavaScript and Python based on the Django web framework. The program’s source code and user’s manual are freely available at http://circularlogo.sourceforge.net. CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html. Conclusion CircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.
Collapse
Affiliation(s)
- Zhenqing Ye
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Tao Ma
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - Michael T Kalmbach
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Surendra Dasari
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Jean-Pierre A Kocher
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Liguo Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA. .,Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
22
|
Abstract
Protein-DNA binding plays a central role in gene regulation and by that in all processes in the living cell. Novel experimental and computational approaches facilitate better understanding of protein-DNA binding preferences via high-throughput measurement of protein binding to a large number of DNA sequences and inference of binding models from them. Here we review the state of the art in measuring protein-DNA binding in vitro, emphasizing the advantages and limitations of different technologies. In addition, we describe models for representing protein-DNA binding preferences and key computational approaches to learn those from high-throughput data. Using large experimental data sets, we test the performance of different models based on different measuring techniques. We conclude with pertinent open problems.
Collapse
|
23
|
Nettling M, Treutler H, Cerquides J, Grosse I. Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies. BMC Bioinformatics 2017; 18:141. [PMID: 28249564 PMCID: PMC5333389 DOI: 10.1186/s12859-017-1495-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Accepted: 01/24/2017] [Indexed: 11/23/2022] Open
Abstract
Background Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for de-novo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and non-phylogenetic approaches based on sequences from only one species that typically take into account intra-motif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intra-motif dependencies separately improves de-novo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously. Results Here, we present an approach for de-novo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intra-motif dependencies. We study the degree of intra-motif dependencies inferred by this approach from ChIP-seq data of 35 transcription factors. We find that significant intra-motif dependencies of orders 1 and 2 are present in all 35 datasets and that intra-motif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intra-motif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1. Conclusion Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1495-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.
| | | | - Jesus Cerquides
- Institut d'Investigació en Intel ·ligència Artificial, IIIA-CSIC, Campus UAB, Cerdanyola, Spain
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| |
Collapse
|
24
|
A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen–Shannon Divergence. ENTROPY 2016. [DOI: 10.3390/e18100379] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
25
|
Orenstein Y, Wang Y, Berger B. RCK: accurate and efficient inference of sequence- and structure-based protein-RNA binding models from RNAcompete data. Bioinformatics 2016; 32:i351-i359. [PMID: 27307637 PMCID: PMC4908343 DOI: 10.1093/bioinformatics/btw259] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
MOTIVATION Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240 000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but is outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers, nor can it be tractably run on the full RNACompete dataset. RESULTS We develop RCK, an efficient, scalable algorithm that infers both sequence and structure preferences based on a new k-mer based model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale. AVAILABILITY AND IMPLEMENTATION Software and models are freely available at http://rck.csail.mit.edu/ CONTACT bab@mit.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Yuhao Wang
- Computer Science and Artificial Intelligence Laboratory
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory Math Department, MIT, Cambridge, MA, USA
| |
Collapse
|
26
|
Lis M, Walther D. The orientation of transcription factor binding site motifs in gene promoter regions: does it matter? BMC Genomics 2016; 17:185. [PMID: 26939991 PMCID: PMC4778318 DOI: 10.1186/s12864-016-2549-x] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 02/27/2016] [Indexed: 12/23/2022] Open
Abstract
Background Gene expression is to large degree regulated by the specific binding of protein transcription factors to cis-regulatory transcription factor binding sites in gene promoter regions. Despite the identification of hundreds of binding site sequence motifs, the question as to whether motif orientation matters with regard to the gene expression regulation of the respective downstream genes appears surprisingly underinvestigated. Results We pursued a statistical approach by probing 293 reported non-palindromic transcription factor binding site and ten core promoter motifs in Arabidopsis thaliana for evidence of any relevance of motif orientation based on mapping statistics and effects on the co-regulation of gene expression of the respective downstream genes. Although positional intervals closer to the transcription start site (TSS) were found with increased frequencies of motifs exhibiting orientation preference, a corresponding effect with regard to gene expression regulation as evidenced by increased co-expression of genes harboring the favored orientation in their upstream sequence could not be established. Furthermore, we identified an intrinsic orientational asymmetry of sequence regions close to the TSS as the likely source of the identified motif orientation preferences. By contrast, motif presence irrespective of orientation was found associated with pronounced effects on gene expression co-regulation validating the pursued approach. Inspecting motif pairs revealed statistically preferred orientational arrangements, but no consistent effect with regard to arrangement-dependent gene expression regulation was evident. Conclusions Our results suggest that for the motifs considered here, either no specific orientation rendering them functional across all their instances exists with orientational requirements instead depending on gene-locus specific additional factors, or that the binding orientation of transcription factors may generally not be relevant, but rather the event of binding itself. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2549-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Monika Lis
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany.
| | - Dirk Walther
- Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany.
| |
Collapse
|