1
|
Vorontsov IE, Kozin I, Abramov S, Boytsov A, Jolma A, Albu M, Ambrosini G, Faltejskova K, Gralak AJ, Gryzunov N, Inukai S, Kolmykov S, Kravchenko P, Kribelbauer-Swietek JF, Laverty KU, Nozdrin V, Patel ZM, Penzar D, Plescher ML, Pour SE, Razavi R, Yang AWH, Yevshin I, Zinkevich A, Weirauch MT, Bucher P, Deplancke B, Fornes O, Grau J, Grosse I, Kolpakov FA, Makeev VJ, Hughes TR, Kulakovskiy IV. Cross-platform DNA motif discovery and benchmarking to explore binding specificities of poorly studied human transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.619379. [PMID: 39605530 PMCID: PMC11601219 DOI: 10.1101/2024.11.11.619379] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
A DNA sequence pattern, or "motif", is an essential representation of DNA-binding specificity of a transcription factor (TF). Any particular motif model has potential flaws due to shortcomings of the underlying experimental data and computational motif discovery algorithm. As a part of the Codebook/GRECO-BIT initiative, here we evaluated at large scale the cross-platform recognition performance of positional weight matrices (PWMs), which remain popular motif models in many practical applications. We applied ten different DNA motif discovery tools to generate PWMs from the "Codebook" data comprised of 4,237 experiments from five different platforms profiling the DNA-binding specificity of 394 human proteins, focusing on understudied transcription factors of different structural families. For many of the proteins, there was no prior knowledge of a genuine motif. By benchmarking-supported human curation, we constructed an approved subset of experiments comprising about 30% of all experiments and 50% of tested TFs which displayed consistent motifs across platforms and replicates. We present the Codebook Motif Explorer (https://mex.autosome.org), a detailed online catalog of DNA motifs, including the top-ranked PWMs, and the underlying source and benchmarking data. We demonstrate that in the case of high-quality experimental data, most of the popular motif discovery tools detect valid motifs and generate PWMs, which perform well both on genomic and synthetic data. Yet, for each of the algorithms, there were problematic combinations of proteins and platforms, and the basic motif properties such as nucleotide composition and information content offered little help in detecting such pitfalls. By combining multiple PMWs in decision trees, we demonstrate how our setup can be readily adapted to train and test binding specificity models more complex than PWMs. Overall, our study provides a rich motif catalog as a solid baseline for advanced models and highlights the power of the multi-platform multi-tool approach for reliable mapping of DNA binding specificities.
Collapse
Affiliation(s)
- Ilya E Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
| | - Ivan Kozin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, 98121, Seattle, WA, USA
| | - Arttu Jolma
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Mihai Albu
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | | | - Katerina Faltejskova
- Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, 160 00 Praha 6, Czech Republic
- Computer Science Institute, Faculty of Mathematics and Physics, Charles University, 118 00 Praha 1, Czech Republic
| | - Antoni J Gralak
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Nikita Gryzunov
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Sachi Inukai
- Chugai Pharmaceutical Co., Ltd, Tokyo, 103-8324, Japan
| | - Semyon Kolmykov
- Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia
| | | | - Judith F Kribelbauer-Swietek
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Kaitlin U Laverty
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Vladimir Nozdrin
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Zain M Patel
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Marie-Luise Plescher
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Sara E Pour
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Rozita Razavi
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Ally W H Yang
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | | | - Philipp Bucher
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Bart Deplancke
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Fedor A Kolpakov
- Department of Computational Biology, Sirius University of Science and Technology, 354340, Sirius, Krasnodar region, Russia
- Bioinformatics Laboratory, Federal Research Center for Information and Computational Technologies, 630090, Novosibirsk, Russia
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Moscow Center for Advanced Studies, 123592, Moscow, Russia
| | - Timothy R Hughes
- Donnelly Centre and Department of Molecular Genetics, Toronto, ON M5S 3E1, Canada
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Life Improvement by Future Technologies (LIFT) Center, 121205, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| |
Collapse
|
2
|
Gralak AJ, Faltejskova K, Yang AW, Steiner C, Russeil J, Grenningloh N, Inukai S, Demir M, Dainese R, Owen C, Pankevich E, Codebook/GRECO-BIT Consortium, Hughes TR, Kulakovskiy IV, Kribelbauer-Swietek JF, van Mierlo G, Deplancke B. Identification of methylation-sensitive human transcription factors using meSMiLE-seq. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.619598. [PMID: 39605503 PMCID: PMC11601298 DOI: 10.1101/2024.11.11.619598] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Transcription factors (TFs) are key players in eukaryotic gene regulation, but the DNA binding specificity of many TFs remains unknown. Here, we assayed 284 mostly poorly characterized, putative human TFs using selective microfluidics-based ligand enrichment followed by sequencing (SMiLE-seq), revealing 72 new DNA binding motifs. To investigate whether some of the 158 TFs for which we did not find motifs preferably bind epigenetically modified DNA (i.e. methylated CG dinucleotides), we developed methylation-sensitive SMiLE-seq (meSMiLE-seq). This microfluidic assay simultaneously probes the affinity of a protein to methylated and unmethylated DNA, augmenting the capabilities of the original method to infer methylation-aware binding sites. We assayed 114 TFs with meSMiLE-seq and identified DNA-binding models for 48 proteins, including the known methylation-sensitive binding modes for POU5F1 and RFX5. For 11 TFs, binding to methylated DNA was preferred or resulted in the discovery of alternative, methylation-dependent motifs (e.g. PRDM13), while aversion towards methylated sequences was found for 13 TFs (e.g. USF3). Finally, we uncovered a potential role for ZHX2 as a putative binder of Z-DNA, a left-handed helical DNA structure which is adopted more frequently upon CpG methylation. Altogether, our study significantly expands the human TF codebook by identifying DNA binding motifs for 98 TFs, while providing a versatile platform to quantitatively assay the impact of DNA modifications on TF binding.
Collapse
Affiliation(s)
- Antoni J. Gralak
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Katerina Faltejskova
- Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, Prague, Czech Republic
- Computer Science Institute, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
| | | | - Clemence Steiner
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Julie Russeil
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nadia Grenningloh
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Sachi Inukai
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Mustafa Demir
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Riccardo Dainese
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Cooper Owen
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Eugenia Pankevich
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | | | | | - Ivan V. Kulakovskiy
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Judith F. Kribelbauer-Swietek
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Guido van Mierlo
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
- Department of Medical BioSciences, Radboud University Medical Center, 6500 HB Nijmegen, The Netherlands
| | - Bart Deplancke
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
3
|
Jolma A, Hernandez-Corchado A, Yang AW, Fathi A, Laverty KU, Brechalov A, Razavi R, Albu M, Zheng H, The Codebook Consortium, Kulakovskiy IV, Najafabadi HS, Hughes TR. GHT-SELEX demonstrates unexpectedly high intrinsic sequence specificity and complex DNA binding of many human transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.618478. [PMID: 39605368 PMCID: PMC11601218 DOI: 10.1101/2024.11.11.618478] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
A long-standing challenge in human regulatory genomics is that transcription factor (TF) DNA-binding motifs are short and degenerate, while the genome is large. Motif scans therefore produce many false-positive binding site predictions. By surveying 179 TFs across 25 families using >1,500 cyclic in vitro selection experiments with fragmented, naked, and unmodified genomic DNA - a method we term GHT-SELEX (Genomic HT-SELEX) - we find that many human TFs possess much higher sequence specificity than anticipated. Moreover, genomic binding regions from GHT-SELEX are often surprisingly similar to those obtained in vivo (i.e. ChIP-seq peaks). We find that comparable specificity can also be obtained from motif scans, but performance is highly dependent on derivation and use of the motifs, including accounting for multiple local matches in the scans. We also observe alternative engagement of multiple DNA-binding domains within the same protein: long C2H2 zinc finger proteins often utilize modular DNA recognition, engaging different subsets of their DNA binding domain (DBD) arrays to recognize multiple types of distinct target sites, frequently evolving via internal duplication and divergence of one or more DBDs. Thus, contrary to conventional wisdom, it is common for TFs to possess sufficient intrinsic specificity to independently delineate cellular targets.
Collapse
Affiliation(s)
- Arttu Jolma
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Aldo Hernandez-Corchado
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montréal, QC H3A 0G1, Canada
| | - Ally W.H. Yang
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Ali Fathi
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Kaitlin U. Laverty
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | | | - Rozita Razavi
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Mihai Albu
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Hong Zheng
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | | | - Ivan V. Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia and Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Hamed S. Najafabadi
- Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montréal, QC H3A 0G1, Canada
| | - Timothy R. Hughes
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| |
Collapse
|
4
|
Razavi R, Fathi A, Yellan I, Brechalov A, Laverty KU, Jolma A, Hernandez-Corchado A, Zheng H, Yang AW, Albu M, Barazandeh M, Hu C, Vorontsov IE, Patel ZM, The Codebook Consortium, Kulakovskiy IV, Bucher P, Morris Q, Najafabadi HS, Hughes TR. Extensive binding of uncharacterized human transcription factors to genomic dark matter. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.622123. [PMID: 39605320 PMCID: PMC11601254 DOI: 10.1101/2024.11.11.622123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Most of the human genome is thought to be non-functional, and includes large segments often referred to as "dark matter" DNA. The genome also encodes hundreds of putative and poorly characterized transcription factors (TFs). We determined genomic binding locations of 166 uncharacterized human TFs in living cells. Nearly half of them associated strongly with known regulatory regions such as promoters and enhancers, often at conserved motif matches and co-localizing with each other. Surprisingly, the other half often associated with genomic dark matter, at largely unique sites, via intrinsic sequence recognition. Dozens of these, which we term "Dark TFs", mainly bind within regions of closed chromatin. Dark TF binding sites are enriched for transposable elements, and are rarely under purifying selection. Some Dark TFs are KZNFs, which contain the repressive KRAB domain, but many are not: the Dark TFs also include known or potential pioneer TFs. Compiled literature information supports that the Dark TFs exert diverse functions ranging from early development to tumor suppression. Thus, our results sheds light on a large fraction of previously uncharacterized human TFs and their unappreciated activities within the dark matter genome.
Collapse
Affiliation(s)
- Rozita Razavi
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Ali Fathi
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Isaac Yellan
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Alexander Brechalov
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Kaitlin U. Laverty
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
- Memorial Sloan Kettering Cancer Center, Rockefeller Research Laboratories, New York, NY 10065, USA
| | - Arttu Jolma
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Aldo Hernandez-Corchado
- Victor P. Dahdaleh Institute of Genomic Medicine, 740 Dr. Penfield Avenue, Room 7202, Montréal, Québec, H3A 0G1, Canada
| | - Hong Zheng
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Ally W.H. Yang
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Mihai Albu
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Marjan Barazandeh
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Chun Hu
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | - Ilya E. Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Zain M. Patel
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| | | | - Ivan V. Kulakovskiy
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| | - Philipp Bucher
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Quaid Morris
- Memorial Sloan Kettering Cancer Center, Rockefeller Research Laboratories, New York, NY 10065, USA
| | - Hamed S. Najafabadi
- Victor P. Dahdaleh Institute of Genomic Medicine, 740 Dr. Penfield Avenue, Room 7202, Montréal, Québec, H3A 0G1, Canada
- Department of Human Genetics, McGill University, Montréal, Québec, H3A 0C7, Canada
| | - Timothy R. Hughes
- Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1, Canada
| |
Collapse
|