1
|
Chen H, Xu Y, Ge H, Su X. DNA-Protein Binding is Dominated by Short Anchoring Elements. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2414823. [PMID: 40138198 PMCID: PMC12097035 DOI: 10.1002/advs.202414823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 02/21/2025] [Indexed: 03/29/2025]
Abstract
Unveiling the complexities of gene expression regulation, the study explores the intricate DNA-binding mechanisms of transcription factors (TFs). By employing the KaScape method previously developed to measure both bound and unbound populations at thermodynamic equilibrium, "anchoring elements" (AEs), 3-4 base pair sequences, are identified in Arabidopsis WRKY and human PU.1 TFs crucial for binding affinity. Building on the BEESEM method, the study introduces the AEEscape algorithm, which advances the AE concept by creating a precise model of the position-specific k-mer binding energy landscape. This method allows for the direct identification of the dominant role of AEs from experimental data. Moreover, when integrated with genomic data, it reveals an energetic funnel around transcription factor binding sites (TFBSs), which is directly correlated with the density of AEs (AED). The findings not only offer a fresh perspective on TF-TFBS interactions but also highlight the critical role of AED in gene regulation. These insights can pave the way for innovative strategies to manipulate gene expression.
Collapse
Affiliation(s)
- Hong Chen
- State Key Laboratory of Gene Function and Modulation Research, School of Life Sciences, and Biomedical Pioneering Innovation Center (BIOPIC)Peking UniversityBeijing100871China
| | - Yongping Xu
- State Key Laboratory of Gene Function and Modulation Research, School of Life Sciences, and Biomedical Pioneering Innovation Center (BIOPIC)Peking UniversityBeijing100871China
| | - Hao Ge
- Beijing International Center for Mathematical Research (BICMR) and Biomedical Pioneering Innovation Center (BIOPIC)Peking UniversityBeijing100871China
| | - Xiao‐Dong Su
- State Key Laboratory of Gene Function and Modulation Research, School of Life Sciences, and Biomedical Pioneering Innovation Center (BIOPIC)Peking UniversityBeijing100871China
| |
Collapse
|
2
|
Liu S, Gomez-Alcala P, Leemans C, Glassford WJ, Melo LA, Lu XJ, Mann RS, Bussemaker HJ. Predicting the DNA binding specificity of transcription factor mutants using family-level biophysically interpretable machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.01.24.577115. [PMID: 38352411 PMCID: PMC10862739 DOI: 10.1101/2024.01.24.577115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Sequence-specific interactions of transcription factors (TFs) with genomic DNA underlie many cellular processes. High-throughput in vitro binding assays coupled with machine learning have made it possible to accurately define such molecular recognition in a biophysically interpretable way for hundreds of TFs across many structural families, providing new avenues for predicting how the sequence preference of a TF is impacted by disease-associated mutations in its DNA binding domain. We developed a method based on a reference-free tetrahedral representation of variation in base preference within a given structural family that can be used to accurately predict the effect of mutations in the protein sequence of the TF. Using the basic helix-loop-helix (bHLH) and homeodomain families as test cases, our results demonstrate the feasibility of accurately predicting the shifts (ΔΔΔG/RT) in binding free energy associated with TF mutants by leveraging high-quality DNA binding models for sets of homologous wild-type TFs.
Collapse
Affiliation(s)
- Shaoxun Liu
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Pilar Gomez-Alcala
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Christ Leemans
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - William J. Glassford
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | - Lucas A.N. Melo
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Xiang-Jun Lu
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Richard S. Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Harmen J. Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| |
Collapse
|
3
|
Mekkaoui F, Drewell RA, Dresch JM, Spratt DE. Experimental approaches to investigate biophysical interactions between homeodomain transcription factors and DNA. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2025; 1868:195074. [PMID: 39644990 PMCID: PMC11832328 DOI: 10.1016/j.bbagrm.2024.195074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 11/26/2024] [Accepted: 12/01/2024] [Indexed: 12/09/2024]
Abstract
Homeodomain transcription factors (TFs) bind to specific DNA sequences to regulate the expression of target genes. Structural work has provided insight into molecular identities and aided in unraveling structural features of these TFs. However, the detailed affinity and specificity by which these TFs bind to DNA sequences is still largely unknown. Qualitative methods, such as DNA footprinting, Electrophoretic Mobility Shift Assays (EMSAs), Systematic Evolution of Ligands by Exponential Enrichment (SELEX), Bacterial One Hybrid (B1H) systems, Surface Plasmon Resonance (SPR), and Protein Binding Microarrays (PBMs) have been widely used to investigate the biochemical characteristics of TF-DNA binding events. In addition to these qualitative methods, bioinformatic approaches have also assisted in TF binding site discovery. Here we discuss the advantages and limitations of these different approaches, as well as the benefits of utilizing more quantitative approaches, such as Mechanically Induced Trapping of Molecular Interactions (MITOMI), Microscale Thermophoresis (MST) and Isothermal Titration Calorimetry (ITC), in determining the biophysical basis of binding specificity of TF-DNA complexes and improving upon existing computational approaches aimed at affinity predictions.
Collapse
Affiliation(s)
- Fadwa Mekkaoui
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Jacqueline M Dresch
- Biology Department, Clark University, 950 Main Street, Worcester, MA 01610, United States of America
| | - Donald E Spratt
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, MA 01610, United States of America.
| |
Collapse
|
4
|
Schroeder JW, Wolfe MB, Freddolino L. ShapeME: A tool and web front-end for de novo discovery of structural motifs underpinning protein-DNA interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.28.635290. [PMID: 39975017 PMCID: PMC11838363 DOI: 10.1101/2025.01.28.635290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Determining where transcriptional regulators bind within a genome is paramount to understanding how gene expression is regulated. Historically, position weight matrices (PWMs) have been used to define the binding preferences of DNA binding proteins1. However, PWMs treat the identity of each base in a sequence as an independent and additive measure of binding preference, which can limit their utility2. Models that consider higher order interactions between nearby bases yield greater success in predicting proteins' binding to DNA, but for many proteins there is still substantial room for improvement in predicting and understanding the determinants of proteins' binding to DNA3. In addition to DNA sequence motifs, structural motifs (e.g., a narrow minor groove width) are important determinants of binding for some DNA-binding proteins4. Despite the initial success of algorithms using structural features of DNA to predict binding properties of proteins from either ChIP-seq or SELEX data5-8, there remains a need for a de novo structural motif discovery framework which can be applied to data from a variety of experimental designs. Here, we present a unified workflow, capable of utilizing virtually any type of data representing sequence coverage or enrichment (e.g. ChIP-seq, RNA-seq, SELEX, etc.), to discover short structural motifs with explanatory power for a protein's DNA binding preference. We couple the DNAshapeR algorithm9 with our own information-theoretic approach to de novo motif discovery, and wrap shape and sequence motif inference and model selection into a single tool called ShapeME. Application of our structural motif discovery algorithm to proteins with ChIP-seq data in ENCODE datasets reveals a subset of proteins where short structural motifs outperform the best PWM for that protein as determined from the JASPAR database, or as identified by the sequence motif elicitation tool STREME. Our approach offers a powerful and versatile framework for inferring structural DNA binding motifs, and will complement current sequence-based motif elicitation tools in discovery of protein-DNA interaction principles. A web-based interface to ShapeME is available at https://seq2fun.dcmb.med.umich.edu/shapeme, with full source code available at https://github.com/freddolino-lab/ShapeME.
Collapse
Affiliation(s)
- Jeremy W. Schroeder
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael B. Wolfe
- Department of Biochemistry, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Lydia Freddolino
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
5
|
Jolma A, Laverty KU, Fathi A, Yang AW, Yellan I, Vorontsov IE, Inukai S, Kribelbauer-Swietek JF, Gralak AJ, Razavi R, Albu M, Brechalov A, Patel ZM, Nozdrin V, Meshcheryakov G, Kozin I, Abramov S, Boytsov A, The Codebook Consortium, Fornes O, Makeev VJ, Grau J, Grosse I, Bucher P, Deplancke B, Kulakovskiy IV, Hughes TR. Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.622097. [PMID: 39605729 PMCID: PMC11601247 DOI: 10.1101/2024.11.11.622097] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
We describe an effort ("Codebook") to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments across multiple in vitro and in vivo assays produced motifs for just over half of the putative TFs analyzed (177, or 53%), of which most are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple assays provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking, however, suggests that many of the currently known binding motifs for well-studied TFs may inaccurately describe the TF's true sequence preferences.
Collapse
Affiliation(s)
- Arttu Jolma
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Kaitlin U. Laverty
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Ali Fathi
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Ally W.H. Yang
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Isaac Yellan
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Ilya E. Vorontsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Sachi Inukai
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Judith F. Kribelbauer-Swietek
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Antoni J. Gralak
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Rozita Razavi
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | - Mihai Albu
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
| | | | - Zain M. Patel
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Vladimir Nozdrin
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
| | - Georgy Meshcheryakov
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| | - Ivan Kozin
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
| | - Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
| | | | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
| | - Vsevolod J. Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
| | - Philipp Bucher
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Bart Deplancke
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Ivan V. Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
| | - Timothy R. Hughes
- Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| |
Collapse
|
6
|
Song JS, Manjunath M. Predicting the molecular functions of regulatory genetic variants associated with cancer. Oncotarget 2023; 14:775-777. [PMID: 37646780 PMCID: PMC10467629 DOI: 10.18632/oncotarget.28451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Indexed: 09/01/2023] Open
Affiliation(s)
- Jun S. Song
- Correspondence to:Jun S. Song, Department of Physics, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA email
| | | |
Collapse
|
7
|
Bhimsaria D, Rodríguez-Martínez JA, Mendez-Johnson JL, Ghoshdastidar D, Varadarajan A, Bansal M, Daniels DL, Ramanathan P, Ansari AZ. Hidden modes of DNA binding by human nuclear receptors. Nat Commun 2023; 14:4179. [PMID: 37443151 PMCID: PMC10345098 DOI: 10.1038/s41467-023-39577-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Accepted: 06/19/2023] [Indexed: 07/15/2023] Open
Abstract
Human nuclear receptors (NRs) are a superfamily of ligand-responsive transcription factors that have central roles in cellular function. Their malfunction is linked to numerous diseases, and the ability to modulate their activity with synthetic ligands has yielded 16% of all FDA-approved drugs. NRs regulate distinct gene networks, however they often function from genomic sites that lack known binding motifs. Here, to annotate genomic binding sites of known and unexamined NRs more accurately, we use high-throughput SELEX to comprehensively map DNA binding site preferences of all full-length human NRs, in complex with their ligands. Furthermore, to identify non-obvious binding sites buried in DNA-protein interactomes, we develop MinSeq Find, a search algorithm based on the MinTerm concept from electrical engineering and digital systems design. The resulting MinTerm sequence set (MinSeqs) reveal a constellation of binding sites that more effectively annotate NR-binding profiles in cells. MinSeqs also unmask binding sites created or disrupted by 52,106 single-nucleotide polymorphisms associated with human diseases. By implicating druggable NRs as hidden drivers of multiple human diseases, our results not only reveal new biological roles of NRs, but they also provide a resource for drug-repurposing and precision medicine.
Collapse
Affiliation(s)
- Devesh Bhimsaria
- Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee, 247667, India.
| | | | | | | | - Ashwin Varadarajan
- Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Manju Bansal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012, India
| | - Danette L Daniels
- Promega Corporation, Madison, WI, 53711, USA
- Foghorn Therapeutics, Cambridge, MA, 02139, USA
| | - Parameswaran Ramanathan
- Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| | - Aseem Z Ansari
- Department of Chemical Biology and Therapeutics, St. Jude Children's Research Hospital, Memphis, TN, 38105, USA.
| |
Collapse
|
8
|
Cain B, Webb J, Yuan Z, Cheung D, Lim HW, Kovall R, Weirauch MT, Gebelein B. Prediction of cooperative homeodomain DNA binding sites from high-throughput-SELEX data. Nucleic Acids Res 2023; 51:6055-6072. [PMID: 37114997 PMCID: PMC10325903 DOI: 10.1093/nar/gkad318] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 04/12/2023] [Accepted: 04/25/2023] [Indexed: 04/29/2023] Open
Abstract
Homeodomain proteins constitute one of the largest families of metazoan transcription factors. Genetic studies have demonstrated that homeodomain proteins regulate many developmental processes. Yet, biochemical data reveal that most bind highly similar DNA sequences. Defining how homeodomain proteins achieve DNA binding specificity has therefore been a long-standing goal. Here, we developed a novel computational approach to predict cooperative dimeric binding of homeodomain proteins using High-Throughput (HT) SELEX data. Importantly, we found that 15 of 88 homeodomain factors form cooperative homodimer complexes on DNA sites with precise spacing requirements. Approximately one third of the paired-like homeodomain proteins cooperatively bind palindromic sequences spaced 3 bp apart, whereas other homeodomain proteins cooperatively bind sites with distinct orientation and spacing requirements. Combining structural models of a paired-like factor with our cooperativity predictions identified key amino acid differences that help differentiate between cooperative and non-cooperative factors. Finally, we confirmed predicted cooperative dimer sites in vivo using available genomic data for a subset of factors. These findings demonstrate how HT-SELEX data can be computationally mined to predict cooperativity. In addition, the binding site spacing requirements of select homeodomain proteins provide a mechanism by which seemingly similar AT-rich DNA sequences can preferentially recruit specific homeodomain factors.
Collapse
Affiliation(s)
- Brittany Cain
- Department of Biomedical Engineering, University of Cincinnati, Cincinnati, OH 45221, USA
- Division of Developmental Biology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Ave, MLC 7007, Cincinnati, OH 45229, USA
| | - Jordan Webb
- Department of Molecular Genetics, Biochemistry and Microbiology, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | - Zhenyu Yuan
- Department of Molecular Genetics, Biochemistry and Microbiology, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | - David Cheung
- Graduate Program in Molecular and Developmental Biology, Cincinnati Children's Hospital Research Foundation, Cincinnati, OH 45229, USA
| | - Hee-Woong Lim
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH 45229, USA
| | - Rhett A Kovall
- Department of Molecular Genetics, Biochemistry and Microbiology, University of Cincinnati College of Medicine, Cincinnati, OH 45267, USA
| | - Matthew T Weirauch
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH 45229, USA
- Divisions of Human Genetics, Biomedical Informatics and Developmental Biology, Center for Autoimmune Genomics and Etiology (CAGE), Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
| | - Brian Gebelein
- Division of Developmental Biology, Cincinnati Children's Hospital Medical Center, 3333 Burnet Ave, MLC 7007, Cincinnati, OH 45229, USA
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH 45229, USA
| |
Collapse
|
9
|
Cooper BH, Dantas Machado AC, Gan Y, Aparicio O, Rohs R. DNA binding specificity of all four Saccharomyces cerevisiae forkhead transcription factors. Nucleic Acids Res 2023; 51:5621-5633. [PMID: 37177995 PMCID: PMC10287902 DOI: 10.1093/nar/gkad372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 04/19/2023] [Accepted: 04/27/2023] [Indexed: 05/15/2023] Open
Abstract
Quantifying the nucleotide preferences of DNA binding proteins is essential to understanding how transcription factors (TFs) interact with their targets in the genome. High-throughput in vitro binding assays have been used to identify the inherent DNA binding preferences of TFs in a controlled environment isolated from confounding factors such as genome accessibility, DNA methylation, and TF binding cooperativity. Unfortunately, many of the most common approaches for measuring binding preferences are not sensitive enough for the study of moderate-to-low affinity binding sites, and are unable to detect small-scale differences between closely related homologs. The Forkhead box (FOX) family of TFs is known to play a crucial role in regulating a variety of key processes from proliferation and development to tumor suppression and aging. By using the high-sequencing depth SELEX-seq approach to study all four FOX homologs in Saccharomyces cerevisiae, we have been able to precisely quantify the contribution and importance of nucleotide positions all along an extended binding site. Essential to this process was the alignment of our SELEX-seq reads to a set of candidate core sequences determined using a recently developed tool for the alignment of enriched k-mers and a newly developed approach for the reprioritization of candidate cores.
Collapse
Affiliation(s)
- Brendon H Cooper
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Yan Gan
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Oscar M Aparicio
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
- Departments of Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
10
|
Rube HT, Rastogi C, Feng S, Kribelbauer JF, Li A, Becerra B, Melo LAN, Do BV, Li X, Adam HH, Shah NH, Mann RS, Bussemaker HJ. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat Biotechnol 2022; 40:1520-1527. [PMID: 35606422 PMCID: PMC9546773 DOI: 10.1038/s41587-022-01307-0] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2021] [Accepted: 04/04/2022] [Indexed: 01/02/2023]
Abstract
Protein-ligand interactions are increasingly profiled at high throughput using affinity selection and massively parallel sequencing. However, these assays do not provide the biophysical parameters that most rigorously quantify molecular interactions. Here we describe a flexible machine learning method, called ProBound, that accurately defines sequence recognition in terms of equilibrium binding constants or kinetic rates. This is achieved using a multi-layered maximum-likelihood framework that models both the molecular interactions and the data generation process. We show that ProBound quantifies transcription factor (TF) behavior with models that predict binding affinity over a range exceeding that of previous resources; captures the impact of DNA modifications and conformational flexibility of multi-TF complexes; and infers specificity directly from in vivo data such as ChIP-seq without peak calling. When coupled with an assay called KD-seq, it determines the absolute affinity of protein-ligand interactions. We also apply ProBound to profile the kinetics of kinase-substrate interactions. ProBound opens new avenues for decoding biological networks and rationally engineering protein-ligand interactions.
Collapse
Affiliation(s)
- H Tomas Rube
- Department of Bioengineering, University of California, Merced, Merced, CA, USA
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Siqian Feng
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
| | | | - Allyson Li
- Department of Chemistry, Columbia University, New York, NY, USA
| | - Basheer Becerra
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Lucas A N Melo
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Bach Viet Do
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Xiaoting Li
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Hammaad H Adam
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Neel H Shah
- Department of Chemistry, Columbia University, New York, NY, USA
| | - Richard S Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA.
- Department of Systems Biology, Columbia University, New York, NY, USA.
| |
Collapse
|
11
|
Cooper BH, Chiu TP, Rohs R. Top-Down Crawl: a method for the ultra-rapid and motif-free alignment of sequences with associated binding metrics. Bioinformatics 2022; 38:5121-5123. [PMID: 36179084 PMCID: PMC9665867 DOI: 10.1093/bioinformatics/btac653] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 09/21/2022] [Accepted: 09/29/2022] [Indexed: 12/24/2022] Open
Abstract
SUMMARY Several high-throughput protein-DNA binding methods currently available produce highly reproducible measurements of binding affinity at the level of the k-mer. However, understanding where a k-mer is positioned along a binding site sequence depends on alignment. Here, we present Top-Down Crawl (TDC), an ultra-rapid tool designed for the alignment of k-mer level data in a rank-dependent and position weight matrix (PWM)-independent manner. As the framework only depends on the rank of the input, the method can accept input from many types of experiments (protein binding microarray, SELEX-seq, SMiLE-seq, etc.) without the need for specialized parameterization. Measuring the performance of the alignment using multiple linear regression with 5-fold cross-validation, we find TDC to perform as well as or better than computationally expensive PWM-based methods. AVAILABILITY AND IMPLEMENTATION TDC can be run online at https://topdowncrawl.usc.edu or locally as a python package available through pip at https://pypi.org/project/TopDownCrawl. CONTACT SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Brendon H Cooper
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- To whom correspondence should be addressed.
| |
Collapse
|
12
|
Barissi S, Sala A, Wieczór M, Battistini F, Orozco M. DNAffinity: a machine-learning approach to predict DNA binding affinities of transcription factors. Nucleic Acids Res 2022; 50:9105-9114. [PMID: 36018808 PMCID: PMC9458447 DOI: 10.1093/nar/gkac708] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/21/2022] [Accepted: 08/08/2022] [Indexed: 12/24/2022] Open
Abstract
We present a physics-based machine learning approach to predict in vitro transcription factor binding affinities from structural and mechanical DNA properties directly derived from atomistic molecular dynamics simulations. The method is able to predict affinities obtained with techniques as different as uPBM, gcPBM and HT-SELEX with an excellent performance, much better than existing algorithms. Due to its nature, the method can be extended to epigenetic variants, mismatches, mutations, or any non-coding nucleobases. When complemented with chromatin structure information, our in vitro trained method provides also good estimates of in vivo binding sites in yeast.
Collapse
Affiliation(s)
| | | | - Miłosz Wieczór
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and Technology. Baldiri Reixac 10–12, 08028 Barcelona, Spain,Department of Physical Chemistry. Gdansk University of Technology, 80-233 Gdańsk, Poland
| | | | - Modesto Orozco
- Correspondence may also be addressed to Modesto Orozco. Tel: +34 934 037 156;
| |
Collapse
|
13
|
Laverty KU, Jolma A, Pour SE, Zheng H, Ray D, Morris Q, Hughes TR. PRIESSTESS: interpretable, high-performing models of the sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res 2022; 50:e111. [PMID: 36018788 DOI: 10.1093/nar/gkac694] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 07/22/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022] Open
Abstract
Modelling both primary sequence and secondary structure preferences for RNA binding proteins (RBPs) remains an ongoing challenge. Current models use varied RNA structure representations and can be difficult to interpret and evaluate. To address these issues, we present a universal RNA motif-finding/scanning strategy, termed PRIESSTESS (Predictive RBP-RNA InterpretablE Sequence-Structure moTif regrESSion), that can be applied to diverse RNA binding datasets. PRIESSTESS identifies dozens of enriched RNA sequence and/or structure motifs that are subsequently reduced to a set of core motifs by logistic regression with LASSO regularization. Importantly, these core motifs are easily visualized and interpreted, and provide a measure of RBP secondary structure specificity. We used PRIESSTESS to interrogate new HTR-SELEX data for 23 RBPs with diverse RNA binding modes and captured known primary sequence and secondary structure preferences for each. Moreover, when applying PRIESSTESS to 144 RBPs across 202 RNA binding datasets, 75% showed an RNA secondary structure preference but only 10% had a preference besides unpaired bases, suggesting that most RBPs simply recognize the accessibility of primary sequences.
Collapse
Affiliation(s)
- Kaitlin U Laverty
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Arttu Jolma
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| | - Sara E Pour
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Hong Zheng
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Debashish Ray
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Quaid Morris
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| |
Collapse
|
14
|
Zemlyanskaya EV, Dolgikh VA, Levitsky VG, Mironova V. Transcriptional regulation in plants: Using omics data to crack the cis-regulatory code. CURRENT OPINION IN PLANT BIOLOGY 2021; 63:102058. [PMID: 34098218 DOI: 10.1016/j.pbi.2021.102058] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 04/15/2021] [Accepted: 04/19/2021] [Indexed: 06/12/2023]
Abstract
Innovative omics technologies, advanced bioinformatics, and machine learning methods are rapidly becoming integral tools for plant functional genomics, with tremendous recent advances made in this field. In transcriptional regulation, an initial lag in the accumulation of plant omics data relative to that of animals stimulated the development of computational methods capable of extracting maximum information from the available data sets. Recent comprehensive studies of transcription factor-binding profiles in Arabidopsis and maize and the accumulation of uniformly processed omics data in public databases have brought plant biologists into the big leagues, with many cutting-edge methods available. Here, we summarize the state-of-the-art bioinformatics approaches used to predict or infer the cis-regulatory code behind transcriptional gene regulation, focusing on their plant research applications.
Collapse
Affiliation(s)
- Elena V Zemlyanskaya
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090, Russia; Novosibirsk State University, Novosibirsk, 630090, Russia.
| | - Vladislav A Dolgikh
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090, Russia
| | - Victor G Levitsky
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090, Russia; Novosibirsk State University, Novosibirsk, 630090, Russia
| | - Victoria Mironova
- Institute of Cytology and Genetics, Siberian Branch, Russian Academy of Sciences, Novosibirsk, 630090, Russia; Novosibirsk State University, Novosibirsk, 630090, Russia; Department of Plant Systems Physiology, Institute for Water and Wetland Research, Radboud University, Heyendaalseweg 135, 6525, AJ Nijmegen, the Netherlands.
| |
Collapse
|
15
|
Xu Y, Jiang X, Zhou Y, Ma M, Wang M, Ying B. Systematic Evolution of Ligands by Exponential Enrichment Technologies and Aptamer-Based Applications: Recent Progress and Challenges in Precision Medicine of Infectious Diseases. Front Bioeng Biotechnol 2021; 9:704077. [PMID: 34447741 PMCID: PMC8383106 DOI: 10.3389/fbioe.2021.704077] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/26/2021] [Indexed: 02/05/2023] Open
Abstract
Infectious diseases are considered as a pressing challenge to global public health. Accurate and rapid diagnostics tools for early recognition of the pathogen, as well as individualized precision therapy are essential for controlling the spread of infectious diseases. Aptamers, which were screened by systematic evolution of ligands by exponential enrichment (SELEX), can bind to targets with high affinity and specificity so that have exciting potential in both diagnosis and treatment of infectious diseases. In this review, we provide a comprehensive overview of the latest development of SELEX technology and focus on the applications of aptamer-based technologies in infectious diseases, such as targeted drug-delivery, treatments and biosensors for diagnosing. The challenges and the future development in this field of clinical application will also be discussed.
Collapse
Affiliation(s)
- Yixin Xu
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Xin Jiang
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Yanhong Zhou
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Ming Ma
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China.,The First People's Hospital of Shuangliu District, Chengdu/West China (Airport)Hospital Sichuan University, Chengdu, China
| | - Minjin Wang
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| | - Binwu Ying
- Department of Laboratory Medicine, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
16
|
Yan J, Qiu Y, Ribeiro Dos Santos AM, Yin Y, Li YE, Vinckier N, Nariai N, Benaglio P, Raman A, Li X, Fan S, Chiou J, Chen F, Frazer KA, Gaulton KJ, Sander M, Taipale J, Ren B. Systematic analysis of binding of transcription factors to noncoding variants. Nature 2021; 591:147-151. [PMID: 33505025 PMCID: PMC9367673 DOI: 10.1038/s41586-021-03211-0] [Citation(s) in RCA: 93] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 12/11/2020] [Indexed: 12/30/2022]
Abstract
Many sequence variants have been linked to complex human traits and diseases1, but deciphering their biological functions remains challenging, as most of them reside in noncoding DNA. Here we have systematically assessed the binding of 270 human transcription factors to 95,886 noncoding variants in the human genome using an ultra-high-throughput multiplex protein-DNA binding assay, termed single-nucleotide polymorphism evaluation by systematic evolution of ligands by exponential enrichment (SNP-SELEX). The resulting 828 million measurements of transcription factor-DNA interactions enable estimation of the relative affinity of these transcription factors to each variant in vitro and evaluation of the current methods to predict the effects of noncoding variants on transcription factor binding. We show that the position weight matrices of most transcription factors lack sufficient predictive power, whereas the support vector machine combined with the gapped k-mer representation show much improved performance, when assessed on results from independent SNP-SELEX experiments involving a new set of 61,020 sequence variants. We report highly predictive models for 94 human transcription factors and demonstrate their utility in genome-wide association studies and understanding of the molecular pathways involved in diverse human traits and diseases.
Collapse
Affiliation(s)
- Jian Yan
- School of Medicine, Northwest University, Xi'an, China.
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong SAR, China.
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden.
| | - Yunjiang Qiu
- Ludwig Institute for Cancer Research, La Jolla, CA, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - André M Ribeiro Dos Santos
- Ludwig Institute for Cancer Research, La Jolla, CA, USA
- Universidade Federal do Pará, Institute of Biological Sciences, Belém, Brazil
| | - Yimeng Yin
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden
- Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Yang E Li
- Ludwig Institute for Cancer Research, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
| | - Nick Vinckier
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Naoki Nariai
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Paola Benaglio
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Anugraha Raman
- Ludwig Institute for Cancer Research, La Jolla, CA, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, USA
| | - Xiaoyu Li
- School of Medicine, Northwest University, Xi'an, China
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong SAR, China
| | - Shicai Fan
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Joshua Chiou
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Fulin Chen
- School of Medicine, Northwest University, Xi'an, China
| | - Kelly A Frazer
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Kyle J Gaulton
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Maike Sander
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Jussi Taipale
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Solna, Sweden.
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
- Genome-Scale Biology Program, University of Helsinki, Helsinki, Finland.
| | - Bing Ren
- Ludwig Institute for Cancer Research, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA.
- Center for Epigenomics, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
17
|
Asif M, Orenstein Y. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics 2020; 36:i634-i642. [PMID: 33381817 DOI: 10.1093/bioinformatics/btaa789] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Transcription factor (TF) DNA-binding is a central mechanism in gene regulation. Biologists would like to know where and when these factors bind DNA. Hence, they require accurate DNA-binding models to enable binding prediction to any DNA sequence. Recent technological advancements measure the binding of a single TF to thousands of DNA sequences. One of the prevailing techniques, high-throughput SELEX, measures protein-DNA binding by high-throughput sequencing over several cycles of enrichment. Unfortunately, current computational methods to infer the binding preferences from high-throughput SELEX data do not exploit the richness of these data, and are under-using the most advanced computational technique, deep neural networks. RESULTS To better characterize the binding preferences of TFs from these experimental data, we developed DeepSELEX, a new algorithm to infer intrinsic DNA-binding preferences using deep neural networks. DeepSELEX takes advantage of the richness of high-throughput sequencing data and learns the DNA-binding preferences by observing the changes in DNA sequences through the experimental cycles. DeepSELEX outperforms extant methods for the task of DNA-binding inference from high-throughput SELEX data in binding prediction in vitro and is on par with the state of the art in in vivo binding prediction. Analysis of model parameters reveals it learns biologically relevant features that shed light on TFs' binding mechanism. AVAILABILITY AND IMPLEMENTATION DeepSELEX is available through github.com/OrensteinLab/DeepSELEX/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Maor Asif
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
18
|
Li L, Xu S, Yan H, Li X, Yazd HS, Li X, Huang T, Cui C, Jiang J, Tan W. Nucleic Acid Aptamers for Molecular Diagnostics and Therapeutics: Advances and Perspectives. Angew Chem Int Ed Engl 2020. [DOI: 10.1002/ange.202003563] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Long Li
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
| | - Shujuan Xu
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
- Molecular Science and Biomedicine Laboratory (MBL) State Key Laboratory of Chemo/Biosensing and Chemometrics College of Chemistry and Chemical Engineering College of Biology Aptamer Engineering Center of Hunan Province Hunan University Changsha 410082 China
| | - He Yan
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
- Molecular Science and Biomedicine Laboratory (MBL) State Key Laboratory of Chemo/Biosensing and Chemometrics College of Chemistry and Chemical Engineering College of Biology Aptamer Engineering Center of Hunan Province Hunan University Changsha 410082 China
| | - Xiaowei Li
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
| | - Hoda Safari Yazd
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
| | - Xiang Li
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
| | - Tong Huang
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
| | - Cheng Cui
- Molecular Science and Biomedicine Laboratory (MBL) State Key Laboratory of Chemo/Biosensing and Chemometrics College of Chemistry and Chemical Engineering College of Biology Aptamer Engineering Center of Hunan Province Hunan University Changsha 410082 China
- Institute of Cancer and Basic Medicine (IBMC) Chinese Academy of Sciences The Cancer Hospital of the University of Chinese Academy of Sciences Hangzhou Zhejiang 310022 China
| | - Jianhui Jiang
- Molecular Science and Biomedicine Laboratory (MBL) State Key Laboratory of Chemo/Biosensing and Chemometrics College of Chemistry and Chemical Engineering College of Biology Aptamer Engineering Center of Hunan Province Hunan University Changsha 410082 China
| | - Weihong Tan
- Department of Chemistry and Physiology and Functional Genomics Center for Research at the Bio/Nano Interface Health Cancer Center UF Genetics Institute McKnight Brain Institute University of Florida Gainesville Florida 32611 USA
- Molecular Science and Biomedicine Laboratory (MBL) State Key Laboratory of Chemo/Biosensing and Chemometrics College of Chemistry and Chemical Engineering College of Biology Aptamer Engineering Center of Hunan Province Hunan University Changsha 410082 China
- Institute of Molecular Medicine (IMM) Renji Hospital State Key Laboratory of Oncogenes and Related Genes Shanghai Jiao Tong University School of Medicine, and College of Chemistry and Chemical Engineering Shanghai Jiao Tong University Shanghai 200240 China
| |
Collapse
|
19
|
Li L, Xu S, Yan H, Li X, Yazd HS, Li X, Huang T, Cui C, Jiang J, Tan W. Nucleic Acid Aptamers for Molecular Diagnostics and Therapeutics: Advances and Perspectives. Angew Chem Int Ed Engl 2020; 60:2221-2231. [PMID: 32282107 DOI: 10.1002/anie.202003563] [Citation(s) in RCA: 213] [Impact Index Per Article: 42.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Indexed: 12/11/2022]
Abstract
The advent of SELEX (systematic evolution of ligands by exponential enrichment) technology has shown the ability to evolve artificial ligands with affinity and specificity able to meet growing clinical demand for probes that can, for example, distinguish between the target leukemia cells and other cancer cells within the matrix of heterogeneity, which characterizes cancer cells. Though antibodies are the conventional and ideal choice as a molecular recognition tool for many applications, aptamers complement the use of antibodies due to many unique advantages, such as small size, low cost, and facile chemical modification. This Minireview will focus on the novel applications of aptamers and SELEX, as well as opportunities to develop molecular tools able to meet future clinical needs in biomedicine.
Collapse
Affiliation(s)
- Long Li
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA
| | - Shujuan Xu
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA.,Molecular Science and Biomedicine Laboratory (MBL), State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, College of Biology, Aptamer Engineering Center of Hunan Province, Hunan University, Changsha, 410082, China
| | - He Yan
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA.,Molecular Science and Biomedicine Laboratory (MBL), State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, College of Biology, Aptamer Engineering Center of Hunan Province, Hunan University, Changsha, 410082, China
| | - Xiaowei Li
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA
| | - Hoda Safari Yazd
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA
| | - Xiang Li
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA
| | - Tong Huang
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA
| | - Cheng Cui
- Molecular Science and Biomedicine Laboratory (MBL), State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, College of Biology, Aptamer Engineering Center of Hunan Province, Hunan University, Changsha, 410082, China.,Institute of Cancer and Basic Medicine (IBMC), Chinese Academy of Sciences, The Cancer Hospital of the University of Chinese Academy of Sciences, Hangzhou, Zhejiang, 310022, China
| | - Jianhui Jiang
- Molecular Science and Biomedicine Laboratory (MBL), State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, College of Biology, Aptamer Engineering Center of Hunan Province, Hunan University, Changsha, 410082, China
| | - Weihong Tan
- Department of Chemistry and Physiology and Functional Genomics, Center for Research at the Bio/Nano Interface, Health Cancer Center, UF Genetics Institute, McKnight Brain Institute, University of Florida, Gainesville, Florida, 32611, USA.,Molecular Science and Biomedicine Laboratory (MBL), State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, College of Biology, Aptamer Engineering Center of Hunan Province, Hunan University, Changsha, 410082, China.,Institute of Molecular Medicine (IMM), Renji Hospital, State Key Laboratory of Oncogenes and Related Genes, Shanghai Jiao Tong University School of Medicine, and College of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| |
Collapse
|
20
|
Khabiri M, Freddolino L. Expression of Concern for "Deficiencies in Molecular Dynamics Simulation-Based Prediction of Protein-DNA Binding Free Energy Landscapes". J Phys Chem B 2020; 124:1115–1123. [PMID: 29741907 DOI: 10.1021/acs.jpcb.8b04187] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Affiliation(s)
- Morteza Khabiri
- Department of Biological Chemistry , University of Michigan Medical School , Ann Arbor , Michigan , United States
| | - Lydia Freddolino
- Department of Biological Chemistry , University of Michigan Medical School , Ann Arbor , Michigan , United States
- Department of Computational Medicine and Bioinformatics , University of Michigan Medical School , Ann Arbor , Michigan , United States
| |
Collapse
|
21
|
Wetzel JL, Singh M. Sharing DNA-binding information across structurally similar proteins enables accurate specificity determination. Nucleic Acids Res 2020; 48:e9. [PMID: 31777934 PMCID: PMC7028011 DOI: 10.1093/nar/gkz1087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Revised: 10/03/2019] [Accepted: 11/01/2019] [Indexed: 01/31/2023] Open
Abstract
We are now in an era where protein-DNA interactions have been experimentally assayed for thousands of DNA-binding proteins. In order to infer DNA-binding specificities from these data, numerous sophisticated computational methods have been developed. These approaches typically infer DNA-binding specificities by considering interactions for each protein independently, ignoring related and potentially valuable interaction information across other proteins that bind DNA via the same structural domain. Here we introduce a framework for inferring DNA-binding specificities by considering protein-DNA interactions for entire groups of structurally similar proteins simultaneously. We devise both constrained optimization and label propagation algorithms for this task, each balancing observations at the individual protein level against dataset-wide consistency of interaction preferences. We test our approaches on two large, independent Cys2His2 zinc finger protein-DNA interaction datasets. We demonstrate that jointly inferring specificities within each dataset individually dramatically improves accuracy, leading to increased agreement both between these two datasets and with a fixed external standard. Overall, our results suggest that sharing protein-DNA interaction information across structurally similar proteins is a powerful means to enable accurate inference of DNA-binding specificities.
Collapse
Affiliation(s)
- Joshua L Wetzel
- The Lewis-Sigler Institute for Integrative Genomics, Princeton, NJ 08544, USA
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA
| | - Mona Singh
- The Lewis-Sigler Institute for Integrative Genomics, Princeton, NJ 08544, USA
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
22
|
Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, Santana-Garcia W, Tan G, Chèneby J, Ballester B, Parcy F, Sandelin A, Lenhard B, Wasserman WW, Mathelier A. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res 2020; 48:D87-D92. [PMID: 31701148 PMCID: PMC7145627 DOI: 10.1093/nar/gkz1001] [Citation(s) in RCA: 856] [Impact Index Per Article: 171.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Revised: 10/15/2019] [Accepted: 10/16/2019] [Indexed: 02/07/2023] Open
Abstract
JASPAR (http://jaspar.genereg.net) is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles. Moreover, we created a Q&A forum to ease the communication between the user community and JASPAR curators. Finally, we updated the genomic tracks, inference tool, and TF-binding profile similarity clusters. All the data is available through the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.
Collapse
Affiliation(s)
- Oriol Fornes
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Jaime A Castro-Mondragon
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Aziz Khan
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Robin van der Lee
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Xi Zhang
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Phillip A Richmond
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Bhavi P Modi
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Solenne Correard
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Marius Gheorghe
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
| | - Damir Baranašić
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W120NN, UK
| | - Walter Santana-Garcia
- Institut de Biologie de l’ENS (IBENS), Département de biologie, École normale supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Ge Tan
- Functional Genomics Centre Zurich, ETH Zurich, Zurich, Switzerland
| | | | | | - François Parcy
- CNRS, Univ. Grenoble Alpes, CEA, INRA, IRIG-LPCV, 38000 Grenoble, France
| | - Albin Sandelin
- The Bioinformatics Centre, Department of Biology and Biotech Research & Innovation Centre, University of Copenhagen, DK2200 Copenhagen N, Denmark
| | - Boris Lenhard
- Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London W12 0NN, UK
- Computational Regulatory Genomics, MRC London Institute of Medical Sciences, London W120NN, UK
- Sars International Centre for Marine Molecular Biology, University of Bergen, N-5008 Bergen, Norway
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children's Hospital Research Institute, University of British Columbia, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada
| | - Anthony Mathelier
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo, 0318 Oslo, Norway
- Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0310 Oslo, Norway
| |
Collapse
|
23
|
Kribelbauer JF, Rastogi C, Bussemaker HJ, Mann RS. Low-Affinity Binding Sites and the Transcription Factor Specificity Paradox in Eukaryotes. Annu Rev Cell Dev Biol 2019; 35:357-379. [PMID: 31283382 DOI: 10.1146/annurev-cellbio-100617-062719] [Citation(s) in RCA: 135] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Eukaryotic transcription factors (TFs) from the same structural family tend to bind similar DNA sequences, despite the ability of these TFs to execute distinct functions in vivo. The cell partly resolves this specificity paradox through combinatorial strategies and the use of low-affinity binding sites, which are better able to distinguish between similar TFs. However, because these sites have low affinity, it is challenging to understand how TFs recognize them in vivo. Here, we summarize recent findings and technological advancements that allow for the quantification and mechanistic interpretation of TF recognition across a wide range of affinities. We propose a model that integrates insights from the fields of genetics and cell biology to provide further conceptual understanding of TF binding specificity. We argue that in eukaryotes, target specificity is driven by an inhomogeneous 3D nuclear distribution of TFs and by variation in DNA binding affinity such that locally elevated TF concentration allows low-affinity binding sites to be functional.
Collapse
Affiliation(s)
- Judith F Kribelbauer
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA; .,Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10031, USA;
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA; .,Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10031, USA;
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA; .,Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10031, USA;
| | - Richard S Mann
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10031, USA; .,Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY 10031, USA.,Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY 10027, USA
| |
Collapse
|
24
|
Lai X, Stigliani A, Vachon G, Carles C, Smaczniak C, Zubieta C, Kaufmann K, Parcy F. Building Transcription Factor Binding Site Models to Understand Gene Regulation in Plants. MOLECULAR PLANT 2019; 12:743-763. [PMID: 30447332 DOI: 10.1016/j.molp.2018.10.010] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 09/20/2018] [Accepted: 10/30/2018] [Indexed: 06/09/2023]
Abstract
Transcription factors (TFs) are key cellular components that control gene expression. They recognize specific DNA sequences, the TF binding sites (TFBSs), and thus are targeted to specific regions of the genome where they can recruit transcriptional co-factors and/or chromatin regulators to fine-tune spatiotemporal gene regulation. Therefore, the identification of TFBSs in genomic sequences and their subsequent quantitative modeling is of crucial importance for understanding and predicting gene expression. Here, we review how TFBSs can be determined experimentally, how the TFBS models can be constructed in silico, and how they can be optimized by taking into account features such as position interdependence within TFBSs, DNA shape, and/or by introducing state-of-the-art computational algorithms such as deep learning methods. In addition, we discuss the integration of context variables into the TFBS modeling, including nucleosome positioning, chromatin states, methylation patterns, 3D genome architectures, and TF cooperative binding, in order to better predict TF binding under cellular contexts. Finally, we explore the possibilities of combining the optimized TFBS model with technological advances, such as targeted TFBS perturbation by CRISPR, to better understand gene regulation, evolution, and plant diversity.
Collapse
Affiliation(s)
- Xuelei Lai
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France.
| | - Arnaud Stigliani
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Gilles Vachon
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Cristel Carles
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Cezary Smaczniak
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Chloe Zubieta
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France
| | - Kerstin Kaufmann
- Department for Plant Cell and Molecular Biology, Institute for Biology, Humboldt-Universität zu Berlin, Berlin, Germany
| | - François Parcy
- CNRS, Univ. Grenoble Alpes, CEA, INRA, BIG-LPCV, 38000 Grenoble, France.
| |
Collapse
|
25
|
Zhang S, Liang Y, Wang X, Su Z, Chen Y. FisherMP: fully parallel algorithm for detecting combinatorial motifs from large ChIP-seq datasets. DNA Res 2019; 26:231-242. [PMID: 30957858 PMCID: PMC6589551 DOI: 10.1093/dnares/dsz004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Accepted: 03/05/2019] [Indexed: 11/14/2022] Open
Abstract
Detecting binding motifs of combinatorial transcription factors (TFs) from chromatin immunoprecipitation sequencing (ChIP-seq) experiments is an important and challenging computational problem for understanding gene regulations. Although a number of motif-finding algorithms have been presented, most are either time consuming or have sub-optimal accuracy for processing large-scale datasets. In this article, we present a fully parallelized algorithm for detecting combinatorial motifs from ChIP-seq datasets by using Fisher combined method and OpenMP parallel design. Large scale validations on both synthetic data and 350 ChIP-seq datasets from the ENCODE database showed that FisherMP has not only super speeds on large datasets, but also has high accuracy when compared with multiple popular methods. By using FisherMP, we successfully detected combinatorial motifs of CTCF, YY1, MAZ, STAT3 and USF2 in chromosome X, suggesting that they are functional co-players in gene regulation and chromosomal organization. Integrative and statistical analysis of these TF-binding peaks clearly demonstrate that they are not only highly coordinated with each other, but that they are also correlated with histone modifications. FisherMP can be applied for integrative analysis of binding motifs and for predicting cis-regulatory modules from a large number of ChIP-seq datasets.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Ying Liang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Xiangyun Wang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
| | - Zhengchang Su
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin, China
- Department of Bioinformatics and Genomics, the University of North Carolina at Charlotte, NC, USA
| | - Yong Chen
- Department of Biological Sciences, Center for Systems Biology, the University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
26
|
Kinney JB, McCandlish DM. Massively Parallel Assays and Quantitative Sequence-Function Relationships. Annu Rev Genomics Hum Genet 2019; 20:99-127. [PMID: 31091417 DOI: 10.1146/annurev-genom-083118-014845] [Citation(s) in RCA: 96] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Over the last decade, a rich variety of massively parallel assays have revolutionized our understanding of how biological sequences encode quantitative molecular phenotypes. These assays include deep mutational scanning, high-throughput SELEX, and massively parallel reporter assays. Here, we review these experimental methods and how the data they produce can be used to quantitatively model sequence-function relationships. In doing so, we touch on a diverse range of topics, including the identification of clinically relevant genomic variants, the modeling of transcription factor binding to DNA, the functional and evolutionary landscapes of proteins, and cis-regulatory mechanisms in both transcription and mRNA splicing. We further describe a unified conceptual framework and a core set of mathematical modeling strategies that studies in these diverse areas can make use of. Finally, we highlight key aspects of experimental design and mathematical modeling that are important for the results of such studies to be interpretable and reproducible.
Collapse
Affiliation(s)
- Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; ,
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; ,
| |
Collapse
|
27
|
Samee MAH, Bruneau BG, Pollard KS. A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs. Cell Syst 2019; 8:27-42.e6. [PMID: 30660610 PMCID: PMC6368855 DOI: 10.1016/j.cels.2018.12.001] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 08/18/2018] [Accepted: 12/03/2018] [Indexed: 12/17/2022]
Abstract
DNA shape adds specificity to sequence motifs but has not been explored systematically outside this context. We hypothesized that DNA-binding proteins (DBPs) preferentially occupy DNA with specific structures ("shape motifs") regardless of whether or not these correspond to high information content sequence motifs. We present ShapeMF, a Gibbs sampling algorithm that identifies de novo shape motifs. Using binding data from hundreds of in vivo and in vitro experiments, we show that most DBPs have shape motifs and can occupy these in the absence of sequence motifs. This "shape-only binding" is common for many DBPs and in regions co-bound by multiple DBPs. When shape and sequence motifs co-occur, they can be overlapping, flanking, or separated by consistent spacing. Finally, DBPs within the same protein family have different shape motifs, explaining their distinct genome-wide occupancy despite having similar sequence motifs. These results suggest that shape motifs not only complement sequence motifs but also facilitate recognition of DNA beyond conventionally defined sequence motifs.
Collapse
Affiliation(s)
| | - Benoit G Bruneau
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Pediatrics and Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Katherine S Pollard
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Epidemiology & Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, CA 94158, USA; Chan-Zuckerberg Biohub, San Francisco, CA 94158, USA.
| |
Collapse
|
28
|
Specificity landscapes unmask submaximal binding site preferences of transcription factors. Proc Natl Acad Sci U S A 2018; 115:E10586-E10595. [PMID: 30341220 DOI: 10.1073/pnas.1811431115] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
We have developed Differential Specificity and Energy Landscape (DiSEL) analysis to comprehensively compare DNA-protein interactomes (DPIs) obtained by high-throughput experimental platforms and cutting edge computational methods. While high-affinity DNA binding sites are identified by most methods, DiSEL uncovered nuanced sequence preferences displayed by homologous transcription factors. Pairwise analysis of 726 DPIs uncovered homolog-specific differences at moderate- to low-affinity binding sites (submaximal sites). DiSEL analysis of variants of 41 transcription factors revealed that many disease-causing mutations result in allele-specific changes in binding site preferences. We focused on a set of highly homologous factors that have different biological roles but "read" DNA using identical amino acid side chains. Rather than direct readout, our results indicate that DNA noncontacting side chains allosterically contribute to sculpt distinct sequence preferences among closely related members of transcription factor families.
Collapse
|
29
|
Sasse A, Laverty KU, Hughes TR, Morris QD. Motif models for RNA-binding proteins. Curr Opin Struct Biol 2018; 53:115-123. [PMID: 30172081 DOI: 10.1016/j.sbi.2018.08.001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 08/07/2018] [Indexed: 01/24/2023]
Abstract
Identifying the binding preferences of RNA-binding proteins (RBPs) is important in understanding their contribution to post-transcriptional regulation. Here, we review the current state-of-the art of RNA motif identification tools for RBPs. New in vivo and in vitro data sets provide sufficient statistical power to enable detection of relatively long and complex sequence and sequence-structure binding preferences, and recent computational methods are geared towards quantitative identification of these patterns. We classify methods by their motif model's representational power and describe the underlying considerations for RNA-protein interactions. All classical motif identification algorithms apply physically motivated architectures, consisting of a motif and an occupancy model, we call these explicit motif models. Recent methods, such as convolutional neural networks and support vector machines, abandon the classical architecture and implicitly model RNA binding without defining a motif model. Although they achieve high accuracy on held-out data they may be unsuitable to solve the ultimate goal of the field, using motifs trained on in vitro data to predict in vivo binding sites. For this task methods need to separate intrinsic binding preferences from cellular effects from protein and RNA concentrations, cooperativity, and competition. To tackle this problem, we advocate for the use of a `three-layer' architecture, consisting of motif model, occupancy model, and extrinsic factor model, which enables separation and adjustment to cellular conditions.
Collapse
Affiliation(s)
- Alexander Sasse
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Kaitlin U Laverty
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada; Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada; Canadian Institute for Advanced Research, MaRS Centre, West Tower, 661 University Avenue, Suite 505, Toronto, ON M5G 1M1, Canada
| | - Quaid D Morris
- Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada; Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada; Department of Computer Science, University of Toronto, Toronto, ON M5T 3A1, Canada
| |
Collapse
|
30
|
Abstract
Transcription factors (TFs) control gene expression by binding to genomic DNA in a sequence-specific manner. Mutations in TF binding sites are increasingly found to be associated with human disease, yet we currently lack robust methods to predict these sites. Here, we developed a versatile maximum likelihood framework named No Read Left Behind (NRLB) that infers a biophysical model of protein-DNA recognition across the full affinity range from a library of in vitro selected DNA binding sites. NRLB predicts human Max homodimer binding in near-perfect agreement with existing low-throughput measurements. It can capture the specificity of the p53 tetramer and distinguish multiple binding modes within a single sample. Additionally, we confirm that newly identified low-affinity enhancer binding sites are functional in vivo, and that their contribution to gene expression matches their predicted affinity. Our results establish a powerful paradigm for identifying protein binding sites and interpreting gene regulatory sequences in eukaryotic genomes.
Collapse
|
31
|
Ruan S, Stormo GD. Comparison of discriminative motif optimization using matrix and DNA shape-based models. BMC Bioinformatics 2018; 19:86. [PMID: 29510689 PMCID: PMC5840810 DOI: 10.1186/s12859-018-2104-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Accepted: 03/01/2018] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Transcription factor (TF) binding site specificity is commonly represented by some form of matrix model in which the positions in the binding site are assumed to contribute independently to the site's activity. The independence assumption is known to be an approximation, often a good one but sometimes poor. Alternative approaches have been developed that use k-mers (DNA "words" of length k) to account for the non-independence, and more recently DNA structural parameters have been incorporated into the models. ChIP-seq data are often used to assess the discriminatory power of motifs and to compare different models. However, to measure the improvement due to using more complex models, one must compare to optimized matrix models. RESULTS We describe a program "Discriminative Additive Model Optimization" (DAMO) that uses positive and negative examples, as in ChIP-seq data, and finds the additive position weight matrix (PWM) that maximizes the Area Under the Receiver Operating Characteristic Curve (AUROC). We compare to a recent study where structural parameters, serving as features in a gradient boosting classifier algorithm, are shown to improve the AUROC over JASPAR position frequency matrices (PFMs). In agreement with the previous results, we find that adding structural parameters gives the largest improvement, but most of the gain can be obtained by an optimized PWM and nearly all of the gain can be obtained with a di-nucleotide extension to the PWM. CONCLUSION To appropriately compare different models for TF bind sites, optimized models must be used. PWMs and their extensions are good representations of binding specificity for most TFs, and more complex models, including the incorporation of DNA shape features and gradient boosting classifiers, provide only moderate improvements for a few TFs.
Collapse
Affiliation(s)
- Shuxiang Ruan
- Department of Genetics and Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, 63110 USA
| | - Gary D. Stormo
- Department of Genetics and Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, 63110 USA
| |
Collapse
|
32
|
Rube HT, Rastogi C, Kribelbauer JF, Bussemaker HJ. A unified approach for quantifying and interpreting DNA shape readout by transcription factors. Mol Syst Biol 2018; 14:e7902. [PMID: 29472273 PMCID: PMC5822049 DOI: 10.15252/msb.20177902] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 01/26/2018] [Accepted: 01/31/2018] [Indexed: 01/07/2023] Open
Abstract
Transcription factors (TFs) interpret DNA sequence by probing the chemical and structural properties of the nucleotide polymer. DNA shape is thought to enable a parsimonious representation of dependencies between nucleotide positions. Here, we propose a unified mathematical representation of the DNA sequence dependence of shape and TF binding, respectively, which simplifies and enhances analysis of shape readout. First, we demonstrate that linear models based on mononucleotide features alone account for 60-70% of the variance in minor groove width, roll, helix twist, and propeller twist. This explains why simple scoring matrices that ignore all dependencies between nucleotide positions can partially account for DNA shape readout by a TF Adding dinucleotide features as sequence-to-shape predictors to our model, we can almost perfectly explain the shape parameters. Building on this observation, we developed a post hoc analysis method that can be used to analyze any mechanism-agnostic protein-DNA binding model in terms of shape readout. Our insights provide an alternative strategy for using DNA shape information to enhance our understanding of how cis-regulatory codes are interpreted by the cellular machinery.
Collapse
Affiliation(s)
- H Tomas Rube
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Program in Applied Physics and Applied Mathematics, Columbia University, New York, NY, USA
| | - Judith F Kribelbauer
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
33
|
Zhang L, Martini GD, Rube HT, Kribelbauer JF, Rastogi C, FitzPatrick VD, Houtman JC, Bussemaker HJ, Pufall MA. SelexGLM differentiates androgen and glucocorticoid receptor DNA-binding preference over an extended binding site. Genome Res 2017; 28:111-121. [PMID: 29196557 PMCID: PMC5749176 DOI: 10.1101/gr.222844.117] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 11/22/2017] [Indexed: 11/28/2022]
Abstract
The DNA-binding interfaces of the androgen (AR) and glucocorticoid (GR) receptors are virtually identical, yet these transcription factors share only about a third of their genomic binding sites and regulate similarly distinct sets of target genes. To address this paradox, we determined the intrinsic specificities of the AR and GR DNA-binding domains using a refined version of SELEX-seq. We developed an algorithm, SelexGLM, that quantifies binding specificity over a large (31-bp) binding site by iteratively fitting a feature-based generalized linear model to SELEX probe counts. This analysis revealed that the DNA-binding preferences of AR and GR homodimers differ significantly, both within and outside the 15-bp core binding site. The relative preference between the two factors can be tuned over a wide range by changing the DNA sequence, with AR more sensitive to sequence changes than GR. The specificity of AR extends to the regions flanking the core 15-bp site, where isothermal calorimetry measurements reveal that affinity is augmented by enthalpy-driven readout of poly(A) sequences associated with narrowed minor groove width. We conclude that the increased specificity of AR is correlated with more enthalpy-driven binding than GR. The binding models help explain differences in AR and GR genomic binding and provide a biophysical rationale for how promiscuous binding by GR allows functional substitution for AR in some castration-resistant prostate cancers.
Collapse
Affiliation(s)
- Liyang Zhang
- Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | - Gabriella D Martini
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - H Tomas Rube
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - Judith F Kribelbauer
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - Vincent D FitzPatrick
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - Jon C Houtman
- Department of Immunology, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, New York 10027, USA.,Department of Systems Biology, Columbia University Medical Center, New York, New York 10032, USA
| | - Miles A Pufall
- Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| |
Collapse
|
34
|
Inherent limitations of probabilistic models for protein-DNA binding specificity. PLoS Comput Biol 2017; 13:e1005638. [PMID: 28686588 PMCID: PMC5521849 DOI: 10.1371/journal.pcbi.1005638] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2017] [Revised: 07/21/2017] [Accepted: 06/21/2017] [Indexed: 01/10/2023] Open
Abstract
The specificities of transcription factors are most commonly represented with probabilistic models. These models provide a probability for each base occurring at each position within the binding site and the positions are assumed to contribute independently. The model is simple and intuitive and is the basis for many motif discovery algorithms. However, the model also has inherent limitations that prevent it from accurately representing true binding probabilities, especially for the highest affinity sites under conditions of high protein concentration. The limitations are not due to the assumption of independence between positions but rather are caused by the non-linear relationship between binding affinity and binding probability and the fact that independent normalization at each position skews the site probabilities. Generally probabilistic models are reasonably good approximations, but new high-throughput methods allow for biophysical models with increased accuracy that should be used whenever possible. Transcription factors (TFs), a class of DNA-binding proteins, play a central role in the regulation of gene expression. TFs control the rate of transcription by binding to the genome in a sequence-specific manner. Thus, one important aspect in the study of gene regulation mechanism is to model the binding specificities of TFs, namely the features of the DNA sequences that a TF prefers to bind. Multiple models have been proposed to characterize the binding specificities of TFs, among which the class of probabilistic models is the most popular. In this study, we point out several major limitations of the well-established probabilistic model by comparing it with the biophysical model. Through simulations we demonstrate that the probabilistic model is only an approximation of the biophysical model. The latter has most of the advantages of the former, and is a more accurate representation of binding specificities. We propose a shift from the probabilistic model to the biophysical model in future studies of protein-DNA interactions.
Collapse
|