1
|
Kabir A, Bhattarai M, Rasmussen KØ, Shehu A, Bishop AR, Alexandrov B, Usheva A. Advancing Transcription Factor Binding Site Prediction Using DNA Breathing Dynamics and Sequence Transformers via Cross Attention. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.16.575935. [PMID: 38293094 PMCID: PMC10827174 DOI: 10.1101/2024.01.16.575935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Understanding the impact of genomic variants on transcription factor binding and gene regulation remains a key area of research, with implications for unraveling the complex mechanisms underlying various functional effects. Our study delves into the role of DNA's biophysical properties, including thermodynamic stability, shape, and flexibility in transcription factor (TF) binding. We developed a multi-modal deep learning model integrating these properties with DNA sequence data. Trained on ChIP-Seq (chromatin immunoprecipitation sequencing) data in vivo involving 690 TF-DNA binding events in human genome, our model significantly improves prediction performance in over 660 binding events, with up to 9.6% increase in AUROC metric compared to the baseline model when using no DNA biophysical properties explicitly. Further, we expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (SELEX) and Protein Binding Microarray (PBM) datasets, comparing our model with established frameworks. The inclusion of DNA breathing features consistently improved TF binding predictions across different cell lines in these datasets. Notably, for complex ChIP-Seq datasets, integrating DNABERT2 with a cross-attention mechanism provided greater predictive capabilities and insights into the mechanisms of disease-related non-coding variants found in genome-wide association studies. This work highlights the importance of DNA biophysical characteristics in TF binding and the effectiveness of multi-modal deep learning models in gene regulation studies.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
- Department of Computer Science, George Mason University, 4400 University Dr, 22030, VA, USA
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, 4400 University Dr, 22030, VA, USA
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544, NM, USA
| | - Anny Usheva
- Department of Surgery, Brown University, 69 Brown St Box 1822, 02912, RI, USA
| |
Collapse
|
2
|
Li J, Chiu TP, Rohs R. Predicting DNA structure using a deep learning method. Nat Commun 2024; 15:1243. [PMID: 38336958 PMCID: PMC10858265 DOI: 10.1038/s41467-024-45191-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 01/17/2024] [Indexed: 02/12/2024] Open
Abstract
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA structure, also described as DNA shape, plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k-mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, DNA structural features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing an understanding of the effects of flanking regions on DNA structure in a target region of a sequence. The Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as versatile and powerful tool for diverse DNA structure-related studies.
Collapse
Affiliation(s)
- Jinsen Li
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Tsu-Pei Chiu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA.
- Department of Chemistry, University of Southern California, Los Angeles, CA, 90089, USA.
- Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA.
- Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
3
|
Li J, Chiu TP, Rohs R. Deep DNAshape: Predicting DNA shape considering extended flanking regions using a deep learning method. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.22.563383. [PMID: 37961633 PMCID: PMC10634709 DOI: 10.1101/2023.10.22.563383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA shape plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k -mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, refined DNA shape features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing a deeper understanding of the effects of flanking regions on DNA shape in a target region of a sequence. Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as a versatile and powerful tool for diverse DNA structure-related studies.
Collapse
|
4
|
Alexandari AM, Horton CA, Shrikumar A, Shah N, Li E, Weilert M, Pufall MA, Zeitlinger J, Fordyce PM, Kundaje A. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein-DNA binding. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.11.540401. [PMID: 37214836 PMCID: PMC10197627 DOI: 10.1101/2023.05.11.540401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Transcription factors (TF) are proteins that bind DNA in a sequence-specific manner to regulate gene transcription. Despite their unique intrinsic sequence preferences, in vivo genomic occupancy profiles of TFs differ across cellular contexts. Hence, deciphering the sequence determinants of TF binding, both intrinsic and context-specific, is essential to understand gene regulation and the impact of regulatory, non-coding genetic variation. Biophysical models trained on in vitro TF binding assays can estimate intrinsic affinity landscapes and predict occupancy based on TF concentration and affinity. However, these models cannot adequately explain context-specific, in vivo binding profiles. Conversely, deep learning models, trained on in vivo TF binding assays, effectively predict and explain genomic occupancy profiles as a function of complex regulatory sequence syntax, albeit without a clear biophysical interpretation. To reconcile these complementary models of in vitro and in vivo TF binding, we developed Affinity Distillation (AD), a method that extracts thermodynamic affinities de-novo from deep learning models of TF chromatin immunoprecipitation (ChIP) experiments by marginalizing away the influence of genomic sequence context. Applied to neural networks modeling diverse classes of yeast and mammalian TFs, AD predicts energetic impacts of sequence variation within and surrounding motifs on TF binding as measured by diverse in vitro assays with superior dynamic range and accuracy compared to motif-based methods. Furthermore, AD can accurately discern affinities of TF paralogs. Our results highlight thermodynamic affinity as a key determinant of in vivo binding, suggest that deep learning models of in vivo binding implicitly learn high-resolution affinity landscapes, and show that these affinities can be successfully distilled using AD. This new biophysical interpretation of deep learning models enables high-throughput in silico experiments to explore the influence of sequence context and variation on both intrinsic affinity and in vivo occupancy.
Collapse
Affiliation(s)
- Amr M. Alexandari
- Department of Computer Science, Stanford University, Stanford, CA 94305
| | | | - Avanti Shrikumar
- Department of Earth System Science, Stanford University, Stanford, CA 94305
| | - Nilay Shah
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Eileen Li
- Department of Genetics, Stanford University, Stanford, CA 94305
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Miles A. Pufall
- Department of Biochemistry, Carver College of Medicine, University of Iowa, Iowa City, Iowa 52242, USA
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA
- The University of Kansas Medical Center, Kansas City, KS, USA
| | - Polly M. Fordyce
- Department of Genetics, Stanford University, Stanford, CA 94305
- Department of Bioengineering, Stanford University, Stanford, CA 94305
- ChEM-H Institute, Stanford University, Stanford, CA 94305
- Chan Zuckerberg Biohub, San Francisco, CA 94110
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA 94305
- Department of Genetics, Stanford University, Stanford, CA 94305
| |
Collapse
|
5
|
Chen Z, Wei Q. Developing an Improved Survival Prediction Model for Disease Prognosis. Biomolecules 2022; 12:biom12121751. [PMID: 36551179 PMCID: PMC9775036 DOI: 10.3390/biom12121751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 11/18/2022] [Accepted: 11/23/2022] [Indexed: 11/27/2022] Open
Abstract
Machine learning has become an important research field in genetics and molecular biology. Survival analysis using machine learning can provide an important computed-aid clinical research scheme for evaluating tumor treatment options. However, the genomic features are high-dimensional, which limits the prediction performance of the survival learning model. Therefore, in this paper, we propose an improved survival prediction model using a deep forest and self-supervised learning. It uses a deep survival forest to perform adaptive learning of high-dimensional genomic data and ensure robustness. In addition, self-supervised learning, as a semi-supervised learning style, is designed to utilize unlabeled samples to improve model performance. Based on four cancer datasets from The Cancer Genome Atlas (TCGA), the experimental results show that our proposed method outperforms four advanced survival analysis methods in terms of the C-index and brier score. The developed prediction model will help doctors rethink patient characteristics' relevance to survival time and personalize treatment decisions.
Collapse
|
6
|
Boytsov A, Abramov S, Makeev VJ, Kulakovskiy IV. Positional weight matrices have sufficient prediction power for analysis of noncoding variants. F1000Res 2022; 11:33. [PMID: 35811788 PMCID: PMC9237556 DOI: 10.12688/f1000research.75471.3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/30/2022] [Indexed: 11/23/2022] Open
Abstract
The position weight matrix, also called the position-specific scoring matrix, is the commonly accepted model to quantify the specificity of transcription factor binding to DNA. Position weight matrices are used in thousands of projects and software tools in regulatory genomics, including computational prediction of the regulatory impact of single-nucleotide variants. Yet, recently Yan et al. reported that "the position weight matrices of most transcription factors lack sufficient predictive power" if applied to the analysis of regulatory variants studied with a newly developed experimental method, SNP-SELEX. Here, we re-analyze the rich experimental dataset obtained by Yan et al. and show that appropriately selected position weight matrices in fact can adequately quantify transcription factor binding to alternative alleles.
Collapse
Affiliation(s)
- Alexandr Boytsov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Sergey Abramov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Vsevolod J. Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Moscow Institute of Physics and Technology, Dolgoprudny, 141700, Russian Federation
| | - Ivan V. Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, 119991, Russian Federation
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, 142290, Russian Federation
| |
Collapse
|
7
|
Zhang Y, Mo Q, Xue L, Luo J. Evaluation of deep learning approaches for modeling transcription factor sequence specificity. Genomics 2021; 113:3774-3781. [PMID: 34534646 DOI: 10.1016/j.ygeno.2021.09.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 07/19/2021] [Accepted: 09/11/2021] [Indexed: 11/16/2022]
Abstract
As a key component of gene regulation, transcription factors (TFs) play an important role in a number of biological processes. To fully understand the underlying mechanism of TF-mediated gene regulation, it is therefore critical to accurately identify TF binding sites and predict their affinities. Recently, deep learning (DL) algorithms have achieved promising results in the prediction of DNA-TF binding, however, various deep learning architectures have not been systematically compared, and the relative merit of each architecture remains unclear. To address this problem, we applied four different deep learning architectures to SELEX-seq and HT-SELEX data, covering three species and 35 families. We evaluated and compared the performance of different deep neural models using 10-fold cross-validation. Our results indicate that the hybrid CNN + DNN model shows the best performances. We expect that our study will be broadly applicable to modeling and predicting TF binding specificity when more high-throughput affinity data are available.
Collapse
Affiliation(s)
- Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Qi Mo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou 646000, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China; Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou 646000, China; Sichuan Key Medical Laboratory of New Drug Discovery and Druggability Evaluation, Luzhou Key Laboratory of Activity Screening and Druggability Evaluation for Chinese Materia Medica, Southwest Medical University, Luzhou 646000, China.
| |
Collapse
|
8
|
Yao Q, Ferragina P, Reshef Y, Lettre G, Bauer DE, Pinello L. Motif-Raptor: a cell type-specific and transcription factor centric approach for post-GWAS prioritization of causal regulators. Bioinformatics 2021; 37:2103-2111. [PMID: 33532840 PMCID: PMC11025460 DOI: 10.1093/bioinformatics/btab072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWASs) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype. RESULTS We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predictive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait-associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hypotheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated. AVAILABILITY AND IMPLEMENTATION Motif-Raptor is freely available as a Python package at: https://github.com/pinellolab/MotifRaptor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiuming Yao
- Department of Pathology, Massachusetts General Hospital, Charlestown, MA 02129, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa 56128, Italy
| | - Yakir Reshef
- Department of Computer Science, Harvard University, Cambridge, MA 02138, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Guillaume Lettre
- Faculty of Medicine, Université de Montréal, Montreal, Quebec H3C3J7, Canada
- Montreal Heart Institute, Montreal, Quebec H1T1C8, Canada
| | - Daniel E Bauer
- Division of Hematology/Oncology, Boston Children’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Luca Pinello
- Department of Pathology, Massachusetts General Hospital, Charlestown, MA 02129, USA
- Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
9
|
Yin C, Chen Z. Developing Sustainable Classification of Diseases via Deep Learning and Semi-Supervised Learning. Healthcare (Basel) 2020; 8:E291. [PMID: 32846941 PMCID: PMC7551840 DOI: 10.3390/healthcare8030291] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 08/19/2020] [Accepted: 08/20/2020] [Indexed: 01/07/2023] Open
Abstract
Disease classification based on machine learning has become a crucial research topic in the fields of genetics and molecular biology. Generally, disease classification involves a supervised learning style; i.e., it requires a large number of labelled samples to achieve good classification performance. However, in the majority of the cases, labelled samples are hard to obtain, so the amount of training data are limited. However, many unclassified (unlabelled) sequences have been deposited in public databases, which may help the training procedure. This method is called semi-supervised learning and is very useful in many applications. Self-training can be implemented using high- to low-confidence samples to prevent noisy samples from affecting the robustness of semi-supervised learning in the training process. The deep forest method with the hyperparameter settings used in this paper can achieve excellent performance. Therefore, in this work, we propose a novel combined deep learning model and semi-supervised learning with self-training approach to improve the performance in disease classification, which utilizes unlabelled samples to update a mechanism designed to increase the number of high-confidence pseudo-labelled samples. The experimental results show that our proposed model can achieve good performance in disease classification and disease-causing gene identification.
Collapse
Affiliation(s)
- Chunwu Yin
- School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China;
| | - Zhanbo Chen
- School of Information and Statistics, Guangxi University of Finance and Economics, Nanning 530003, China
- Center of Guangxi Cooperative Innovation for Education Performance Assessment, Guangxi University of Finance and Economics, Nanning 530003, China
| |
Collapse
|
10
|
Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194443. [PMID: 31639474 PMCID: PMC7166147 DOI: 10.1016/j.bbagrm.2019.194443] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 09/21/2019] [Accepted: 10/06/2019] [Indexed: 12/14/2022]
Abstract
Transcription factors (TFs) selectively bind distinct sets of sites in different cell types. Such cell type-specific binding specificity is expected to result from interplay between the TF's intrinsic sequence preferences, cooperative interactions with other regulatory proteins, and cell type-specific chromatin landscapes. Cell type-specific TF binding events are highly correlated with patterns of chromatin accessibility and active histone modifications in the same cell type. However, since concurrent chromatin may itself be a consequence of TF binding, chromatin landscapes measured prior to TF activation provide more useful insights into how cell type-specific TF binding events became established in the first place. Here, we review the various sequence and chromatin determinants of cell type-specific TF binding specificity. We identify the current challenges and opportunities associated with computational approaches to characterizing, imputing, and predicting cell type-specific TF binding patterns. We further focus on studies that characterize TF binding in dynamic regulatory settings, and we discuss how these studies are leading to a more complex and nuanced understanding of dynamic protein-DNA binding activities. We propose that TF binding activities at individual sites can be viewed along a two-dimensional continuum of local sequence and chromatin context. Under this view, cell type-specific TF binding activities may result from either strongly favorable sequence features or strongly favorable chromatin context.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America.
| |
Collapse
|
11
|
Wang Z, He W, Tang J, Guo F. Identification of Highest-Affinity Binding Sites of Yeast Transcription Factor Families. J Chem Inf Model 2020; 60:1876-1883. [DOI: 10.1021/acs.jcim.9b01012] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Zongyu Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Wenying He
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300072, P. R. China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29208, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
12
|
Li G, Quan Y, Wang X, Liu R, Bie L, Gao J, Zhang HY. Trinucleotide Base Pair Stacking Free Energy for Understanding TF-DNA Recognition and the Functions of SNPs. Front Chem 2019; 6:666. [PMID: 30713839 PMCID: PMC6345724 DOI: 10.3389/fchem.2018.00666] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Accepted: 12/21/2018] [Indexed: 01/03/2023] Open
Abstract
Single nucleotide polymorphisms (SNPs) affect base pair stacking, which is the primary factor for maintaining the stability of DNA. However, the mechanism of how SNPs lead to phenotype variations is still unclear. In this work, we connected SNPs and base pair stacking by a 3-mer base pair stacking free energy matrix. The SNPs with large base pair stacking free energy differences led to phenotype variations. A molecular dynamics (MD) simulation was then applied. Our results showed that base pair stacking played an important role in the transcription factor (TF)-DNA interaction. Changes in DNA structure mainly originate from TF-DNA interactions, and with the increased base pair stacking free energy, the structure of DNA approaches its free type, although its binding affinity was increased by the SNP. In addition, quantitative models using base pair stacking features revealed that base pair stacking can be used to predict TF binding specificity. As such, our work combined knowledge from bioinformatics and structural biology and provided a new understanding of the relationship between SNPs and phenotype variations. The 3-mer base pair stacking free energy matrix is useful in high-throughput screening of SNPs and predicting TF-DNA binding affinity.
Collapse
Affiliation(s)
- Gen Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Yuan Quan
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Xiaocong Wang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Rong Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Lihua Bie
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jun Gao
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Hong-Yu Zhang
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, China
| |
Collapse
|
13
|
Afek A, Tagliafierro L, Glenn OC, Lukatsky DB, Gordan R, Chiba-Falek O. Toward deciphering the mechanistic role of variations in the Rep1 repeat site in the transcription regulation of SNCA gene. Neurogenetics 2018; 19:135-144. [PMID: 29730780 DOI: 10.1007/s10048-018-0546-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 04/25/2018] [Indexed: 12/01/2022]
Abstract
Short structural variants-variants other than single nucleotide polymorphisms-are hypothesized to contribute to many complex diseases, possibly by modulating gene expression. However, the molecular mechanisms by which noncoding short structural variants exert their effects on gene regulation have not been discovered. Here, we study simple sequence repeats (SSRs), a common class of short structural variants. Previously, we showed that repetitive sequences can directly influence the binding of transcription factors to their proximate recognition sites, a mechanism we termed non-consensus binding. In this study, we focus on the SSR termed Rep1, which was associated with Parkinson's disease (PD) and has been implicated in the cis-regulation of the PD-risk SNCA gene. We show that Rep1 acts via the non-consensus binding mechanism to affect the binding of transcription factors from the GATA and ELK families to their specific sites located right next to the Rep1 repeat. Next, we performed an expression analysis to further our understanding regarding the GATA and ELK family members that are potentially relevant for SNCA transcriptional regulation in health and disease. Our analysis indicates a potential role for GATA2, consistent with previous reports. Our study proposes non-consensus transcription factor binding as a potential mechanism through which noncoding repeat variants could exert their pathogenic effects by regulating gene expression.
Collapse
Affiliation(s)
- A Afek
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27710, USA
| | - L Tagliafierro
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27710, USA.,Department of Neurology, Duke University Medical Center, Durham, NC, 27710, USA
| | - O C Glenn
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27710, USA.,Department of Neurology, Duke University Medical Center, Durham, NC, 27710, USA
| | - D B Lukatsky
- Department of Chemistry, Ben-Gurion University of the Negev, 8410501, Beersheba, Israel
| | - R Gordan
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, 27710, USA. .,Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27710, USA. .,Department of Computer Science, Duke University, Durham, NC, 27708, USA.
| | - O Chiba-Falek
- Center for Genomic and Computational Biology, Duke University Medical Center, Durham, NC, 27710, USA. .,Department of Neurology, Duke University Medical Center, Durham, NC, 27710, USA.
| |
Collapse
|
14
|
Shen N, Zhao J, Schipper JL, Zhang Y, Bepler T, Leehr D, Bradley J, Horton J, Lapp H, Gordan R. Divergence in DNA Specificity among Paralogous Transcription Factors Contributes to Their Differential In Vivo Binding. Cell Syst 2018; 6:470-483.e8. [PMID: 29605182 DOI: 10.1016/j.cels.2018.02.009] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 02/13/2018] [Accepted: 02/14/2018] [Indexed: 12/29/2022]
Abstract
Paralogous transcription factors (TFs) are oftentimes reported to have identical DNA-binding motifs, despite the fact that they perform distinct regulatory functions. Differential genomic targeting by paralogous TFs is generally assumed to be due to interactions with protein co-factors or the chromatin environment. Using a computational-experimental framework called iMADS (integrative modeling and analysis of differential specificity), we show that, contrary to previous assumptions, paralogous TFs bind differently to genomic target sites even in vitro. We used iMADS to quantify, model, and analyze specificity differences between 11 TFs from 4 protein families. We found that paralogous TFs have diverged mainly at medium- and low-affinity sites, which are poorly captured by current motif models. We identify sequence and shape features differentially preferred by paralogous TFs, and we show that the intrinsic differences in specificity among paralogous TFs contribute to their differential in vivo binding. Thus, our study represents a step forward in deciphering the molecular mechanisms of differential specificity in TF families.
Collapse
Affiliation(s)
- Ning Shen
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Department of Pharmacology and Cancer Biology, Duke University, Durham, NC 27710, USA; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Jingkang Zhao
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA; Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, USA
| | - Joshua L Schipper
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Yuning Zhang
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Program in Computational Biology and Bioinformatics, Duke University, Durham, NC 27708, USA
| | - Tristan Bepler
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA
| | - Dan Leehr
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA
| | - John Bradley
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA
| | - John Horton
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Hilmar Lapp
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA
| | - Raluca Gordan
- Center for Genomic and Computational Biology, Duke University, Durham, NC 27708, USA; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA; Department of Computer Science, Duke University, Durham, NC 27708, USA; Department of Molecular Genetics and Microbiology, Duke University, Durham, NC 27710, USA.
| |
Collapse
|
15
|
Comprehensive, high-resolution binding energy landscapes reveal context dependencies of transcription factor binding. Proc Natl Acad Sci U S A 2018; 115:E3702-E3711. [PMID: 29588420 PMCID: PMC5910820 DOI: 10.1073/pnas.1715888115] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Transcription factors (TFs) are primary regulators of gene expression in cells, where they bind specific genomic target sites to control transcription. Quantitative measurements of TF-DNA binding energies can improve the accuracy of predictions of TF occupancy and downstream gene expression in vivo and shed light on how transcriptional networks are rewired throughout evolution. Here, we present a sequencing-based TF binding assay and analysis pipeline (BET-seq, for Binding Energy Topography by sequencing) capable of providing quantitative estimates of binding energies for more than one million DNA sequences in parallel at high energetic resolution. Using this platform, we measured the binding energies associated with all possible combinations of 10 nucleotides flanking the known consensus DNA target interacting with two model yeast TFs, Pho4 and Cbf1. A large fraction of these flanking mutations change overall binding energies by an amount equal to or greater than consensus site mutations, suggesting that current definitions of TF binding sites may be too restrictive. By systematically comparing estimates of binding energies output by deep neural networks (NNs) and biophysical models trained on these data, we establish that dinucleotide (DN) specificities are sufficient to explain essentially all variance in observed binding behavior, with Cbf1 binding exhibiting significantly more nonadditivity than Pho4. NN-derived binding energies agree with orthogonal biochemical measurements and reveal that dynamically occupied sites in vivo are both energetically and mutationally distant from the highest affinity sites.
Collapse
|
16
|
Li RY, Di Felice R, Rohs R, Lidar DA. Quantum annealing versus classical machine learning applied to a simplified computational biology problem. NPJ QUANTUM INFORMATION 2018; 4:14. [PMID: 29652405 PMCID: PMC5891835 DOI: 10.1038/s41534-018-0060-8] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Revised: 12/26/2017] [Accepted: 01/11/2018] [Indexed: 05/17/2023]
Abstract
Transcription factors regulate gene expression, but how these proteins recognize and specifically bind to their DNA targets is still debated. Machine learning models are effective means to reveal interaction mechanisms. Here we studied the ability of a quantum machine learning approach to predict binding specificity. Using simplified datasets of a small number of DNA sequences derived from actual binding affinity experiments, we trained a commercially available quantum annealer to classify and rank transcription factor binding. The results were compared to state-of-the-art classical approaches for the same simplified datasets, including simulated annealing, simulated quantum annealing, multiple linear regression, LASSO, and extreme gradient boosting. Despite technological limitations, we find a slight advantage in classification performance and nearly equal ranking performance using the quantum annealer for these fairly small training data sets. Thus, we propose that quantum annealing might be an effective method to implement machine learning for certain computational biology problems.
Collapse
Affiliation(s)
- Richard Y Li
- Department of Chemistry, University of Southern California, Los Angeles, California 90089, USA
- Computational Biology and Bioinformatics Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Center for Quantum Information Science & Technology, University of Southern California, Los Angeles, California 90089, USA
| | - Rosa Di Felice
- Computational Biology and Bioinformatics Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics and Astronomy, University of Southern California, Los Angeles, California 90089, USA
- Center S3, CNR Institute of Nanoscience, Via Campi 213/A, 41125 Modena, Italy
| | - Remo Rohs
- Department of Chemistry, University of Southern California, Los Angeles, California 90089, USA
- Computational Biology and Bioinformatics Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics and Astronomy, University of Southern California, Los Angeles, California 90089, USA
- Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Daniel A Lidar
- Department of Chemistry, University of Southern California, Los Angeles, California 90089, USA
- Center for Quantum Information Science & Technology, University of Southern California, Los Angeles, California 90089, USA
- Department of Physics and Astronomy, University of Southern California, Los Angeles, California 90089, USA
- Department of Electrical Engineering, University of Southern California, Los Angeles, California 90089, USA
| |
Collapse
|
17
|
Chiu TP, Rao S, Mann RS, Honig B, Rohs R. Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein-DNA binding. Nucleic Acids Res 2017; 45:12565-12576. [PMID: 29040720 PMCID: PMC5716191 DOI: 10.1093/nar/gkx915] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 09/28/2017] [Indexed: 12/16/2022] Open
Abstract
Protein–DNA binding is a fundamental component of gene regulatory processes, but it is still not completely understood how proteins recognize their target sites in the genome. Besides hydrogen bonding in the major groove (base readout), proteins recognize minor-groove geometry using positively charged amino acids (shape readout). The underlying mechanism of DNA shape readout involves the correlation between minor-groove width and electrostatic potential (EP). To probe this biophysical effect directly, rather than using minor-groove width as an indirect measure for shape readout, we developed a methodology, DNAphi, for predicting EP in the minor groove and confirmed the direct role of EP in protein–DNA binding using massive sequencing data. The DNAphi method uses a sliding-window approach to mine results from non-linear Poisson–Boltzmann (NLPB) calculations on DNA structures derived from all-atom Monte Carlo simulations. We validated this approach, which only requires nucleotide sequence as input, based on direct comparison with NLPB calculations for available crystal structures. Using statistical machine-learning approaches, we showed that adding EP as a biophysical feature can improve the predictive power of quantitative binding specificity models across 27 transcription factor families. High-throughput prediction of EP offers a novel way to integrate biophysical and genomic studies of protein–DNA binding.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Satyanarayan Rao
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Richard S Mann
- Departments of Systems Biology and Biochemistry & Molecular Biophysics, Mortimer B. Zuckerman Institute, Columbia University, New York, NY 10032, USA
| | - Barry Honig
- Departments of Systems Biology and Biochemistry & Molecular Biophysics, Mortimer B. Zuckerman Institute, Columbia University, New York, NY 10032, USA.,Howard Hughes Medical Institute, New York, NY 10032, USA
| | - Remo Rohs
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
18
|
Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017; 33:514-521. [PMID: 28011774 DOI: 10.1093/bioinformatics/btw683] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 10/27/2016] [Indexed: 01/09/2023] Open
Abstract
Motivation While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact janne.h.korhonen@gmail.com.
Collapse
Affiliation(s)
- Janne H Korhonen
- School of Computer Science, Reykjavík University, Reykjavík, Iceland.,Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| | - Kimmo Palin
- Genome-Scale Biology Research Program, Research Programs Unit
| | - Jussi Taipale
- Department of Biosciences and Nutrition, Karolinska Institutet, Genome Scale Biology Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| |
Collapse
|
19
|
Abstract
Protein-DNA binding plays a central role in gene regulation and by that in all processes in the living cell. Novel experimental and computational approaches facilitate better understanding of protein-DNA binding preferences via high-throughput measurement of protein binding to a large number of DNA sequences and inference of binding models from them. Here we review the state of the art in measuring protein-DNA binding in vitro, emphasizing the advantages and limitations of different technologies. In addition, we describe models for representing protein-DNA binding preferences and key computational approaches to learn those from high-throughput data. Using large experimental data sets, we test the performance of different models based on different measuring techniques. We conclude with pertinent open problems.
Collapse
|
20
|
Yang L, Orenstein Y, Jolma A, Yin Y, Taipale J, Shamir R, Rohs R. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol 2017; 13:910. [PMID: 28167566 PMCID: PMC5327724 DOI: 10.15252/msb.20167238] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Transcription factors (TFs) achieve DNA‐binding specificity through contacts with functional groups of bases (base readout) and readout of structural properties of the double helix (shape readout). Currently, it remains unclear whether DNA shape readout is utilized by only a few selected TF families, or whether this mechanism is used extensively by most TF families. We resequenced data from previously published HT‐SELEX experiments, the most extensive mammalian TF–DNA binding data available to date. Using these data, we demonstrated the contributions of DNA shape readout across diverse TF families and its importance in core motif‐flanking regions. Statistical machine‐learning models combined with feature‐selection techniques helped to reveal the nucleotide position‐dependent DNA shape readout in TF‐binding sites and the TF family‐specific position dependence. Based on these results, we proposed novel DNA shape logos to visualize the DNA shape preferences of TFs. Overall, this work suggests a way of obtaining mechanistic insights into TF–DNA binding without relying on experimentally solved all‐atom structures.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA, USA
| | - Yaron Orenstein
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Arttu Jolma
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Yimeng Yin
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Jussi Taipale
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
21
|
Dresch JM, Zellers RG, Bork DK, Drewell RA. Nucleotide Interdependency in Transcription Factor Binding Sites in the Drosophila Genome. GENE REGULATION AND SYSTEMS BIOLOGY 2016; 10:21-33. [PMID: 27330274 PMCID: PMC4907338 DOI: 10.4137/grsb.s38462] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Revised: 04/17/2016] [Accepted: 04/28/2016] [Indexed: 01/14/2023]
Abstract
A long-standing objective in modern biology is to characterize the molecular components that drive the development of an organism. At the heart of eukaryotic development lies gene regulation. On the molecular level, much of the research in this field has focused on the binding of transcription factors (TFs) to regulatory regions in the genome known as cis-regulatory modules (CRMs). However, relatively little is known about the sequence-specific binding preferences of many TFs, especially with respect to the possible interdependencies between the nucleotides that make up binding sites. A particular limitation of many existing algorithms that aim to predict binding site sequences is that they do not allow for dependencies between nonadjacent nucleotides. In this study, we use a recently developed computational algorithm, MARZ, to compare binding site sequences using 32 distinct models in a systematic and unbiased approach to explore nucleotide dependencies within binding sites for 15 distinct TFs known to be critical to Drosophila development. Our results indicate that many of these proteins have varying levels of nucleotide interdependencies within their DNA recognition sequences, and that, in some cases, models that account for these dependencies greatly outperform traditional models that are used to predict binding sites. We also directly compare the ability of different models to identify the known KRUPPEL TF binding sites in CRMs and demonstrate that a more complex model that accounts for nucleotide interdependencies performs better when compared with simple models. This ability to identify TFs with critical nucleotide interdependencies in their binding sites will lead to a deeper understanding of how these molecular characteristics contribute to the architecture of CRMs and the precise regulation of transcription during organismal development.
Collapse
Affiliation(s)
- Jacqueline M. Dresch
- Department of Mathematics and Computer Science, Clark University, Worcester, MA, USA
| | - Rowan G. Zellers
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | - Daniel K. Bork
- Computer Science Department, Harvey Mudd College, Claremont, CA, USA
- Mathematics Department, Harvey Mudd College, Claremont, CA, USA
| | | |
Collapse
|
22
|
Siebert M, Söding J. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. Nucleic Acids Res 2016; 44:6055-69. [PMID: 27288444 PMCID: PMC5291271 DOI: 10.1093/nar/gkw521] [Citation(s) in RCA: 61] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2016] [Accepted: 05/29/2016] [Indexed: 01/01/2023] Open
Abstract
Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.
Collapse
Affiliation(s)
- Matthias Siebert
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany Gene Center, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, 81377 Munich, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Am Fassberg 11, 37077 Göttingen, Germany
| |
Collapse
|
23
|
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet 2016; 7:24. [PMID: 26941778 PMCID: PMC4763482 DOI: 10.3389/fgene.2016.00024] [Citation(s) in RCA: 86] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2015] [Accepted: 02/05/2016] [Indexed: 12/27/2022] Open
Abstract
Eukaryotic genomes contain a variety of structured patterns: repetitive elements, binding sites of DNA and RNA associated proteins, splice sites, and so on. Often, these structured patterns can be formalized as motifs and described using a proper mathematical model such as position weight matrix and IUPAC consensus. Two key tasks are typically carried out for motifs in the context of the analysis of genomic sequences. These are: identification in a set of DNA regions of over-represented motifs from a particular motif database, and de novo discovery of over-represented motifs. Here we describe existing methodology to perform these two tasks for motifs characterizing transcription factor binding. When applied to the output of ChIP-seq and ChIP-exo experiments, or to promoter regions of co-modulated genes, motif analysis techniques allow for the prediction of transcription factor binding events and enable identification of transcriptional regulators and co-regulators. The usefulness of motif analysis is further exemplified in this review by how motif discovery improves peak calling in ChIP-seq and ChIP-exo experiments and, when coupled with information on gene expression, allows insights into physical mechanisms of transcriptional modulation.
Collapse
Affiliation(s)
- Valentina Boeva
- Centre de Recherche, Institut CurieParis, France; INSERM, U900Paris, France; Mines ParisTechFontainebleau, France; PSL Research UniversityParis, France; Department of Development, Reproduction and Cancer, Institut CochinParis, France; INSERM, U1016Paris, France; Centre National de la Recherche Scientifique UMR 8104Paris, France; Université Paris Descartes UMR-S1016Paris, France
| |
Collapse
|
24
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 PMCID: PMC4821295 DOI: 10.12688/f1000research.7408.2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/29/2016] [Indexed: 11/22/2022] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
25
|
Nettling M, Treutler H, Grau J, Keilwagen J, Posch S, Grosse I. DiffLogo: a comparative visualization of sequence motifs. BMC Bioinformatics 2015; 16:387. [PMID: 26577052 PMCID: PMC4650857 DOI: 10.1186/s12859-015-0767-x] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 10/08/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. RESULTS Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. CONCLUSIONS DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches.
Collapse
Affiliation(s)
- Martin Nettling
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Hendrik Treutler
- Leibniz Institute of Plant Biochemistry, Halle (Saale), Germany.
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany.
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
| | - Ivo Grosse
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| |
Collapse
|
26
|
Keilwagen J, Grau J. Varying levels of complexity in transcription factor binding motifs. Nucleic Acids Res 2015; 43:e119. [PMID: 26116565 PMCID: PMC4605289 DOI: 10.1093/nar/gkv577] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2015] [Revised: 05/11/2015] [Accepted: 05/21/2015] [Indexed: 11/17/2022] Open
Abstract
Binding of transcription factors to DNA is one of the keystones of gene regulation. The existence of statistical dependencies between binding site positions is widely accepted, while their relevance for computational predictions has been debated. Building probabilistic models of binding sites that may capture dependencies is still challenging, since the most successful motif discovery approaches require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that combine putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously. We find that Slim models yield a substantially better prediction performance than previous models on genomic context protein binding microarray data sets and on ChIP-seq data sets. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The observed dependency structures range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, D-06484 Quedlinburg, Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle-Wittenberg, D-06099 Halle (Saale), Germany
| |
Collapse
|
27
|
Giguère S, Laviolette F, Marchand M, Tremblay D, Moineau S, Liang X, Biron É, Corbeil J. Machine learning assisted design of highly active peptides for drug discovery. PLoS Comput Biol 2015; 11:e1004074. [PMID: 25849257 PMCID: PMC4388847 DOI: 10.1371/journal.pcbi.1004074] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 12/05/2014] [Indexed: 01/15/2023] Open
Abstract
The discovery of peptides possessing high biological activity is very challenging due to the enormous diversity for which only a minority have the desired properties. To lower cost and reduce the time to obtain promising peptides, machine learning approaches can greatly assist in the process and even partly replace expensive laboratory experiments by learning a predictor with existing data or with a smaller amount of data generation. Unfortunately, once the model is learned, selecting peptides having the greatest predicted bioactivity often requires a prohibitive amount of computational time. For this combinatorial problem, heuristics and stochastic optimization methods are not guaranteed to find adequate solutions. We focused on recent advances in kernel methods and machine learning to learn a predictive model with proven success. For this type of model, we propose an efficient algorithm based on graph theory, that is guaranteed to find the peptides for which the model predicts maximal bioactivity. We also present a second algorithm capable of sorting the peptides of maximal bioactivity. Extensive analyses demonstrate how these algorithms can be part of an iterative combinatorial chemistry procedure to speed up the discovery and the validation of peptide leads. Moreover, the proposed approach does not require the use of known ligands for the target protein since it can leverage recent multi-target machine learning predictors where ligands for similar targets can serve as initial training data. Finally, we validated the proposed approach in vitro with the discovery of new cationic antimicrobial peptides. Source code freely available at http://graal.ift.ulaval.ca/peptide-design/. Part of the complexity of drug discovery is the sheer chemical diversity to explore combined to all requirements a compound must meet to become a commercial drug. Hence, it makes sense to automate this chemical exploration endeavor in a wise, informed, and efficient fashion. Here, we focused on peptides as they have properties that make them excellent drug starting points. Machine learning techniques may replace expensive in-vitro laboratory experiments by learning an accurate model of it. However, computational models also suffer from the combinatorial explosion due to the enormous chemical diversity. Indeed, applying the model to every peptides would take an astronomical amount of computer time. Therefore, given a model, is it possible to determine, using reasonable computational time, the peptide that has the best properties and chance for success? This exact question is what motivated our work. We focused on recent advances in kernel methods and machine learning to learn a model that already had excellent results. We demonstrate that this class of model has mathematical properties that makes it possible to rapidly identify and sort the best peptides. Finally, in-vitro and in-silico results are provided to support and validate this theoretical discovery.
Collapse
Affiliation(s)
- Sébastien Giguère
- Department of Computer Science and Software Engineering, Université Laval, Québec, Canada
- * E-mail:
| | - François Laviolette
- Department of Computer Science and Software Engineering, Université Laval, Québec, Canada
| | - Mario Marchand
- Department of Computer Science and Software Engineering, Université Laval, Québec, Canada
| | - Denise Tremblay
- Department of Biochemistry, Microbiology and Bioinformatics, Université Laval, Québec, Canada
| | - Sylvain Moineau
- Department of Biochemistry, Microbiology and Bioinformatics, Université Laval, Québec, Canada
| | - Xinxia Liang
- Faculty of Pharmacy, Université Laval, Québec, Canada
| | - Éric Biron
- Faculty of Pharmacy, Université Laval, Québec, Canada
| | - Jacques Corbeil
- Department of Molecular Medicine, Université Laval, Québec, Canada
| |
Collapse
|
28
|
Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A 2015; 112:4654-9. [PMID: 25775564 DOI: 10.1073/pnas.1422023112] [Citation(s) in RCA: 161] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
DNA binding specificities of transcription factors (TFs) are a key component of gene regulatory processes. Underlying mechanisms that explain the highly specific binding of TFs to their genomic target sites are poorly understood. A better understanding of TF-DNA binding requires the ability to quantitatively model TF binding to accessible DNA as its basic step, before additional in vivo components can be considered. Traditionally, these models were built based on nucleotide sequence. Here, we integrated 3D DNA shape information derived with a high-throughput approach into the modeling of TF binding specificities. Using support vector regression, we trained quantitative models of TF binding specificity based on protein binding microarray (PBM) data for 68 mammalian TFs. The evaluation of our models included cross-validation on specific PBM array designs, testing across different PBM array designs, and using PBM-trained models to predict relative binding affinities derived from in vitro selection combined with deep sequencing (SELEX-seq). Our results showed that shape-augmented models compared favorably to sequence-based models. Although both k-mer and DNA shape features can encode interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space. In addition, analyzing the feature weights of DNA shape-augmented models uncovered TF family-specific structural readout mechanisms that were not revealed by the DNA sequence. As such, this work combines knowledge from structural biology and genomics, and suggests a new path toward understanding TF binding and genome function.
Collapse
|
29
|
Zellers RG, Drewell RA, Dresch JM. MARZ: an algorithm to combinatorially analyze gapped n-mer models of transcription factor binding. BMC Bioinformatics 2015; 16:30. [PMID: 25637281 PMCID: PMC4384306 DOI: 10.1186/s12859-014-0446-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Accepted: 11/24/2014] [Indexed: 12/28/2022] Open
Abstract
Background A key challenge in understanding the molecular mechanisms that control gene regulation is the characterization of the specificity with which transcription factor proteins bind to specific DNA sequences. A number of computational approaches have been developed to examine these interactions, including simple mononucleotide and dinucleotide position weight matrix models. Results Here we develop a novel, unbiased computational algorithm, MARZ, that systematically analyzes all possible gapped matrices across a fixed number of nucleotides. In addition, to evaluate the ability of these matrix models to predict in vivo binding sites, we utilize a new scoring system and, in combination with established scoring methods and statistical analysis, test the performance of 32 different gapped matrices on the well characterized HUNCHBACK transcription factor in Drosophila. Conclusions Our results indicate that in many cases gapped matrix models can outperform traditional models, but that the relative strength of the binding sites considered in the analysis can profoundly influence the predictive ability of specific models. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0446-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rowan G Zellers
- Department of Computer Science, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA. .,Department of Mathematics, Harvey Mudd College, 301 Platt Boulevard, Claremont CA, 91711, USA.
| | - Robert A Drewell
- Biology Department, Clark University, 950 Main Street, Worcester MA, 01610, USA.
| | - Jacqueline M Dresch
- Department of Mathematics and Statistics, Amherst College, P.O. Box 5000, Amherst MA, 01002, USA.
| |
Collapse
|
30
|
Abstract
Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)-DNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TF-DNA binding preferences. We used high-throughput protein-DNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying repeat symmetries. Based on statistical mechanics modeling, we identify a new protein-DNA binding mechanism induced by DNA sequence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed governs protein-DNA binding preferences.
Collapse
|
31
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 332] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
32
|
Munteanu A, Ohler U, Gordân R. COUGER--co-factors associated with uniquely-bound genomic regions. Nucleic Acids Res 2014; 42:W461-7. [PMID: 24861628 PMCID: PMC4086139 DOI: 10.1093/nar/gku435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Most transcription factors (TFs) belong to protein families that share a common DNA binding domain and have very similar DNA binding preferences. However, many paralogous TFs (i.e. members of the same TF family) perform different regulatory functions and interact with different genomic regions in the cell. A potential mechanism for achieving this differential in vivo specificity is through interactions with protein co-factors. Computational tools for studying the genomic binding profiles of paralogous TFs and identifying their putative co-factors are currently lacking. Here, we present an interactive web implementation of COUGER, a classification-based framework for identifying protein co-factors that might provide specificity to paralogous TFs. COUGER takes as input two sets of genomic regions bound by paralogous TFs, and it identifies a small set of putative co-factors that best distinguish the two sets of sequences. To achieve this task, COUGER uses a classification approach, with features that reflect the DNA-binding specificities of the putative co-factors. The identified co-factors are presented in a user-friendly output page, together with information that allows the user to understand and to explore the contributions of individual co-factor features. COUGER can be run as a stand-alone tool or through a web interface: http://couger.oit.duke.edu.
Collapse
Affiliation(s)
- Alina Munteanu
- Faculty of Computer Science, Alexandru I. Cuza University, Iasi 700483, Romania Berlin Institute for Medical Systems Biology, Max Delbruck Center, 13125 Berlin, Germany
| | - Uwe Ohler
- Berlin Institute for Medical Systems Biology, Max Delbruck Center, 13125 Berlin, Germany Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA
| | - Raluca Gordân
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA
| |
Collapse
|
33
|
Siggers T, Gordân R. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res 2013; 42:2099-111. [PMID: 24243859 PMCID: PMC3936734 DOI: 10.1093/nar/gkt1112] [Citation(s) in RCA: 151] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Binding of proteins to particular DNA sites across the genome is a primary determinant of specificity in genome maintenance and gene regulation. DNA-binding specificity is encoded at multiple levels, from the detailed biophysical interactions between proteins and DNA, to the assembly of multi-protein complexes. At each level, variation in the mechanisms used to achieve specificity has led to difficulties in constructing and applying simple models of DNA binding. We review the complexities in protein–DNA binding found at multiple levels and discuss how they confound the idea of simple recognition codes. We discuss the impact of new high-throughput technologies for the characterization of protein–DNA binding, and how these technologies are uncovering new complexities in protein–DNA recognition. Finally, we review the concept of multi-protein recognition codes in which new DNA-binding specificities are achieved by the assembly of multi-protein complexes.
Collapse
Affiliation(s)
- Trevor Siggers
- Department of Biology, Boston University, Boston, MA 02215, USA, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA
| | | |
Collapse
|
34
|
Zhong S, He X, Bar-Joseph Z. Predicting tissue specific transcription factor binding sites. BMC Genomics 2013; 14:796. [PMID: 24238150 PMCID: PMC3898213 DOI: 10.1186/1471-2164-14-796] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 11/06/2013] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Studies of gene regulation often utilize genome-wide predictions of transcription factor (TF) binding sites. Most existing prediction methods are based on sequence information alone, ignoring biological contexts such as developmental stages and tissue types. Experimental methods to study in vivo binding, including ChIP-chip and ChIP-seq, can only study one transcription factor in a single cell type and under a specific condition in each experiment, and therefore cannot scale to determine the full set of regulatory interactions in mammalian transcriptional regulatory networks. RESULTS We developed a new computational approach, PIPES, for predicting tissue-specific TF binding. PIPES integrates in vitro protein binding microarrays (PBMs), sequence conservation and tissue-specific epigenetic (DNase I hypersensitivity) information. We demonstrate that PIPES improves over existing methods on distinguishing between in vivo bound and unbound sequences using ChIP-seq data for 11 mouse TFs. In addition, our predictions are in good agreement with current knowledge of tissue-specific TF regulation. CONCLUSIONS We provide a systematic map of computationally predicted tissue-specific binding targets for 284 mouse TFs across 55 tissue/cell types. Such comprehensive resource is useful for researchers studying gene regulation.
Collapse
Affiliation(s)
| | | | - Ziv Bar-Joseph
- Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA.
| |
Collapse
|
35
|
Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, Rohs R. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2013; 42:D148-55. [PMID: 24214955 PMCID: PMC3964943 DOI: 10.1093/nar/gkt1087] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA, Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel, Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada and Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | | | | | |
Collapse
|