1
|
Karollus A, Hingerl J, Gankin D, Grosshauser M, Klemon K, Gagneur J. Species-aware DNA language models capture regulatory elements and their evolution. Genome Biol 2024; 25:83. [PMID: 38566111 PMCID: PMC10985990 DOI: 10.1186/s13059-024-03221-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 03/20/2024] [Indexed: 04/04/2024] Open
Abstract
BACKGROUND The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, new algorithms are needed that can leverage conservation to capture regulatory elements while accounting for their evolution. RESULTS Here, we introduce species-aware DNA language models, which we trained on more than 800 species spanning over 500 million years of evolution. Investigating their ability to predict masked nucleotides from context, we show that DNA language models distinguish transcription factor and RNA-binding protein motifs from background non-coding sequence. Owing to their flexibility, DNA language models capture conserved regulatory elements over much further evolutionary distances than sequence alignment would allow. Remarkably, DNA language models reconstruct motif instances bound in vivo better than unbound ones and account for the evolution of motif sequences and their positional constraints, showing that these models capture functional high-order sequence and evolutionary context. We further show that species-aware training yields improved sequence representations for endogenous and MPRA-based gene expression prediction, as well as motif discovery. CONCLUSIONS Collectively, these results demonstrate that species-aware DNA language models are a powerful, flexible, and scalable tool to integrate information from large compendia of highly diverged genomes.
Collapse
Affiliation(s)
- Alexander Karollus
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Johannes Hingerl
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Dennis Gankin
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Martin Grosshauser
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Kristian Klemon
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Munich Center for Machine Learning, Munich, Germany.
- Institute of Human Genetics, School of Medicine and Health, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
| |
Collapse
|
2
|
Huang T, Xiao H, Tian Q, He Z, Yuan C, Lin Z, Gao X, Yao M. Identification of upstream transcription factor binding sites in orthologous genes using mixed Student’s t-test statistics. PLoS Comput Biol 2022; 18:e1009773. [PMID: 35671296 PMCID: PMC9205514 DOI: 10.1371/journal.pcbi.1009773] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 06/17/2022] [Accepted: 04/30/2022] [Indexed: 11/18/2022] Open
Abstract
Background Transcription factor (TF) regulates the transcription of DNA to messenger RNA by binding to upstream sequence motifs. Identifying the locations of known motifs in whole genomes is computationally intensive. Methodology/Principal findings This study presents a computational tool, named “Grit”, for screening TF-binding sites (TFBS) by coordinating transcription factors to their promoter sequences in orthologous genes. This tool employs a newly developed mixed Student’s t-test statistical method that detects high-scoring binding sites utilizing conservation information among species. The program performs sequence scanning at a rate of 3.2 Mbp/s on a quad-core Amazon server and has been benchmarked by the well-established ChIP-Seq datasets, putting Grit amongst the top-ranked TFBS predictors. It significantly outperforms the well-known transcription factor motif scanning tools, Pscan (4.8%) and FIMO (17.8%), in analyzing well-documented ChIP-Atlas human genome Chip-Seq datasets. Significance Grit is a good alternative to current available motif scanning tools. Locating transcription factor-binding (TF-binding) site in the genome and identification their function is fundamental in understanding various biological processes. Improve the performance of the prediction tools is important because accurate TF-binding site prediction can save cost and time for wet-lab experiments. Also, genome wide TF-binding site prediction can provide new insights for transcriptome regulation in system biology perspective. This study developed a new TF-binding site prediction tool based on mixed Student’s t-test statistical method. The tool is amongst the top-ranked TF-binding site predictors, as such, it can help the researchers in TF-binding site identification and transcriptional regulation mechanism interpretation of genes.
Collapse
Affiliation(s)
- Tinghua Huang
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Hong Xiao
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Qi Tian
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Zhen He
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Cheng Yuan
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Zezhao Lin
- College of Animal Science, Yangtze University, Jingzhou, China
| | - Xuejun Gao
- College of Animal Science, Yangtze University, Jingzhou, China
- * E-mail: (XG); (MY)
| | - Min Yao
- College of Animal Science, Yangtze University, Jingzhou, China
- * E-mail: (XG); (MY)
| |
Collapse
|
3
|
Kim J, Lee KT, Lee JS, Shin J, Cui B, Yang K, Choi YS, Choi N, Lee SH, Lee JH, Bahn YS, Cho SW. Fungal brain infection modelled in a human-neurovascular-unit-on-a-chip with a functional blood-brain barrier. Nat Biomed Eng 2021; 5:830-846. [PMID: 34127820 DOI: 10.1038/s41551-021-00743-8] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Accepted: 04/30/2021] [Indexed: 02/05/2023]
Abstract
The neurovascular unit, which consists of vascular cells surrounded by astrocytic end-feet and neurons, controls cerebral blood flow and the permeability of the blood-brain barrier (BBB) to maintain homeostasis in the neuronal milieu. Studying how some pathogens and drugs can penetrate the human BBB and disrupt neuronal homeostasis requires in vitro microphysiological models of the neurovascular unit. Here we show that the neurotropism of Cryptococcus neoformans-the most common pathogen causing fungal meningitis-and its ability to penetrate the BBB can be modelled by the co-culture of human neural stem cells, brain microvascular endothelial cells and brain vascular pericytes in a human-neurovascular-unit-on-a-chip maintained by a stepwise gravity-driven unidirectional flow and recapitulating the structural and functional features of the BBB. We found that the pathogen forms clusters of cells that penetrate the BBB without altering tight junctions, suggesting a transcytosis-mediated mechanism. The neurovascular-unit-on-a-chip may facilitate the study of the mechanisms of brain infection by pathogens, and the development of drugs for a range of brain diseases.
Collapse
Affiliation(s)
- Jin Kim
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Kyung-Tae Lee
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Jong Seung Lee
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Jisoo Shin
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Baofang Cui
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Kisuk Yang
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Yi Sun Choi
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Nakwon Choi
- Brain Science Institute, Korea Institute of Science and Technology (KIST), Seoul, Republic of Korea.,KU-KIST Graduate School of Converging Science and Technology, Korea University, Seoul, Republic of Korea.,Division of Bio-Medical Science and Technology, KIST School, Korea University of Science and Technology (UST), Seoul, Republic of Korea
| | - Soo Hyun Lee
- Brain Science Institute, Korea Institute of Science and Technology (KIST), Seoul, Republic of Korea
| | - Jae-Hyun Lee
- Institute for Basic Science (IBS), Center for Nanomedicine, Seoul, Republic of Korea.,Graduate Program of Nano Biomedical Engineering (NanoBME), Advanced Science Institute, Yonsei University, Seoul, Republic of Korea
| | - Yong-Sun Bahn
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea.
| | - Seung-Woo Cho
- Department of Biotechnology, Yonsei University, Seoul, Republic of Korea. .,Institute for Basic Science (IBS), Center for Nanomedicine, Seoul, Republic of Korea. .,Graduate Program of Nano Biomedical Engineering (NanoBME), Advanced Science Institute, Yonsei University, Seoul, Republic of Korea.
| |
Collapse
|
4
|
Thakur V, Bains S, Pathania S, Sharma S, Kaur R, Singh K. Comparative transcriptomics reveals candidate transcription factors involved in costunolide biosynthesis in medicinal plant-Saussurea lappa. Int J Biol Macromol 2020; 150:52-67. [DOI: 10.1016/j.ijbiomac.2020.01.312] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 01/28/2020] [Accepted: 01/28/2020] [Indexed: 01/01/2023]
|
5
|
Jackson CA, Castro DM, Saldi GA, Bonneau R, Gresham D. Gene regulatory network reconstruction using single-cell RNA sequencing of barcoded genotypes in diverse environments. eLife 2020; 9:e51254. [PMID: 31985403 PMCID: PMC7004572 DOI: 10.7554/elife.51254] [Citation(s) in RCA: 98] [Impact Index Per Article: 19.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 01/10/2020] [Indexed: 11/13/2022] Open
Abstract
Understanding how gene expression programs are controlled requires identifying regulatory relationships between transcription factors and target genes. Gene regulatory networks are typically constructed from gene expression data acquired following genetic perturbation or environmental stimulus. Single-cell RNA sequencing (scRNAseq) captures the gene expression state of thousands of individual cells in a single experiment, offering advantages in combinatorial experimental design, large numbers of independent measurements, and accessing the interaction between the cell cycle and environmental responses that is hidden by population-level analysis of gene expression. To leverage these advantages, we developed a method for scRNAseq in budding yeast (Saccharomyces cerevisiae). We pooled diverse transcriptionally barcoded gene deletion mutants in 11 different environmental conditions and determined their expression state by sequencing 38,285 individual cells. We benchmarked a framework for learning gene regulatory networks from scRNAseq data that incorporates multitask learning and constructed a global gene regulatory network comprising 12,228 interactions.
Collapse
Affiliation(s)
- Christopher A Jackson
- Center For Genomics and Systems BiologyNew York UniversityNew YorkUnited States
- Department of BiologyNew York UniversityNew YorkUnited States
| | | | | | - Richard Bonneau
- Center For Genomics and Systems BiologyNew York UniversityNew YorkUnited States
- Department of BiologyNew York UniversityNew YorkUnited States
- Courant Institute of Mathematical Sciences, Computer Science DepartmentNew York UniversityNew YorkUnited States
- Center For Data ScienceNew York UniversityNew YorkUnited States
- Flatiron Institute, Center for Computational BiologySimons FoundationNew YorkUnited States
| | - David Gresham
- Center For Genomics and Systems BiologyNew York UniversityNew YorkUnited States
- Department of BiologyNew York UniversityNew YorkUnited States
| |
Collapse
|
6
|
Dalfovo D, Valentini S, Romanel A. Exploring functionally annotated transcriptional consensus regulatory elements with CONREL. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5981331. [PMID: 33186463 PMCID: PMC7805434 DOI: 10.1093/database/baaa071] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 07/03/2020] [Accepted: 08/06/2020] [Indexed: 12/31/2022]
Abstract
Understanding the interaction between human genome regulatory elements and transcription factors is fundamental to elucidate the structure of gene regulatory networks. Here we present CONREL, a web application that allows for the exploration of functionally annotated transcriptional ‘consensus’ regulatory elements at different levels of abstraction. CONREL provides an extensive collection of consensus promoters, enhancers and active enhancers for 198 cell-lines across 38 tissue types, which are also combined to provide global consensuses. In addition, 1000 Genomes Project genotype data and the ‘total binding affinity’ of thousands of transcription factor binding motifs at genomic regulatory elements is fully combined and exploited to characterize and annotate functional properties of our collection. Comparison with other available resources highlights the strengths and advantages of CONREL. CONREL can be used to explore genomic loci, specific genes or genomic regions of interest across different cell lines and tissue types. The resource is freely available at https://bcglab.cibio.unitn.it/conrel.
Collapse
Affiliation(s)
- Davide Dalfovo
- Laboratory of Bioinformatics and Computational Genomics, Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Via Sommarive 9, 38123 Trento, Italy
| | - Samuel Valentini
- Laboratory of Bioinformatics and Computational Genomics, Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Via Sommarive 9, 38123 Trento, Italy
| | | |
Collapse
|
7
|
Bioinformatics Approaches to Gain Insights into cis-Regulatory Motifs Involved in mRNA Localization. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1203:165-194. [PMID: 31811635 DOI: 10.1007/978-3-030-31434-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Messenger RNA (mRNA) is a fundamental intermediate in the expression of proteins. As an integral part of this important process, protein production can be localized by the targeting of mRNA to a specific subcellular compartment. The subcellular destination of mRNA is suggested to be governed by a region of its primary sequence or secondary structure, which consequently dictates the recruitment of trans-acting factors, such as RNA-binding proteins or regulatory RNAs, to form a messenger ribonucleoprotein particle. This molecular ensemble is requisite for precise and spatiotemporal control of gene expression. In the context of RNA localization, the description of the binding preferences of an RNA-binding protein defines a motif, and one, or more, instance of a given motif is defined as a localization element (zip code). In this chapter, we first discuss the cis-regulatory motifs previously identified as mRNA localization elements. We then describe motif representation in terms of entropy and information content and offer an overview of motif databases and search algorithms. Finally, we provide an outline of the motif topology of asymmetrically localized mRNA molecules.
Collapse
|
8
|
del Olmo Toledo V, Puccinelli R, Fordyce PM, Pérez JC. Diversification of DNA binding specificities enabled SREBP transcription regulators to expand the repertoire of cellular functions that they govern in fungi. PLoS Genet 2018; 14:e1007884. [PMID: 30596634 PMCID: PMC6329520 DOI: 10.1371/journal.pgen.1007884] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 01/11/2019] [Accepted: 12/08/2018] [Indexed: 01/08/2023] Open
Abstract
The Sterol Regulatory Element Binding Proteins (SREBPs) are basic-helix-loop-helix transcription regulators that control the expression of sterol biosynthesis genes in higher eukaryotes and some fungi. Surprisingly, SREBPs do not regulate sterol biosynthesis in the ascomycete yeasts (Saccharomycotina) as this role was handed off to an unrelated transcription regulator in this clade. The SREBPs, nonetheless, expanded in fungi such as the ascomycete yeasts Candida spp., raising questions about their role and evolution in these organisms. Here we report that the fungal SREBPs diversified their DNA binding preferences concomitantly with an expansion in function. We establish that several branches of fungal SREBPs preferentially bind non-palindromic DNA sequences, in contrast to the palindromic DNA motifs recognized by most basic-helix-loop-helix proteins (including SREBPs) in higher eukaryotes. Reconstruction and biochemical characterization of the likely ancestor protein suggest that an intrinsic DNA binding promiscuity in the family was resolved by alternative mechanisms in different branches of fungal SREBPs. Furthermore, we show that two SREBPs in the human commensal yeast Candida albicans drive a transcriptional cascade that inhibits a morphological switch under anaerobic conditions. Preventing this morphological transition enhances C. albicans colonization of the mammalian intestine, the fungus' natural niche. Thus, our results illustrate how diversification in DNA binding preferences enabled the functional expansion of a family of eukaryotic transcription regulators.
Collapse
Affiliation(s)
- Valentina del Olmo Toledo
- Interdisciplinary Center for Clinical Research, University Hospital Würzburg, Würzburg, Germany
- Institute for Molecular Infection Biology, University Würzburg, Würzburg, Germany
| | - Robert Puccinelli
- Department of Genetics, Stanford University, Stanford, California, United States of America
- Chan Zuckerberg Biohub, San Francisco, California, United States of America
| | - Polly M. Fordyce
- Department of Genetics, Stanford University, Stanford, California, United States of America
- Chan Zuckerberg Biohub, San Francisco, California, United States of America
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
- Stanford CheM-H Institute, Stanford University, Stanford, California, United States of America
| | - J. Christian Pérez
- Interdisciplinary Center for Clinical Research, University Hospital Würzburg, Würzburg, Germany
- Institute for Molecular Infection Biology, University Würzburg, Würzburg, Germany
- * E-mail:
| |
Collapse
|
9
|
Cheng C, Tang RQ, Xiong L, Hector RE, Bai FW, Zhao XQ. Association of improved oxidative stress tolerance and alleviation of glucose repression with superior xylose-utilization capability by a natural isolate of Saccharomyces cerevisiae. BIOTECHNOLOGY FOR BIOFUELS 2018; 11:28. [PMID: 29441126 PMCID: PMC5798184 DOI: 10.1186/s13068-018-1018-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 01/11/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND Saccharomyces cerevisiae wild strains generally have poor xylose-utilization capability, which is a major barrier for efficient bioconversion of lignocellulosic biomass. Laboratory adaption is commonly used to enhance xylose utilization of recombinant S. cerevisiae. Apparently, yeast cells could remodel the metabolic network for xylose metabolism. However, it still remains unclear why natural isolates of S. cerevisiae poorly utilize xylose. Here, we analyzed a unique S. cerevisiae natural isolate YB-2625 which has superior xylose metabolism capability in the presence of mixed-sugar. Comparative transcriptomic analysis was performed using S. cerevisiae YB-2625 grown in a mixture of glucose and xylose, and the model yeast strain S288C served as a control. Global gene transcription was compared at both the early mixed-sugar utilization stage and the latter xylose-utilization stage. RESULTS Genes involved in endogenous xylose-assimilation (XYL2 and XKS1), gluconeogenesis, and TCA cycle showed higher transcription levels in S. cerevisiae YB-2625 at the xylose-utilization stage, when compared to the reference strain. On the other hand, transcription factor encoding genes involved in regulation of glucose repression (MIG1, MIG2, and MIG3) as well as HXK2 displayed decreased transcriptional levels in YB-2625, suggesting the alleviation of glucose repression of S. cerevisiae YB-2625. Notably, genes encoding antioxidant enzymes (CTT1, CTA1, SOD2, and PRX1) showed higher transcription levels in S. cerevisiae YB-2625 in the xylose-utilization stage than that of the reference strain. Consistently, catalase activity of YB-2625 was 1.9-fold higher than that of S. cerevisiae S288C during the xylose-utilization stage. As a result, intracellular reactive oxygen species levels of S. cerevisiae YB-2625 were 43.3 and 58.6% lower than that of S288C at both sugar utilization stages. Overexpression of CTT1 and PRX1 in the recombinant strain S. cerevisiae YRH396 deriving from S. cerevisiae YB-2625 increased cell growth when xylose was used as the sole carbon source, leading to 13.5 and 18.1%, respectively, more xylose consumption. CONCLUSIONS Enhanced oxidative stress tolerance and relief of glucose repression are proposed to be two major mechanisms for superior xylose utilization by S. cerevisiae YB-2625. The present study provides insights into the innate regulatory mechanisms underlying xylose utilization in wild-type S. cerevisiae, which benefits the rapid development of robust yeast strains for lignocellulosic biorefineries.
Collapse
Affiliation(s)
- Cheng Cheng
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, 116024 China
| | - Rui-Qi Tang
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 China
| | - Liang Xiong
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian, 116024 China
| | - Ronald E. Hector
- Bioenergy Research Unit, National Center for Agricultural Utilization Research, USDA-ARS, Peoria, IL USA
| | - Feng-Wu Bai
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 China
| | - Xin-Qing Zhao
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 China
| |
Collapse
|
10
|
Grassi E, Mariella E, Forneris M, Marotta F, Catapano M, Molineris I, Provero P. A functional strategy to characterize expression Quantitative Trait Loci. Hum Genet 2017; 136:1477-1487. [PMID: 29101457 DOI: 10.1007/s00439-017-1849-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2017] [Accepted: 09/20/2017] [Indexed: 02/08/2023]
Abstract
The study of genetic variation has been revolutionized by the advent of high-throughput technologies able to determine the complete genomic sequence of thousands of individuals. Understanding the functional relevance of variants is, however, still a difficult task, especially when focusing on non-coding variants. Most of the variants associated with disease by Genome-Wide Association Studies (GWAS) are indeed non-coding, and presumably exert their effects by altering gene regulation. Expression Quantitative Trait Loci (eQTL) studies represent an important step in understanding the functional relevance of regulatory variants. We propose a new strategy to detect and characterize eQTLs, based on the effect of variants on the Total Binding Affinity (TBA) profiles of regulatory regions. Using a large dataset of coupled genome and expression data, we show that TBA-based inference allows the identification of eQTLs not revealed by traditional methods and helps in their interpretation in terms of altered transcription factor binding.
Collapse
Affiliation(s)
- Elena Grassi
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy
| | - Elisa Mariella
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy
| | - Mattia Forneris
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy.,Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Federico Marotta
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy
| | - Marika Catapano
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy.,Deptartment of Medical and Molecular Genetics, King's College, London, UK
| | - Ivan Molineris
- Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute IRCCS, Milan, Italy
| | - Paolo Provero
- Department of Molecular Biotechnology and Health Sciences, Molecular Biotechnology Center, University of Turin, Turin, Italy. .,Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute IRCCS, Milan, Italy.
| |
Collapse
|
11
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 277] [Impact Index Per Article: 34.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
12
|
Bussemaker HJ, Causton HC, Fazlollahi M, Lee E, Muroff I. Network-based approaches that exploit inferred transcription factor activity to analyze the impact of genetic variation on gene expression. ACTA ACUST UNITED AC 2017; 2:98-102. [PMID: 28691107 DOI: 10.1016/j.coisb.2017.04.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Over the past decade, a number of methods have emerged for inferring protein-level transcription factor activities in individual samples based on prior information about the structure of the gene regulatory network. We discuss how this has enabled new methods for dissecting trans-acting mechanisms that underpin genetic variation in gene expression.
Collapse
Affiliation(s)
- Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10027.,Department of Systems Biology, Columbia University, New York, NY 10032
| | - Helen C Causton
- Department of Pathology and Cell Biology, Columbia University Medical Center, New York, NY 10032
| | - Mina Fazlollahi
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029
| | - Eunjee Lee
- Department of Genetics and Genomic Sciences, Mount Sinai School of Medicine, New York, NY 10029
| | - Ivor Muroff
- Department of Biological Sciences, Columbia University, New York, NY 10027
| |
Collapse
|
13
|
Barros de Souza R, Silva RK, Ferreira DS, de Sá Leitão Paiva Junior S, de Barros Pita W, de Morais Junior MA. Magnesium ions in yeast: setting free the metabolism from glucose catabolite repression. Metallomics 2016; 8:1193-1203. [DOI: 10.1039/c6mt00157b] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
14
|
Riley TR, Lazarovici A, Mann RS, Bussemaker HJ. Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE. eLife 2015; 4:e06397. [PMID: 26701911 PMCID: PMC4758951 DOI: 10.7554/elife.06397] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Accepted: 12/20/2015] [Indexed: 01/26/2023] Open
Abstract
Transcription factors are crucial regulators of gene expression. Accurate quantitative definition of their intrinsic DNA binding preferences is critical to understanding their biological function. High-throughput in vitro technology has recently been used to deeply probe the DNA binding specificity of hundreds of eukaryotic transcription factors, yet algorithms for analyzing such data have not yet fully matured. Here, we present a general framework (FeatureREDUCE) for building sequence-to-affinity models based on a biophysically interpretable and extensible model of protein-DNA interaction that can account for dependencies between nucleotides within the binding interface or multiple modes of binding. When training on protein binding microarray (PBM) data, we use robust regression and modeling of technology-specific biases to infer specificity models of unprecedented accuracy and precision. We provide quantitative validation of our results by comparing to gold-standard data when available.
Collapse
Affiliation(s)
- Todd R Riley
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Systems Biology, Columbia University, New York, United States
- Department of Biology, University of Massachusetts Boston, Boston, United States
| | - Allan Lazarovici
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Electrical Engineering, Columbia University, New York, United States
| | - Richard S Mann
- Department of Systems Biology, Columbia University, New York, United States
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, United States
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, United States
- Department of Systems Biology, Columbia University, New York, United States
| |
Collapse
|
15
|
Grassi E, Zapparoli E, Molineris I, Provero P. Total Binding Affinity Profiles of Regulatory Regions Predict Transcription Factor Binding and Gene Expression in Human Cells. PLoS One 2015; 10:e0143627. [PMID: 26599758 PMCID: PMC4658012 DOI: 10.1371/journal.pone.0143627] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 11/07/2015] [Indexed: 11/29/2022] Open
Abstract
Transcription factors regulate gene expression by binding regulatory DNA. Understanding the rules governing such binding is an essential step in describing the network of regulatory interactions, and its pathological alterations. We show that describing regulatory regions in terms of their profile of total binding affinities for transcription factors leads to increased predictive power compared to methods based on the identification of discrete binding sites. This applies both to the prediction of transcription factor binding as revealed by ChIP-seq experiments and to the prediction of gene expression through RNA-seq. Further significant improvements in predictive power are obtained when regulatory regions are defined based on chromatin states inferred from histone modification data.
Collapse
Affiliation(s)
- Elena Grassi
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Ettore Zapparoli
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Ivan Molineris
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
| | - Paolo Provero
- Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Turin, Italy
- Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, Milan, Italy
| |
Collapse
|
16
|
MORPHEUS, a Webtool for Transcription Factor Binding Analysis Using Position Weight Matrices with Dependency. PLoS One 2015; 10:e0135586. [PMID: 26285209 PMCID: PMC4540572 DOI: 10.1371/journal.pone.0135586] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 07/24/2015] [Indexed: 12/21/2022] Open
Abstract
Transcriptional networks are central to any biological process and changes affecting transcription factors or their binding sites in the genome are a key factor driving evolution. As more organisms are being sequenced, tools are needed to easily predict transcription factor binding sites (TFBS) presence and affinity from mere inspection of genomic sequences. Although many TFBS discovery algorithms exist, tools for using the DNA binding models they generate are relatively scarce and their use is limited among the biologist community by the lack of flexible and user-friendly tools. We have developed a suite of web tools (called Morpheus) based on the proven Position Weight Matrices (PWM) formalism that can be used without any programing skills and incorporates some unique features such as the presence of dependencies between nucleotides positions or the possibility to compute the predicted occupancy of a large regulatory region using a biophysical model. To illustrate the possibilities and simplicity of Morpheus tools in functional and evolutionary analysis, we have analysed the regulatory link between LEAFY, a key plant transcription factor involved in flower development, and its direct target gene APETALA1 during the divergence of Brassicales clade.
Collapse
|
17
|
Wang J, Malecka A, Trøen G, Delabie J. Comprehensive genome-wide transcription factor analysis reveals that a combination of high affinity and low affinity DNA binding is needed for human gene regulation. BMC Genomics 2015; 16 Suppl 7:S12. [PMID: 26099425 PMCID: PMC4474539 DOI: 10.1186/1471-2164-16-s7-s12] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Background High-throughput in vivo protein-DNA interaction experiments are currently widely used in gene regulation studies. Hitherto, comprehensive data analysis remains a challenge and for that reason most computational methods only consider the top few hundred or thousand strongest protein binding sites whereas weak protein binding sites are completely ignored. Results A new biophysical model of protein-DNA interactions, BayesPI2+, was developed to address the above-mentioned challenges. BayesPI2+ can be run in either a serial computation model or a parallel ensemble learning framework. BayesPI2+ allowed us to analyze all binding sites of the transcription factors, including weak binding that cannot be analyzed by other models. It is evaluated in both synthetic and real in vivo protein-DNA binding experiments. Analysing ESR1 and SPIB in breast carcinoma and activated B cell-like diffuse large B-cell lymphoma cell lines, respectively, revealed that the concerted binding to high and low affinity sites correlates best with gene expression. Conclusions BayesPI2+ allows us to analyze transcription factor binding on a larger scale than hitherto achieved. By this analysis, we were able to demonstrate that genes are regulated by concerted binding to high and low affinity binding sites. The program and output results are publicly available at: http://folk.uio.no/junbaiw/BayesPI2Plus.
Collapse
|
18
|
Wang J. Quality versus accuracy: result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence. BMC Bioinformatics 2014; 15:289. [PMID: 25158938 PMCID: PMC4161872 DOI: 10.1186/1471-2105-15-289] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 08/18/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics. A systematic comparison of 26 algorithms to learn TF-DNA binding specificity in in vitro protein-binding microarray (PBM) data was published recently, but the quality of those examined PBMs was not evaluated completely. RESULTS Here, new quality-control parameters such as principal component analysis (PCA) ellipse is proposed to assess the data quality for either single or paired PBMs. Additionally, a biophysical model of TF-DNA interactions including adjacent dinucleotide interdependence was implemented in a new program - BayesPI2, where sparse Bayesian learning and relevance vector machine are used to predict unknown model parameters. Then, 66 mouse TFs from the DREAM5 challenge were classified into two groups (i.e. good vs. bad) based on the paired PBM quality-control parameters. Subsequently, computational methods to model TF sequence specificity were evaluated between the two groups. CONCLUSION Results indicate that both the algorithm performance and the predicted TF-binding energy-level of a motif are significantly influenced by PBM data quality, where poor PBM data quality is linked to specific protein domains (e.g. C2H2 DNA-binding domain). Especially, the new dinucleotide energy-dependent model (BayesPI2) offers great improvement in testing prediction accuracy over the simple energy-independent model, for at least 21% of analyzed the TFs.
Collapse
Affiliation(s)
- Junbai Wang
- Pathology Department, Oslo University Hospital - Norwegian Radium Hospital, Montebello, Oslo, 0310, Norway.
| |
Collapse
|
19
|
Glenwinkel L, Wu D, Minevich G, Hobert O. TargetOrtho: a phylogenetic footprinting tool to identify transcription factor targets. Genetics 2014; 197:61-76. [PMID: 24558259 PMCID: PMC4012501 DOI: 10.1534/genetics.113.160721] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2014] [Accepted: 02/09/2014] [Indexed: 11/18/2022] Open
Abstract
The identification of the regulatory targets of transcription factors is central to our understanding of how transcription factors fulfill their many key roles in development and homeostasis. DNA-binding sites have been uncovered for many transcription factors through a number of experimental approaches, but it has proven difficult to use this binding site information to reliably predict transcription factor target genes in genomic sequence space. Using the nematode Caenorhabditis elegans and other related nematode species as a starting point, we describe here a bioinformatic pipeline that identifies potential transcription factor target genes from genomic sequences. Among the key features of this pipeline is the use of sequence conservation of transcription-factor-binding sites in related species. Rather than using aligned genomic DNA sequences from the genomes of multiple species as a starting point, TargetOrtho scans related genome sequences independently for matches to user-provided transcription-factor-binding motifs, assigns motif matches to adjacent genes, and then determines whether orthologous genes in different species also contain motif matches. We validate TargetOrtho by identifying previously characterized targets of three different types of transcription factors in C. elegans, and we use TargetOrtho to identify novel target genes of the Collier/Olf/EBF transcription factor UNC-3 in C. elegans ventral nerve cord motor neurons. We have also implemented the use of TargetOrtho in Drosophila melanogaster using conservation among five species in the D. melanogaster species subgroup for target gene discovery.
Collapse
Affiliation(s)
- Lori Glenwinkel
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | | | - Gregory Minevich
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| | - Oliver Hobert
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University Medical Center, New York, New York 10032
| |
Collapse
|
20
|
Siewert E, Kechris KJ. Modeling considerations for using expression data from multiple species. Stat Med 2013; 32:4057-70. [DOI: 10.1002/sim.5850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2012] [Accepted: 04/17/2013] [Indexed: 11/10/2022]
Affiliation(s)
- Elizabeth Siewert
- Department of Biostatistics and Informatics, Colorado School of Public Health; University of Colorado Denver; Denver Colorado U.S.A
- Statistically Speaking Consulting; Wylie Texas U.S.A
| | - Katerina J. Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health; University of Colorado Denver; Denver Colorado U.S.A
| |
Collapse
|
21
|
Piro RM, Molineris I, Di Cunto F, Eils R, König R. Disease-gene discovery by integration of 3D gene expression and transcription factor binding affinities. ACTA ACUST UNITED AC 2012; 29:468-75. [PMID: 23267172 DOI: 10.1093/bioinformatics/bts720] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
MOTIVATION The computational evaluation of candidate genes for hereditary disorders is a non-trivial task. Several excellent methods for disease-gene prediction have been developed in the past 2 decades, exploiting widely differing data sources to infer disease-relevant functional relationships between candidate genes and disorders. We have shown recently that spatially mapped, i.e. 3D, gene expression data from the mouse brain can be successfully used to prioritize candidate genes for human Mendelian disorders of the central nervous system. RESULTS We improved our previous work 2-fold: (i) we demonstrate that condition-independent transcription factor binding affinities of the candidate genes' promoters are relevant for disease-gene prediction and can be integrated with our previous approach to significantly enhance its predictive power; and (ii) we define a novel similarity measure-termed Relative Intensity Overlap-for both 3D gene expression patterns and binding affinity profiles that better exploits their disease-relevant information content. Finally, we present novel disease-gene predictions for eight loci associated with different syndromes of unknown molecular basis that are characterized by mental retardation.
Collapse
Affiliation(s)
- Rosario M Piro
- Department of Theoretical Bioinformatics, German Cancer Research Center (Deutsches Krebsforschungszentrum, DKFZ), University of Heidelberg, Im 69120 Heidelberg, Germany.
| | | | | | | | | |
Collapse
|
22
|
Burger A, Walczak AM, Wolynes PG. Influence of decoys on the noise and dynamics of gene expression. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 86:041920. [PMID: 23214628 DOI: 10.1103/physreve.86.041920] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2012] [Indexed: 06/01/2023]
Abstract
Many transcription factors bind to DNA with a remarkable lack of specificity, so that regulatory binding sites compete with an enormous number of nonregulatory "decoy" sites. For an autoregulated gene, we show decoy sites decrease noise in the number of unbound proteins to a Poisson limit that results from binding and unbinding. This noise buffering is optimized for a given protein concentration when decoys have a 1/2 probability of being occupied. Decoys linearly increase the time to approach steady state and exponentially increase the time to switch epigenetically between bistable states.
Collapse
Affiliation(s)
- Anat Burger
- Center for Theoretical Biological Physics, University of California San Diego, La Jolla, California, USA
| | | | | |
Collapse
|
23
|
Singh RK, Gonzalez M, Kabbaj MHM, Gunjan A. Novel E3 ubiquitin ligases that regulate histone protein levels in the budding yeast Saccharomyces cerevisiae. PLoS One 2012; 7:e36295. [PMID: 22570702 PMCID: PMC3343073 DOI: 10.1371/journal.pone.0036295] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2011] [Accepted: 03/29/2012] [Indexed: 02/02/2023] Open
Abstract
Core histone proteins are essential for packaging the genomic DNA into chromatin in all eukaryotes. Since multiple genes encode these histone proteins, there is potential for generating more histones than what is required for chromatin assembly. The positively charged histones have a very high affinity for negatively charged molecules such as DNA, and any excess of histone proteins results in deleterious effects on genomic stability and cell viability. Hence, histone levels are known to be tightly regulated via transcriptional, posttranscriptional and posttranslational mechanisms. We have previously elucidated the posttranslational regulation of histone protein levels by the ubiquitin-proteasome pathway involving the E2 ubiquitin conjugating enzymes Ubc4/5 and the HECT (Homologous to E6-AP C-Terminus) domain containing E3 ligase Tom1 in the budding yeast. Here we report the identification of four additional E3 ligases containing the RING (Really Interesting New Gene) finger domains that are involved in the ubiquitylation and subsequent degradation of excess histones in yeast. These E3 ligases are Pep5, Snt2 as well as two previously uncharacterized Open Reading Frames (ORFs) YKR017C and YDR266C that we have named Hel1 and Hel2 (for Histone E3 Ligases) respectively. Mutants lacking these E3 ligases are sensitive to histone overexpression as they fail to degrade excess histones and accumulate high levels of endogenous histones on histone chaperones. Co-immunoprecipitation assays showed that these E3 ligases interact with the major E2 enzyme Ubc4 that is involved in the degradation related ubiquitylation of histones. Using mutagenesis we further demonstrate that the RING domains of Hel1, Hel2 and Snt2 are required for histone regulation. Lastly, mutants corresponding to Hel1, Hel2 and Pep5 are sensitive to replication inhibitors. Overall, our results highlight the importance of posttranslational histone regulatory mechanisms that employ multiple E3 ubiquitin ligases to ensure excess histone degradation and thus contribute to the maintenance of genomic stability.
Collapse
Affiliation(s)
- Rakesh Kumar Singh
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, Florida, United States of America
- * E-mail: (RKS); (AG)
| | - Melanie Gonzalez
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, Florida, United States of America
| | - Marie-Helene Miquel Kabbaj
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, Florida, United States of America
| | - Akash Gunjan
- Department of Biomedical Sciences, College of Medicine, Florida State University, Tallahassee, Florida, United States of America
- * E-mail: (RKS); (AG)
| |
Collapse
|
24
|
Kaplan T, Biggin MD. Quantitative models of the mechanisms that control genome-wide patterns of animal transcription factor binding. Methods Cell Biol 2012; 110:263-83. [PMID: 22482953 DOI: 10.1016/b978-0-12-388403-9.00011-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Animal transcription factors drive complex spatial and temporal patterns of gene expression during development by binding to a wide array of genomic regions. While the in vivo DNA binding landscape and in vitro DNA binding affinities of many such proteins have been characterized, our understanding of the forces that determine where, when, and the extent to which these transcription factors bind DNA in cells remains primitive. In this chapter, we describe computational thermodynamic models that predict the genome-wide DNA binding landscape of transcription factors in vivo and evaluate the contribution of biophysical determinants, such as protein-protein interactions and chromatin accessibility, on DNA occupancy. We show that predictions based only on DNA sequence and in vitro DNA affinity data achieve a mild correlation (r=0.4) with experimental measurements of in vivo DNA binding. However, by incorporating direct measurements of DNA accessibility in chromatin, it is possible to obtain much higher accuracy (r=0.6-0.9) for various transcription factors across known target genes. Thus, a combination of experimental DNA accessibility data and computational modeling of transcription factor DNA binding may be sufficient to predict the binding landscape of any animal transcription factor with reasonable accuracy.
Collapse
Affiliation(s)
- Tommy Kaplan
- Department of Molecular and Cell Biology, California Institute of Quantitative Biosciences, University of California, Berkeley, California, USA; School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
| | | |
Collapse
|
25
|
Berger N, Dubreucq B, Roudier F, Dubos C, Lepiniec L. Transcriptional regulation of Arabidopsis LEAFY COTYLEDON2 involves RLE, a cis-element that regulates trimethylation of histone H3 at lysine-27. THE PLANT CELL 2011; 23:4065-78. [PMID: 22080598 PMCID: PMC3246333 DOI: 10.1105/tpc.111.087866] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2011] [Revised: 10/11/2011] [Accepted: 10/30/2011] [Indexed: 05/17/2023]
Abstract
LEAFY COTYLEDON2 (LEC2) is a master regulator of seed development in Arabidopsis thaliana. In vegetative organs, LEC2 expression is negatively regulated by Polycomb Repressive Complex2 (PRC2) that catalyzes histone H3 Lys 27 trimethylation (H3K27me3) and plays a crucial role in developmental phase transitions. To characterize the cis-regulatory elements involved in the transcriptional regulation of LEC2, molecular dissections and functional analyses of the promoter region were performed in vitro, both in yeast and in planta. Two cis-activating elements and a cis-repressing element (RLE) that is required for H3K27me3 marking were characterized. Remarkably, insertion of the RLE cis-element into pF3H, an unrelated promoter, is sufficient for repressing its transcriptional activity in different tissues. Besides improving our understanding of LEC2 regulation, this study provides important new insights into the mechanisms underlying H3K27me3 deposition and PRC2 recruitment at a specific locus in plants.
Collapse
Affiliation(s)
- Nathalie Berger
- Institut Jean-Pierre Bourgin, Unité Mixte de Recherche 1318 Institut National de la Recherche Agronomique–Agro-ParisTech, Saclay Plant Sciences, 78026 Versailles cedex, France
| | - Bertrand Dubreucq
- Institut Jean-Pierre Bourgin, Unité Mixte de Recherche 1318 Institut National de la Recherche Agronomique–Agro-ParisTech, Saclay Plant Sciences, 78026 Versailles cedex, France
| | - François Roudier
- Institut de Biologie de l'Ecole Normale Supérieure, Centre National de la Recherche Scientifique Unité Mixte de Recherche 8197–Institut National de la Santé et de la Recherche Médicale U1024, 75230 Paris cedex 05, France
| | - Christian Dubos
- Institut Jean-Pierre Bourgin, Unité Mixte de Recherche 1318 Institut National de la Recherche Agronomique–Agro-ParisTech, Saclay Plant Sciences, 78026 Versailles cedex, France
| | - Loïc Lepiniec
- Institut Jean-Pierre Bourgin, Unité Mixte de Recherche 1318 Institut National de la Recherche Agronomique–Agro-ParisTech, Saclay Plant Sciences, 78026 Versailles cedex, France
- Address correspondence to
| |
Collapse
|
26
|
Cheng C, Min R, Gerstein M. TIP: a probabilistic method for identifying transcription factor target genes from ChIP-seq binding profiles. ACTA ACUST UNITED AC 2011; 27:3221-7. [PMID: 22039215 DOI: 10.1093/bioinformatics/btr552] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION ChIP-seq and ChIP-chip experiments have been widely used to identify transcription factor (TF) binding sites and target genes. Conventionally, a fairly 'simple' approach is employed for target gene identification e.g. finding genes with binding sites within 2 kb of a transcription start site (TSS). However, this does not take into account the number of sites upstream of the TSS, their exact positioning or the fact that different TFs appear to act at different characteristic distances from the TSS. RESULTS Here we propose a probabilistic model called target identification from profiles (TIP) that quantitatively measures the regulatory relationships between TFs and target genes. For each TF, our model builds a characteristic, averaged profile of binding around the TSS and then uses this to weight the sites associated with a given gene, providing a continuous-valued 'regulatory' score relating each TF and potential target. Moreover, the score can readily be turned into a ranked list of target genes and an estimate of significance, which is useful for case-dependent downstream analysis. CONCLUSION We show the advantages of TIP by comparing it to the 'simple' approach on several representative datasets, using motif occurrence and relationship to knock-out experiments as metrics of validation. Moreover, we show that the probabilistic model is not as sensitive to various experimental parameters (including sequencing depth and peak-calling method) as the simple approach; in fact, the lesser dependence on sequencing depth potentially utilizes the result of a ChIP-seq experiment in a more 'cost-effective' manner. CONTACT mark.gerstein@yale.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Cheng
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | | | | |
Collapse
|
27
|
Reineke AR, Bornberg-Bauer E, Gu J. Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes. Nucleic Acids Res 2011; 39:6029-43. [PMID: 21470961 PMCID: PMC3152334 DOI: 10.1093/nar/gkr179] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2010] [Revised: 02/22/2011] [Accepted: 03/15/2011] [Indexed: 12/17/2022] Open
Abstract
The discovery of regulatory motifs embedded in upstream regions of plants is a particularly challenging bioinformatics task. Previous studies have shown that motifs in plants are short compared with those found in vertebrates. Furthermore, plant genomes have undergone several diversification mechanisms such as genome duplication events which impact the evolution of regulatory motifs. In this article, a systematic phylogenomic comparison of upstream regions is conducted to further identify features of the plant regulatory genomes, the component of genomes regulating gene expression, to enable future de novo discoveries. The findings highlight differences in upstream region properties between major plant groups and the effects of divergence times and duplication events. First, clear differences in upstream region evolution can be detected between monocots and dicots, thus suggesting that a separation of these groups should be made when searching for novel regulatory motifs, particularly since universal motifs such as the TATA box are rare. Second, investigating the decay rate of significantly aligned regions suggests that a divergence time of ~100 mya sets a limit for reliable conserved non-coding sequence (CNS) detection. Insights presented here will set a framework to help identify embedded motifs of functional relevance by understanding the limits of bioinformatics detection for CNSs.
Collapse
Affiliation(s)
| | | | - Jenny Gu
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstrasse 1, 48149, Münster, Germany
| |
Collapse
|
28
|
Benson CC, Zhou Q, Long X, Miano JM. Identifying functional single nucleotide polymorphisms in the human CArGome. Physiol Genomics 2011; 43:1038-48. [PMID: 21771879 DOI: 10.1152/physiolgenomics.00098.2011] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Regulatory SNPs (rSNPs) reside primarily within the nonprotein coding genome and are thought to disturb normal patterns of gene expression by altering DNA binding of transcription factors. Nevertheless, despite the explosive rise in SNP association studies, there is little information as to the function of rSNPs in human disease. Serum response factor (SRF) is a widely expressed DNA-binding transcription factor that has variable affinity to at least 1,216 permutations of a 10 bp transcription factor binding site (TFBS) known as the CArG box. We developed a robust in silico bioinformatics screening method to evaluate sequences around RefSeq genes for conserved CArG boxes. Utilizing a predetermined phastCons threshold score, we identified 8,252 strand-specific CArGs within an 8 kb window around the transcription start site of 5,213 genes, including all previously defined SRF target genes. We then interrogated this CArG dataset for the presence of previously annotated common polymorphisms. We found a total of 118 unique CArG boxes harboring a SNP within the 10 bp CArG sequence and 1,130 CArG boxes with SNPs located just outside the CArG element. Gel shift and luciferase reporter assays validated SRF binding and functional activity of several new CArG boxes. Importantly, SNPs within or just outside the CArG box often resulted in altered SRF binding and activity. Collectively, these findings demonstrate a powerful approach to computationally define rSNPs in the human CArGome and provide a foundation for similar analyses of other TFBS. Such information may find utility in genetic association studies of human disease where little insight is known regarding the functionality of rSNPs.
Collapse
Affiliation(s)
- Craig C Benson
- University of Rochester Medical Center, Rochester, NY, USA
| | | | | | | |
Collapse
|
29
|
Xie D, Chen CC, He X, Cao X, Zhong S. Towards an evolutionary model of transcription networks. PLoS Comput Biol 2011; 7:e1002064. [PMID: 21695281 PMCID: PMC3111474 DOI: 10.1371/journal.pcbi.1002064] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2010] [Accepted: 04/08/2011] [Indexed: 11/18/2022] Open
Abstract
DNA evolution models made invaluable contributions to comparative genomics, although it seemed formidable to include non-genomic features into these models. In order to build an evolutionary model of transcription networks (TNs), we had to forfeit the substitution model used in DNA evolution and to start from modeling the evolution of the regulatory relationships. We present a quantitative evolutionary model of TNs, subjecting the phylogenetic distance and the evolutionary changes of cis-regulatory sequence, gene expression and network structure to one probabilistic framework. Using the genome sequences and gene expression data from multiple species, this model can predict regulatory relationships between a transcription factor (TF) and its target genes in all species, and thus identify TN re-wiring events. Applying this model to analyze the pre-implantation development of three mammalian species, we identified the conserved and re-wired components of the TNs downstream to a set of TFs including Oct4, Gata3/4/6, cMyc and nMyc. Evolutionary events on the DNA sequence that led to turnover of TF binding sites were identified, including a birth of an Oct4 binding site by a 2nt deletion. In contrast to recent reports of large interspecies differences of TF binding sites and gene expression patterns, the interspecies difference in TF-target relationship is much smaller. The data showed increasing conservation levels from genomic sequences to TF-DNA interaction, gene expression, TN, and finally to morphology, suggesting that evolutionary changes are larger at molecular levels and smaller at functional levels. The data also showed that evolutionarily older TFs are more likely to have conserved target genes, whereas younger TFs tend to have larger re-wiring rates. DNA evolution models made invaluable contributions to comparative genomic studies. Still lacking is an evolutionary model of transcription networks (TNs). To develop such a model, we had to forfeit the substitution model used in DNA evolution and to start from modeling the evolution of the regulatory relationships, and then subject the phylogenetic distance and the multi-species DNA sequence and gene expression data to one probabilistic framework. This model enabled us to infer the evolutionary changes of transcriptional regulatory relationships. Applying this model to analyze three yeast species, we found the anaerobic phenotype in two species was associated with the evolutionary loss of a larger cis-regulatory motif than previously thought. Analyzing three mammalian species, we found increasing conservation levels from genomic sequences to transcription factor-DNA interaction, gene expression, TN, and finally to morphology, suggesting that evolutionary changes are larger at molecular levels and smaller at functional levels. We also found that evolutionarily younger TFs are more likely to regulate different target genes in different species.
Collapse
Affiliation(s)
- Dan Xie
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Chieh-Chun Chen
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Xin He
- Department of Biochemistry and Biophysics, University of California, San Francisco, California, United States of America
| | - Xiaoyi Cao
- Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Sheng Zhong
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Center for Biophysics and Computational Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
30
|
Hemberg M, Kreiman G. Conservation of transcription factor binding events predicts gene expression across species. Nucleic Acids Res 2011; 39:7092-102. [PMID: 21622661 PMCID: PMC3167604 DOI: 10.1093/nar/gkr404] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Recent technological advances have made it possible to determine the genome-wide binding sites of transcription factors (TFs). Comparisons across species have suggested a relatively low degree of evolutionary conservation of experimentally defined TF binding events (TFBEs). Using binding data for six different TFs in hepatocytes and embryonic stem cells from human and mouse, we demonstrate that evolutionary conservation of TFBEs within orthologous proximal promoters is closely linked to function, defined as expression of the target genes. We show that (i) there is a significantly higher degree of conservation of TFBEs when the target gene is expressed in both species; (ii) there is increased conservation of binding events for groups of TFs compared to individual TFs; and (iii) conserved TFBEs have a greater impact on the expression of their target genes than non-conserved ones. These results link conservation of structural elements (TFBEs) to conservation of function (gene expression) and suggest a higher degree of functional conservation than implied by previous studies.
Collapse
Affiliation(s)
- Martin Hemberg
- Children's Hospital Boston, Program in Biophysics and Program in Neuroscience, Harvard Medical School, 300 Longwood Avenue, Boston, MA 02115, USA
| | | |
Collapse
|
31
|
Li XY, Thomas S, Sabo PJ, Eisen MB, Stamatoyannopoulos JA, Biggin MD. The role of chromatin accessibility in directing the widespread, overlapping patterns of Drosophila transcription factor binding. Genome Biol 2011; 12:R34. [PMID: 21473766 PMCID: PMC3218860 DOI: 10.1186/gb-2011-12-4-r34] [Citation(s) in RCA: 156] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2011] [Accepted: 04/07/2011] [Indexed: 12/11/2022] Open
Abstract
Background In Drosophila embryos, many biochemically and functionally unrelated transcription factors bind quantitatively to highly overlapping sets of genomic regions, with much of the lowest levels of binding being incidental, non-functional interactions on DNA. The primary biochemical mechanisms that drive these genome-wide occupancy patterns have yet to be established. Results Here we use data resulting from the DNaseI digestion of isolated embryo nuclei to provide a biophysical measure of the degree to which proteins can access different regions of the genome. We show that the in vivo binding patterns of 21 developmental regulators are quantitatively correlated with DNA accessibility in chromatin. Furthermore, we find that levels of factor occupancy in vivo correlate much more with the degree of chromatin accessibility than with occupancy predicted from in vitro affinity measurements using purified protein and naked DNA. Within accessible regions, however, the intrinsic affinity of the factor for DNA does play a role in determining net occupancy, with even weak affinity recognition sites contributing. Finally, we show that programmed changes in chromatin accessibility between different developmental stages correlate with quantitative alterations in factor binding. Conclusions Based on these and other results, we propose a general mechanism to explain the widespread, overlapping DNA binding by animal transcription factors. In this view, transcription factors are expressed at sufficiently high concentrations in cells such that they can occupy their recognition sequences in highly accessible chromatin without the aid of physical cooperative interactions with other proteins, leading to highly overlapping, graded binding of unrelated factors.
Collapse
Affiliation(s)
- Xiao-Yong Li
- Genomics Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road MS 84-171, Berkeley, CA 94720, USA
| | | | | | | | | | | |
Collapse
|
32
|
Wang J. Computational study of associations between histone modification and protein-DNA binding in yeast genome by integrating diverse information. BMC Genomics 2011; 12:172. [PMID: 21457549 PMCID: PMC3082246 DOI: 10.1186/1471-2164-12-172] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 04/01/2011] [Indexed: 01/06/2023] Open
Abstract
Background In parallel with the quick development of high-throughput technologies, in vivo (vitro) experiments for genome-wide identification of protein-DNA interactions have been developed. Nevertheless, a few questions remain in the field, such as how to distinguish true protein-DNA binding (functional binding) from non-specific protein-DNA binding (non-functional binding). Previous researches tackled the problem by integrated analysis of multiple available sources. However, few systematic studies have been carried out to examine the possible relationships between histone modification and protein-DNA binding. Here this issue was investigated by using publicly available histone modification data in yeast. Results Two separate histone modification datasets were studied, at both the open reading frame (ORF) and the promoter region of binding targets for 37 yeast transcription factors. Both results revealed a distinct histone modification pattern between the functional protein-DNA binding sites and non-functional ones for almost half of all TFs tested. Such difference is much stronger at the ORF than at the promoter region. In addition, a protein-histone modification interaction pathway can only be inferred from the functional protein binding targets. Conclusions Overall, the results suggest that histone modification information can be used to distinguish the functional protein-DNA binding from the non-functional, and that the regulation of various proteins is controlled by the modification of different histone lysines such as the protein-specific histone modification levels.
Collapse
Affiliation(s)
- Junbai Wang
- Department of Pathology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
33
|
Moyroud E, Minguet EG, Ott F, Yant L, Posé D, Monniaux M, Blanchet S, Bastien O, Thévenon E, Weigel D, Schmid M, Parcy F. Prediction of regulatory interactions from genome sequences using a biophysical model for the Arabidopsis LEAFY transcription factor. THE PLANT CELL 2011; 23:1293-306. [PMID: 21515819 PMCID: PMC3101549 DOI: 10.1105/tpc.111.083329] [Citation(s) in RCA: 127] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Revised: 03/22/2011] [Accepted: 04/01/2011] [Indexed: 05/18/2023]
Abstract
Despite great advances in sequencing technologies, generating functional information for nonmodel organisms remains a challenge. One solution lies in an improved ability to predict genetic circuits based on primary DNA sequence in combination with detailed knowledge of regulatory proteins that have been characterized in model species. Here, we focus on the LEAFY (LFY) transcription factor, a conserved master regulator of floral development. Starting with biochemical and structural information, we built a biophysical model describing LFY DNA binding specificity in vitro that accurately predicts in vivo LFY binding sites in the Arabidopsis thaliana genome. Applying the model to other plant species, we could follow the evolution of the regulatory relationship between LFY and the AGAMOUS (AG) subfamily of MADS box genes and show that this link predates the divergence between monocots and eudicots. Remarkably, our model succeeds in detecting the connection between LFY and AG homologs despite extensive variation in binding sites. This demonstrates that the cis-element fluidity recently observed in animals also exists in plants, but the challenges it poses can be overcome with predictions grounded in a biophysical model. Therefore, our work opens new avenues to deduce the structure of regulatory networks from mere inspection of genomic sequences.
Collapse
Affiliation(s)
- Edwige Moyroud
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Eugenio Gómez Minguet
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Felix Ott
- Max Planck Institute for Developmental Biology, Department of Molecular Biology, 72076 Tuebingen, Germany
| | - Levi Yant
- Max Planck Institute for Developmental Biology, Department of Molecular Biology, 72076 Tuebingen, Germany
| | - David Posé
- Max Planck Institute for Developmental Biology, Department of Molecular Biology, 72076 Tuebingen, Germany
| | - Marie Monniaux
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Sandrine Blanchet
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Olivier Bastien
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Emmanuel Thévenon
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
| | - Detlef Weigel
- Max Planck Institute for Developmental Biology, Department of Molecular Biology, 72076 Tuebingen, Germany
| | - Markus Schmid
- Max Planck Institute for Developmental Biology, Department of Molecular Biology, 72076 Tuebingen, Germany
| | - François Parcy
- Laboratoire de Physiologie Cellulaire Végétale, Unité Mixte de Recherche 5168, Centre National de la Recherche Scientifique, Commissariat à l’Énergie Atomique, Institut National de la Recherche Agronomique, Université Joseph Fourier Grenoble I, 38054 Grenoble, France
- Address correspondence to
| |
Collapse
|
34
|
Geeven G, Macgillavry HD, Eggers R, Sassen MM, Verhaagen J, Smit AB, de Gunst MCM, van Kesteren RE. LLM3D: a log-linear modeling-based method to predict functional gene regulatory interactions from genome-wide expression data. Nucleic Acids Res 2011; 39:5313-27. [PMID: 21422075 PMCID: PMC3141251 DOI: 10.1093/nar/gkr139] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
All cellular processes are regulated by condition-specific and time-dependent interactions between transcription factors and their target genes. While in simple organisms, e.g. bacteria and yeast, a large amount of experimental data is available to support functional transcription regulatory interactions, in mammalian systems reconstruction of gene regulatory networks still heavily depends on the accurate prediction of transcription factor binding sites. Here, we present a new method, log-linear modeling of 3D contingency tables (LLM3D), to predict functional transcription factor binding sites. LLM3D combines gene expression data, gene ontology annotation and computationally predicted transcription factor binding sites in a single statistical analysis, and offers a methodological improvement over existing enrichment-based methods. We show that LLM3D successfully identifies novel transcriptional regulators of the yeast metabolic cycle, and correctly predicts key regulators of mouse embryonic stem cell self-renewal more accurately than existing enrichment-based methods. Moreover, in a clinically relevant in vivo injury model of mammalian neurons, LLM3D identified peroxisome proliferator-activated receptor γ (PPARγ) as a neuron-intrinsic transcriptional regulator of regenerative axon growth. In conclusion, LLM3D provides a significant improvement over existing methods in predicting functional transcription regulatory interactions in the absence of experimental transcription factor binding data.
Collapse
Affiliation(s)
- Geert Geeven
- Department of Mathematics, Faculty of Sciences, VU University, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Molineris I, Grassi E, Ala U, Di Cunto F, Provero P. Evolution of promoter affinity for transcription factors in the human lineage. Mol Biol Evol 2011; 28:2173-83. [PMID: 21335606 DOI: 10.1093/molbev/msr027] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Changes in gene regulation are believed to play an important role in the evolution of animals. It has been suggested that changes in cis-regulatory regions are responsible for many or most of the anatomical and behavioral differences between humans and apes. However, the study of the evolution of cis-regulatory regions is made problematic by the degeneracy of transcription factor (TF) binding sites and the shuffling of their positions. In this work, we use the predicted total affinity of a promoter for a large collection of TFs as the basis for studying the evolution of cis-regulatory regions in mammals. We introduce the human specificity of a promoter, measuring the divergence between the affinity profile of a human promoter and its orthologous promoters in other mammals. The promoters of genes involved in functional categories such as neural processes and signal transduction, among others, have higher human specificity compared with the rest of the genome. Clustering of the human-specific affinities (HSAs) of neural genes reveals patterns of promoter evolution associated with functional categories such as synaptic transmission and brain development and to diseases such as bipolar disorder and autism.
Collapse
Affiliation(s)
- Ivan Molineris
- Department of Genetics, Biology and Biochemistry, Molecular Biotechnology Center, University of Turin, Turin, Italy
| | | | | | | | | |
Collapse
|
36
|
Kaplan T, Li XY, Sabo PJ, Thomas S, Stamatoyannopoulos JA, Biggin MD, Eisen MB. Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early Drosophila development. PLoS Genet 2011; 7:e1001290. [PMID: 21304941 PMCID: PMC3033374 DOI: 10.1371/journal.pgen.1001290] [Citation(s) in RCA: 139] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2010] [Accepted: 01/01/2011] [Indexed: 01/01/2023] Open
Abstract
Transcription factors that drive complex patterns of gene expression during animal development bind to thousands of genomic regions, with quantitative differences in binding across bound regions mediating their activity. While we now have tools to characterize the DNA affinities of these proteins and to precisely measure their genome-wide distribution in vivo, our understanding of the forces that determine where, when, and to what extent they bind remains primitive. Here we use a thermodynamic model of transcription factor binding to evaluate the contribution of different biophysical forces to the binding of five regulators of early embryonic anterior-posterior patterning in Drosophila melanogaster. Predictions based on DNA sequence and in vitro protein-DNA affinities alone achieve a correlation of ∼0.4 with experimental measurements of in vivo binding. Incorporating cooperativity and competition among the five factors, and accounting for spatial patterning by modeling binding in every nucleus independently, had little effect on prediction accuracy. A major source of error was the prediction of binding events that do not occur in vivo, which we hypothesized reflected reduced accessibility of chromatin. To test this, we incorporated experimental measurements of genome-wide DNA accessibility into our model, effectively restricting predicted binding to regions of open chromatin. This dramatically improved our predictions to a correlation of 0.6-0.9 for various factors across known target genes. Finally, we used our model to quantify the roles of DNA sequence, accessibility, and binding competition and cooperativity. Our results show that, in regions of open chromatin, binding can be predicted almost exclusively by the sequence specificity of individual factors, with a minimal role for protein interactions. We suggest that a combination of experimentally determined chromatin accessibility data and simple computational models of transcription factor binding may be used to predict the binding landscape of any animal transcription factor with significant precision.
Collapse
Affiliation(s)
- Tommy Kaplan
- Department of Molecular and Cell Biology, California Institute of Quantitative Biosciences, University of California Berkeley, Berkeley, California, United States of America
| | - Xiao-Yong Li
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, California, United States of America
| | - Peter J. Sabo
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Sean Thomas
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | | | - Mark D. Biggin
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Michael B. Eisen
- Department of Molecular and Cell Biology, California Institute of Quantitative Biosciences, University of California Berkeley, Berkeley, California, United States of America
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, California, United States of America
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| |
Collapse
|
37
|
Tietjen JR, Donato LJ, Bhimisaria D, Ansari AZ. Sequence-specificity and energy landscapes of DNA-binding molecules. Methods Enzymol 2011; 497:3-30. [PMID: 21601080 DOI: 10.1016/b978-0-12-385075-1.00001-9] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
A central goal of biology is to understand how transcription factors target and regulate specific genes and networks to control cell fate and function. An equally important goal of synthetic biology, chemical biology, and personalized medicine is to devise molecules that can regulate genes and networks in a programmable manner. To achieve these goals, it is necessary to chart the sequence specificity of natural and engineered DNA-binding molecules. Cognate site identification (CSI) is now achieved via unbiased, high-throughput platforms that interrogate an entire sequence space bound by typical DNA-binding molecules. Analysis of these comprehensive specificity profiles is facilitated through the use of sequence-specificity landscapes (SSLs). SSLs reveal new modes of sequence cognition and overcome the limitations of current approaches that yield amalgamated "consensus" motifs. The landscapes also reveal the impact of nonconserved flanking sequences on binding to cognate sites. SSLs also serve as comprehensive binding energy landscapes that provide insights into the energetic thresholds at which natural and engineered molecules function within cells. Furthermore, applying the CSI binding data to genomic sequence (genomescapes) provides a powerful tool for identification of potential in vivo binding sites of a given DNA ligand, and can provide insight into differential regulation of gene networks. These tools can be directly applied to the design and development of synthetic therapeutic molecules and to expand our knowledge of the basic principles of molecular recognition.
Collapse
Affiliation(s)
- Joshua R Tietjen
- Department of Biochemistry, The Genome Center of Wisconsin, University of Wisconsin-Madison, Madison, Wisconsin, USA
| | | | | | | |
Collapse
|
38
|
Venkataram S, Fay JC. Is transcription factor binding site turnover a sufficient explanation for cis-regulatory sequence divergence? Genome Biol Evol 2010; 2:851-8. [PMID: 21068212 PMCID: PMC2997565 DOI: 10.1093/gbe/evq066] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The molecular evolution of cis-regulatory sequences is not well understood. Comparisons of closely related species show that cis-regulatory sequences contain a large number of sites constrained by purifying selection. In contrast, there are a number of examples from distantly related species where cis-regulatory sequences retain little to no sequence similarity but drive similar patterns of gene expression. Binding site turnover, whereby the gain of a redundant binding site enables loss of a previously functional site, is one model by which cis-regulatory sequences can diverge without a concurrent change in function. To determine whether cis-regulatory sequence divergence is consistent with binding site turnover, we examined binding site evolution within orthologous intergenic sequences from 14 yeast species defined by their syntenic relationships with adjacent coding sequences. Both local and global alignments show that nearly all distantly related orthologous cis-regulatory sequences have no significant level of sequence similarity but are enriched for experimentally identified binding sites. Yet, a significant proportion of experimentally identified binding sites that are conserved in closely related species are absent in distantly related species and so cannot be explained by binding site turnover. Depletion of binding sites depends on the transcription factor but is detectable for a quarter of all transcription factors examined. Our results imply that binding site turnover is not a sufficient explanation for cis-regulatory sequence evolution.
Collapse
|
39
|
Fuellen G. Evolution of gene regulation--on the road towards computational inferences. Brief Bioinform 2010; 12:122-31. [PMID: 20702596 DOI: 10.1093/bib/bbq060] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
If fragments of DNA are transcribed (expressed), they deserve to be called (parts of) a gene. Whether transcription takes place depends on the 'gene regulatory network'. This network is defined as the complex interplay of the sequence, biochemical modifications and structure of the chromosomal DNA with the regulatory proteins/RNA (transcription factors, co-factors, regulating RNA and the transcriptional apparatus itself). Gene regulatory networks play a role in various stages of development as well as in the maintenance of the organism; in this review we will concentrate on the former. Their evolutionary reconstruction is daunting (to say the least), and bioinformatics tools are in their infancy. However, gain of understanding offers a reward beyond itself, since evolutionary considerations can enable discoveries in the first place, e.g. the computational identification of conserved transcription factor binding sites. We discuss the evolution of gene regulation in the context of the 'Genetic Theory of Morphological Evolution' as described by Carroll, identifying those parts of the theory that are relevant for bioinformatics, and their implications. We discuss the important question of how bioinformatics analysis results on the evolution of gene regulation may be validated. Finally, we briefly exemplify use of the UCSC genome browser, exploiting its pre-computed alignments to describe the evolution of gene regulation.
Collapse
Affiliation(s)
- Georg Fuellen
- Institute for Biostatistics and Informatics in Medicine and Ageing Research-IBIMA, University of Rostock, Medical Faculty, Ernst-Heydemann-Str. 8, 18057 Rostock, Germany.
| |
Collapse
|
40
|
MER41 repeat sequences contain inducible STAT1 binding sites. PLoS One 2010; 5:e11425. [PMID: 20625510 PMCID: PMC2897888 DOI: 10.1371/journal.pone.0011425] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2009] [Accepted: 06/02/2010] [Indexed: 11/19/2022] Open
Abstract
Chromatin immunoprecipitation combined with massively parallel sequencing methods (ChIP-seq) is becoming the standard approach to study interactions of transcription factors (TF) with genomic sequences. At the example of public STAT1 ChIP-seq data sets, we present novel approaches for the interpretation of ChIP-seq data. We compare recently developed approaches to determine STAT1 binding sites from ChIP-seq data. Assessing the content of the established consensus sequence for STAT1 binding sites, we find that the usage of “negative control” ChIP-seq data fails to provide substantial advantages. We derive a single refined probabilistic model of STAT1 binding sequences from these ChIP-seq data. Contrary to previous claims, we find no evidence that STAT1 binds to multiple distinct motifs upon interferon-gamma stimulation in vivo. While a large majority of genomic sites with high ChIP-seq signal is associated with a nucleotide sequence ressembling a STAT1 binding site, only a very small subset of the over 5 million potential STAT1 binding sites in the human genome is covered by ChIP-seq data. Furthermore a surprisingly large fraction of the ChIP-seq signal (5%) is absorbed by a small family of repetitive sequences (MER41). The observation of the binding of activated STAT1 protein to a specific repetitive element bolsters similar reports concerning p53 and other TFs, and strengthens the notion of an involvement of repeats in gene regulation. Incidentally MER41 are specific to primates, consequently, regulatory mechanisms in the IFN-STAT pathway might fundamentally differ between primates and rodents. On a methodological aspect, the presence of large numbers of nearly identical binding sites in repetitive sequences may lead to wrong conclusions about intrinsic binding preferences of TF as illustrated by the spacing analysis STAT1 tandem motifs. Therefore, ChIP-seq data should be analyzed independently within repetitive and non-repetitive sequences.
Collapse
|
41
|
Kiełbasa SM, Klein H, Roider HG, Vingron M, Blüthgen N. TransFind--predicting transcriptional regulators for gene sets. Nucleic Acids Res 2010; 38:W275-80. [PMID: 20511592 PMCID: PMC2896106 DOI: 10.1093/nar/gkq438] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
The analysis of putative transcription factor binding sites in promoter regions of coregulated genes allows to infer the transcription factors that underlie observed changes in gene expression. While such analyses constitute a central component of the in-silico characterization of transcriptional regulatory networks, there is still a lack of simple-to-use web servers able to combine state-of-the-art prediction methods with phylogenetic analysis and appropriate multiple testing corrected statistics, which returns the results within a short time. Having these aims in mind we developed TransFind, which is freely available at http://transfind.sys-bio.net/.
Collapse
Affiliation(s)
- Szymon M Kiełbasa
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany.
| | | | | | | | | |
Collapse
|
42
|
Gordân R, Narlikar L, Hartemink AJ. Finding regulatory DNA motifs using alignment-free evolutionary conservation information. Nucleic Acids Res 2010; 38:e90. [PMID: 20047961 PMCID: PMC2847231 DOI: 10.1093/nar/gkp1166] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2009] [Revised: 10/30/2009] [Accepted: 11/23/2009] [Indexed: 01/01/2023] Open
Abstract
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do.
Collapse
Affiliation(s)
- Raluca Gordân
- Department of Computer Science, Duke University, Box 90129, Durham, NC 27708, USA
| | | | | |
Collapse
|
43
|
Zhou X, Sumazin P, Rajbhandari P, Califano A. A systems biology approach to transcription factor binding site prediction. PLoS One 2010; 5:e9878. [PMID: 20360861 PMCID: PMC2845628 DOI: 10.1371/journal.pone.0009878] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2010] [Accepted: 03/02/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The elucidation of mammalian transcriptional regulatory networks holds great promise for both basic and translational research and remains one the greatest challenges to systems biology. Recent reverse engineering methods deduce regulatory interactions from large-scale mRNA expression profiles and cross-species conserved regulatory regions in DNA. Technical challenges faced by these methods include distinguishing between direct and indirect interactions, associating transcription regulators with predicted transcription factor binding sites (TFBSs), identifying non-linearly conserved binding sites across species, and providing realistic accuracy estimates. METHODOLOGY/PRINCIPAL FINDINGS We address these challenges by closely integrating proven methods for regulatory network reverse engineering from mRNA expression data, linearly and non-linearly conserved regulatory region discovery, and TFBS evaluation and discovery. Using an extensive test set of high-likelihood interactions, which we collected in order to provide realistic prediction-accuracy estimates, we show that a careful integration of these methods leads to significant improvements in prediction accuracy. To verify our methods, we biochemically validated TFBS predictions made for both transcription factors (TFs) and co-factors; we validated binding site predictions made using a known E2F1 DNA-binding motif on E2F1 predicted promoter targets, known E2F1 and JUND motifs on JUND predicted promoter targets, and a de novo discovered motif for BCL6 on BCL6 predicted promoter targets. Finally, to demonstrate accuracy of prediction using an external dataset, we showed that sites matching predicted motifs for ZNF263 are significantly enriched in recent ZNF263 ChIP-seq data. CONCLUSIONS/SIGNIFICANCE Using an integrative framework, we were able to address technical challenges faced by state of the art network reverse engineering methods, leading to significant improvement in direct-interaction detection and TFBS-discovery accuracy. We estimated the accuracy of our framework on a human B-cell specific test set, which may help guide future methodological development.
Collapse
Affiliation(s)
- Xiang Zhou
- Department of Biomedical Informatics (DBMI), Columbia University, New York, New York, United States of America
| | - Pavel Sumazin
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
| | - Presha Rajbhandari
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
| | - Andrea Califano
- Department of Biomedical Informatics (DBMI), Columbia University, New York, New York, United States of America
- Center for Computational Biology and Bioinformatics (C2B2), Columbia University, New York, New York, United States of America
- Herbert Irving Comprehensive Cancer Center, Columbia University, New York, New York, United States of America
| |
Collapse
|
44
|
Kim J, Cunningham R, James B, Wyder S, Gibson JD, Niehuis O, Zdobnov EM, Robertson HM, Robinson GE, Werren JH, Sinha S. Functional characterization of transcription factor motifs using cross-species comparison across large evolutionary distances. PLoS Comput Biol 2010; 6:e1000652. [PMID: 20126523 PMCID: PMC2813253 DOI: 10.1371/journal.pcbi.1000652] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2009] [Accepted: 12/18/2009] [Indexed: 11/19/2022] Open
Abstract
We address the problem of finding statistically significant associations between cis-regulatory motifs and functional gene sets, in order to understand the biological roles of transcription factors. We develop a computational framework for this task, whose features include a new statistical score for motif scanning, the use of different scores for predicting targets of different motifs, and new ways to deal with redundancies among significant motif–function associations. This framework is applied to the recently sequenced genome of the jewel wasp, Nasonia vitripennis, making use of the existing knowledge of motifs and gene annotations in another insect genome, that of the fruitfly. The framework uses cross-species comparison to improve the specificity of its predictions, and does so without relying upon non-coding sequence alignment. It is therefore well suited for comparative genomics across large evolutionary divergences, where existing alignment-based methods are not applicable. We also apply the framework to find motifs associated with socially regulated gene sets in the honeybee, Apis mellifera, using comparisons with Nasonia, a solitary species, to identify honeybee-specific associations. We develop a computational pipeline for predicting the functions of transcription factor motifs, through DNA sequence analysis. The pipeline is applied to the newly sequenced genome of the jewel wasp, Nasonia vitripennis. It exploits the wealth of molecular data available in another insect species, the fruitfly Drosophila melanogaster, and uses cross-species comparison to its advantage. Our main contribution is to show how this can be done despite the large evolutionary divergence between the two species. The methodology presented here may be applied more generally to other scenarios (genomes) where comparative regulatory genomics must deal with large evolutionary divergences.
Collapse
Affiliation(s)
- Jaebum Kim
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Ryan Cunningham
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Brian James
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Stefan Wyder
- Department of Genetic Medicine and Development, University of Geneva Medical School, and Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Joshua D. Gibson
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Oliver Niehuis
- Department of Biology, University of Osnabrück, Osnabrück, Germany
| | - Evgeny M. Zdobnov
- Department of Genetic Medicine and Development, University of Geneva Medical School, and Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Hugh M. Robertson
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Gene E. Robinson
- Department of Entomology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - John H. Werren
- Department of Biology, University of Rochester, Rochester, New York, United States of America
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
- * E-mail:
| |
Collapse
|
45
|
Prevalence of transcription promoters within archaeal operons and coding sequences. Mol Syst Biol 2009; 5:285. [PMID: 19536208 PMCID: PMC2710873 DOI: 10.1038/msb.2009.42] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2008] [Accepted: 05/13/2009] [Indexed: 01/21/2023] Open
Abstract
Despite the knowledge of complex prokaryotic-transcription mechanisms, generalized rules, such as the simplified organization of genes into operons with well-defined promoters and terminators, have had a significant role in systems analysis of regulatory logic in both bacteria and archaea. Here, we have investigated the prevalence of alternate regulatory mechanisms through genome-wide characterization of transcript structures of approximately 64% of all genes, including putative non-coding RNAs in Halobacterium salinarum NRC-1. Our integrative analysis of transcriptome dynamics and protein-DNA interaction data sets showed widespread environment-dependent modulation of operon architectures, transcription initiation and termination inside coding sequences, and extensive overlap in 3' ends of transcripts for many convergently transcribed genes. A significant fraction of these alternate transcriptional events correlate to binding locations of 11 transcription factors and regulators (TFs) inside operons and annotated genes-events usually considered spurious or non-functional. Using experimental validation, we illustrate the prevalence of overlapping genomic signals in archaeal transcription, casting doubt on the general perception of rigid boundaries between coding sequences and regulatory elements.
Collapse
|
46
|
Sebestyén E, Nagy T, Suhai S, Barta E. DoOPSearch: a web-based tool for finding and analysing common conserved motifs in the promoter regions of different chordate and plant genes. BMC Bioinformatics 2009; 10 Suppl 6:S6. [PMID: 19534755 PMCID: PMC2697653 DOI: 10.1186/1471-2105-10-s6-s6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background The comparative genomic analysis of a large number of orthologous promoter regions of the chordate and plant genes from the DoOP databases shows thousands of conserved motifs. Most of these motifs differ from any known transcription factor binding site (TFBS). To identify common conserved motifs, we need a specific tool to be able to search amongst them. Since conserved motifs from the DoOP databases are linked to genes, the result of such a search can give a list of genes that are potentially regulated by the same transcription factor(s). Results We have developed a new tool called DoOPSearch for the analysis of the conserved motifs in the promoter regions of chordate or plant genes. We used the orthologous promoters of the DoOP database to extract thousands of conserved motifs from different taxonomic groups. The advantage of this approach is that different sets of conserved motifs might be found depending on how broad the taxonomic coverage of the underlying orthologous promoter sequence collection is (consider e.g. primates vs. mammals or Brassicaceae vs. Viridiplantae). The DoOPSearch tool allows the users to search these motif collections or the promoter regions of DoOP with user supplied query sequences or any of the conserved motifs from the DoOP database. To find overrepresented gene ontologies, the gene lists obtained can be analysed further using a modified version of the GeneMerge program. Conclusion We present here a comparative genomics based promoter analysis tool. Our system is based on a unique collection of conserved promoter motifs characteristic of different taxonomic groups. We offer both a command line and a web-based tool for searching in these motif collections using user specified queries. These can be either short promoter sequences or consensus sequences of known transcription factor binding sites. The GeneMerge analysis of the search results allows the user to identify statistically overrepresented Gene Ontology terms that might provide a clue on the function of the motifs and genes.
Collapse
Affiliation(s)
- Endre Sebestyén
- Agricultural Research Institute of the Hungarian Academy of Sciences, Martonvásár, Brunszvik u, 2, H-2462, Hungary.
| | | | | | | |
Collapse
|
47
|
Abstract
MOTIVATION Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. RESULTS This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. AVAILABILITY AND IMPLEMENTATION The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenjie Fu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
48
|
Hazelett DJ, Lakeland DL, Weiss JB. Affinity Density: a novel genomic approach to the identification of transcription factor regulatory targets. Bioinformatics 2009; 25:1617-24. [PMID: 19401399 PMCID: PMC2732317 DOI: 10.1093/bioinformatics/btp282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Methods: A new method was developed for identifying novel transcription factor regulatory targets based on calculating Local Affinity Density. Techniques from the signal-processing field were used, in particular the Hann digital filter, to calculate the relative binding affinity of different regions based on previously published in vitro binding data. To illustrate this approach, the complete genomes of Drosophila melanogaster and D.pseudoobscura were analyzed for binding sites of the homeodomain proteinc Tinman, an essential heart development gene in both Drosophila and Mouse. The significant binding regions were identified relative to genomic background and assigned to putative target genes. Valid candidates common to both species of Drosophila were selected as a test of conservation. Results: The new method was more sensitive than cluster searches for conserved binding motifs with respect to positive identification of known Tinman targets. Our Local Affinity Density method also identified a significantly greater proportion of Tinman-coexpressed genes than equivalent, optimized cluster searching. In addition, this new method predicted a significantly greater than expected number of genes with previously published RNAi phenotypes in the heart. Availability: Algorithms were implemented in Python, LISP, R and maxima, using MySQL to access locally mirrored sequence data from Ensembl (D.melanogaster release 4.3) and flybase (D.pseudoobscura). All code is licensed under GPL and freely available at http://www.ohsu.edu/cellbio/dev_biol_prog/affinitydensity/. Contact:hazelett@ohsu.edu
Collapse
Affiliation(s)
- Dennis J Hazelett
- Integrative Biosciences, Oregon Health and Science University, 611 SW Campus Drive, Portland, OR 97239, USA.
| | | | | |
Collapse
|
49
|
Kechris K, Li H. c-REDUCE: incorporating sequence conservation to detect motifs that correlate with expression. BMC Bioinformatics 2008; 9:506. [PMID: 19040743 PMCID: PMC2626603 DOI: 10.1186/1471-2105-9-506] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Accepted: 11/28/2008] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Computational methods for characterizing novel transcription factor binding sites search for sequence patterns or "motifs" that appear repeatedly in genomic regions of interest. Correlation-based motif finding strategies are used to identify motifs that correlate with expression data and do not rely on promoter sequences from a pre-determined set of genes. RESULTS In this work, we describe a method for predicting motifs that combines the correlation-based strategy with phylogenetic footprinting, where motifs are identified by evaluating orthologous sequence regions from multiple species. Our method, c-REDUCE, can account for variability at a motif position inferred from evolutionary information. c-REDUCE has been tested on ChIP-chip data for yeast transcription factors and on gene expression data in Drosophila. CONCLUSION Our results indicate that utilizing sequence conservation information in addition to correlation-based methods improves the identification of known motifs.
Collapse
Affiliation(s)
- Katerina Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, 4200 East Ninth Avenue, B-119, Denver, CO 80262, USA
| | - Hao Li
- Department of Biochemistry and Biophysics, UCSF, 1700 4th Street, San Francisco, CA 94143, USA
- Center for Theoretical Biology, Peking University, Beijing 100871, PR China
| |
Collapse
|