1
|
Majdandzic A, Rajesh C, Koo PK. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol 2023; 24:109. [PMID: 37161475 PMCID: PMC10169356 DOI: 10.1186/s13059-023-02956-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 04/28/2023] [Indexed: 05/11/2023] Open
Abstract
Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.
Collapse
Affiliation(s)
- Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
2
|
Mourad R. Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences. BMC Bioinformatics 2023; 24:186. [PMID: 37147561 PMCID: PMC10163727 DOI: 10.1186/s12859-023-05303-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 04/25/2023] [Indexed: 05/07/2023] Open
Abstract
MOTIVATION Genome-wide association studies have systematically identified thousands of single nucleotide polymorphisms (SNPs) associated with complex genetic diseases. However, the majority of those SNPs were found in non-coding genomic regions, preventing the understanding of the underlying causal mechanism. Predicting molecular processes based on the DNA sequence represents a promising approach to understand the role of those non-coding SNPs. Over the past years, deep learning was successfully applied to regulatory sequence prediction using supervised learning. Supervised learning required DNA sequences associated with functional data for training, whose amount is strongly limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is exponentially increasing due to ongoing large sequencing projects, but without functional data in most cases. RESULTS To alleviate the limitations of supervised learning, we propose a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts (e.g. from other species without ChIP-seq experiment, such as chimpanzee). Our approach is flexible and can be plugged into any neural architecture including shallow and deep networks, and shows strong predictive performance improvements compared to supervised learning in most cases (up to [Formula: see text]). AVAILABILITY AND IMPLEMENTATION https://forgemia.inra.fr/raphael.mourad/deepgnn .
Collapse
Affiliation(s)
- Raphaël Mourad
- MIAT, INRAE, 31320, Castanet-Tolosan, France.
- University of Toulouse, UPS, 31062, Toulouse, France.
| |
Collapse
|
3
|
Lee NK, Tang Z, Toneyan S, Koo PK. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol 2023; 24:105. [PMID: 37143118 PMCID: PMC10161416 DOI: 10.1186/s13059-023-02941-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Accepted: 04/17/2023] [Indexed: 05/06/2023] Open
Abstract
Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.
Collapse
Affiliation(s)
- Nicholas Keone Lee
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
4
|
Li Z, Kuo CC, Ticconi F, Shaigan M, Gehrmann J, Gusmao EG, Allhoff M, Manolov M, Zenke M, Costa IG. RGT: a toolbox for the integrative analysis of high throughput regulatory genomics data. BMC Bioinformatics 2023; 24:79. [PMID: 36879236 PMCID: PMC9990262 DOI: 10.1186/s12859-023-05184-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 02/13/2023] [Indexed: 03/08/2023] Open
Abstract
BACKGROUND Massive amounts of data are produced by combining next-generation sequencing with complex biochemistry techniques to characterize regulatory genomics profiles, such as protein-DNA interaction and chromatin accessibility. Interpretation of such high-throughput data typically requires different computation methods. However, existing tools are usually developed for a specific task, which makes it challenging to analyze the data in an integrative manner. RESULTS We here describe the Regulatory Genomics Toolbox (RGT), a computational library for the integrative analysis of regulatory genomics data. RGT provides different functionalities to handle genomic signals and regions. Based on that, we developed several tools to perform distinct downstream analyses, including the prediction of transcription factor binding sites using ATAC-seq data, identification of differential peaks from ChIP-seq data, and detection of triple helix mediated RNA and DNA interactions, visualization, and finding an association between distinct regulatory factors. CONCLUSION We present here RGT; a framework to facilitate the customization of computational methods to analyze genomic data for specific regulatory genomics problems. RGT is a comprehensive and flexible Python package for analyzing high throughput regulatory genomics data and is available at: https://github.com/CostaLab/reg-gen . The documentation is available at: https://reg-gen.readthedocs.io.
Collapse
Affiliation(s)
- Zhijian Li
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany. .,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany.
| | - Chao-Chung Kuo
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Fabio Ticconi
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Mina Shaigan
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Julia Gehrmann
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Eduardo Gade Gusmao
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Manuel Allhoff
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Martin Manolov
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany.,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany
| | - Martin Zenke
- Department of Cell Biology, Institute of Biomedical Engineering, RWTH Aachen University Medical School, 52074, Aachen, Germany.,Helmholtz Institute for Biomedical Engineering, RWTH Aachen University, 52074, Aachen, Germany.,Department of Hematology, Oncology, Hemostaseology, and Stem Cell Transplantation, Faculty of Medicine, RWTH Aachen University, 52074, Aachen, Germany
| | - Ivan G Costa
- Institute for Computational Genomics, Medical Faculty, RWTH Aachen University, 52074, Aachen, Germany. .,Joint Research Center for Computational Biomedicine, RWTH Aachen University Hospital, 52074, Aachen, Germany.
| |
Collapse
|
5
|
Ramos-Rodríguez M, Pérez-González B, Pasquali L. The β-Cell Genomic Landscape in T1D: Implications for Disease Pathogenesis. Curr Diab Rep 2021; 21:1. [PMID: 33387073 PMCID: PMC7778620 DOI: 10.1007/s11892-020-01370-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/19/2020] [Indexed: 11/15/2022]
Abstract
PURPOSE OF REVIEW Type 1 diabetes (T1D) develops as a consequence of a combination of genetic predisposition and environmental factors. Combined, these events trigger an autoimmune disease that results in progressive loss of pancreatic β cells, leading to insulin deficiency. This article reviews the current knowledge on the genetics of T1D with a specific focus on genetic variation in pancreatic islet regulatory networks and its implication to T1D risk and disease development. RECENT FINDINGS Accumulating evidence suggest an active role of β cells in T1D pathogenesis. Based on such observation several studies aimed in mapping T1D risk variants acting at the β cell level. Such studies unravel T1D risk loci shared with type 2 diabetes (T2D) and T1D risk variants potentially interfering with β-cell responses to external stimuli. The characterization of regulatory genomics maps of disease-relevant states and cell types can be used to elucidate the mechanistic role of β cells in the pathogenesis of T1D.
Collapse
Affiliation(s)
- Mireia Ramos-Rodríguez
- Endocrine Regulatory Genomics, Department of Experimental & Health Sciences, University Pompeu Fabra, 08003, Barcelona, Spain
| | - Beatriz Pérez-González
- Endocrine Regulatory Genomics, Department of Experimental & Health Sciences, University Pompeu Fabra, 08003, Barcelona, Spain
| | - Lorenzo Pasquali
- Endocrine Regulatory Genomics, Department of Experimental & Health Sciences, University Pompeu Fabra, 08003, Barcelona, Spain.
| |
Collapse
|
6
|
Barral A, Rollan I, Sanchez-Iranzo H, Jawaid W, Badia-Careaga C, Menchero S, Gomez MJ, Torroja C, Sanchez-Cabo F, Göttgens B, Manzanares M, Sainz de Aja J. Nanog regulates Pou3f1 expression at the exit from pluripotency during gastrulation. Biol Open 2019; 8:bio046367. [PMID: 31791948 PMCID: PMC6899006 DOI: 10.1242/bio.046367] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2019] [Accepted: 10/23/2019] [Indexed: 12/22/2022] Open
Abstract
Pluripotency is regulated by a network of transcription factors that maintain early embryonic cells in an undifferentiated state while allowing them to proliferate. NANOG is a critical factor for maintaining pluripotency and its role in primordial germ cell differentiation has been well described. However, Nanog is expressed during gastrulation across all the posterior epiblast, and only later in development is its expression restricted to primordial germ cells. In this work, we unveiled a previously unknown mechanism by which Nanog specifically represses genes involved in anterior epiblast lineage. Analysis of transcriptional data from both embryonic stem cells and gastrulating mouse embryos revealed Pou3f1 expression to be negatively correlated with that of Nanog during the early stages of differentiation. We have functionally demonstrated Pou3f1 to be a direct target of NANOG by using a dual transgene system for the controlled expression of Nanog Use of Nanog null ES cells further demonstrated a role for Nanog in repressing a subset of anterior neural genes. Deletion of a NANOG binding site (BS) located nine kilobases downstream of the transcription start site of Pou3f1 revealed this BS to have a specific role in the regionalization of the expression of this gene in the embryo. Our results indicate an active role of Nanog inhibiting neural regulatory networks by repressing Pou3f1 at the onset of gastrulation.This article has an associated First Person interview with the joint first authors of the paper.
Collapse
Affiliation(s)
- Antonio Barral
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Isabel Rollan
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Hector Sanchez-Iranzo
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Wajid Jawaid
- Wellcome-Medical Research Council Cambridge Stem Cell Institute, Cambridge CB2 0AW, UK
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 0AW, UK
| | - Claudio Badia-Careaga
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Sergio Menchero
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Manuel J Gomez
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Carlos Torroja
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Fatima Sanchez-Cabo
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| | - Berthold Göttgens
- Wellcome-Medical Research Council Cambridge Stem Cell Institute, Cambridge CB2 0AW, UK
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge CB2 0AW, UK
| | - Miguel Manzanares
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
- Centro de Biologia Molecular Severo Ochoa, CSIC-UAM, Madrid 28049, Spain
| | - Julio Sainz de Aja
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid 28029, Spain
| |
Collapse
|
7
|
Chen H, Lareau C, Andreani T, Vinyard ME, Garcia SP, Clement K, Andrade-Navarro MA, Buenrostro JD, Pinello L. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol 2019; 20:241. [PMID: 31739806 PMCID: PMC6859644 DOI: 10.1186/s13059-019-1854-5] [Citation(s) in RCA: 147] [Impact Index Per Article: 29.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 10/03/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans), lead to inherent data sparsity (1-10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10-45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level. RESULTS We present a benchmarking framework that is applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms. Methods for processing and featurizing scATAC-seq data were compared by their ability to discriminate cell types when combined with common unsupervised clustering approaches. We rank evaluated methods and discuss computational challenges associated with scATAC-seq analysis including inherently sparse data, determination of features, peak calling, the effects of sequencing coverage and noise, and clustering performance. Running times and memory requirements are also discussed. CONCLUSIONS This reference summary of scATAC-seq methods offers recommendations for best practices with consideration for both the non-expert user and the methods developer. Despite variation across methods and datasets, SnapATAC, Cusanovich2018, and cisTopic outperform other methods in separating cell populations of different coverages and noise levels in both synthetic and real datasets. Notably, SnapATAC is the only method able to analyze a large dataset (> 80,000 cells).
Collapse
Affiliation(s)
- Huidong Chen
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, 02129, USA
- Department of Pathology, Harvard Medical School, Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Caleb Lareau
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, 02138, USA
| | - Tommaso Andreani
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, 02129, USA
- Department of Pathology, Harvard Medical School, Boston, MA, 02115, USA
- Faculty of Biology, Computational Biology and Data Mining Lab, Johannes Gutenberg University of Mainz, 55128, Mainz, Germany
| | - Michael E Vinyard
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, 02129, USA
- Department of Pathology, Harvard Medical School, Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, 02142, USA
| | - Sara P Garcia
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
| | - Kendell Clement
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, 02129, USA
- Department of Pathology, Harvard Medical School, Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Computational Biology and Data Mining Lab, Johannes Gutenberg University of Mainz, 55128, Mainz, Germany
| | - Jason D Buenrostro
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, 02138, USA
| | - Luca Pinello
- Molecular Pathology Unit, Massachusetts General Hospital Research Institute, Charlestown, MA, 02129, USA.
- Center for Cancer Research, Massachusetts General Hospital, Charlestown, MA, 02129, USA.
- Department of Pathology, Harvard Medical School, Boston, MA, 02115, USA.
- Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.
| |
Collapse
|
8
|
Siebenthall KT, Miller CP, Vierstra JD, Mathieu J, Tretiakova M, Reynolds A, Sandstrom R, Rynes E, Haugen E, Johnson A, Nelson J, Bates D, Diegel M, Dunn D, Frerker M, Buckley M, Kaul R, Zheng Y, Himmelfarb J, Ruohola-Baker H, Akilesh S. Integrated epigenomic profiling reveals endogenous retrovirus reactivation in renal cell carcinoma. EBioMedicine 2019; 41:427-42. [PMID: 30827930 DOI: 10.1016/j.ebiom.2019.01.063] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2018] [Revised: 01/30/2019] [Accepted: 01/31/2019] [Indexed: 02/08/2023] Open
Abstract
Background Transcriptional dysregulation drives cancer formation but the underlying mechanisms are still poorly understood. Renal cell carcinoma (RCC) is the most common malignant kidney tumor which canonically activates the hypoxia-inducible transcription factor (HIF) pathway. Despite intensive study, novel therapeutic strategies to target RCC have been difficult to develop. Since the RCC epigenome is relatively understudied, we sought to elucidate key mechanisms underpinning the tumor phenotype and its clinical behavior. Methods We performed genome-wide chromatin accessibility (DNase-seq) and transcriptome profiling (RNA-seq) on paired tumor/normal samples from 3 patients undergoing nephrectomy for removal of RCC. We incorporated publicly available data on HIF binding (ChIP-seq) in a RCC cell line. We performed integrated analyses of these high-resolution, genome-scale datasets together with larger transcriptomic data available through The Cancer Genome Atlas (TCGA). Findings Though HIF transcription factors play a cardinal role in RCC oncogenesis, we found that numerous transcription factors with a RCC-selective expression pattern also demonstrated evidence of HIF binding near their gene body. Examination of chromatin accessibility profiles revealed that some of these transcription factors influenced the tumor's regulatory landscape, notably the stem cell transcription factor POU5F1 (OCT4). Elevated POU5F1 transcript levels were correlated with advanced tumor stage and poorer overall survival in RCC patients. Unexpectedly, we discovered a HIF-pathway-responsive promoter embedded within a endogenous retroviral long terminal repeat (LTR) element at the transcriptional start site of the PSOR1C3 long non-coding RNA gene upstream of POU5F1. RNA transcripts are induced from this promoter and read through PSOR1C3 into POU5F1 producing a novel POU5F1 transcript isoform. Rather than being unique to the POU5F1 locus, we found that HIF binds to several other transcriptionally active LTR elements genome-wide correlating with broad gene expression changes in RCC. Interpretation Integrated transcriptomic and epigenomic analysis of matched tumor and normal tissues from even a small number of primary patient samples revealed remarkably convergent shared regulatory landscapes. Several transcription factors appear to act downstream of HIF including the potent stem cell transcription factor POU5F1. Dysregulated expression of POU5F1 is part of a larger pattern of gene expression changes in RCC that may be induced by HIF-dependent reactivation of dormant promoters embedded within endogenous retroviral LTRs.
Collapse
|
9
|
Vandel J, Cassan O, Lèbre S, Lecellier CH, Bréhélin L. Probing transcription factor combinatorics in different promoter classes and in enhancers. BMC Genomics 2019; 20:103. [PMID: 30709337 PMCID: PMC6359851 DOI: 10.1186/s12864-018-5408-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2018] [Accepted: 12/26/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In eukaryotic cells, transcription factors (TFs) are thought to act in a combinatorial way, by competing and collaborating to regulate common target genes. However, several questions remain regarding the conservation of these combinations among different gene classes, regulatory regions and cell types. RESULTS We propose a new approach named TFcoop to infer the TF combinations involved in the binding of a target TF in a particular cell type. TFcoop aims to predict the binding sites of the target TF upon the nucleotide content of the sequences and of the binding affinity of all identified cooperating TFs. The set of cooperating TFs and model parameters are learned from ChIP-seq data of the target TF. We used TFcoop to investigate the TF combinations involved in the binding of 106 TFs on 41 cell types and in four regulatory regions: promoters of mRNAs, lncRNAs and pri-miRNAs, and enhancers. We first assess that TFcoop is accurate and outperforms simple PWM methods for predicting TF binding sites. Next, analysis of the learned models sheds light on important properties of TF combinations in different promoter classes and in enhancers. First, we show that combinations governing TF binding on enhancers are more cell-type specific than that governing binding in promoters. Second, for a given TF and cell type, we observe that TF combinations are different between promoters and enhancers, but similar for promoters of mRNAs, lncRNAs and pri-miRNAs. Analysis of the TFs cooperating with the different targets show over-representation of pioneer TFs and a clear preference for TFs with binding motif composition similar to that of the target. Lastly, our models accurately distinguish promoters associated with specific biological processes. CONCLUSIONS TFcoop appears as an accurate approach for studying TF combinations. Its use on ENCODE and FANTOM data allowed us to discover important properties of human TF combinations in different promoter classes and in enhancers. The R code for learning a TFcoop model and for reproducing the main experiments described in the paper is available in an R Markdown file at address https://gite.lirmm.fr/brehelin/TFcoop .
Collapse
Affiliation(s)
- Jimmy Vandel
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France.,IBC, CNRS, Univ. Montpellier, Montpellier, France
| | - Océane Cassan
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France.,IBC, CNRS, Univ. Montpellier, Montpellier, France
| | - Sophie Lèbre
- IBC, CNRS, Univ. Montpellier, Montpellier, France.,IMAG, Univ. Montpellier, CNRS, Montpellier, France.,Univ. Paul Valery Montpellier, Montpellier, France
| | - Charles-Henri Lecellier
- IBC, CNRS, Univ. Montpellier, Montpellier, France. .,Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | - Laurent Bréhélin
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France. .,IBC, CNRS, Univ. Montpellier, Montpellier, France.
| |
Collapse
|
10
|
Abstract
Although the number of sequenced insect genomes numbers in the hundreds, little is known about gene regulatory sequences in any species other than the well-studied Drosophila melanogaster. We provide here a detailed protocol for using SCRMshaw, a computational method for predicting cis-regulatory modules (CRMs, also "enhancers") in sequenced insect genomes. SCRMshaw is effective for CRM discovery throughout the range of holometabolous insects and potentially in even more diverged species, with true-positive prediction rates of 75% or better. Minimal requirements for using SCRMshaw are a genome sequence and training data in the form of known Drosophila CRMs; a comprehensive set of the latter can be obtained from the SCRMshaw download site. For basic applications, a user with only modest computational know-how can run SCRMshaw on a desktop computer. SCRMshaw can be run with a single, narrow set of training data to predict CRMs regulating a specific pattern of gene expression, or with multiple sets of training data covering a broad range of CRM activities to provide an initial rough regulatory annotation of a complete, newly-sequenced genome.
Collapse
Affiliation(s)
- Majid Kazemian
- Departments of Biochemistry and Computer Science, Purdue University, West Lafayette, IN, USA.
| | - Marc S Halfon
- Departments of Biochemistry, Biomedical Informatics, and Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY, USA.
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY, USA.
- Department of Molecular and Cellular Biology and Program in Cancer Genetics, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA.
| |
Collapse
|
11
|
Kvon EZ. Using transgenic reporter assays to functionally characterize enhancers in animals. Genomics 2015; 106:185-92. [PMID: 26072435 DOI: 10.1016/j.ygeno.2015.06.007] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2015] [Revised: 05/11/2015] [Accepted: 06/09/2015] [Indexed: 11/21/2022]
Abstract
Enhancers or cis-regulatory modules play an instructive role in regulating gene expression during animal development and in response to the environment. Despite their importance, we only have an incomplete map of enhancers in the genome and our understanding of the mechanisms governing their function is still limited. Recent advances in genomics provided powerful tools to generate genome-wide maps of potential enhancers. However, most of these methods are based on indirect measures of enhancer activity and have to be followed by functional testing. Animal transgenesis has been a valuable method to functionally test and characterize enhancers in vivo. In this review I discuss how different transgenic strategies are utilized to characterize enhancers in model organisms focusing on studies in Drosophila and mouse. I will further discuss recent large-scale transgenic efforts to systematically identify and catalog enhancers as well as highlight the challenges and future directions in the field.
Collapse
|