1
|
Raun N, Jones SG, Kerr O, Keung C, Butler EF, Alka K, Krupski JD, Reid-Taylor RA, Ibrahim V, Williams M, Top D, Kramer JM. Trithorax regulates long-term memory in Drosophila through epigenetic maintenance of mushroom body metabolic state and translation capacity. PLoS Biol 2025; 23:e3003004. [PMID: 39869640 PMCID: PMC11835295 DOI: 10.1371/journal.pbio.3003004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/18/2025] [Accepted: 01/06/2025] [Indexed: 01/29/2025] Open
Abstract
The role of epigenetics and chromatin in the maintenance of postmitotic neuronal cell identities is not well understood. Here, we show that the histone methyltransferase Trithorax (Trx) is required in postmitotic memory neurons of the Drosophila mushroom body (MB) to enable their capacity for long-term memory (LTM), but not short-term memory (STM). Using MB-specific RNA-, ChIP-, and ATAC-sequencing, we find that Trx maintains homeostatic expression of several non-canonical MB-enriched transcripts, including the orphan nuclear receptor Hr51, and the metabolic enzyme lactate dehydrogenase (Ldh). Through these key targets, Trx facilitates a metabolic state characterized by high lactate levels in MBγ neurons. This metabolic state supports a high capacity for protein translation, a process that is essential for LTM, but not STM. These data suggest that Trx, a classic regulator of cell type specification during development, has additional functions in maintaining underappreciated aspects of postmitotic neuron identity, such as metabolic state. Our work supports a body of evidence suggesting that a high capacity for energy metabolism is an essential cell identity characteristic for neurons that mediate LTM.
Collapse
Affiliation(s)
- Nicholas Raun
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Spencer G. Jones
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Olivia Kerr
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Crystal Keung
- Department of Physiology and Pharmacology, University of Western Ontario, London, Canada
| | - Emily F. Butler
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Kumari Alka
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Jonathan D. Krupski
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Robert A. Reid-Taylor
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Veyan Ibrahim
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - MacKayla Williams
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
| | - Deniz Top
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
- Department of Cell Biology, University of Alberta, Edmonton, Canada
| | - Jamie M. Kramer
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Canada
- Department of Physiology and Pharmacology, University of Western Ontario, London, Canada
| |
Collapse
|
2
|
Labani M, Beheshti A, O’Brien TA. GENet: A Graph-Based Model Leveraging Histone Marks and Transcription Factors for Enhanced Gene Expression Prediction. Genes (Basel) 2024; 15:938. [PMID: 39062717 PMCID: PMC11275947 DOI: 10.3390/genes15070938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 07/16/2024] [Accepted: 07/17/2024] [Indexed: 07/28/2024] Open
Abstract
Understanding the regulatory mechanisms of gene expression is a crucial objective in genomics. Although the DNA sequence near the transcription start site (TSS) offers valuable insights, recent methods suggest that analyzing only the surrounding DNA may not suffice to accurately predict gene expression levels. We developed GENet (Gene Expression Network from Histone and Transcription Factor Integration), a novel approach that integrates essential regulatory signals from transcription factors and histone modifications into a graph-based model. GENet extends beyond simple DNA sequence analysis by incorporating additional layers of genetic control, which are vital for determining gene expression. Our method markedly enhances the prediction of mRNA levels compared to previous models that depend solely on DNA sequence data. The results underscore the significance of including comprehensive regulatory information in gene expression studies. GENet emerges as a promising tool for researchers, with potential applications extending from fundamental biological research to the development of medical therapies.
Collapse
Affiliation(s)
- Mahdieh Labani
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
| | - Amin Beheshti
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
| | - Tracey A. O’Brien
- School of Computing, Macquarie University, Sydney 2109, Australia; (M.L.); (T.A.O.)
- Cancer Institute NSW, Sydney 2065, Australia
- School of Clinical Medicine, Medicine & Health, University of New South Wales (UNSW), Sydney 2052, Australia
| |
Collapse
|
3
|
Pianfetti E, Lovino M, Ficarra E, Martignetti L. MiREx: mRNA levels prediction from gene sequence and miRNA target knowledge. BMC Bioinformatics 2023; 24:443. [PMID: 37993778 PMCID: PMC10666312 DOI: 10.1186/s12859-023-05560-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 11/06/2023] [Indexed: 11/24/2023] Open
Abstract
Messenger RNA (mRNA) has an essential role in the protein production process. Predicting mRNA expression levels accurately is crucial for understanding gene regulation, and various models (statistical and neural network-based) have been developed for this purpose. A few models predict mRNA expression levels from the DNA sequence, exploiting the DNA sequence and gene features (e.g., number of exons/introns, gene length). Other models include information about long-range interaction molecules (i.e., enhancers/silencers) and transcriptional regulators as predictive features, such as transcription factors (TFs) and small RNAs (e.g., microRNAs - miRNAs). Recently, a convolutional neural network (CNN) model, called Xpresso, has been proposed for mRNA expression level prediction leveraging the promoter sequence and mRNAs' half-life features (gene features). To push forward the mRNA level prediction, we present miREx, a CNN-based tool that includes information about miRNA targets and expression levels in the model. Indeed, each miRNA can target specific genes, and the model exploits this information to guide the learning process. In detail, not all miRNAs are included, only a selected subset with the highest impact on the model. MiREx has been evaluated on four cancer primary sites from the genomics data commons (GDC) database: lung, kidney, breast, and corpus uteri. Results show that mRNA level prediction benefits from selected miRNA targets and expression information. Future model developments could include other transcriptional regulators or be trained with proteomics data to infer protein levels.
Collapse
Affiliation(s)
- Elena Pianfetti
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy
| | - Marta Lovino
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy.
| | - Elisa Ficarra
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy
| | - Loredana Martignetti
- Institut Curie, Rue d'Ulm 26, Paris, 75005, France.
- Inserm U900, Paris, France.
- CBIO-Centre for Computational Biology, Paris, France.
- PSL Research University, Paris, France.
| |
Collapse
|
4
|
Singh B, Kumar S, Elangovan A, Vasht D, Arya S, Duc NT, Swami P, Pawar GS, Raju D, Krishna H, Sathee L, Dalal M, Sahoo RN, Chinnusamy V. Phenomics based prediction of plant biomass and leaf area in wheat using machine learning approaches. FRONTIERS IN PLANT SCIENCE 2023; 14:1214801. [PMID: 37448870 PMCID: PMC10337996 DOI: 10.3389/fpls.2023.1214801] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Accepted: 06/07/2023] [Indexed: 07/15/2023]
Abstract
Introduction Phenomics has emerged as important tool to bridge the genotype-phenotype gap. To dissect complex traits such as highly dynamic plant growth, and quantification of its component traits over a different growth phase of plant will immensely help dissect genetic basis of biomass production. Based on RGB images, models have been developed to predict biomass recently. However, it is very challenging to find a model performing stable across experiments. In this study, we recorded RGB and NIR images of wheat germplasm and Recombinant Inbred Lines (RILs) of Raj3765xHD2329, and examined the use of multimodal images from RGB, NIR sensors and machine learning models to predict biomass and leaf area non-invasively. Results The image-based traits (i-Traits) containing geometric features, RGB based indices, RGB colour classes and NIR features were categorized into architectural traits and physiological traits. Total 77 i-Traits were selected for prediction of biomass and leaf area consisting of 35 architectural and 42 physiological traits. We have shown that different biomass related traits such as fresh weight, dry weight and shoot area can be predicted accurately from RGB and NIR images using 16 machine learning models. We applied the models on two consecutive years of experiments and found that measurement accuracies were similar suggesting the generalized nature of models. Results showed that all biomass-related traits could be estimated with about 90% accuracy but the performance of model BLASSO was relatively stable and high in all the traits and experiments. The R2 of BLASSO for fresh weight prediction was 0.96 (both year experiments), for dry weight prediction was 0.90 (Experiment 1) and 0.93 (Experiment 2) and for shoot area prediction 0.96 (Experiment 1) and 0.93 (Experiment 2). Also, the RMSRE of BLASSO for fresh weight prediction was 0.53 (Experiment 1) and 0.24 (Experiment 2), for dry weight prediction was 0.85 (Experiment 1) and 0.25 (Experiment 2) and for shoot area prediction 0.59 (Experiment 1) and 0.53 (Experiment 2). Discussion Based on the quantification power analysis of i-Traits, the determinants of biomass accumulation were found which contains both architectural and physiological traits. The best predictor i-Trait for fresh weight and dry weight prediction was Area_SV and for shoot area prediction was projected shoot area. These results will be helpful for identification and genetic basis dissection of major determinants of biomass accumulation and also non-invasive high throughput estimation of plant growth during different phenological stages can identify hitherto uncovered genes for biomass production and its deployment in crop improvement for breaking the yield plateau.
Collapse
Affiliation(s)
- Biswabiplab Singh
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Sudhir Kumar
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Allimuthu Elangovan
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Devendra Vasht
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Sunny Arya
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Nguyen Trung Duc
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
- Vietnam National University of Agriculture, Hanoi, Vietnam
| | - Pooja Swami
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Godawari Shivaji Pawar
- Division of Agricultural Botany, Vasantrao Naik Marathwada Krishi Vidyapeeth, Parbhani, India
| | - Dhandapani Raju
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Hari Krishna
- Division of Genetics, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Lekshmy Sathee
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| | - Monika Dalal
- ICAR-National Institute for Plant Biotechnology, New Delhi, India
| | - Rabi Narayan Sahoo
- Division of Agricultural Physics, ICAR-Indian Agricultural Research Institute, New Delhi, India
| | - Viswanathan Chinnusamy
- Division of Plant Physiology and Nanaji Deshmukh Plant Phenomics Centre (NDPPC), Indian Council of Agricultural Research (ICAR)-Indian Agricultural Research Institute, New Delhi, India
| |
Collapse
|
5
|
Hecker D, Behjati Ardakani F, Karollus A, Gagneur J, Schulz MH. The adapted Activity-By-Contact model for enhancer-gene assignment and its application to single-cell data. Bioinformatics 2023; 39:btad062. [PMID: 36708003 PMCID: PMC9931646 DOI: 10.1093/bioinformatics/btad062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 12/05/2022] [Accepted: 01/26/2023] [Indexed: 01/29/2023] Open
Abstract
MOTIVATION Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model which scores enhancer-gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene's perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is neither implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. RESULTS We demonstrate that the ABC score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalized ABC model with TF binding information and illustrated an analysis of a single-cell ATAC-seq dataset of the human heart, where we were able to characterize cell type-specific regulatory interactions and predict gene expression based on TF affinities. All executed processing steps are incorporated into our new computational pipeline STARE. AVAILABILITY AND IMPLEMENTATION The software is available at https://github.com/schulzlab/STARE. CONTACT marcel.schulz@em.uni-frankfurt.de. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dennis Hecker
- Institute of Cardiovascular Regeneration, Goethe University Hospital
- Cardio-Pulmonary Institute, Goethe University
- German Centre for Cardiovascular Research, Partner site Rhine-Main, Frankfurt am Main 60590
| | - Fatemeh Behjati Ardakani
- Institute of Cardiovascular Regeneration, Goethe University Hospital
- Cardio-Pulmonary Institute, Goethe University
- German Centre for Cardiovascular Research, Partner site Rhine-Main, Frankfurt am Main 60590
| | - Alexander Karollus
- School of Computation, Information and Technology, Technical University of Munich, Garching 85748
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching 85748
- Institute of Human Genetics, Technical University of Munich, Munich 81675
- Computational Health Center, Helmholtz Center Munich, Neuherberg 85764
- Munich Data Science Institute, Technical University of Munich, Garching 85748, Germany
| | - Marcel H Schulz
- Institute of Cardiovascular Regeneration, Goethe University Hospital
- Cardio-Pulmonary Institute, Goethe University
- German Centre for Cardiovascular Research, Partner site Rhine-Main, Frankfurt am Main 60590
| |
Collapse
|
6
|
Kang Y, Jung WJ, Brent MR. Predicting which genes will respond to transcription factor perturbations. G3 (BETHESDA, MD.) 2022; 12:jkac144. [PMID: 35666184 PMCID: PMC9339286 DOI: 10.1093/g3journal/jkac144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 05/25/2022] [Indexed: 11/13/2022]
Abstract
The ability to predict which genes will respond to the perturbation of a transcription factor serves as a benchmark for our systems-level understanding of transcriptional regulatory networks. In previous work, machine learning models have been trained to predict static gene expression levels in a biological sample by using data from the same or similar samples, including data on their transcription factor binding locations, histone marks, or DNA sequence. We report on a different challenge-training machine learning models to predict which genes will respond to the perturbation of a transcription factor without using any data from the perturbed cells. We find that existing transcription factor location data (ChIP-seq) from human cells have very little detectable utility for predicting which genes will respond to perturbation of a transcription factor. Features of genes, including their preperturbation expression level and expression variation, are very useful for predicting responses to perturbation of any transcription factor. This shows that some genes are poised to respond to transcription factor perturbations and others are resistant, shedding light on why it has been so difficult to predict responses from binding locations. Certain histone marks, including H3K4me1 and H3K4me3, have some predictive power when located downstream of the transcription start site. However, the predictive power of histone marks is much less than that of gene expression level and expression variation. Sequence-based or epigenetic properties of genes strongly influence their tendency to respond to direct transcription factor perturbations, partially explaining the oft-noted difficulty of predicting responsiveness from transcription factor binding location data. These molecular features are largely reflected in and summarized by the gene's expression level and expression variation. Code is available at https://github.com/BrentLab/TFPertRespExplainer.
Collapse
Affiliation(s)
- Yiming Kang
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
| | - Wooseok J Jung
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
| | - Michael R Brent
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| |
Collapse
|
7
|
Schmidt F, Marx A, Baumgarten N, Hebel M, Wegner M, Kaulich M, Leisegang M, Brandes R, Göke J, Vreeken J, Schulz M. Integrative analysis of epigenetics data identifies gene-specific regulatory elements. Nucleic Acids Res 2021; 49:10397-10418. [PMID: 34508352 PMCID: PMC8501997 DOI: 10.1093/nar/gkab798] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2020] [Revised: 08/01/2021] [Accepted: 09/07/2021] [Indexed: 12/19/2022] Open
Abstract
Understanding how epigenetic variation in non-coding regions is involved in distal gene-expression regulation is an important problem. Regulatory regions can be associated to genes using large-scale datasets of epigenetic and expression data. However, for regions of complex epigenomic signals and enhancers that regulate many genes, it is difficult to understand these associations. We present StitchIt, an approach to dissect epigenetic variation in a gene-specific manner for the detection of regulatory elements (REMs) without relying on peak calls in individual samples. StitchIt segments epigenetic signal tracks over many samples to generate the location and the target genes of a REM simultaneously. We show that this approach leads to a more accurate and refined REM detection compared to standard methods even on heterogeneous datasets, which are challenging to model. Also, StitchIt REMs are highly enriched in experimentally determined chromatin interactions and expression quantitative trait loci. We validated several newly predicted REMs using CRISPR-Cas9 experiments, thereby demonstrating the reliability of StitchIt. StitchIt is able to dissect regulation in superenhancers and predicts thousands of putative REMs that go unnoticed using peak-based approaches suggesting that a large part of the regulome might be uncharted water.
Collapse
Affiliation(s)
- Florian Schmidt
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland University, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Graduate School of Computer Science, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Laboratory of Systems Biology and Data Analytics, Genome Institute of Singapore, 60 Biopolis Street, 138672 Singapore, Singapore
| | - Alexander Marx
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland University, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Graduate School of Computer Science, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Nina Baumgarten
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research (DZHK), Partner site RheinMain, 60590 Frankfurt am Main, Germany
| | - Marie Hebel
- Institute of Biochemistry II, Goethe University Frankfurt - Medical Faculty, University Hospital, 60590 Frankfurt am Main, Germany
| | - Martin Wegner
- Institute of Biochemistry II, Goethe University Frankfurt - Medical Faculty, University Hospital, 60590 Frankfurt am Main, Germany
| | - Manuel Kaulich
- Institute of Biochemistry II, Goethe University Frankfurt - Medical Faculty, University Hospital, 60590 Frankfurt am Main, Germany
- Frankfurt Cancer Institute, Goethe University, 60590 Frankfurt am Main, Germany
| | - Matthias S Leisegang
- German Center for Cardiovascular Research (DZHK), Partner site RheinMain, 60590 Frankfurt am Main, Germany
- Institute for Cardiovascular Physiology, Goethe University, 60590 Frankfurt am Main, Germany
| | - Ralf P Brandes
- German Center for Cardiovascular Research (DZHK), Partner site RheinMain, 60590 Frankfurt am Main, Germany
- Institute for Cardiovascular Physiology, Goethe University, 60590 Frankfurt am Main, Germany
| | - Jonathan Göke
- Laboratory of Computational Transcriptomics, Genome Institute of Singapore, 60 Biopolis Street, 138672 Singapore, Singapore
| | - Jilles Vreeken
- CISPA Helmholtz Center for Information Security, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland University, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Marcel H Schulz
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland University, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
- Institute for Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research (DZHK), Partner site RheinMain, 60590 Frankfurt am Main, Germany
| |
Collapse
|
8
|
Agarwal V, Shendure J. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep 2021; 31:107663. [PMID: 32433972 DOI: 10.1016/j.celrep.2020.107663] [Citation(s) in RCA: 120] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 06/11/2019] [Accepted: 04/28/2020] [Indexed: 01/06/2023] Open
Abstract
Algorithms that accurately predict gene structure from primary sequence alone were transformative for annotating the human genome. Can we also predict the expression levels of genes based solely on genome sequence? Here, we sought to apply deep convolutional neural networks toward that goal. Surprisingly, a model that includes only promoter sequences and features associated with mRNA stability explains 59% and 71% of variation in steady-state mRNA levels in human and mouse, respectively. This model, termed Xpresso, more than doubles the accuracy of alternative sequence-based models and isolates rules as predictive as models relying on chromatic immunoprecipitation sequencing (ChIP-seq) data. Xpresso recapitulates genome-wide patterns of transcriptional activity, and its residuals can be used to quantify the influence of enhancers, heterochromatic domains, and microRNAs. Model interpretation reveals that promoter-proximal CpG dinucleotides strongly predict transcriptional activity. Looking forward, we propose cell-type-specific gene-expression predictions based solely on primary sequences as a grand challenge for the field.
Collapse
Affiliation(s)
- Vikram Agarwal
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA; Calico Life Sciences LLC, South San Francisco, CA 94080, USA.
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA; Howard Hughes Medical Institute, Seattle, WA 98195, USA; Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
9
|
Wang T, Guo Y, Liu S, Zhang C, Cui T, Ding K, Wang P, Wang X, Wang Z. KLF4, a Key Regulator of a Transitive Triplet, Acts on the TGF-β Signaling Pathway and Contributes to High-Altitude Adaptation of Tibetan Pigs. Front Genet 2021; 12:628192. [PMID: 33936161 PMCID: PMC8082500 DOI: 10.3389/fgene.2021.628192] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 03/10/2021] [Indexed: 11/13/2022] Open
Abstract
Tibetan pigs are native mammalian species on the Tibetan Plateau that have evolved distinct physiological traits that allow them to tolerate high-altitude hypoxic environments. However, the genetic mechanism underlying this adaptation remains elusive. Here, based on multitissue transcriptional data from high-altitude Tibetan pigs and low-altitude Rongchang pigs, we performed a weighted correlation network analysis (WGCNA) and identified key modules related to these tissues. Complex network analysis and bioinformatics analysis were integrated to identify key genes and three-node network motifs. We found that among the six tissues (muscle, liver, heart, spleen, kidneys, and lungs), lung tissue may be the key organs for Tibetan pigs to adapt to hypoxic environment. In the lung tissue of Tibetan pigs, we identified KLF4, BCL6B, EGR1, EPAS1, SMAD6, SMAD7, KDR, ATOH8, and CCN1 genes as potential regulators of hypoxia adaption. We found that KLF4 and EGR1 genes might simultaneously regulate the BCL6B gene, forming a KLF4-EGR1-BCL6B complex. This complex, dominated by KLF4, may enhance the hypoxia tolerance of Tibetan pigs by mediating the TGF-β signaling pathway. The complex may also affect the PI3K-Akt signaling pathway, which plays an important role in angiogenesis caused by hypoxia. Therefore, we postulate that the KLF4-EGR1-BCL6B complex may be beneficial for Tibetan pigs to survive better in the hypoxia environments. Although further molecular experiments and independent large-scale studies are needed to verify our findings, these findings may provide new details of the regulatory architecture of hypoxia-adaptive genes and are valuable for understanding the genetic mechanism of hypoxic adaptation in mammals.
Collapse
Affiliation(s)
- Tao Wang
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| | - Yuanyuan Guo
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| | - Shengwei Liu
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| | - Chaoxin Zhang
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| | - Tongyan Cui
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| | - Kun Ding
- College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot, China
| | - Peng Wang
- HeiLongJiang Provincial Husbandry Department, Harbin, China
| | - Xibiao Wang
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China
| | - Zhipeng Wang
- College of Animal Science and Technology, Northeast Agricultural University, Harbin, China.,Bioinformatics Center, Northeast Agricultural University, Harbin, China
| |
Collapse
|
10
|
Scherer M, Schmidt F, Lazareva O, Walter J, Baumbach J, Schulz MH, List M. Machine learning for deciphering cell heterogeneity and gene regulation. NATURE COMPUTATIONAL SCIENCE 2021; 1:183-191. [PMID: 38183187 DOI: 10.1038/s43588-021-00038-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 02/08/2021] [Indexed: 12/14/2022]
Abstract
Epigenetics studies inheritable and reversible modifications of DNA that allow cells to control gene expression throughout their development and in response to environmental conditions. In computational epigenomics, machine learning is applied to study various epigenetic mechanisms genome wide. Its aim is to expand our understanding of cell differentiation, that is their specialization, in health and disease. Thus far, most efforts focus on understanding the functional encoding of the genome and on unraveling cell-type heterogeneity. Here, we provide an overview of state-of-the-art computational methods and their underlying statistical concepts, which range from matrix factorization and regularized linear regression to deep learning methods. We further show how the rise of single-cell technology leads to new computational challenges and creates opportunities to further our understanding of epigenetic regulation.
Collapse
Affiliation(s)
- Michael Scherer
- Department of Genetics/Epigenetics, Saarland University, Saarbrücken, Germany
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
- Graduate School of Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | | | - Olga Lazareva
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Jörn Walter
- Computational Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| | - Jan Baumbach
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
- Computational BioMedicine Lab, Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
- Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Marcel H Schulz
- Institute of Cardiovascular Regeneration, University Hospital and Goethe University Frankfurt, Frankfurt, Germany
| | - Markus List
- Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany.
| |
Collapse
|
11
|
Aflakparast M, Geeven G, de Gunst MCM. Bayesian mixture regression analysis for regulation of Pluripotency in ES cells. BMC Bioinformatics 2020; 21:3. [PMID: 31898480 PMCID: PMC6941360 DOI: 10.1186/s12859-019-3331-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Accepted: 12/17/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Observed levels of gene expression strongly depend on both activity of DNA binding transcription factors (TFs) and chromatin state through different histone modifications (HMs). In order to recover the functional relationship between local chromatin state, TF binding and observed levels of gene expression, regression methods have proven to be useful tools. They have been successfully applied to predict mRNA levels from genome-wide experimental data and they provide insight into context-dependent gene regulatory mechanisms. However, heterogeneity arising from gene-set specific regulatory interactions is often overlooked. RESULTS We show that regression models that predict gene expression by using experimentally derived ChIP-seq profiles of TFs can be significantly improved by mixture modelling. In order to find biologically relevant gene clusters, we employ a Bayesian allocation procedure which allows us to integrate additional biological information such as three-dimensional nuclear organization of chromosomes and gene function. The data integration procedure involves transforming the additional data into gene similarity values. We propose a generic similarity measure that is especially suitable for situations where the additional data are of both continuous and discrete type, and compare its performance with similar measures in the context of mixture modelling. CONCLUSIONS We applied the proposed method on a data from mouse embryonic stem cells (ESC). We find that including additional data results in mixture components that exhibit biologically meaningful gene clusters, and provides valuable insight into the heterogeneity of the regulatory interactions.
Collapse
Affiliation(s)
- Mehran Aflakparast
- Department of Mathematics, Vrije Universiteit Amsterdam, De Boelelaan 1081a, Amsterdam, 1081 HV, The Netherlands.
| | - Geert Geeven
- Hubrecht Institute-KNAW, University Medical Centre Utrecht, Uppsalalaan 8, Utrecht, 3584CT, The Netherlands
| | - Mathisca C M de Gunst
- Department of Mathematics, Vrije Universiteit Amsterdam, De Boelelaan 1081a, Amsterdam, 1081 HV, The Netherlands
| |
Collapse
|
12
|
Schmidt F, Schulz MH. On the problem of confounders in modeling gene expression. Bioinformatics 2019; 35:711-719. [PMID: 30084962 PMCID: PMC6530814 DOI: 10.1093/bioinformatics/bty674] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Revised: 06/21/2018] [Accepted: 08/02/2018] [Indexed: 01/01/2023] Open
Abstract
Motivation Modeling of Transcription Factor (TF) binding from both ChIP-seq and chromatin accessibility data has become prevalent in computational biology. Several models have been proposed to generate new hypotheses on transcriptional regulation. However, there is no distinct approach to derive TF binding scores from ChIP-seq and open chromatin experiments. Here, we review biases of various scoring approaches and their effects on the interpretation and reliability of predictive gene expression models. Results We generated predictive models for gene expression using ChIP-seq and DNase1-seq data from DEEP and ENCODE. Via randomization experiments, we identified confounders in TF gene scores derived from both ChIP-seq and DNase1-seq data. We reviewed correction approaches for both data types, which reduced the influence of identified confounders without harm to model performance. Also, our analyses highlighted further quality control measures, in addition to model performance, that may help to assure model reliability and to avoid misinterpretation in future studies. Availability and implementation The software used in this study is available online at https://github.com/SchulzLab/TEPIC. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Florian Schmidt
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany.,Graduate School for Computer Science, Saarland Informatics Campus, Saarbrücken, Germany
| | - Marcel H Schulz
- High-througput Genomics and Systems Biology, Cluster of Excellence on Multimodal Computing and Interaction, Saarland Informatics Campus, Saarbrücken, Germany.,Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany
| |
Collapse
|
13
|
The spatial binding model of the pioneer factor Oct4 with its target genes during cell reprogramming. Comput Struct Biotechnol J 2019; 17:1226-1233. [PMID: 31921389 PMCID: PMC6944736 DOI: 10.1016/j.csbj.2019.09.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Revised: 09/05/2019] [Accepted: 09/07/2019] [Indexed: 12/18/2022] Open
Abstract
Understanding the target regulation between pioneer factor and its binding genes is crucial for improving the efficiency of TF-mediated reprogramming. Oct4 as the only one factor that cannot be substituted by other POU members, it is urgent need to develop a quantitative model for describing the spatial binding pattern with its target genes. The dynamic profiles of pioneer factor Oct4-binding showed that the major wave occurs at the intermediate stage of cell reprogramming (from day 7 to day 15), and the promoter is the preferred targeting regions. The Oct4-binding distributions perform significant chromosome bias. The overall enrichment on chromosome 1–11 is higher than that on the others. The dramatic event of TF-mediated reprogramming is mainly concentrated on autosomes. We also found that the spatial binding ability of Oct4 binding can be represented quantitatively by using three parameters of peaks (height, width and distance). The dynamic changes of Oct4-binding demonstrated that the width play more important roles in regulating expression of target genes. At last, a multivariate linear regression was introduced to establish the spatial binding model of the Oct4-binding. The evaluation results confirmed that the height and width is positively correlated with the gene expression. And the additive interaction terms of height and width can better optimize the model performance than the multiplicative terms. The best average coefficients of determination of improved model achieved to 81.38%. Our study will provide new insights into the cooperative regulation of spatial binding pattern of pioneer factors in cell reprogramming.
Collapse
|
14
|
Read DF, Cook K, Lu YY, Le Roch KG, Noble WS. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS Comput Biol 2019; 15:e1007329. [PMID: 31509524 PMCID: PMC6756558 DOI: 10.1371/journal.pcbi.1007329] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 09/23/2019] [Accepted: 08/12/2019] [Indexed: 12/02/2022] Open
Abstract
Empirical evidence suggests that the malaria parasite Plasmodium falciparum employs a broad range of mechanisms to regulate gene transcription throughout the organism's complex life cycle. To better understand this regulatory machinery, we assembled a rich collection of genomic and epigenomic data sets, including information about transcription factor (TF) binding motifs, patterns of covalent histone modifications, nucleosome occupancy, GC content, and global 3D genome architecture. We used these data to train machine learning models to discriminate between high-expression and low-expression genes, focusing on three distinct stages of the red blood cell phase of the Plasmodium life cycle. Our results highlight the importance of histone modifications and 3D chromatin architecture in Plasmodium transcriptional regulation and suggest that AP2 transcription factors may play a limited regulatory role, perhaps operating in conjunction with epigenetic factors.
Collapse
Affiliation(s)
- David F. Read
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Kate Cook
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Yang Y. Lu
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Karine G. Le Roch
- Department of Molecular, Cell and Systems Biology, University of California, Riverside, California, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| |
Collapse
|
15
|
Feng ZX, Li QZ, Meng JJ. Modeling the relationship of diverse genomic signatures to gene expression levels with the regulation of long-range enhancer-promoter interactions. BIOPHYSICS REPORTS 2019. [DOI: 10.1007/s41048-019-0089-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
16
|
Zhao Y, Schaafsma E, Cheng C. Applications of ENCODE data to Systematic Analyses via Data Integration. ACTA ACUST UNITED AC 2019; 11:57-64. [PMID: 31011690 DOI: 10.1016/j.coisb.2018.08.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Large-scale genomic data have been utilized to generate unprecedented biological findings and new hypotheses. To delineate functional elements in the human genome, the Encyclopedia of DNA Elements (ENCODE) project has generated an enormous amount of genomic data, yielding around 7,000 data profiles in different cell and tissue types. In this article, we reviewed the systematic analyses that have integrated ENCODE data with other data sources to reveal new biological insights, ranging from human genome annotation to the identification of new candidate drugs. These analyses demonstrate the critical impact of ENCODE data on basic biology and translational research.
Collapse
Affiliation(s)
- Yanding Zhao
- Department of Biomedical Data Science, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756.,Department of Molecular and Systems Biology, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756
| | - Evelien Schaafsma
- Department of Biomedical Data Science, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756.,Department of Molecular and Systems Biology, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756
| | - Chao Cheng
- Department of Biomedical Data Science, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756.,Department of Molecular and Systems Biology, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756.,Norris Cotton Cancer Center, The Geisel School of Medicine at Dartmouth College, One Medical Center Dr., Dartmouth-Hitchcock Medical Center, Lebanon, NH, United States, 03756
| |
Collapse
|
17
|
Lu R, Rogan PK. Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations. F1000Res 2018; 7:1933. [PMID: 31001412 PMCID: PMC6464064 DOI: 10.12688/f1000research.17363.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/05/2018] [Indexed: 10/12/2023] Open
Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets. Methods: Genes with correlated expression patterns across 53 tissues and TF targets were respectively identified from Bray-Curtis Similarity and TF knockdown experiments. Corresponding promoter sequences were reduced to DNase I-accessible intervals; TFBSs were then identified within these intervals using information theory-based position weight matrices for each TF (iPWMs) and clustered. Features from information-dense TFBS clusters predicted these genes with machine learning classifiers, which were evaluated for accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed to in silico examine their impact on cluster densities and the regulatory states of target genes. Results: We initially chose the glucocorticoid receptor gene ( NR3C1), whose regulation has been extensively studied, to test this approach. SLC25A32 and TANK were found to exhibit the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the largest area under the Receiver Operating Characteristic (ROC) curve in detecting such genes. Target gene prediction was confirmed using siRNA knockdown of TFs, which was found to be more accurate than those predicted after CRISPR/CAS9 inactivation. In-silico mutation analyses of TFBSs also revealed that one or more information-dense TFBS clusters in promoters are required for accurate target gene prediction. Conclusions: Machine learning based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
Collapse
Affiliation(s)
- Ruipeng Lu
- Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada
| | - Peter K. Rogan
- Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada
- Biochemistry, University of Western Ontario, London, Ontario, N6A 5C1, Canada
- Cytognomix, London, Ontario, N5X 3X5, Canada
| |
Collapse
|
18
|
Lu R, Rogan PK. Transcription factor binding site clusters identify target genes with similar tissue-wide expression and buffer against mutations. F1000Res 2018; 7:1933. [PMID: 31001412 PMCID: PMC6464064 DOI: 10.12688/f1000research.17363.2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/28/2019] [Indexed: 12/20/2022] Open
Abstract
Background: The distribution and composition of cis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets using Machine Learning (ML). Methods: Bray-Curtis Similarity was used to identify genes with correlated expression patterns across 53 tissues. TF targets from knockdown experiments were also analyzed by this approach to set up the ML framework. TFBSs were selected within DNase I-accessible intervals of corresponding promoter sequences using information theory-based position weight matrices (iPWMs) for each TF. Features from information-dense clusters of TFBSs were input to ML classifiers which predict these gene targets along with their accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed in silico to examine their impact on TFBS clustering and predict changes in gene regulation. Results: The glucocorticoid receptor gene ( NR3C1), whose regulation has been extensively studied, was selected to test this approach. SLC25A32 and TANK exhibited the most similar expression patterns to NR3C1. A Decision Tree classifier exhibited the best performance in detecting such genes, based on Area Under the Receiver Operating Characteristic curve (ROC). TF target gene prediction was confirmed using siRNA knockdown, which was more accurate than CRISPR/CAS9 inactivation. TFBS mutation analyses revealed that accurate target gene prediction required at least 1 information-dense TFBS cluster. Conclusions: ML based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.
Collapse
Affiliation(s)
- Ruipeng Lu
- Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada
| | - Peter K. Rogan
- Computer Science, University of Western Ontario, London, Ontario, N6A 5B7, Canada
- Biochemistry, University of Western Ontario, London, Ontario, N6A 5C1, Canada
- Cytognomix, London, Ontario, N5X 3X5, Canada
| |
Collapse
|
19
|
Ng FSL, Ruau D, Wernisch L, Göttgens B. A graphical model approach visualizes regulatory relationships between genome-wide transcription factor binding profiles. Brief Bioinform 2018; 19:162-173. [PMID: 27780826 PMCID: PMC5496675 DOI: 10.1093/bib/bbw102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Indexed: 11/16/2022] Open
Abstract
Integrated analysis of multiple genome-wide transcription factor (TF)-binding profiles will be vital to advance our understanding of the global impact of TF binding. However, existing methods for measuring similarity in large numbers of chromatin immunoprecipitation assays with sequencing (ChIP-seq), such as correlation, mutual information or enrichment analysis, are limited in their ability to display functionally relevant TF relationships. In this study, we propose the use of graphical models to determine conditional independence between TFs and showed that network visualization provides a promising alternative to distinguish ‘direct’ versus ‘indirect’ TF interactions. We applied four algorithms to measure ‘direct’ dependence to a compendium of 367 mouse haematopoietic TF ChIP-seq samples and obtained a consensus network known as a ‘TF association network’ where edges in the network corresponded to likely causal pairwise relationships between TFs. The ‘TF association network’ illustrates the role of TFs in developmental pathways, is reminiscent of combinatorial TF regulation, corresponds to known protein–protein interactions and indicates substantial TF-binding reorganization in leukemic cell types. With the rapid increase in TF ChIP-Seq data sets, the approach presented here will be a powerful tool to study transcriptional programmes across a wide range of biological systems.
Collapse
Affiliation(s)
- Felicia S L Ng
- Department of Haematology, Wellcome Trust and MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Hills Road, Cambridge, UK
| | - David Ruau
- Department of Haematology, Wellcome Trust and MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Hills Road, Cambridge, UK
| | - Lorenz Wernisch
- Department of Haematology, Wellcome Trust and MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Hills Road, Cambridge, UK
| | - Berthold Göttgens
- Department of Haematology, Wellcome Trust and MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Hills Road, Cambridge, UK
- Corresponding author: Berthold Gottgens, Department of Haematology, Wellcome Trust and MRC Cambridge Stem Cell Institute & Cambridge Institute for Medical Research, Hills Road, Cambridge CB2 0XY, UK. Tel: 01223-336829; Fax: 01223-762670; E-mail:
| |
Collapse
|
20
|
Zhang LQ, Li QZ. Estimating the effects of transcription factors binding and histone modifications on gene expression levels in human cells. Oncotarget 2018; 8:40090-40103. [PMID: 28454114 PMCID: PMC5522221 DOI: 10.18632/oncotarget.16988] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 03/11/2017] [Indexed: 12/22/2022] Open
Abstract
Transcription factors and histone modifications are vital for the regulation of gene expression. Hence, to estimate the effects of transcription factors binding and histone modifications on gene expression, we construct a statistical model for the genome-wide 15 transcription factors binding data, 10 histone modifications profiles and DNase-I hypersensitivity data in three mammalian. Remarkably, our results show POLR2A and H3K36me3 can highly and consistently predict gene expression in three cell lines. And H3K4me3, H3K27me3 and H3K9ac are more reliable predictors than other histone modifications in human embryonic stem cells. Moreover, genome-wide statistical redundancies exist within and between transcription factors and histone modifications, and these phenomena may be caused by the regulation mechanism. In further study, we find that even though transcription factors and histone modifications offer similar effects on expression levels of genome-wide genes, the effects of transcription factors and histone modifications on predictive abilities are different for genes in independent biological processes.
Collapse
Affiliation(s)
- Lu-Qiang Zhang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| | - Qian-Zhong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, China
| |
Collapse
|
21
|
Abstract
Transcription is regulated by transcription factor (TF) binding at promoters and distal regulatory elements and histone modifications that control the accessibility of these elements. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become the standard assay for identifying genome-wide protein-DNA interactions in vitro and in vivo. As large-scale ChIP-seq data sets have been collected for different TFs and histone modifications, their potential to predict gene expression can be used to test hypotheses about the mechanisms of gene regulation. In addition, complementary functional genomics assays provide a global view of chromatin accessibility and long-range cis-regulatory interactions that are being combined with TF binding and histone remodeling to study the regulation of gene expression. Thus, ChIP-seq analysis is now widely integrated with other functional genomics assays to better understand gene regulatory mechanisms. In this review, we discuss advances and challenges in integrating ChIP-seq data to identify context-specific chromatin states associated with gene activity. We describe the overall computational design of integrating ChIP-seq data with other functional genomics assays. We also discuss the challenges of extending these methods to low-input ChIP-seq assays and related single-cell assays.
Collapse
Affiliation(s)
| | - Ali Mortazavi
- Corresponding author: Ali Mortazavi, Department of Developmental and Cell Biology, 2300 Biological Sciences 3, University of California, Irvine, CA 92697, USA. Tel: (949)824-6762; E-mail:
| |
Collapse
|
22
|
Kehl T, Schneider L, Schmidt F, Stöckel D, Gerstner N, Backes C, Meese E, Keller A, Schulz MH, Lenhof HP. RegulatorTrail: a web service for the identification of key transcriptional regulators. Nucleic Acids Res 2017; 45:W146-W153. [PMID: 28472408 PMCID: PMC5570139 DOI: 10.1093/nar/gkx350] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 04/07/2017] [Accepted: 04/20/2017] [Indexed: 12/14/2022] Open
Abstract
Transcriptional regulators such as transcription factors and chromatin modifiers play a central role in most biological processes. Alterations in their activities have been observed in many diseases, e.g. cancer. Hence, it is of utmost importance to evaluate and assess the effects of transcriptional regulators on natural and pathogenic processes. Here, we present RegulatorTrail, a web service that provides rich functionality for the identification and prioritization of key transcriptional regulators that have a strong impact on, e.g. pathological processes. RegulatorTrail offers eight methods that use regulator binding information in combination with transcriptomic or epigenomic data to infer the most influential regulators. Our web service not only provides an intuitive web interface, but also a well-documented RESTful API that allows for a straightforward integration into third-party workflows. The presented case studies highlight the capabilities of our web service and demonstrate its potential for the identification of influential regulators: we successfully identified regulators that might explain the increased malignancy in metastatic melanoma compared to primary tumors, as well as important regulators in macrophages. RegulatorTrail is freely accessible at: https://regulatortrail.bioinf.uni-sb.de/.
Collapse
Affiliation(s)
- Tim Kehl
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Lara Schneider
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Florian Schmidt
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Cluster of Excellence Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Daniel Stöckel
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Nico Gerstner
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Christina Backes
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Eckart Meese
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Human Genetics, Saarland University, 66421 Homburg, Germany
| | - Andreas Keller
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| | - Marcel H Schulz
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
- Cluster of Excellence Multimodal Computing and Interaction, Saarland Informatics Campus, 66123 Saarland University, Saarbrücken, Germany
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Hans-Peter Lenhof
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66123 Saarbrücken, Germany
| |
Collapse
|
23
|
Integrated analysis and transcript abundance modelling of H3K4me3 and H3K27me3 in developing secondary xylem. Sci Rep 2017; 7:3370. [PMID: 28611454 PMCID: PMC5469831 DOI: 10.1038/s41598-017-03665-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 05/02/2017] [Indexed: 01/10/2023] Open
Abstract
Despite the considerable contribution of xylem development (xylogenesis) to plant biomass accumulation, its epigenetic regulation is poorly understood. Furthermore, the relative contributions of histone modifications to transcriptional regulation is not well studied in plants. We investigated the biological relevance of H3K4me3 and H3K27me3 in secondary xylem development using ChIP-seq and their association with transcript levels among other histone modifications in woody and herbaceous models. In developing secondary xylem of the woody model Eucalyptus grandis, H3K4me3 and H3K27me3 genomic spans were distinctly associated with xylogenesis-related processes, with (late) lignification pathways enriched for putative bivalent domains, but not early secondary cell wall polysaccharide deposition. H3K27me3-occupied genes, of which 753 (~31%) are novel targets, were enriched for transcriptional regulation and flower development and had significant preferential expression in roots. Linear regression models of the ChIP-seq profiles predicted ~50% of transcript abundance measured with strand-specific RNA-seq, confirmed in a parallel analysis in Arabidopsis where integration of seven additional histone modifications each contributed smaller proportions of unique information to the predictive models. This study uncovers the biological importance of histone modification antagonism and genomic span in xylogenesis and quantifies for the first time the relative correlations of histone modifications with transcript abundance in plants.
Collapse
|
24
|
Song L, Huang SSC, Wise A, Castanon R, Nery JR, Chen H, Watanabe M, Thomas J, Bar-Joseph Z, Ecker JR. A transcription factor hierarchy defines an environmental stress response network. Science 2017; 354:354/6312/aag1550. [PMID: 27811239 PMCID: PMC5217750 DOI: 10.1126/science.aag1550] [Citation(s) in RCA: 343] [Impact Index Per Article: 42.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Accepted: 09/28/2016] [Indexed: 12/17/2022]
Abstract
Environmental stresses are universally encountered by microbes, plants, and animals. Yet systematic studies of stress-responsive transcription factor (TF) networks in multicellular organisms have been limited. The phytohormone abscisic acid (ABA) influences the expression of thousands of genes, allowing us to characterize complex stress-responsive regulatory networks. Using chromatin immunoprecipitation sequencing, we identified genome-wide targets of 21 ABA-related TFs to construct a comprehensive regulatory network in Arabidopsis thaliana Determinants of dynamic TF binding and a hierarchy among TFs were defined, illuminating the relationship between differential gene expression patterns and ABA pathway feedback regulation. By extrapolating regulatory characteristics of observed canonical ABA pathway components, we identified a new family of transcriptional regulators modulating ABA and salt responsiveness and demonstrated their utility to modulate plant resilience to osmotic stress.
Collapse
Affiliation(s)
- Liang Song
- Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Shao-Shan Carol Huang
- Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Aaron Wise
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Rosa Castanon
- Genomic Analysis Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Joseph R Nery
- Genomic Analysis Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Huaming Chen
- Genomic Analysis Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Marina Watanabe
- Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Jerushah Thomas
- Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Ziv Bar-Joseph
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Joseph R Ecker
- Plant Biology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA. .,Genomic Analysis Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037, USA.,Howard Hughes Medical Institute (HHMI), Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| |
Collapse
|
25
|
Schmidt F, Gasparoni N, Gasparoni G, Gianmoena K, Cadenas C, Polansky JK, Ebert P, Nordström K, Barann M, Sinha A, Fröhler S, Xiong J, Dehghani Amirabad A, Behjati Ardakani F, Hutter B, Zipprich G, Felder B, Eils J, Brors B, Chen W, Hengstler JG, Hamann A, Lengauer T, Rosenstiel P, Walter J, Schulz MH. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res 2017; 45:54-66. [PMID: 27899623 PMCID: PMC5224477 DOI: 10.1093/nar/gkw1061] [Citation(s) in RCA: 83] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2016] [Revised: 10/18/2016] [Accepted: 10/24/2016] [Indexed: 12/21/2022] Open
Abstract
The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq data sets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.
Collapse
Affiliation(s)
- Florian Schmidt
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Nina Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Gilles Gasparoni
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Kathrin Gianmoena
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Cristina Cadenas
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Julia K Polansky
- Experimental Rheumatology, German Rheumatism Research Centre, Berlin, 10117, Germany
| | - Peter Ebert
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Karl Nordström
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Matthias Barann
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Anupam Sinha
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Sebastian Fröhler
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jieyi Xiong
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Azim Dehghani Amirabad
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Fatemeh Behjati Ardakani
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Barbara Hutter
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Gideon Zipprich
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Bärbel Felder
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Jürgen Eils
- Data Management and Genomics IT, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Benedikt Brors
- Applied Bioinformatics, Deutsches Krebsforschungszentrum, Heidelberg, 69120, Germany
| | - Wei Chen
- Berlin Institute for Medical Systems Biology, Max-Delbrück Center for Molecular Medicine, Berlin, 13092, Germany
| | - Jan G Hengstler
- Leibniz Research Centre for Working Environment and Human Factors IfADo, Dortmund, 44139, Germany
| | - Alf Hamann
- International Max Planck Research School for Computer Science, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Thomas Lengauer
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| | - Philip Rosenstiel
- Institute of Clinical Molecular Biology, Christian-Albrechts-University, Kiel, 24105, Germany
| | - Jörn Walter
- Department of Genetics, University of Saarland, Saarbrücken, 66123, Germany
| | - Marcel H Schulz
- Cluster of Excellence for Multimodal Computing and Interaction, Saarland Informatics Campus, Saarland University, Saarbrücken, 66123, Germany
- Computational Biology & Applied Algorithmics, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, 66123, Germany
| |
Collapse
|
26
|
Budden DM, Crampin EJ. Distributed gene expression modelling for exploring variability in epigenetic function. BMC Bioinformatics 2016; 17:446. [PMID: 27816056 PMCID: PMC5097851 DOI: 10.1186/s12859-016-1313-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Accepted: 10/25/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predictive gene expression modelling is an important tool in computational biology due to the volume of high-throughput sequencing data generated by recent consortia. However, the scope of previous studies has been restricted to a small set of cell-lines or experimental conditions due an inability to leverage distributed processing architectures for large, sharded data-sets. RESULTS We present a distributed implementation of gene expression modelling using the MapReduce paradigm and prove that performance improves as a linear function of available processor cores. We then leverage the computational efficiency of this framework to explore the variability of epigenetic function across fifty histone modification data-sets from variety of cancerous and non-cancerous cell-lines. CONCLUSIONS We demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched -omics data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity.
Collapse
Affiliation(s)
- David M Budden
- Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Cambridge, 02139, USA. .,Systems Biology Laboratory, Melbourne School of Engineering, the University of Melbourne, Parkville, 3010, Australia.
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, the University of Melbourne, Parkville, 3010, Australia.,ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, Parkville, 3010, Australia.,Department of Mathematics and Statistics, the University of Melbourne, Parkville, 3010, Australia.,School of Medicine, the University of Melbourne, Parkville, 3010, Australia
| |
Collapse
|
27
|
Su WX, Li QZ, Zhang LQ, Fan GL, Wu CY, Yan ZH, Zuo YC. Gene expression classification using epigenetic features and DNA sequence composition in the human embryonic stem cell line H1. Gene 2016; 592:227-234. [DOI: 10.1016/j.gene.2016.07.059] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2016] [Revised: 06/20/2016] [Accepted: 07/23/2016] [Indexed: 01/01/2023]
|
28
|
E2F1 Orchestrates Transcriptomics and Oxidative Metabolism in Wharton's Jelly-Derived Mesenchymal Stem Cells from Growth-Restricted Infants. PLoS One 2016; 11:e0163035. [PMID: 27631473 PMCID: PMC5025055 DOI: 10.1371/journal.pone.0163035] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2016] [Accepted: 09/01/2016] [Indexed: 12/31/2022] Open
Abstract
Wharton's jelly-derived Mesenchymal Stem Cells (MSCs) isolated from newborns with intrauterine fetal growth restriction were previously shown to exert anabolic features including insulin hypersensitivity. Here, we extend these observations and demonstrate that MSCs from small for gestational age (SGA) individuals have decreased mitochondrial oxygen consumption rates. Comparing normally grown and SGA MSCs using next generation sequencing studies, we measured global transcriptomic and epigenetic profiles and identified E2F1 as an over-expressed transcription factor regulating oxidative metabolism in the SGA group. We further show that E2F1 regulates the differential transcriptome found in SGA derived MSCs and is associated with the activating histone marks H3K27ac and H3K4me3. One of the key genes regulated by E2F1 was found to be the fatty acid elongase ELOVL2, a gene involved in the endogenous synthesis of docosahexaenoic acid (DHA). Finally, we shed light on how the E2F1-ELOVL2 pathway may alter oxidative respiration in the SGA condition by contributing to the maintenance of cellular metabolic homeostasis.
Collapse
|
29
|
Narang V, Ramli MA, Singhal A, Kumar P, de Libero G, Poidinger M, Monterola C. Automated Identification of Core Regulatory Genes in Human Gene Regulatory Networks. PLoS Comput Biol 2015; 11:e1004504. [PMID: 26393364 PMCID: PMC4578944 DOI: 10.1371/journal.pcbi.1004504] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 08/11/2015] [Indexed: 12/20/2022] Open
Abstract
Human gene regulatory networks (GRN) can be difficult to interpret due to a tangle of edges interconnecting thousands of genes. We constructed a general human GRN from extensive transcription factor and microRNA target data obtained from public databases. In a subnetwork of this GRN that is active during estrogen stimulation of MCF-7 breast cancer cells, we benchmarked automated algorithms for identifying core regulatory genes (transcription factors and microRNAs). Among these algorithms, we identified K-core decomposition, pagerank and betweenness centrality algorithms as the most effective for discovering core regulatory genes in the network evaluated based on previously known roles of these genes in MCF-7 biology as well as in their ability to explain the up or down expression status of up to 70% of the remaining genes. Finally, we validated the use of K-core algorithm for organizing the GRN in an easier to interpret layered hierarchy where more influential regulatory genes percolate towards the inner layers. The integrated human gene and miRNA network and software used in this study are provided as supplementary materials (S1 Data) accompanying this manuscript. A gene regulatory network (GRN) represents how some genes encoding regulatory molecules such as transcription factors or microRNAs regulate the expression of other genes. Researchers commonly study GRNs involved in a specific biological process with the aim of identifying a few important regulatory genes. In higher organisms such as humans, a regulatory gene regulates multiple target genes and correspondingly any gene is regulated by multiple regulatory genes. Due to such multiplicity of interactions, a GRN usually resembles a tangled hairball wherein it is difficult to identify few most influential regulatory genes. In this study, we show that network analysis algorithms such as K-core, pagerank and betweenness centrality are useful for identifying a few important or core regulatory genes in a GRN, and the K-core algorithm is also useful for organizing regulatory genes in a hierarchical layered structure where the most influential genes in a GRN are found within the innermost layer or core. These few core regulatory genes determine to a large extent the expression status of the remaining genes in the network. We illustrate a pragmatic application of this technique to GRNs reconstructed from genome-wide gene expression measurements in the MCF-7 human breast cancer cell line.
Collapse
|
30
|
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 2015; 11:e1004418. [PMID: 26291518 PMCID: PMC4546298 DOI: 10.1371/journal.pcbi.1004418] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 06/29/2015] [Indexed: 11/19/2022] Open
Abstract
Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome. Identification of transcription factor binding sites based on sequence motifs is typically accompanied by a high false positive rate. Increasing evidence suggests that there are many other factors besides DNA sequence that may affect the binding and interaction of TFs with DNA. Through the integration of sequence motif, chromatin state, and DNA structure properties, we show that TF binding can be better predicted. Moreover, considering chromatin state and DNA structure properties simultaneously yields a significant improvement. While the binding of some TFs can be readily predicted using either chromatin state information or DNA structure, other TFs need both. Thus, our findings provide insights on how different histone modifications and DNA structure properties may influence the binding of a particular TF and thus how TFs regulate gene expression. These features are referred to as sequence “intrinsic properties” because they can be predicted from sequences alone. These intrinsic properties can be used to build a TF binding prediction model that has a similar performance to considering all features. Moreover, the intrinsic property model allows TFBS predictions not only across TFs, but also across DNA-binding domain families that are present in most eukaryotes, suggesting that the model likely can be used across species.
Collapse
|
31
|
Budden DM, Hurley DG, Crampin EJ. Modelling the conditional regulatory activity of methylated and bivalent promoters. Epigenetics Chromatin 2015; 8:21. [PMID: 26097508 PMCID: PMC4474576 DOI: 10.1186/s13072-015-0013-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Accepted: 06/10/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predictive modelling of gene expression is a powerful framework for the in silico exploration of transcriptional regulatory interactions through the integration of high-throughput -omics data. A major limitation of previous approaches is their inability to handle conditional interactions that emerge when genes are subject to different regulatory mechanisms. Although chromatin immunoprecipitation-based histone modification data are often used as proxies for chromatin accessibility, the association between these variables and expression often depends upon the presence of other epigenetic markers (e.g. DNA methylation or histone variants). These conditional interactions are poorly handled by previous predictive models and reduce the reliability of downstream biological inference. RESULTS We have previously demonstrated that integrating both transcription factor and histone modification data within a single predictive model is rendered ineffective by their statistical redundancy. In this study, we evaluate four proposed methods for quantifying gene-level DNA methylation levels and demonstrate that inclusion of these data in predictive modelling frameworks is also subject to this critical limitation in data integration. Based on the hypothesis that statistical redundancy in epigenetic data is caused by conditional regulatory interactions within a dynamic chromatin context, we construct a new gene expression model which is the first to improve prediction accuracy by unsupervised identification of latent regulatory classes. We show that DNA methylation and H2A.Z histone variant data can be interpreted in this way to identify and explore the signatures of silenced and bivalent promoters, substantially improving genome-wide predictions of mRNA transcript abundance and downstream biological inference across multiple cell lines. CONCLUSIONS Previous models of gene expression have been applied successfully to several important problems in molecular biology, including the discovery of transcription factor roles, identification of regulatory elements responsible for differential expression patterns and comparative analysis of the transcriptome across distant species. Our analysis supports our hypothesis that statistical redundancy in epigenetic data is partially due to conditional relationships between these regulators and gene expression levels. This analysis provides insight into the heterogeneous roles of H3K4me3 and H3K27me3 in the presence of the H2A.Z histone variant (implicated in cancer progression) and how these signatures change during lineage commitment and carcinogenesis.
Collapse
Affiliation(s)
- David M Budden
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia
| | - Daniel G Hurley
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia ; ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, 3010 Parkville, Australia ; Department of Mathematics and Statistics, The University of Melbourne, 3010 Parkville, Australia ; School of Medicine, The University of Melbourne, 3010 Parkville, Australia
| |
Collapse
|
32
|
Abstract
Despite the rapid accumulation of tumor-profiling data and transcription factor (TF) ChIP-seq profiles, efforts integrating TF binding with the tumor-profiling data to understand how TFs regulate tumor gene expression are still limited. To systematically search for cancer-associated TFs, we comprehensively integrated 686 ENCODE ChIP-seq profiles representing 150 TFs with 7484 TCGA tumor data in 18 cancer types. For efficient and accurate inference on gene regulatory rules across a large number and variety of datasets, we developed an algorithm, RABIT (regression analysis with background integration). In each tumor sample, RABIT tests whether the TF target genes from ChIP-seq show strong differential regulation after controlling for background effect from copy number alteration and DNA methylation. When multiple ChIP-seq profiles are available for a TF, RABIT prioritizes the most relevant ChIP-seq profile in each tumor. In each cancer type, RABIT further tests whether the TF expression and somatic mutation variations are correlated with differential expression patterns of its target genes across tumors. Our predicted TF impact on tumor gene expression is highly consistent with the knowledge from cancer-related gene databases and reveals many previously unidentified aspects of transcriptional regulation in tumor progression. We also applied RABIT on RNA-binding protein motifs and found that some alternative splicing factors could affect tumor-specific gene expression by binding to target gene 3'UTR regions. Thus, RABIT (rabit.dfci.harvard.edu) is a general platform for predicting the oncogenic role of gene expression regulators.
Collapse
|
33
|
Ignatieva EV, Podkolodnaya OA, Orlov YL, Vasiliev GV, Kolchanov NA. Regulatory genomics: Combined experimental and computational approaches. RUSS J GENET+ 2015. [DOI: 10.1134/s1022795415040067] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
34
|
Weber D, Heisig J, Kneitz S, Wolf E, Eilers M, Gessler M. Mechanisms of epigenetic and cell-type specific regulation of Hey target genes in ES cells and cardiomyocytes. J Mol Cell Cardiol 2015; 79:79-88. [DOI: 10.1016/j.yjmcc.2014.11.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 10/07/2014] [Accepted: 11/06/2014] [Indexed: 01/20/2023]
|
35
|
Budden DM, Hurley DG, Cursons J, Markham JF, Davis MJ, Crampin EJ. Predicting expression: the complementary power of histone modification and transcription factor binding data. Epigenetics Chromatin 2014; 7:36. [PMID: 25489339 PMCID: PMC4258808 DOI: 10.1186/1756-8935-7-36] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 11/05/2014] [Indexed: 01/01/2023] Open
Abstract
Background Transcription factors (TFs) and histone modifications (HMs) play critical roles in gene expression by regulating mRNA transcription. Modelling frameworks have been developed to integrate high-throughput omics data, with the aim of elucidating the regulatory logic that results from the interactions of DNA, TFs and HMs. These models have yielded an unexpected and poorly understood result: that TFs and HMs are statistically redundant in explaining mRNA transcript abundance at a genome-wide level. Results We constructed predictive models of gene expression by integrating RNA-sequencing, TF and HM chromatin immunoprecipitation sequencing and DNase I hypersensitivity data for two mammalian cell types. All models identified genome-wide statistical redundancy both within and between TFs and HMs, as previously reported. To investigate potential explanations, groups of genes were constructed for ontology-classified biological processes. Predictive models were constructed for each process to explore the distribution of statistical redundancy. We found significant variation in the predictive capacity of TFs and HMs across these processes and demonstrated the predictive power of HMs to be inversely proportional to process enrichment for housekeeping genes. Conclusions It is well established that the roles played by TFs and HMs are not functionally redundant. Instead, we attribute the statistical redundancy reported in this and previous genome-wide modelling studies to the heterogeneous distribution of HMs across chromatin domains. Furthermore, we conclude that statistical redundancy between individual TFs can be readily explained by nucleosome-mediated cooperative binding. This could possibly help the cell confer regulatory robustness by rejecting signalling noise and allowing control via multiple pathways. Electronic supplementary material The online version of this article (doi:10.1186/1756-8935-7-36) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- David M Budden
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia
| | - Daniel G Hurley
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Joseph Cursons
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - John F Markham
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; The Walter and Eliza Hall Institute of Medical Research, Department of Medical Biology, The University of Melbourne, 3010 Parkville, Australia
| | - Melissa J Davis
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia
| | - Edmund J Crampin
- Systems Biology Laboratory, Melbourne School of Engineering, The University of Melbourne, 3010 Parkville, Australia ; NICTA Victoria Research Laboratory, The University of Melbourne, 3010 Parkville, Australia ; The Walter and Eliza Hall Institute of Medical Research, Department of Medical Biology, The University of Melbourne, 3010 Parkville, Australia ; ARC Centre of Excellence in Convergent Bio-Nano Science and Technology, 3010 Parkville, Australia ; Department of Mathematics and Statistics, The University of Melbourne, 3010 Parkville, Australia ; School of Medicine, The University of Melbourne, 3010 Parkville, Australia
| |
Collapse
|
36
|
Angelini C, Costa V. Understanding gene regulatory mechanisms by integrating ChIP-seq and RNA-seq data: statistical solutions to biological problems. Front Cell Dev Biol 2014; 2:51. [PMID: 25364758 PMCID: PMC4207007 DOI: 10.3389/fcell.2014.00051] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 09/01/2014] [Indexed: 11/15/2022] Open
Abstract
The availability of omic data produced from international consortia, as well as from worldwide laboratories, is offering the possibility both to answer long-standing questions in biomedicine/molecular biology and to formulate novel hypotheses to test. However, the impact of such data is not fully exploited due to a limited availability of multi-omic data integration tools and methods. In this paper, we discuss the interplay between gene expression and epigenetic markers/transcription factors. We show how integrating ChIP-seq and RNA-seq data can help to elucidate gene regulatory mechanisms. In particular, we discuss the two following questions: (i) Can transcription factor occupancies or histone modification data predict gene expression? (ii) Can ChIP-seq and RNA-seq data be used to infer gene regulatory networks? We propose potential directions for statistical data integration. We discuss the importance of incorporating underestimated aspects (such as alternative splicing and long-range chromatin interactions). We also highlight the lack of data benchmarks and the need to develop tools for data integration from a statistical viewpoint, designed in the spirit of reproducible research.
Collapse
Affiliation(s)
- Claudia Angelini
- Istituto per le Applicazioni del Calcolo "M. Picone" - CNR Napoli, Italy ; Computational and Biology Open Laboratory (ComBOlab) Napoli, Italy
| | - Valerio Costa
- Computational and Biology Open Laboratory (ComBOlab) Napoli, Italy ; Institute of Genetics and Biophysics "A. Buzzati-Traverso" - CNR Napoli, Italy
| |
Collapse
|
37
|
Budden DM, Hurley DG, Crampin EJ. Predictive modelling of gene expression from transcriptional regulatory elements. Brief Bioinform 2014; 16:616-28. [PMID: 25231769 DOI: 10.1093/bib/bbu034] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2014] [Accepted: 08/20/2014] [Indexed: 12/15/2022] Open
Abstract
Predictive modelling of gene expression provides a powerful framework for exploring the regulatory logic underpinning transcriptional regulation. Recent studies have demonstrated the utility of such models in identifying dysregulation of gene and miRNA expression associated with abnormal patterns of transcription factor (TF) binding or nucleosomal histone modifications (HMs). Despite the growing popularity of such approaches, a comparative review of the various modelling algorithms and feature extraction methods is lacking. We define and compare three methods of quantifying pairwise gene-TF/HM interactions and discuss their suitability for integrating the heterogeneous chromatin immunoprecipitation (ChIP)-seq binding patterns exhibited by TFs and HMs. We then construct log-linear and ϵ-support vector regression models from various mouse embryonic stem cell (mESC) and human lymphoblastoid (GM12878) data sets, considering both ChIP-seq- and position weight matrix- (PWM)-derived in silico TF-binding. The two algorithms are evaluated both in terms of their modelling prediction accuracy and ability to identify the established regulatory roles of individual TFs and HMs. Our results demonstrate that TF-binding and HMs are highly predictive of gene expression as measured by mRNA transcript abundance, irrespective of algorithm or cell type selection and considering both ChIP-seq and PWM-derived TF-binding. As we encourage other researchers to explore and develop these results, our framework is implemented using open-source software and made available as a preconfigured bootable virtual environment.
Collapse
|
38
|
O'Connor TR, Bailey TL. Creating and validating cis-regulatory maps of tissue-specific gene expression regulation. Nucleic Acids Res 2014; 42:11000-10. [PMID: 25200088 PMCID: PMC4176179 DOI: 10.1093/nar/gku801] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Predicting which genomic regions control the transcription of a given gene is a challenge. We present a novel computational approach for creating and validating maps that associate genomic regions (cis-regulatory modules–CRMs) with genes. The method infers regulatory relationships that explain gene expression observed in a test tissue using widely available genomic data for ‘other’ tissues. To predict the regulatory targets of a CRM, we use cross-tissue correlation between histone modifications present at the CRM and expression at genes within 1 Mbp of it. To validate cis-regulatory maps, we show that they yield more accurate models of gene expression than carefully constructed control maps. These gene expression models predict observed gene expression from transcription factor binding in the CRMs linked to that gene. We show that our maps are able to identify long-range regulatory interactions and improve substantially over maps linking genes and CRMs based on either the control maps or a ‘nearest neighbor’ heuristic. Our results also show that it is essential to include CRMs predicted in multiple tissues during map-building, that H3K27ac is the most informative histone modification, and that CAGE is the most informative measure of gene expression for creating cis-regulatory maps.
Collapse
Affiliation(s)
- Timothy R O'Connor
- Institute for Molecular Bioscience, The University of Queensland, Brisbane 4072, Queensland, Australia
| | - Timothy L Bailey
- Institute for Molecular Bioscience, The University of Queensland, Brisbane 4072, Queensland, Australia
| |
Collapse
|
39
|
Noderer WL, Flockhart RJ, Bhaduri A, Diaz de Arce AJ, Zhang J, Khavari PA, Wang CL. Quantitative analysis of mammalian translation initiation sites by FACS-seq. Mol Syst Biol 2014; 10:748. [PMID: 25170020 PMCID: PMC4299517 DOI: 10.15252/msb.20145136] [Citation(s) in RCA: 137] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
An approach combining fluorescence-activated cell sorting and high-throughput DNA sequencing
(FACS-seq) was employed to determine the efficiency of start codon recognition for all possible
translation initiation sites (TIS) utilizing AUG start codons. Using FACS-seq, we measured
translation from a genetic reporter library representing all 65,536 possible TIS sequences spanning
the −6 to +5 positions. We found that the motif RYMRMVAUGGC enhanced start codon
recognition and translation efficiency. However, dinucleotide interactions, which cannot be conveyed
by a single motif, were also important for modeling TIS efficiency. Our dataset combined with
modeling allowed us to predict genome-wide translation initiation efficiency for all mRNA
transcripts. Additionally, we screened somatic TIS mutations associated with tumorigenesis to
identify candidate driver mutations consistent with known tumor expression patterns. Finally, we
implemented a quantitative leaky scanning model to predict alternative initiation sites that produce
truncated protein isoforms and compared predictions with ribosome footprint profiling data. The
comprehensive analysis of the TIS sequence space enables quantitative predictions of translation
initiation based on genome sequence.
Collapse
Affiliation(s)
- William L Noderer
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| | - Ross J Flockhart
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Aparna Bhaduri
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA The Program in Cancer Biology, Stanford University School of Medicine, Stanford, CA, USA
| | | | - Jiajing Zhang
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - Paul A Khavari
- The Program in Epithelial Biology, Stanford University School of Medicine, Stanford, CA, USA Veterans Affairs Palo Alto Healthcare System, Palo Alto, CA, USA
| | - Clifford L Wang
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
40
|
Ling MHT, Poh CL. A predictor for predicting Escherichia coli transcriptome and the effects of gene perturbations. BMC Bioinformatics 2014; 15:140. [PMID: 24884349 PMCID: PMC4038595 DOI: 10.1186/1471-2105-15-140] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2013] [Accepted: 05/09/2014] [Indexed: 11/24/2022] Open
Abstract
Background A means to predict the effects of gene over-expression, knockouts, and environmental stimuli in silico is useful for system biologists to develop and test hypotheses. Several studies had predicted the expression of all Escherichia coli genes from sequences and reported a correlation of 0.301 between predicted and actual expression. However, these do not allow biologists to study the effects of gene perturbations on the native transcriptome. Results We developed a predictor to predict transcriptome-scale gene expression from a small number (n = 59) of known gene expressions using gene co-expression network, which can be used to predict the effects of over-expressions and knockdowns on E. coli transcriptome. In terms of transcriptome prediction, our results show that the correlation between predicted and actual expression value is 0.467, which is similar to the microarray intra-array variation (p-value = 0.348), suggesting that intra-array variation accounts for a substantial portion of the transcriptome prediction error. In terms of predicting the effects of gene perturbation(s), our results suggest that the expression of 83% of the genes affected by perturbation can be predicted within 40% of error and the correlation between predicted and actual expression values among the affected genes to be 0.698. With the ability to predict the effects of gene perturbations, we demonstrated that our predictor has the potential to estimate the effects of varying gene expression level on the native transcriptome. Conclusion We present a potential means to predict an entire transcriptome and a tool to estimate the effects of gene perturbations for E. coli, which will aid biologists in hypothesis development. This study forms the baseline for future work in using gene co-expression network for gene expression prediction.
Collapse
Affiliation(s)
- Maurice H T Ling
- School of Chemical and Biomedical Engineering, Nanyang Technological University, Nanyang Ave, Singapore, Singapore.
| | | |
Collapse
|
41
|
Abstract
Genomic information is encoded on a wide range of distance scales, ranging from tens of base pairs to megabases. We developed a multiscale framework to analyze and visualize the information content of genomic signals. Different types of signals, such as GC content or DNA methylation, are characterized by distinct patterns of signal enrichment or depletion across scales spanning several orders of magnitude. These patterns are associated with a variety of genomic annotations, including genes, nuclear lamina associated domains, and repeat elements. By integrating the information across all scales, as compared to using any single scale, we demonstrate improved prediction of gene expression from Polymerase II chromatin immunoprecipitation sequencing (ChIP-seq) measurements and we observed that gene expression differences in colorectal cancer are not most strongly related to gene body methylation, but rather to methylation patterns that extend beyond the single-gene scale.
Collapse
|
42
|
Chen H, Lonardi S, Zheng J. Deciphering histone code of transcriptional regulation in malaria parasites by large-scale data mining. Comput Biol Chem 2014; 50:3-10. [PMID: 24581698 DOI: 10.1016/j.compbiolchem.2014.01.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/23/2013] [Indexed: 10/25/2022]
Abstract
Histone modifications play a major role in the regulation of gene expression. Accumulated evidence has shown that histone modifications mediate biological processes such as transcription cooperatively. This has led to the hypothesis of 'histone code' which suggests that combinations of different histone modifications correspond to unique chromatin states and have distinct functions. In this paper, we propose a framework based on association rule mining to discover the potential regulatory relations between histone modifications and gene expression in Plasmodium falciparum. Our approach can output rules with statistical significance. Some of the discovered rules are supported by literature of experimental results. Moreover, we have also discovered de novo rules which can guide further research in epigenetic regulation of transcription. Based on our association rules we build a model to predict gene expression, which outperforms a published Bayesian network model for gene expression prediction by histone modifications. The results of our study reveal mechanisms for histone modifications to regulate transcription in large-scale. Among our findings, the cooperation among histone modifications provides new evidence for the hypothesis of histone code. Furthermore, the rules output by our method can be used to predict the change of gene expression.
Collapse
Affiliation(s)
- Haifen Chen
- School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore.
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California Riverside, 900 University Avenue, Riverside, CA 92521, USA.
| | - Jie Zheng
- School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore; Genome Institute of Singapore, A*STAR (Agency for Science, Technology, and Research), Biopolis, Singapore 138672, Singapore.
| |
Collapse
|
43
|
Comoglio F, Paro R. Combinatorial modeling of chromatin features quantitatively predicts DNA replication timing in Drosophila. PLoS Comput Biol 2014; 10:e1003419. [PMID: 24465194 PMCID: PMC3900380 DOI: 10.1371/journal.pcbi.1003419] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2013] [Accepted: 11/18/2013] [Indexed: 01/14/2023] Open
Abstract
In metazoans, each cell type follows a characteristic, spatio-temporally regulated DNA replication program. Histone modifications (HMs) and chromatin binding proteins (CBPs) are fundamental for a faithful progression and completion of this process. However, no individual HM is strictly indispensable for origin function, suggesting that HMs may act combinatorially in analogy to the histone code hypothesis for transcriptional regulation. In contrast to gene expression however, the relationship between combinations of chromatin features and DNA replication timing has not yet been demonstrated. Here, by exploiting a comprehensive data collection consisting of 95 CBPs and HMs we investigated their combinatorial potential for the prediction of DNA replication timing in Drosophila using quantitative statistical models. We found that while combinations of CBPs exhibit moderate predictive power for replication timing, pairwise interactions between HMs lead to accurate predictions genome-wide that can be locally further improved by CBPs. Independent feature importance and model analyses led us to derive a simplified, biologically interpretable model of the relationship between chromatin landscape and replication timing reaching 80% of the full model accuracy using six model terms. Finally, we show that pairwise combinations of HMs are able to predict differential DNA replication timing across different cell types. All in all, our work provides support to the existence of combinatorial HM patterns for DNA replication and reveal cell-type independent key elements thereof, whose experimental investigation might contribute to elucidate the regulatory mode of this fundamental cellular process.
Collapse
Affiliation(s)
- Federico Comoglio
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - Renato Paro
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- Faculty of Science, University of Basel, Basel, Switzerland
| |
Collapse
|