1
|
Cheng JH, Zheng C, Yamada R, Okada D. Visualization of the landscape of the read alignment shape of ATAC-seq data using Hellinger distance metric. Genes Cells 2024; 29:5-16. [PMID: 37989133 DOI: 10.1111/gtc.13082] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 10/25/2023] [Accepted: 10/28/2023] [Indexed: 11/23/2023]
Abstract
Assay for Transposase-Accessible Chromatin using high-throughput sequencing (ATAC-seq) is the popular technique using next-generation sequencing to measure chromatin accessibility and identify open chromatin regions. While read alignment shape information of next-generation sequencing data with intensity information has been used in various bioinformatics methods, few studies have focused on pure shape information alone. In this study, we investigated what types of ATAC-seq read alignment shapes are observed for the promoter region and whether the pure shape information was related or unrelated to other gene features. We introduced a novel concept and pipeline for handling the pure shape information of NGS data as probability distributions and quantifying their dissimilarities by information theory. Based on this concept, we demonstrate that the pure shape information of ATAC-seq data is correlated with chromatin openness and some gene characteristics. On the other hand, it is suggested that the pure information of ATAC-seq read alignment shape is unlikely to contain additional information to explain differences in RNA expression. Our study suggests that viewing the read alignment shape of NGS data as probability distributions enables us to capture the characteristics of the genome-wide landscape of such data in a non-parametric manner.
Collapse
Affiliation(s)
- Jian Hao Cheng
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Cheng Zheng
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Ryo Yamada
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Daigo Okada
- Center for Genomics Medicine, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| |
Collapse
|
2
|
Jalili V, Cremona MA, Palluzzi F. Rescuing biologically relevant consensus regions across replicated samples. BMC Bioinformatics 2023; 24:240. [PMID: 37286963 PMCID: PMC10246347 DOI: 10.1186/s12859-023-05340-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 05/16/2023] [Indexed: 06/09/2023] Open
Abstract
BACKGROUND Protein-DNA binding sites of ChIP-seq experiments are identified where the binding affinity is significant based on a given threshold. The choice of the threshold is a trade-off between conservative region identification and discarding weak, but true binding sites. RESULTS We rescue weak binding sites using MSPC, which efficiently exploits replicates to lower the threshold required to identify a site while keeping a low false-positive rate, and we compare it to IDR, a widely used post-processing method for identifying highly reproducible peaks across replicates. We observe several master transcription regulators (e.g., SP1 and GATA3) and HDAC2-GATA1 regulatory networks on rescued regions in K562 cell line. CONCLUSIONS We argue the biological relevance of weak binding sites and the information they add when rescued by MSPC. An implementation of the proposed extended MSPC methodology and the scripts to reproduce the performed analysis are freely available at https://genometric.github.io/MSPC/ ; MSPC is distributed as a command-line application and an R package available from Bioconductor ( https://doi.org/doi:10.18129/B9.bioc.rmspc ).
Collapse
Affiliation(s)
- Vahid Jalili
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Marzia A Cremona
- Department of Operations and Decision Systems, Université Laval, Quebec, Canada.
- CHU de Québec - Université Laval Research Center, Quebec, Canada.
| | - Fernando Palluzzi
- Department of Brain and Behavioral Sciences, Università di Pavia, Pavia, Italy.
| |
Collapse
|
3
|
The Hypersaline Archaeal Histones HpyA and HstA Are DNA Binding Proteins That Defy Categorization According to Commonly Used Functional Criteria. mBio 2023; 14:e0344922. [PMID: 36779711 PMCID: PMC10128011 DOI: 10.1128/mbio.03449-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/14/2023] Open
Abstract
Histone proteins are found across diverse lineages of Archaea, many of which package DNA and form chromatin. However, previous research has led to the hypothesis that the histone-like proteins of high-salt-adapted archaea, or halophiles, function differently. The sole histone protein encoded by the model halophilic species Halobacterium salinarum, HpyA, is nonessential and expressed at levels too low to enable genome-wide DNA packaging. Instead, HpyA mediates the transcriptional response to salt stress. Here we compare the features of genome-wide binding of HpyA to those of HstA, the sole histone of another model halophile, Haloferax volcanii. hstA, like hpyA, is a nonessential gene. To better understand HpyA and HstA functions, protein-DNA binding data (chromatin immunoprecipitation sequencing [ChIP-seq]) of these halophilic histones are compared to publicly available ChIP-seq data from DNA binding proteins across all domains of life, including transcription factors (TFs), nucleoid-associated proteins (NAPs), and histones. These analyses demonstrate that HpyA and HstA bind the genome infrequently in discrete regions, which is similar to TFs but unlike NAPs, which bind a much larger genomic fraction. However, unlike TFs that typically bind in intergenic regions, HpyA and HstA binding sites are located in both coding and intergenic regions. The genome-wide dinucleotide periodicity known to facilitate histone binding was undetectable in the genomes of both species. Instead, TF-like and histone-like binding sequence preferences were detected for HstA and HpyA, respectively. Taken together, these data suggest that halophilic archaeal histones are unlikely to facilitate genome-wide chromatin formation and that their function defies categorization as a TF, NAP, or histone. IMPORTANCE Most cells in eukaryotic species-from yeast to humans-possess histone proteins that pack and unpack DNA in response to environmental cues. These essential proteins regulate genes necessary for important cellular processes, including development and stress protection. Although the histone fold domain originated in the domain of life Archaea, the function of archaeal histone-like proteins is not well understood relative to those of eukaryotes. We recently discovered that, unlike histones of eukaryotes, histones in hypersaline-adapted archaeal species do not package DNA and can act as transcription factors (TFs) to regulate stress response gene expression. However, the function of histones across species of hypersaline-adapted archaea still remains unclear. Here, we compare hypersaline histone function to a variety of DNA binding proteins across the tree of life, revealing histone-like behavior in some respects and specific transcriptional regulatory function in others.
Collapse
|
4
|
Cremona MA, Chiaromonte F. Probabilistic K-means with local alignment for clustering and motif discovery in functional data. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2156522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Marzia A. Cremona
- Dept. of Operations and Decision Systems, Université Laval, CHU de Québec – Université Laval Research Center
| | - Francesca Chiaromonte
- Dept. of Statistics, The Pennsylvania State University, Inst. of Economics and EMbeDS, Sant’Anna School of Advanced Studies
| |
Collapse
|
5
|
Heyl F, Backofen R. StoatyDive: Evaluation and classification of peak profiles for sequencing data. Gigascience 2021; 10:giab045. [PMID: 34143874 PMCID: PMC8212874 DOI: 10.1093/gigascience/giab045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 11/26/2020] [Accepted: 05/14/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND The prediction of binding sites (peak-calling) is a common task in the data analysis of methods such as cross-linking immunoprecipitation in combination with high-throughput sequencing (CLIP-Seq). The predicted binding sites are often further analyzed to predict sequence motifs or structure patterns. When looking at a typical result of such high-throughput experiments, the obtained peak profiles differ largely on a genomic level. Thus, a tool is missing that evaluates and classifies the predicted peaks on the basis of their shapes. We hereby present StoatyDive, a tool that can be used to filter for specific peak profile shapes of sequencing data such as CLIP. FINDINGS With StoatyDive we are able to classify peak profile shapes from CLIP-seq data of the histone stem-loop-binding protein (SLBP). We compare the results to existing tools and show that StoatyDive finds more distinct peak shape clusters for CLIP data. Furthermore, we present StoatyDive's capabilities as a quality control tool and as a filter to pick different shapes based on biological or technical questions for other CLIP data from different RNA binding proteins with different biological functions and numbers of RNA recognition motifs. We finally show that proteins involved in splicing, such as RBM22 and U2AF1, have potentially sharper-shaped peaks than other RNA binding proteins. CONCLUSION StoatyDive finally fills the demand for a peak shape clustering tool for CLIP-Seq data that fine-tunes downstream analysis steps such as structure or sequence motif predictions and that acts as a quality control.
Collapse
Affiliation(s)
- Florian Heyl
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Schaenzlestr. 18, 79104 Freiburg, Germany
| |
Collapse
|
6
|
Eicher T, Chan J, Luu H, Machiraju R, Mathé EA. Self-organizing maps with variable neighborhoods facilitate learning of chromatin accessibility signal shapes associated with regulatory elements. BMC Bioinformatics 2021; 22:35. [PMID: 33516170 PMCID: PMC7847148 DOI: 10.1186/s12859-021-03976-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 01/21/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Assigning chromatin states genome-wide (e.g. promoters, enhancers, etc.) is commonly performed to improve functional interpretation of these states. However, computational methods to assign chromatin state suffer from the following drawbacks: they typically require data from multiple assays, which may not be practically feasible to obtain, and they depend on peak calling algorithms, which require careful parameterization and often exclude the majority of the genome. To address these drawbacks, we propose a novel learning technique built upon the Self-Organizing Map (SOM), Self-Organizing Map with Variable Neighborhoods (SOM-VN), to learn a set of representative shapes from a single, genome-wide, chromatin accessibility dataset to associate with a chromatin state assignment in which a particular RE is prevalent. These shapes can then be used to assign chromatin state using our workflow. RESULTS We validate the performance of the SOM-VN workflow on 14 different samples of varying quality, namely one assay each of A549 and GM12878 cell lines and two each of H1 and HeLa cell lines, primary B-cells, and brain, heart, and stomach tissue. We show that SOM-VN learns shapes that are (1) non-random, (2) associated with known chromatin states, (3) generalizable across sets of chromosomes, and (4) associated with magnitude and multimodality. We compare the accuracy of SOM-VN chromatin states against the Clustering Aggregation Tool (CAGT), an unsupervised method that learns chromatin accessibility signal shapes but does not associate these shapes with REs, and we show that overall precision and recall is increased when learning shapes using SOM-VN as compared to CAGT. We further compare enhancer state assignments from SOM-VN in signals above a set threshold to enhancer state assignments from Predicting Enhancers from ATAC-seq Data (PEAS), a deep learning method that assigns enhancer chromatin states to peaks. We show that the precision-recall area under the curve for the assignment of enhancer states is comparable to PEAS. CONCLUSIONS Our work shows that the SOM-VN workflow can learn relationships between REs and chromatin accessibility signal shape, which is an important step toward the goal of assigning and comparing enhancer state across multiple experiments and phenotypic states.
Collapse
Affiliation(s)
- Tara Eicher
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA
- Department of Computer Science and Engineering, The Ohio State University College of Engineering, 2015 Neil Avenue, Columbus, OH, 43210, USA
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, National Institute of Health, 9800 Medical Center Dr., Rockville, MD, 20892, USA
| | - Jany Chan
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA
| | - Han Luu
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA
| | - Raghu Machiraju
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA.
- Department of Computer Science and Engineering, The Ohio State University College of Engineering, 2015 Neil Avenue, Columbus, OH, 43210, USA.
- Department of Pathology, The Ohio State University College of Medicine, 1645 Neil Ave, Columbus, OH, 43210, USA.
- Translational Data Analytics Institute, The Ohio State University, 1760 Neil Ave., Columbus, OH, 43210, USA.
| | - Ewy A Mathé
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W. 9th Avenue, Columbus, OH, 43210, USA.
- Division of Preclinical Innovation, National Center for Advancing Translational Sciences, National Institute of Health, 9800 Medical Center Dr., Rockville, MD, 20892, USA.
| |
Collapse
|
7
|
Khilji S, Hamed M, Chen J, Li Q. Dissecting myogenin-mediated retinoid X receptor signaling in myogenic differentiation. Commun Biol 2020; 3:315. [PMID: 32555436 PMCID: PMC7303199 DOI: 10.1038/s42003-020-1043-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 05/21/2020] [Indexed: 11/18/2022] Open
Abstract
Deciphering the molecular mechanisms underpinning myoblast differentiation is a critical step in developing the best strategy to promote muscle regeneration in patients suffering from muscle-related diseases. We have previously established that a rexinoid x receptor (RXR)-selective agonist, bexarotene, enhances the differentiation and fusion of myoblasts through a direct regulation of MyoD expression, coupled with an augmentation of myogenin protein. Here, we found that RXR signaling associates with the distribution of myogenin at poised enhancers and a distinct E-box motif. We also found an association of myogenin with rexinoid-responsive gene expression and identified an epigenetic signature related to histone acetyltransferase p300. Moreover, RXR signaling augments residue-specific histone acetylation at enhancers co-occupied by p300 and myogenin. Thus, genomic distribution of transcriptional regulators is an important designate for identifying novel targets as well as developing therapeutics that modulate epigenetic landscape in a selective manner to promote muscle regeneration.
Collapse
Affiliation(s)
- Saadia Khilji
- Department of Cellular and Molecular Medicine and Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Munerah Hamed
- Department of Cellular and Molecular Medicine and Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Jihong Chen
- Department of Pathology and Laboratory Medicine, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Qiao Li
- Department of Cellular and Molecular Medicine and Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.
- Department of Pathology and Laboratory Medicine, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.
| |
Collapse
|
8
|
Witmer K, Fraschka SA, Vlachou D, Bártfai R, Christophides GK. An epigenetic map of malaria parasite development from host to vector. Sci Rep 2020; 10:6354. [PMID: 32286373 PMCID: PMC7156373 DOI: 10.1038/s41598-020-63121-5] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 03/24/2020] [Indexed: 12/23/2022] Open
Abstract
The malaria parasite replicates asexually in the red blood cells of its vertebrate host employing epigenetic mechanisms to regulate gene expression in response to changes in its environment. We used chromatin immunoprecipitation followed by sequencing in conjunction with RNA sequencing to create an epigenomic and transcriptomic map of the developmental transition from asexual blood stages to male and female gametocytes and to ookinetes in the rodent malaria parasite Plasmodium berghei. Across the developmental stages examined, heterochromatin protein 1 associates with variantly expressed gene families localised at subtelomeric regions and variant gene expression based on heterochromatic silencing is observed only in some genes. Conversely, the euchromatin mark histone 3 lysine 9 acetylation (H3K9ac) is abundant in non-heterochromatic regions across all developmental stages. H3K9ac presents a distinct pattern of enrichment around the start codon of ribosomal protein genes in all stages but male gametocytes. Additionally, H3K9ac occupancy positively correlates with transcript abundance in all stages but female gametocytes suggesting that transcription in this stage is independent of H3K9ac levels. This finding together with known mRNA repression in female gametocytes suggests a multilayered mechanism operating in female gametocytes in preparation for fertilization and zygote development, coinciding with parasite transition from host to vector.
Collapse
Affiliation(s)
- Kathrin Witmer
- Department of Life Sciences, Imperial College London, SW7 2AZ, London, UK.
| | - Sabine A Fraschka
- Department of Molecular Biology, Radboud University, 6525, GA, Nijmegen, The Netherlands.,Institute of Medical Genetics and Applied Genomics, University of Tübingen, 72076, Tübingen, Germany
| | - Dina Vlachou
- Department of Life Sciences, Imperial College London, SW7 2AZ, London, UK
| | - Richárd Bártfai
- Department of Molecular Biology, Radboud University, 6525, GA, Nijmegen, The Netherlands
| | | |
Collapse
|
9
|
Yamada N, Lai WKM, Farrell N, Pugh BF, Mahony S. Characterizing protein-DNA binding event subtypes in ChIP-exo data. Bioinformatics 2019; 35:903-913. [PMID: 30165373 DOI: 10.1093/bioinformatics/bty703] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Revised: 07/14/2018] [Accepted: 08/23/2018] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION Regulatory proteins associate with the genome either by directly binding cognate DNA motifs or via protein-protein interactions with other regulators. Each recruitment mechanism may be associated with distinct motifs and may also result in distinct characteristic patterns in high-resolution protein-DNA binding assays. For example, the ChIP-exo protocol precisely characterizes protein-DNA crosslinking patterns by combining chromatin immunoprecipitation (ChIP) with 5' → 3' exonuclease digestion. Since different regulatory complexes will result in different protein-DNA crosslinking signatures, analysis of ChIP-exo tag enrichment patterns should enable detection of multiple protein-DNA binding modes for a given regulatory protein. However, current ChIP-exo analysis methods either treat all binding events as being of a uniform type or rely on motifs to cluster binding events into subtypes. RESULTS To systematically detect multiple protein-DNA interaction modes in a single ChIP-exo experiment, we introduce the ChIP-exo mixture model (ChExMix). ChExMix probabilistically models the genomic locations and subtype memberships of binding events using both ChIP-exo tag distribution patterns and DNA motifs. We demonstrate that ChExMix achieves accurate detection and classification of binding event subtypes using in silico mixed ChIP-exo data. We further demonstrate the unique analysis abilities of ChExMix using a collection of ChIP-exo experiments that profile the binding of key transcription factors in MCF-7 cells. In these data, ChExMix identifies possible recruitment mechanisms of FoxA1 and ERα, thus demonstrating that ChExMix can effectively stratify ChIP-exo binding events into biologically meaningful subtypes. AVAILABILITY AND IMPLEMENTATION ChExMix is available from https://github.com/seqcode/chexmix. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Naomi Yamada
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, USA
| | - William K M Lai
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, USA
| | - Nina Farrell
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, USA
| | - B Franklin Pugh
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, USA
| | - Shaun Mahony
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
10
|
Cremona MA, Xu H, Makova KD, Reimherr M, Chiaromonte F, Madrigal P. Functional data analysis for computational biology. Bioinformatics 2019; 35:3211-3213. [PMID: 30668667 PMCID: PMC6736445 DOI: 10.1093/bioinformatics/btz045] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2018] [Revised: 01/01/2019] [Accepted: 01/17/2019] [Indexed: 12/25/2022] Open
Abstract
SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marzia A Cremona
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Hongyan Xu
- Department of Population Health Sciences, Medical College of Georgia, Augusta University, Augusta, GA, USA
| | - Kateryna D Makova
- Department of Biology, The Pennsylvania State University, University Park, PA, USA
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Matthew Reimherr
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Francesca Chiaromonte
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
- Institute of Economics, Sant’Anna School of Advanced Studies, EMbeDS Economics and Management in the era of Data Science, Pisa, Italy
| | - Pedro Madrigal
- Wellcome Trust – MRC Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
- Department of Haematology, University of Cambridge, Cambridge, UK
| |
Collapse
|
11
|
PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition. Nat Commun 2018; 9:1402. [PMID: 29643364 PMCID: PMC5895786 DOI: 10.1038/s41467-018-03635-9] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 03/02/2018] [Indexed: 11/24/2022] Open
Abstract
The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called “tensor decomposition” to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics. Assays to characterize the epigenome and interrogate chromatin state genome wide have so far been performed in a selected set of conditions. Here, Durham et al. develop a computational method based on tensor decomposition to impute missing experiments in collections of epigenomics experiments.
Collapse
|
12
|
Knoedler JR, Subramani A, Denver RJ. The Krüppel-like factor 9 cistrome in mouse hippocampal neurons reveals predominant transcriptional repression via proximal promoter binding. BMC Genomics 2017; 18:299. [PMID: 28407733 PMCID: PMC5390390 DOI: 10.1186/s12864-017-3640-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2016] [Accepted: 03/17/2017] [Indexed: 12/22/2022] Open
Abstract
Background Krüppel-like factor 9 (Klf9) is a zinc finger transcription factor that functions in neural cell differentiation, but little is known about its genomic targets or mechanism of action in neurons. Results We used the mouse hippocampus-derived neuronal cell line HT22 to identify genes regulated by Klf9, and we validated our findings in mouse hippocampus. We engineered HT22 cells to express a Klf9 transgene under control of the tetracycline repressor, and used RNA sequencing to identify genes modulated by Klf9. We found 217 genes repressed and 21 induced by Klf9. We also engineered HT22 cells to co-express biotin ligase and a Klf9 fusion protein containing an N-terminal biotin ligase recognition peptide. Using chromatin-streptavidin precipitation (ChSP) sequencing we identified 3,514 genomic regions where Klf9 associated. Seventy-five percent of these were within 1 kb of transcription start sites, and Klf9 associated in chromatin with 60% of the repressed genes. We analyzed the promoters of several repressed genes containing Klf9 ChSP peaks using transient transfection reporter assays and found that Klf9 repressed promoter activity, which was abolished after mutation of Sp/Klf-like motifs. Knockdown or knockout of Klf9 in HT22 cells caused dysregulation of Klf9 target genes. Chromatin immunoprecipitation assays showed that Klf9 associated in chromatin from mouse hippocampus with genes identified by ChSP sequencing on HT22 cells, and expression of Klf9 target genes was dysregulated in the hippocampus of neonatal Klf9-null mice. Gene ontology analysis revealed that Klf9 genomic targets include genes involved in cystokeletal remodeling, Wnt signaling and inflammation. Conclusions We have identified genomic targets of Klf9 in hippocampal neurons and created a foundation for future studies on how it functions in chromatin, and regulates neuronal morphology and survival across the lifespan. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3640-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Joseph R Knoedler
- Neuroscience Graduate Program, The University of Michigan, Ann Arbor, MI, 48109, USA.,Current address: Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, 94305, USA
| | - Arasakumar Subramani
- Department of Molecular, Cellular and Developmental Biology, The University of Michigan, 3065C Kraus Natural Science Building, Ann Arbor, MI, 48109, USA
| | - Robert J Denver
- Neuroscience Graduate Program, The University of Michigan, Ann Arbor, MI, 48109, USA. .,Department of Molecular, Cellular and Developmental Biology, The University of Michigan, 3065C Kraus Natural Science Building, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
13
|
Huang BFF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics 2016; 17:331. [PMID: 27586051 PMCID: PMC5009551 DOI: 10.1186/s12859-016-1228-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Accepted: 08/26/2016] [Indexed: 02/07/2023] Open
Abstract
Background The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. Results We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. Conclusions Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Barbara F F Huang
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada
| | - Paul C Boutros
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada. .,Department of Medical Biophysics, University of Toronto, Toronto, Canada. .,Department of Pharmacology and Toxicology, University of Toronto, Toronto, Canada. .,MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, M5G 0A3, Canada.
| |
Collapse
|
14
|
Campos-Sánchez R, Cremona MA, Pini A, Chiaromonte F, Makova KD. Integration and Fixation Preferences of Human and Mouse Endogenous Retroviruses Uncovered with Functional Data Analysis. PLoS Comput Biol 2016; 12:e1004956. [PMID: 27309962 PMCID: PMC4911145 DOI: 10.1371/journal.pcbi.1004956] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 04/29/2016] [Indexed: 01/24/2023] Open
Abstract
Endogenous retroviruses (ERVs), the remnants of retroviral infections in the germ line, occupy ~8% and ~10% of the human and mouse genomes, respectively, and affect their structure, evolution, and function. Yet we still have a limited understanding of how the genomic landscape influences integration and fixation of ERVs. Here we conducted a genome-wide study of the most recently active ERVs in the human and mouse genome. We investigated 826 fixed and 1,065 in vitro HERV-Ks in human, and 1,624 fixed and 242 polymorphic ETns, as well as 3,964 fixed and 1,986 polymorphic IAPs, in mouse. We quantitated >40 human and mouse genomic features (e.g., non-B DNA structure, recombination rates, and histone modifications) in ±32 kb of these ERVs' integration sites and in control regions, and analyzed them using Functional Data Analysis (FDA) methodology. In one of the first applications of FDA in genomics, we identified genomic scales and locations at which these features display their influence, and how they work in concert, to provide signals essential for integration and fixation of ERVs. The investigation of ERVs of different evolutionary ages (young in vitro and polymorphic ERVs, older fixed ERVs) allowed us to disentangle integration vs. fixation preferences. As a result of these analyses, we built a comprehensive model explaining the uneven distribution of ERVs along the genome. We found that ERVs integrate in late-replicating AT-rich regions with abundant microsatellites, mirror repeats, and repressive histone marks. Regions favoring fixation are depleted of genes and evolutionarily conserved elements, and have low recombination rates, reflecting the effects of purifying selection and ectopic recombination removing ERVs from the genome. In addition to providing these biological insights, our study demonstrates the power of exploiting multiple scales and localization with FDA. These powerful techniques are expected to be applicable to many other genomic investigations.
Collapse
Affiliation(s)
- Rebeca Campos-Sánchez
- Genetics Graduate Program, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Marzia A. Cremona
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
| | - Alessia Pini
- MOX—Modeling and Scientific Computing, Department of Mathematics, Politecnico di Milano, Milano, Italy
| | - Francesca Chiaromonte
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
| | - Kateryna D. Makova
- Center for Medical Genomics, The Huck Institutes of the Life Sciences, Penn State University, University Park, Pennsylvania, United States of America
- Department of Biology, Penn State University, University Park, Pennsylvania, United States of America
| |
Collapse
|