1
|
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021; 22:288. [PMID: 34635147 PMCID: PMC8504070 DOI: 10.1186/s13059-021-02506-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
Collapse
Affiliation(s)
- Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Dongyuan Song
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - MeiLu McDermott
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
- The Quantitative and Computational Biology section, University of Southern California, Los Angeles, 90089, CA, USA
| | - Kyla Woyshner
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Antigoni Manousopoulou
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Ning Wang
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Leo D Wang
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA.
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095, CA, USA.
| |
Collapse
|
2
|
ChIP-GSM: Inferring active transcription factor modules to predict functional regulatory elements. PLoS Comput Biol 2021; 17:e1009203. [PMID: 34292930 PMCID: PMC8330942 DOI: 10.1371/journal.pcbi.1009203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 08/03/2021] [Accepted: 06/20/2021] [Indexed: 11/19/2022] Open
Abstract
Transcription factors (TFs) often function as a module including both master factors and mediators binding at cis-regulatory regions to modulate nearby gene transcription. ChIP-seq profiling of multiple TFs makes it feasible to infer functional TF modules. However, when inferring TF modules based on co-localization of ChIP-seq peaks, often many weak binding events are missed, especially for mediators, resulting in incomplete identification of modules. To address this problem, we develop a ChIP-seq data-driven Gibbs Sampler to infer Modules (ChIP-GSM) using a Bayesian framework that integrates ChIP-seq profiles of multiple TFs. ChIP-GSM samples read counts of module TFs iteratively to estimate the binding potential of a module to each region and, across all regions, estimates the module abundance. Using inferred module-region probabilistic bindings as feature units, ChIP-GSM then employs logistic regression to predict active regulatory elements. Validation of ChIP-GSM predicted regulatory regions on multiple independent datasets sharing the same context confirms the advantage of using TF modules for predicting regulatory activity. In a case study of K562 cells, we demonstrate that the ChIP-GSM inferred modules form as groups, activate gene expression at different time points, and mediate diverse functional cellular processes. Hence, ChIP-GSM infers biologically meaningful TF modules and improves the prediction accuracy of regulatory region activities.
Collapse
|
3
|
Menzel M, Hurka S, Glasenhardt S, Gogol-Döring A. NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling. Bioinformatics 2021; 37:596-602. [PMID: 32991679 DOI: 10.1093/bioinformatics/btaa845] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 09/14/2020] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION The discovery of sequence motifs mediating DNA-protein binding usually implies the determination of binding sites using high-throughput sequencing and peak calling. The determination of peaks, however, depends strongly on data quality and is susceptible to noise. RESULTS Here, we present a novel approach to reliably identify transcription factor-binding motifs from ChIP-Seq data without peak detection. By evaluating the distributions of sequencing reads around the different k-mers in the genome, we are able to identify binding motifs in ChIP-Seq data that yield no results in traditional pipelines. AVAILABILITY AND IMPLEMENTATION NoPeak is published under the GNU General Public License and available as a standalone console-based Java application at https://github.com/menzel/nopeak. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael Menzel
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Sabine Hurka
- Institute for Insect Biotechnology, Justus Liebig University, Giessen 35392, Germany
| | - Stefan Glasenhardt
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Andreas Gogol-Döring
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| |
Collapse
|
4
|
Zheng A, Lamkin M, Qiu Y, Ren K, Goren A, Gymrek M. A flexible ChIP-sequencing simulation toolkit. BMC Bioinformatics 2021; 22:201. [PMID: 33879052 PMCID: PMC8056602 DOI: 10.1186/s12859-021-04097-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 03/22/2021] [Indexed: 11/17/2022] Open
Abstract
Background A major challenge in evaluating quantitative ChIP-seq analyses, such as peak calling and differential binding, is a lack of reliable ground truth data. Accurate simulation of ChIP-seq data can mitigate this challenge, but existing frameworks are either too cumbersome to apply genome-wide or unable to model a number of important experimental conditions in ChIP-seq. Results We present ChIPs, a toolkit for rapidly simulating ChIP-seq data using statistical models of key experimental steps. We demonstrate how ChIPs can be used for a range of applications, including benchmarking analysis tools and evaluating the impact of various experimental parameters. ChIPs is implemented as a standalone command-line program written in C++ and is available from https://github.com/gymreklab/chips. Conclusions ChIPs is an efficient ChIP-seq simulation framework that generates realistic datasets over a flexible range of experimental conditions. It can serve as an important component in various ChIP-seq analyses where ground truth data are needed. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04097-5.
Collapse
Affiliation(s)
- An Zheng
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Michael Lamkin
- Department of Bioengineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
| | - Yutong Qiu
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.,School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, USA
| | - Kevin Ren
- Department of Mathematics, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA
| | - Alon Goren
- Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA. .,Department of Medicine, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA.
| |
Collapse
|
5
|
Subkhankulova T, Naumenko F, Tolmachov OE, Orlov YL. Novel ChIP-seq simulating program with superior versatility: isChIP. Brief Bioinform 2020; 22:6035271. [PMID: 33320934 DOI: 10.1093/bib/bbaa352] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 10/18/2020] [Accepted: 11/03/2020] [Indexed: 12/13/2022] Open
Abstract
Chromatin immunoprecipitation followed by next-generation sequencing (ChIP-seq) is recognized as an extremely powerful tool to study the interaction of numerous transcription factors and other chromatin-associated proteins with DNA. The core problem in the optimization of ChIP-seq protocol and the following computational data analysis is that a 'true' pattern of binding events for a given protein factor is unknown. Computer simulation of the ChIP-seq process based on 'a-priory known binding template' can contribute to a drastically reduce the number of wet lab experiments and finally help achieve radical optimization of the entire processing pipeline. We present a newly developed ChIP-sequencing simulation algorithm implemented in the novel software, in silico ChIP-seq (isChIP). We demonstrate that isChIP closely approximates real ChIP-seq protocols and is able to model data similar to those obtained from experimental sequencing. We validated isChIP using publicly available datasets generated for well-characterized transcription factors Oct4 and Sox2. Although the novel software is compatible with the Illumina protocols by default, it can also successfully perform simulations with a number of alternative sequencing platforms such as Roche454, Ion Torrent and SOLiD as well as model ChIP -Exo. The versatility of isChIP was demonstrated through modelling a wide range of binding events, including those of transcription factors and chromatin modifiers. We also performed a comparative analysis against a few existing ChIP-seq simulators and showed the fundamental superiority of our model. Due to its ability to utilize known binding templates, isChIP can potentially be employed to help investigators choose the most appropriate analytical software through benchmarking of available ChIP-seq programs and optimize the experimental parameters of ChIP-seq protocol. isChIP software is freely available at https://github.com/fnaumenko/isChIP.
Collapse
Affiliation(s)
| | | | | | - Yuriy L Orlov
- Digital Health Institute, I.M. Sechenov First Moscow State Medical University (Sechenov University), and Senior Scientist at Agrarian and Technological Institute, Peoples' Friendship University of Russia (RUDN University), Russia
| |
Collapse
|
6
|
Todd CD, Deniz Ö, Taylor D, Branco MR. Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells. eLife 2019; 8:e44344. [PMID: 31012843 PMCID: PMC6544436 DOI: 10.7554/elife.44344] [Citation(s) in RCA: 87] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Accepted: 04/20/2019] [Indexed: 12/18/2022] Open
Abstract
Transposable elements (TEs) are thought to have helped establish gene regulatory networks. Both the embryonic and extraembryonic lineages of the early mouse embryo have seemingly co-opted TEs as enhancers, but there is little evidence that they play significant roles in gene regulation. Here we tested a set of long terminal repeat TE families for roles as enhancers in mouse embryonic and trophoblast stem cells. Epigenomic and transcriptomic data suggested that a large number of TEs helped to establish tissue-specific gene expression programmes. Genetic editing of individual TEs confirmed a subset of these regulatory relationships. However, a wider survey via CRISPR interference of RLTR13D6 elements in embryonic stem cells revealed that only a minority play significant roles in gene regulation. Our results suggest that a subset of TEs are important for gene regulation in early mouse development, and highlight the importance of functional experiments when evaluating gene regulatory roles of TEs.
Collapse
Affiliation(s)
- Christopher D Todd
- Blizard Institute, Barts and The London School of Medicine and DentistryQueen Mary University of LondonLondonUnited Kingdom
- Centre for Genomic Health, Life Sciences InstituteQueen Mary University of LondonLondonUnited Kingdom
| | - Özgen Deniz
- Blizard Institute, Barts and The London School of Medicine and DentistryQueen Mary University of LondonLondonUnited Kingdom
- Centre for Genomic Health, Life Sciences InstituteQueen Mary University of LondonLondonUnited Kingdom
| | - Darren Taylor
- Centre for Genomic Health, Life Sciences InstituteQueen Mary University of LondonLondonUnited Kingdom
| | - Miguel R Branco
- Centre for Genomic Health, Life Sciences InstituteQueen Mary University of LondonLondonUnited Kingdom
| |
Collapse
|