1
|
Abstract
Aims:
Robust and more accurate method for identifying transcription factor binding sites
(TFBS) for gene expression.
Background:
Deep neural networks (DNNs) have shown promising growth in solving complex
machine learning problems. Conventional techniques are comfortably replaced by DNNs in
computer vision, signal processing, healthcare, and genomics. Understanding DNA sequences is
always a crucial task in healthcare and regulatory genomics. For DNA motif prediction, choosing the
right dataset with a sufficient number of input sequences is crucial in order to design an effective
model.
Objective:
Designing a new algorithm which works on different dataset while an improved
performance for TFBS prediction.
Methods:
With the help of Layerwise Relevance Propagation, the proposed algorithm identifies the
invariant features with adaptive noise patterns.
Results:
The performance is compared by calculating various metrics on standard as well as recent
methods and significant improvement is noted.
Conclusion:
By identifying the invariant and robust features in the DNA sequences, the
classification performance can be increased.
Collapse
Affiliation(s)
- Kanu Geete
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| | - Manish Pandey
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, India
| |
Collapse
|
2
|
Zhang Q, Zhu L, Huang DS. High-Order Convolutional Neural Network Architecture for Predicting DNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1184-1192. [PMID: 29993783 DOI: 10.1109/tcbb.2018.2819660] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Although Deep learning algorithms have outperformed conventional methods in predicting the sequence specificities of DNA-protein binding, they lack to consider the dependencies among nucleotides and the diverse binding lengths for different transcription factors (TFs). To address the above two limitations simultaneously, in this paper, we propose a high-order convolutional neural network architecture (HOCNN), which employs a high-order encoding method to build high-order dependencies among nucleotides, and a multi-scale convolutional layer to capture the motif features of different length. The experimental results on real ChIP-seq datasets show that the proposed method outperforms the state-of-the-art deep learning method (DeepBind) in the motif discovery task. In addition, we provide further insights about the importance of introducing additional convolutional kernels and the degeneration problem of importing high-order in the motif discovery task.
Collapse
|
3
|
Zhang H, Zhu L, Huang DS. DiscMLA: An Efficient Discriminative Motif Learning Algorithm over High-Throughput Datasets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1810-1820. [PMID: 27164602 DOI: 10.1109/tcbb.2016.2561930] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The transcription factors (TFs) can activate or suppress gene expression by binding to specific sites, hence are crucial regulatory elements for transcription. Recently, series of discriminative motif finders have been tailored to offering promising strategy for harnessing the power of large quantities of accumulated high-throughput experimental data. However, in order to achieve high speed, these algorithms have to sacrifice accuracy by employing simplified statistical models during the searching process. In this paper, we propose a novel approach named Discriminative Motif Learning via AUC (DiscMLA) to discover motifs on high-throughput datasets. Unlike previous approaches, DiscMLA tries to optimize with a more comprehensive criterion (AUC) during motifs searching. In addition, based on an experimental observation of motif identification on large-scale datasets, some novel procedures are designed to accelerate DiscMLA. The experimental results on 52 real-world datasets demonstrate that our approach substantially outperforms previous methods on discriminative motif learning problems. DiscMLA' stability, discriminability, and validity will help to exploit high-throughput datasets and answer many fundamental biological questions.
Collapse
|
4
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
5
|
Abstract
Motivation The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. Results We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. Availability and Implementation CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lin Zhu
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hong-Bo Zhang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, Department of College of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
6
|
Zhu L, Zhang HB, Huang DS. LMMO: A Large Margin Approach for Refining Regulatory Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:913-925. [PMID: 28391205 DOI: 10.1109/tcbb.2017.2691325] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, they usually have to sacrifice accuracy and may fail to fully leverage the potential of large datasets. Recently, it has been demonstrated that the motifs identified by DMDs can be significantly improved by maximizing the receiver-operating characteristic curve (AUC) metric, which has been widely used in the literature to rank the performance of elicited motifs. However, existing approaches for motif refinement choose to directly maximize the non-convex and discontinuous AUC itself, which is known to be difficult and may lead to suboptimal solutions. In this paper, we propose Large Margin Motif Optimizer (LMMO), a large-margin-type algorithm for refining regulatory motifs. By relaxing the AUC cost function with the surrogate convex hinge loss, we show that the resultant learning problem can be cast as an instance of difference-of-convex (DC) programs, and solve it iteratively using constrained concave-convex procedure (CCCP). To further save computational time, we combine LMMO with existing techniques for improving the scalability of large-margin-type algorithms, such as cutting plane method. Experimental evaluations on synthetic and real data illustrate the performance of the proposed approach. The code of LMMO is freely available at: https://github.com/ekffar/LMMO.
Collapse
|
7
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
8
|
Abstract
MOTIVATION Generating accurate transcription factor (TF) binding site motifs from data generated using the next-generation sequencing, especially ChIP-seq, is challenging. The challenge arises because a typical experiment reports a large number of sequences bound by a TF, and the length of each sequence is relatively long. Most traditional motif finders are slow in handling such enormous amount of data. To overcome this limitation, tools have been developed that compromise accuracy with speed by using heuristic discrete search strategies or limited optimization of identified seed motifs. However, such strategies may not fully use the information in input sequences to generate motifs. Such motifs often form good seeds and can be further improved with appropriate scoring functions and rapid optimization. RESULTS We report a tool named discriminative motif optimizer (DiMO). DiMO takes a seed motif along with a positive and a negative database and improves the motif based on a discriminative strategy. We use area under receiver-operating characteristic curve (AUC) as a measure of discriminating power of motifs and a strategy based on perceptron training that maximizes AUC rapidly in a discriminative manner. Using DiMO, on a large test set of 87 TFs from human, drosophila and yeast, we show that it is possible to significantly improve motifs identified by nine motif finders. The motifs are generated/optimized using training sets and evaluated on test sets. The AUC is improved for almost 90% of the TFs on test sets and the magnitude of increase is up to 39%. AVAILABILITY AND IMPLEMENTATION DiMO is available at http://stormo.wustl.edu/DiMO
Collapse
Affiliation(s)
- Ronak Y Patel
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108, USA
| | | |
Collapse
|
9
|
Wang D, Tapan S. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S4. [PMID: 23282090 PMCID: PMC3521183 DOI: 10.1186/1752-0509-6-s2-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Background Computational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools. Results This paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones. Conclusions MISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.
Collapse
Affiliation(s)
- Dianhui Wang
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
| | | |
Collapse
|
10
|
Nandi S, Ioshikhes I. Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites. BMC Genomics 2012; 13:416. [PMID: 22913572 PMCID: PMC3481455 DOI: 10.1186/1471-2164-13-416] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2011] [Accepted: 08/02/2012] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND The identifying of binding sites for transcription factors is a key component of gene regulatory network analysis. This is often done using position-weight matrices (PWMs). Because of the importance of in silico mapping of tentative binding sites, we previously developed an approach for PWM optimization that substantially improves the accuracy of such mapping. RESULTS The present work implements the optimization algorithm applied to the existing PWM for GATA-3 transcription factor and builds a new di-nucleotide PWM. The existing available PWM is based on experimental data adopted from Jaspar. The optimized PWM substantially improves the sensitivity and specificity of the TF mapping compared to the conventional applications. The refined PWM also facilitates in silico identification of novel binding sites that are supported by experimental data. We also describe uncommon positioning of binding motifs for several T-cell lineage specific factors in human promoters. CONCLUSION Our proposed di-nucleotide PWM approach outperforms the conventional mono-nucleotide PWM approach with respect to GATA-3. Therefore our new di-nucleotide PWM provides new insight into plausible transcriptional regulatory interactions in human promoters.
Collapse
Affiliation(s)
- Soumyadeep Nandi
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| | - Ilya Ioshikhes
- Ottawa Institute of Systems Biology and Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
| |
Collapse
|
11
|
Modular insulators: genome wide search for composite CTCF/thyroid hormone receptor binding sites. PLoS One 2010; 5:e10119. [PMID: 20404925 PMCID: PMC2852416 DOI: 10.1371/journal.pone.0010119] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 03/18/2010] [Indexed: 02/07/2023] Open
Abstract
The conserved 11 zinc-finger protein CTCF is involved in several transcriptional mechanisms, including insulation and enhancer blocking. We had previously identified two composite elements consisting of a CTCF and a TR binding site at the chicken lysozyme and the human c-myc genes. Using these it has been demonstrated that thyroid hormone mediates the relief of enhancer blocking even though CTCF remains bound to its binding site. Here we wished to determine whether CTCF and TR combined sites are representative of a general feature of the genome, and whether such sites are functional in regulating enhancer blocking. Genome wide analysis revealed that about 18% of the CTCF regions harbored at least one of the four different palindromic or repeated sequence arrangements typical for the binding of TR homodimers or TR/RXR heterodimers. Functional analysis of 10 different composite elements of thyroid hormone responsive genes was performed using episomal constructs. The episomal system allowed recapitulating CTCF mediated enhancer blocking function to be dependent on poly (ADP)-ribose modification and to mediate histone deacetylation. Furthermore, thyroid hormone sensitive enhancer blocking could be shown for one of these new composite elements. Remarkably, not only did the regulation of enhancer blocking require functional TR binding, but also the basal enhancer blocking activity of CTCF was dependent on the binding of the unliganded TR. Thus, a number of composite CTCF/TR binding sites may represent a subset of other modular CTCF composite sites, such as groups of multiple CTCF sites or of CTCF/Oct4, CTCF/Kaiso or CTCF/Yy1 combinations.
Collapse
|
12
|
Le T, Altman T, Gardiner K. HIGEDA: a hierarchical gene-set genetics based algorithm for finding subtle motifs in biological sequences. Bioinformatics 2010; 26:302-9. [PMID: 19996163 DOI: 10.1093/bioinformatics/btp676] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Identification of motifs in biological sequences is a challenging problem because such motifs are often short, degenerate, and may contain gaps. Most algorithms that have been developed for motif-finding use the expectation-maximization (EM) algorithm iteratively. Although EM algorithms can converge quickly, they depend strongly on initialization parameters and can converge to local sub-optimal solutions. In addition, they cannot generate gapped motifs. The effectiveness of EM algorithms in motif finding can be improved by incorporating methods that choose different sets of initial parameters to enable escape from local optima, and that allow gapped alignments within motif models. RESULTS We have developed HIGEDA, an algorithm that uses the hierarchical gene-set genetic algorithm (HGA) with EM to initiate and search for the best parameters for the motif model. In addition, HIGEDA can identify gapped motifs using a position weight matrix and dynamic programming to generate an optimal gapped alignment of the motif model with sequences from the dataset. We show that HIGEDA outperforms MEME and other motif-finding algorithms on both DNA and protein sequences. AVAILABILITY AND IMPLEMENTATION Source code and test datasets are available for download at http://ouray.cudenver.edu/~tnle/, implemented in C++ and supported on Linux and MS Windows.
Collapse
Affiliation(s)
- Thanh Le
- Department of Computer Science and Engineering, Computational Biosciences Program, University of Colorado, Denver, CO, USA
| | | | | |
Collapse
|
13
|
Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature 2009; 462:65-70. [PMID: 19890324 DOI: 10.1038/nature08531] [Citation(s) in RCA: 299] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2009] [Accepted: 09/22/2009] [Indexed: 11/09/2022]
Abstract
Development requires the establishment of precise patterns of gene expression, which are primarily controlled by transcription factors binding to cis-regulatory modules. Although transcription factor occupancy can now be identified at genome-wide scales, decoding this regulatory landscape remains a daunting challenge. Here we used a novel approach to predict spatio-temporal cis-regulatory activity based only on in vivo transcription factor binding and enhancer activity data. We generated a high-resolution atlas of cis-regulatory modules describing their temporal and combinatorial occupancy during Drosophila mesoderm development. The binding profiles of cis-regulatory modules with characterized expression were used to train support vector machines to predict five spatio-temporal expression patterns. In vivo transgenic reporter assays demonstrate the high accuracy of these predictions and reveal an unanticipated plasticity in transcription factor binding leading to similar expression. This data-driven approach does not require previous knowledge of transcription factor sequence affinity, function or expression, making it widely applicable.
Collapse
|
14
|
Jordan JJ, Menendez D, Inga A, Nourredine M, Bell D, Resnick MA. Noncanonical DNA motifs as transactivation targets by wild type and mutant p53. PLoS Genet 2008; 4:e1000104. [PMID: 18714371 PMCID: PMC2518093 DOI: 10.1371/journal.pgen.1000104] [Citation(s) in RCA: 84] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2008] [Accepted: 05/22/2008] [Indexed: 12/31/2022] Open
Abstract
Sequence-specific binding by the human p53 master regulator is critical to its tumor suppressor activity in response to environmental stresses. p53 binds as a tetramer to two decameric half-sites separated by 0–13 nucleotides (nt), originally defined by the consensus RRRCWWGYYY (n = 0–13) RRRCWWGYYY. To better understand the role of sequence, organization, and level of p53 on transactivation at target response elements (REs) by wild type (WT) and mutant p53, we deconstructed the functional p53 canonical consensus sequence using budding yeast and human cell systems. Contrary to early reports on binding in vitro, small increases in distance between decamer half-sites greatly reduces p53 transactivation, as demonstrated for the natural TIGER RE. This was confirmed with human cell extracts using a newly developed, semi–in vitro microsphere binding assay. These results contrast with the synergistic increase in transactivation from a pair of weak, full-site REs in the MDM2 promoter that are separated by an evolutionary conserved 17 bp spacer. Surprisingly, there can be substantial transactivation at noncanonical ½-(a single decamer) and ¾-sites, some of which were originally classified as biologically relevant canonical consensus sequences including PIDD and Apaf-1. p53 family members p63 and p73 yielded similar results. Efficient transactivation from noncanonical elements requires tetrameric p53, and the presence of the carboxy terminal, non-specific DNA binding domain enhanced transactivation from noncanonical sequences. Our findings demonstrate that RE sequence, organization, and level of p53 can strongly impact p53-mediated transactivation, thereby changing the view of what constitutes a functional p53 target. Importantly, inclusion of ½- and ¾-site REs greatly expands the p53 master regulatory network. Within human cells, the tumor suppressor p53 is the central node of regulation required to elicit multiple biological responses that include cell cycle arrest and death in response to stress or DNA damage, where mutations in p53 are a hallmark of cancer. As a master regulatory gene, p53 controls the action of target genes within its network by directly interacting with a widely accepted consensus DNA binding sequence, composed of two decamer ½-sites that can be separated by up to 13 bases. While mismatches from consensus sequence are frequent, the canonical consensus sequence places a limitation upon the organization and number of target genes within the p53 transcriptional network. Using yeast and human cell systems, our goal was to further understand how the DNA sequence, DNA organization, and level of p53 expression might influence the inclusion of genes within the p53 regulatory network. We found that increases in spacer beyond a few bases greatly reduce responsiveness to p53. Importantly, we established that p53 can function from noncanonical sequences comprising only a decamer ½-site or a ¾-site. These findings further define and expand the universe of potential downstream target genes which may be regulated by p53 and bring further diversity into the p53 regulatory network.
Collapse
Affiliation(s)
- Jennifer J. Jordan
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
- Curriculum in Genetics and Molecular Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
| | - Daniel Menendez
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
| | - Alberto Inga
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
- Unit of Molecular Mutagenesis and DNA Repair, National Institute for Cancer Research, IST, Genoa, Italy
| | - Maher Nourredine
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
| | - Douglas Bell
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
| | - Michael A. Resnick
- Laboratory of Molecular Genetics, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, North Carolina, United States of America
- Curriculum in Genetics and Molecular Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
15
|
Li L, Bass RL, Liang Y. fdrMotif: identifying cis-elements by an EM algorithm coupled with false discovery rate control. ACTA ACUST UNITED AC 2008; 24:629-36. [PMID: 18296465 DOI: 10.1093/bioinformatics/btn009] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Most de novo motif identification methods optimize the motif model first and then separately test the statistical significance of the motif score. In the first stage, a motif abundance parameter needs to be specified or modeled. In the second stage, a Z-score or P-value is used as the test statistic. Error rates under multiple comparisons are not fully considered. METHODOLOGY We propose a simple but novel approach, fdrMotif, that selects as many binding sites as possible while controlling a user-specified false discovery rate (FDR). Unlike existing iterative methods, fdrMotif combines model optimization [e.g. position weight matrix (PWM)] and significance testing at each step. By monitoring the proportion of binding sites selected in many sets of background sequences, fdrMotif controls the FDR in the original data. The model is then updated using an expectation (E)- and maximization (M)-like procedure. We propose a new normalization procedure in the E-step for updating the model. This process is repeated until either the model converges or the number of iterations exceeds a maximum. RESULTS Simulation studies suggest that our normalization procedure assigns larger weights to the binding sites than do two other commonly used normalization procedures. Furthermore, fdrMotif requires only a user-specified FDR and an initial PWM. When tested on 542 high confidence experimental p53 binding loci, fdrMotif identified 569 p53 binding sites in 505 (93.2%) sequences. In comparison, MEME identified more binding sites but in fewer ChIP sequences than fdrMotif. When tested on 500 sets of simulated 'ChIP' sequences with embedded known p53 binding sites, fdrMotif, compared to MEME, has higher sensitivity with similar positive predictive value. Furthermore, fdrMotif is robust to noise: it selected nearly identical binding sites in data adulterated with 50% added background sequences and the unadulterated data. We suggest that fdrMotif represents an improvement over MEME. AVAILABILITY C code can be found at: http://www.niehs.nih.gov/research/resources/software/fdrMotif/.
Collapse
Affiliation(s)
- Leping Li
- Biostatistics Branch, National Institute of Environmental Health Sciences, NIH, DHHS, Research Triangle Park, NC 27709, USA.
| | | | | |
Collapse
|