Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lladser ME, Betterton MD, Knight R. Multiple pattern matching: a Markov chain approach. J Math Biol 2007;56:51-92. [PMID: 17668213 DOI: 10.1007/s00285-007-0109-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2007] [Revised: 05/15/2007] [Indexed: 10/23/2022]

For:	Lladser ME, Betterton MD, Knight R. Multiple pattern matching: a Markov chain approach. J Math Biol 2007;56:51-92. [PMID: 17668213 DOI: 10.1007/s00285-007-0109-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2007] [Revised: 05/15/2007] [Indexed: 10/23/2022]

Number

Cited by Other Article(s)

Prosperi M, Marini S, Boucher C. Fast and exact quantification of motif occurrences in biological sequences. BMC Bioinformatics 2021;22:445. [PMID: 34537012 PMCID: PMC8449872 DOI: 10.1186/s12859-021-04355-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open

Abstract

BACKGROUND

Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce 'motif_prob', a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics.

RESULTS

We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13-31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50-1000× faster) than MoSDi, even when using their fast compound Poisson approximation (60-120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at https://github.com/DataIntellSystLab/motif_prob .

CONCLUSIONS

The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.

Collapse

Char IG, Lladser ME. Stochastic Analysis of Minimal Automata Growth for Generalized Strings. Methodol Comput Appl Probab 2020. [DOI: 10.1007/s11009-019-09706-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Nuel G. Moments of the Count of a Regular Expression in a Heterogeneous Random Sequence. Methodol Comput Appl Probab 2019. [DOI: 10.1007/s11009-019-09700-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]

Nielsen MM, Tataru P, Madsen T, Hobolth A, Pedersen JS. Regmex: a statistical tool for exploring motifs in ranked sequence lists from genomics experiments. Algorithms Mol Biol 2018;13:17. [PMID: 30555524 PMCID: PMC6286601 DOI: 10.1186/s13015-018-0135-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 12/01/2018] [Indexed: 12/23/2022] Open

Abstract

Background

Motif analysis methods have long been central for studying biological function of nucleotide sequences. Functional genomics experiments extend their potential. They typically generate sequence lists ranked by an experimentally acquired functional property such as gene expression or protein binding affinity. Current motif discovery tools suffer from limitations in searching large motif spaces, and thus more complex motifs may not be included. There is thus a need for motif analysis methods that are tailored for analyzing specific complex motifs motivated by biological questions and hypotheses rather than acting as a screen based motif finding tool.

Methods

We present Regmex (REGular expression Motif EXplorer), which offers several methods to identify overrepresented motifs in ranked lists of sequences. Regmex uses regular expressions to define motifs or families of motifs and embedded Markov models to calculate exact p-values for motif observations in sequences. Biases in motif distributions across ranked sequence lists are evaluated using random walks, Brownian bridges, or modified rank based statistics. A modular setup and fast analytic p value evaluations make Regmex applicable to diverse and potentially large-scale motif analysis problems.

Results

We demonstrate use cases of combined motifs on simulated data and on expression data from micro RNA transfection experiments. We confirm previously obtained results and demonstrate the usability of Regmex to test a specific hypothesis about the relative location of microRNA seed sites and U-rich motifs. We further compare the tool with an existing motif discovery tool and show increased sensitivity.

Conclusions

Regmex is a useful and flexible tool to analyze motif hypotheses that relates to large data sets in functional genomics. The method is available as an R package (https://github.com/muhligs/regmex).

Electronic supplementary material

The online version of this article (10.1186/s13015-018-0135-2) contains supplementary material, which is available to authorized users.

Collapse

Analytic Combinatorics for Computing Seeding Probabilities. ALGORITHMS 2018. [DOI: 10.3390/a11010003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Martin DEK, Coleman DA. Distribution of Clump Statistics for a Collection of Words. J Appl Probab 2016. [DOI: 10.1239/jap/1324046018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Régnier M, Furletova E, Yakovlev V, Roytberg M. Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models. Algorithms Mol Biol 2015;9:25. [PMID: 25648087 PMCID: PMC4307674 DOI: 10.1186/s13015-014-0025-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 11/09/2014] [Indexed: 12/02/2022] Open

Abstract

Background

Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length.

Results

We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities.

Electronic supplementary material

The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.

Collapse

Approximation of sojourn-times via maximal couplings: motif frequency distributions. J Math Biol 2014;69:147-82. [DOI: 10.1007/s00285-013-0690-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2013] [Revised: 05/11/2013] [Indexed: 10/26/2022]

Tataru P, Sand A, Hobolth A, Mailund T, Pedersen CNS. Algorithms for hidden markov models restricted to occurrences of regular expressions. BIOLOGY 2013;2:1282-95. [PMID: 24833225 PMCID: PMC4009796 DOI: 10.3390/biology2041282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2013] [Revised: 10/08/2013] [Accepted: 11/05/2013] [Indexed: 11/24/2022]

PROSPERI MATTIACF, PROSPERI LUCIANO, GRAY REBECCAR, SALEMI MARCO. ON COUNTING THE FREQUENCY DISTRIBUTION OF STRING MOTIFS IN MOLECULAR SEQUENCES. INT J BIOMATH 2012. [DOI: 10.1142/s1793524512500556] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Kennedy R, Lladser ME, Wu Z, Zhang C, Yarus M, De Sterck H, Knight R. Natural and artificial RNAs occupy the same restricted region of sequence space. RNA (NEW YORK, N.Y.) 2010;16:280-9. [PMID: 20032164 PMCID: PMC2811657 DOI: 10.1261/rna.1923210] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]

Pudlo P. Large deviations and full Edgeworth expansions for finite Markov chains with applications to the analysis of genomic sequences. ESAIM-PROBAB STAT 2010. [DOI: 10.1051/ps/2009008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Marschall T, Rahmann S. Efficient exact motif discovery. Bioinformatics 2009;25:i356-64. [PMID: 19478010 PMCID: PMC2687942 DOI: 10.1093/bioinformatics/btp188] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Abstract

Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif.

Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis.

Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/

Contact:tobias.marschall@tu-dortmund.de; sven.rahmann@tu-dortmund.de

Collapse