1
|
Bennett I, Martin DEK, Lahiri SN. Fitting sparse Markov models through a collapsed Gibbs sampler. Comput Stat 2022. [DOI: 10.1007/s00180-022-01310-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
2
|
Bai X, Ren J, Sun F. MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences. J Mol Biol 2022; 434:167586. [PMID: 35427634 PMCID: PMC10433695 DOI: 10.1016/j.jmb.2022.167586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/23/2022]
Abstract
Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.
Collapse
Affiliation(s)
- Xin Bai
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Google Research, Brain Team, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
3
|
An S, Ren J, Sun F, Wan L. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses. J Comput Biol 2022; 29:839-856. [PMID: 35451885 PMCID: PMC9419963 DOI: 10.1089/cmb.2021.0604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.
Collapse
Affiliation(s)
- Shaokun An
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Jie Ren
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA
| | - Lin Wan
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
4
|
Bai X, Ren J, Fan Y, Sun F. KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate. Bioinformatics 2021; 37:759-766. [PMID: 33119059 PMCID: PMC8599924 DOI: 10.1093/bioinformatics/btaa912] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 09/11/2020] [Accepted: 10/14/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. RESULTS To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini-Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. AVAILABILITYAND IMPLEMENTATION Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xin Bai
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| | - Yingying Fan
- Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA
| |
Collapse
|
5
|
Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data. QUANTITATIVE BIOLOGY 2020; 8:143-154. [PMID: 34262790 DOI: 10.1007/s40484-020-0200-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Background Markov chains (MC) have been widely used to model molecular sequences. The estimations of MC transition matrix and confidence intervals of the transition probabilities from long sequence data have been intensively studied in the past decades. In next generation sequencing (NGS), a large amount of short reads are generated. These short reads can overlap and some regions of the genome may not be sequenced resulting in a new type of data. Based on NGS data, the transition probabilities of MC can be estimated by moment estimators. However, the classical asymptotic distribution theory for MC transition probability estimators based on long sequences is no longer valid. Methods In this study, we present the asymptotic distributions of several statistics related to MC based on NGS data. We show that, after scaling by the effective coverage d defined in a previous study by the authors, these statistics based on NGS data approximate to the same distributions as the corresponding statistics for long sequences. Results We apply the asymptotic properties of these statistics for finding the theoretical confidence regions for MC transition probabilities based on NGS short reads data. We validate our theoretical confidence intervals using both simulated data and real data sets, and compare the results with those by the parametric bootstrap method. Conclusions We find that the asymptotic distributions of these statistics and the theoretical confidence intervals of transition probabilities based on NGS data given in this study are highly accurate, providing a powerful tool for NGS data analysis.
Collapse
|
6
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
7
|
Bai X, Tang K, Ren J, Waterman M, Sun F. Optimal choice of word length when comparing two Markov sequences using a χ 2-statistic. BMC Genomics 2017; 18:732. [PMID: 28984181 PMCID: PMC5629589 DOI: 10.1186/s12864-017-4020-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background Alignment-free sequence comparison using counts of word patterns (grams, k-tuples) has become an active research topic due to the large amount of sequence data from the new sequencing technologies. Genome sequences are frequently modelled by Markov chains and the likelihood ratio test or the corresponding approximate χ2-statistic has been suggested to compare two sequences. However, it is not known how to best choose the word length k in such studies. Results We develop an optimal strategy to choose k by maximizing the statistical power of detecting differences between two sequences. Let the orders of the Markov chains for the two sequences be r1 and r2, respectively. We show through both simulations and theoretical studies that the optimal k= max(r1,r2)+1 for both long sequences and next generation sequencing (NGS) read data. The orders of the Markov chains may be unknown and several methods have been developed to estimate the orders of Markov chains based on both long sequences and NGS reads. We study the power loss of the statistics when the estimated orders are used. It is shown that the power loss is minimal for some of the estimators of the orders of Markov chains. Conclusion Our studies provide guidelines on choosing the optimal word length for the comparison of Markov sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4020-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xin Bai
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Michael Waterman
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China.,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Fengzhu Sun
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. .,Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA.
| |
Collapse
|
8
|
Arapis AN, Makri FS, Psillakis ZM. On the length and the position of the minimum sequence containing all runs of ones in a Markovian binary sequence. Stat Probab Lett 2016. [DOI: 10.1016/j.spl.2016.03.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
9
|
Abstract
Given a realisation of a Markov chain, one can count the numbers of state transitions of each type. One can ask how many realisations are there with these transition counts and the same initial state. Whittle (1955) has answered this question, by finding an explicit though complicated formula, and has also shown that each realisation is equally likely. In the analysis of DNA sequences which comprise letters from the set {A, C, G, T}, it is often useful to count the frequency of a pattern, say ACGCT, in a long sequence and compare this with the expected frequency for all sequences having the same start letter and the same transition counts (or ‘dinucleotide counts' as they are called in the molecular biology literature). To date, no exact method exists; this paper rectifies that deficiency.
Collapse
|
10
|
Abstract
Given a realisation of a Markov chain, one can count the numbers of state transitions of each type. One can ask how many realisations are there with these transition counts and the same initial state. Whittle (1955) has answered this question, by finding an explicit though complicated formula, and has also shown that each realisation is equally likely. In the analysis of DNA sequences which comprise letters from the set {A, C, G, T}, it is often useful to count the frequency of a pattern, say ACGCT, in a long sequence and compare this with the expected frequency for all sequences having the same start letter and the same transition counts (or ‘dinucleotide counts' as they are called in the molecular biology literature). To date, no exact method exists; this paper rectifies that deficiency.
Collapse
|
11
|
Ren J, Song K, Deng M, Reinert G, Cannon CH, Sun F. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics 2016; 32:993-1000. [PMID: 26130573 PMCID: PMC6169497 DOI: 10.1093/bioinformatics/btv395] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Revised: 03/11/2015] [Accepted: 06/25/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. RESULTS Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. AVAILABILITY AND IMPLEMENTATION Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT fsun@usc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA
| | - Kai Song
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK
| | - Charles H Cannon
- Department of Biological Sciences, Texas Tech University, TX 79409-3131, USA, Xishuangbanna Tropical Botanic Garden, Chinese Academy of Sciences, Yunnan, China and
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA, USA, Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
12
|
Skewes AD, Welch RD. A Markovian analysis of bacterial genome sequence constraints. PeerJ 2013; 1:e127. [PMID: 24010012 PMCID: PMC3757466 DOI: 10.7717/peerj.127] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2013] [Accepted: 07/18/2013] [Indexed: 11/20/2022] Open
Abstract
The arrangement of nucleotides within a bacterial chromosome is influenced by numerous factors. The degeneracy of the third codon within each reading frame allows some flexibility of nucleotide selection; however, the third nucleotide in the triplet of each codon is at least partly determined by the preceding two. This is most evident in organisms with a strong G + C bias, as the degenerate codon must contribute disproportionately to maintaining that bias. Therefore, a correlation exists between the first two nucleotides and the third in all open reading frames. If the arrangement of nucleotides in a bacterial chromosome is represented as a Markov process, we would expect that the correlation would be completely captured by a second-order Markov model and an increase in the order of the model (e.g., third-, fourth-…order) would not capture any additional uncertainty in the process. In this manuscript, we present the results of a comprehensive study of the Markov property that exists in the DNA sequences of 906 bacterial chromosomes. All of the 906 bacterial chromosomes studied exhibit a statistically significant Markov property that extends beyond second-order, and therefore cannot be fully explained by codon usage. An unrooted tree containing all 906 bacterial chromosomes based on their transition probability matrices of third-order shares ∼25% similarity to a tree based on sequence homologies of 16S rRNA sequences. This congruence to the 16S rRNA tree is greater than for trees based on lower-order models (e.g., second-order), and higher-order models result in diminishing improvements in congruence. A nucleotide correlation most likely exists within every bacterial chromosome that extends past three nucleotides. This correlation places significant limits on the number of nucleotide sequences that can represent probable bacterial chromosomes. Transition matrix usage is largely conserved by taxa, indicating that this property is likely inherited, however some important exceptions exist that may indicate the convergent evolution of some bacteria.
Collapse
Affiliation(s)
- Aaron D Skewes
- Department of Biology, Syracuse University , Syracuse, NY, United States ; Department of Mathematics, Syracuse University , Syracuse, NY , United States
| | | |
Collapse
|
13
|
Silva CQD. Hidden Markov models applied to a subsequence of the Xylella fastidiosa genome. Genet Mol Biol 2003. [DOI: 10.1590/s1415-47572003000400018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
14
|
|
15
|
Abstract
In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.
Collapse
Affiliation(s)
- S Schbath
- Institut National de la Recherche Agronomique, Unité de Biométrie, Jouy-en-Josas, France.
| |
Collapse
|
16
|
von Haeseler A, Schöniger M. Evolution of DNA or amino acid sequences with dependent sites. J Comput Biol 1998; 5:149-63. [PMID: 9541878 DOI: 10.1089/cmb.1998.5.149] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
A framework is outlined to study the evolution of DNA or amino acid sequences, if sequence sites do not evolve independently. The units of evolution are nonoverlapping subsequences of length l. Each subsequence evolves independently of the others, but within a subsequence the sequences show a Markov order one dependency. We describe an algorithm to mimic the evolution of such sequences. The influence of dependencies between sites on distance estimates and the reliability of tree reconstruction methods is investigated. We show that an inappropriate model of sequence evolution in the tree reconstruction process will lead to a nonempty Felsenstein zone. Finally, we describe a method to infer l from sequence data. Examples from the evolution of DNA sequences as well as from amino acids are given.
Collapse
|
17
|
Reddy BV, Pandit MW. A statistical analytical approach to decipher information from biological sequences: application to murine splice-site analysis and prediction. J Biomol Struct Dyn 1995; 12:785-801. [PMID: 7779300 DOI: 10.1080/07391102.1995.10508776] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
A simple statistical approach for the analysis of biological sequences, such as splice-sites, promoter regions, helices and extended structure forming regions or any other sequence dependent functional entities in proteins, is presented. The approach has been proved useful to develop a method for prediction of such entities in newly available sequences. We first search for invariant sequence features of each functional entity from the experimentally available sequences and identify a set of 'like' sequences with similar sequence features. In the next step, concrete features of sequence entities in terms of occurrences of smaller subsequences are identified at various positions which are used as a knowledge base to select potential functional entities from the identified 'like' sequences. The third step consists of refinement of this pattern learning, statistical improvements of the knowledge base weight matrices, and finally its application to predict functional entities in newly available sequences. Such an analysis is operationally described for murine splice-site predictions. Regions comprising -30 to +30 nucleotides from the splice-junction at the murine splice-sites (donors and acceptors), reported earlier, were analyzed. Invariant sequence-specific features in terms of monomer frequency average were used to identify splice-site-like sequences in the EMBL murine DNA sequence data base. The frequencies of occurrence of mono-, di-, tri- and tetranucleotides in the known splice-sites were studied in comparison with the splice-site-like sequences; the significant differences in their occurrences were extracted as statistical knowledge coded in weight matrices for computer to identify potential splice-sites. The algorithm was refined and a method was developed to predict potential splice-sites in a given murine DNA; the analysis was also extended to human DNA. The success rate of the method to predict correct splice-sites in these species is found to be 80% and 85%, respectively. The major strength of this method lies in reducing significantly the number of false positives which are normally picked up in such analysis.
Collapse
Affiliation(s)
- B V Reddy
- Centre for Cellular and Molecular Biology, Hyderabad, India
| | | |
Collapse
|
18
|
Schbath S, Prum B, de Turckheim E. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J Comput Biol 1995; 2:417-37. [PMID: 8521272 DOI: 10.1089/cmb.1995.2.417] [Citation(s) in RCA: 69] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.
Collapse
Affiliation(s)
- S Schbath
- INRA, Département de Biométrie et Intelligence Artificielle, Jouy-en-Josas, France
| | | | | |
Collapse
|
19
|
Goldman N. Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res 1993; 21:2487-91. [PMID: 8506142 PMCID: PMC309551 DOI: 10.1093/nar/21.10.2487] [Citation(s) in RCA: 46] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The chaos game representation (CGR) is a scatter plot derived from a DNA sequence, with each point of the plot corresponding to one base of the sequence. If the DNA sequence were a random collection of bases, the CGR would be a uniformly filled square; conversely, any patterns visible in the CGR represent some pattern (information) in the DNA sequence. In this paper, patterns previously observed in a variety of DNA sequences are explained solely in terms of nucleotide, dinucleotide and trinucleotide frequencies.
Collapse
Affiliation(s)
- N Goldman
- Laboratory of Mathematical Biology, National Institute for Medical Research, London, UK
| |
Collapse
|
20
|
Abstract
Penny et al. have written that "The most fundamental criterion for a scientific method is that the data must, in principle, be able to reject the model. Hardly any [phylogenetic] tree-reconstruction methods meet this simple requirement." The ability to reject models is of such great importance because the results of all phylogenetic analyses depend on their underlying models--to have confidence in the inferences, it is necessary to have confidence in the models. In this paper, a test statistic suggested by Cox is employed to test the adequacy of some statistical models of DNA sequence evolution used in the phylogenetic inference method introduced by Felsenstein. Monte Carlo simulations are used to assess significance levels. The resulting statistical tests provide an objective and very general assessment of all the components of a DNA substitution model; more specific versions of the test are devised to test individual components of a model. In all cases, the new analyses have the additional advantage that values of phylogenetic parameters do not have to be assumed in order to perform the tests.
Collapse
Affiliation(s)
- N Goldman
- Department of Zoology, University of Cambridge, UK
| |
Collapse
|