1
|
Markić I, Štula M, Zorić M, Stipaničev D. Entropy-Based Approach in Selection Exact String-Matching Algorithms. ENTROPY (BASEL, SWITZERLAND) 2020; 23:E31. [PMID: 33379282 PMCID: PMC7824336 DOI: 10.3390/e23010031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2020] [Revised: 12/19/2020] [Accepted: 12/22/2020] [Indexed: 11/16/2022]
Abstract
The string-matching paradigm is applied in every computer science and science branch in general. The existence of a plethora of string-matching algorithms makes it hard to choose the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on the usage of different resources. In software engineering, algorithmic productivity is a property of an algorithm execution identified with the computational resources the algorithm consumes. Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency, such as execution time, directly depend on the number of executed actions. Without touching the problematics of computer power consumption or memory, which also depends on the algorithm type and the techniques used in algorithm development, we have developed a methodology which enables the researchers to choose an efficient algorithm for a specific domain. String searching algorithms efficiency is usually observed independently from the domain texts being searched. This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics. The character comparison count metrics is a formal quantitative measure independent of algorithm implementation subtleties and computer platform differences. The model is developed for a particular problem domain by using appropriate domain data (patterns and texts) and provides for a specific domain the ranking of algorithms according to the patterns' entropy. The proposed approach is limited to on-line exact string-matching problems based on information entropy for a search pattern. Meticulous empirical testing depicts the methodology implementation and purports soundness of the methodology.
Collapse
Affiliation(s)
- Ivan Markić
- Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia
| | - Maja Štula
- Department of Electronics and Computing, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia; (M.Š.); (D.S.)
| | - Marija Zorić
- IT Department, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia;
| | - Darko Stipaničev
- Department of Electronics and Computing, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia; (M.Š.); (D.S.)
| |
Collapse
|
2
|
Humphrey S, Kerr A, Rattray M, Dive C, Miller CJ. A model of k-mer surprisal to quantify local sequence information content surrounding splice regions. PeerJ 2020; 8:e10063. [PMID: 33194378 PMCID: PMC7648452 DOI: 10.7717/peerj.10063] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 09/08/2020] [Indexed: 12/22/2022] Open
Abstract
Molecular sequences carry information. Analysis of sequence conservation between homologous loci is a proven approach with which to explore the information content of molecular sequences. This is often done using multiple sequence alignments to support comparisons between homologous loci. These methods therefore rely on sufficient underlying sequence similarity with which to construct a representative alignment. Here we describe a method using a formal metric of information, surprisal, to analyse biological sub-sequences without alignment constraints. We applied our model to the genomes of five different species to reveal similar patterns across a panel of eukaryotes. As the surprisal of a sub-sequence is inversely proportional to its occurrence within the genome, the optimal size of the sub-sequences was selected for each species under consideration. With the model optimized, we found a strong correlation between surprisal and CG dinucleotide usage. The utility of our model was tested by examining the sequences of genes known to undergo splicing. We demonstrate that our model can identify biological features of interest such as known donor and acceptor sites. Analysis across all annotated coding exon junctions in Homo sapiens reveals the information content of coding exons to be greater than the surrounding intron regions, a consequence of increased suppression of the CG dinucleotide in intronic space. Sequences within coding regions proximal to exon junctions exhibited novel patterns within DNA and coding mRNA that are not a function of the encoded amino acid sequence. Our findings are consistent with the presence of secondary information encoding features such as DNA and RNA binding sites, multiplexed through the coding sequence and independent of the information required to define the corresponding amino-acid sequence. We conclude that surprisal provides a complementary methodology with which to locate regions of interest in the genome, particularly in situations that lack an appropriate multiple sequence alignment.
Collapse
Affiliation(s)
- Sam Humphrey
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Alastair Kerr
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Magnus Rattray
- Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Caroline Dive
- CRUK Manchester Institute Cancer Biomarker Centre, The University of Manchester, Manchester, United Kingdom
- CRUK Manchester Institute, CRUK Lung Cancer Centre of Excellence, Manchester, United Kingdom
| | - Crispin J. Miller
- Computational Biology Group, CRUK Beatson Institute, Glasgow, United Kingdom
- Institute of Cancer Sciences, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
3
|
Abstract
Today massive amounts of sequenced metagenomic and metatranscriptomic data from different ecological niches and environmental locations are available. Scientific progress depends critically on methods that allow extracting useful information from the various types of sequence data. Here, we will first discuss types of information contained in the various flavours of biological sequence data, and how this information can be interpreted to increase our scientific knowledge and understanding. We argue that a mechanistic understanding of biological systems analysed from different perspectives is required to consistently interpret experimental observations, and that this understanding is greatly facilitated by the generation and analysis of dynamic mathematical models. We conclude that, in order to construct mathematical models and to test mechanistic hypotheses, time-series data are of critical importance. We review diverse techniques to analyse time-series data and discuss various approaches by which time-series of biological sequence data have been successfully used to derive and test mechanistic hypotheses. Analysing the bottlenecks of current strategies in the extraction of knowledge and understanding from data, we conclude that combined experimental and theoretical efforts should be implemented as early as possible during the planning phase of individual experiments and scientific research projects. This article is part of the theme issue ‘Integrative research perspectives on marine conservation’.
Collapse
Affiliation(s)
- Ovidiu Popa
- Institute of Quantitative and Theoretical Biology, CEPLAS, Heinrich-Heine University Düsseldorf, Germany
| | - Ellen Oldenburg
- Institute of Quantitative and Theoretical Biology, CEPLAS, Heinrich-Heine University Düsseldorf, Germany
| | - Oliver Ebenhöh
- Institute of Quantitative and Theoretical Biology, CEPLAS, Heinrich-Heine University Düsseldorf, Germany.,Cluster of Excellence on Plant Sciences, CEPLAS, Heinrich-Heine University Düsseldorf, Germany
| |
Collapse
|
4
|
Náprstek J, Fischer C. Maximum Entropy Probability Density Principle in Probabilistic Investigations of Dynamic Systems. ENTROPY (BASEL, SWITZERLAND) 2018; 20:e20100790. [PMID: 33265878 PMCID: PMC7512353 DOI: 10.3390/e20100790] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 10/11/2018] [Accepted: 10/11/2018] [Indexed: 06/12/2023]
Abstract
In this study, we consider a method for investigating the stochastic response of a nonlinear dynamical system affected by a random seismic process. We present the solution of the probability density of a single/multiple-degree of freedom (SDOF/MDOF) system with several statically stable equilibrium states and with possible jumps of the snap-through type. The system is a Hamiltonian system with weak damping excited by a system of non-stationary Gaussian white noise. The solution based on the Gibbs principle of the maximum entropy of probability could potentially be implemented in various branches of engineering. The search for the extreme of the Gibbs entropy functional is formulated as a constrained optimization problem. The secondary constraints follow from the Fokker-Planck equation (FPE) for the system considered or from the system of ordinary differential equations for the stochastic moments of the response derived from the relevant FPE. In terms of the application type, this strategy is most suitable for SDOF/MDOF systems containing polynomial type nonlinearities. Thus, the solution links up with the customary formulation of the finite elements discretization for strongly nonlinear continuous systems.
Collapse
|
5
|
Corso G, Prado TDL, Lima GZDS, Kurths J, Lopes SR. Quantifying entropy using recurrence matrix microstates. CHAOS (WOODBURY, N.Y.) 2018; 28:083108. [PMID: 30180629 DOI: 10.1063/1.5042026] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2018] [Accepted: 07/16/2018] [Indexed: 05/28/2023]
Abstract
We conceive a new recurrence quantifier for time series based on the concept of information entropy, in which the probabilities are associated with the presence of microstates defined on the recurrence matrix as small binary submatrices. The new methodology to compute the entropy of a time series has advantages compared to the traditional entropies defined in the literature, namely, a good correlation with the maximum Lyapunov exponent of the system and a weak dependence on the vicinity threshold parameter. Furthermore, the new method works adequately even for small segments of data, bringing consistent results for short and long time series. In a case where long time series are available, the new methodology can be employed to obtain high precision results since it does not demand large computational times related to the analysis of the entire time series or recurrence matrices, as is the case of other traditional entropy quantifiers. The method is applied to discrete and continuous systems.
Collapse
Affiliation(s)
- Gilberto Corso
- Departamento de Biofísica e Farmacologia, Universidade Federal do Rio Grande do Norte, Natal 59078-970, Brazil
| | - Thiago de Lima Prado
- Instituto de Engenharia, Ciência e Tecnologia, Universidade Federal dos Vales do Jequitinhonha e Mucuri, Janaúba 39440-000, Brazil
| | | | - Jürgen Kurths
- Potsdam Institute for Climate Impact Research, Telegraphenberg A 31, 14473 Potsdam, Germany
| | - Sergio Roberto Lopes
- Potsdam Institute for Climate Impact Research, Telegraphenberg A 31, 14473 Potsdam, Germany
| |
Collapse
|
6
|
Pizzi C, Ornamenti M, Spangaro S, Rombo SE, Parida L. Efficient Algorithms for Sequence Analysis with Entropic Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:117-128. [PMID: 28113780 DOI: 10.1109/tcbb.2016.2620143] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.
Collapse
|
7
|
Clustering of giant virus-DNA based on variations in local entropy. Viruses 2014; 6:2259-67. [PMID: 24887142 PMCID: PMC4074927 DOI: 10.3390/v6062259] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2014] [Revised: 05/19/2014] [Accepted: 05/21/2014] [Indexed: 11/17/2022] Open
Abstract
We present a method for clustering genomic sequences based on variations in local entropy. We have analyzed the distributions of the block entropies of viruses and plant genomes. A distinct pattern for viruses and plant genomes is observed. These distributions, which describe the local entropic variability of the genomes, are used for clustering the genomes based on the Jensen-Shannon (JS) distances. The analysis of the JS distances between all genomes that infect the chlorella algae shows the host specificity of the viruses. We illustrate the efficacy of this entropy-based clustering technique by the segregation of plant and virus genomes into separate bins.
Collapse
|
8
|
|
9
|
On the fractal geometry of DNA by the binary image analysis. Bull Math Biol 2013; 75:1544-70. [PMID: 23760660 DOI: 10.1007/s11538-013-9859-9] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 05/21/2013] [Indexed: 12/15/2022]
Abstract
The multifractal analysis of binary images of DNA is studied in order to define a methodological approach to the classification of DNA sequences. This method is based on the computation of some multifractality parameters on a suitable binary image of DNA, which takes into account the nucleotide distribution. The binary image of DNA is obtained by a dot-plot (recurrence plot) of the indicator matrix. The fractal geometry of these images is characterized by fractal dimension (FD), lacunarity, and succolarity. These parameters are compared with some other coefficients such as complexity and Shannon information entropy. It will be shown that the complexity parameters are more or less equivalent to FD, while the parameters of multifractality have different values in the sense that sequences with higher FD might have lower lacunarity and/or succolarity. In particular, the genome of Drosophila melanogaster has been considered by focusing on the chromosome 3r, which shows the highest fractality with a corresponding higher level of complexity. We will single out some results on the nucleotide distribution in 3r with respect to complexity and fractality. In particular, we will show that sequences with higher FD also have a higher frequency distribution of guanine, while low FD is characterized by the higher presence of adenine.
Collapse
|
10
|
Royer L, Reimann M, Stewart AF, Schroeder M. Network compression as a quality measure for protein interaction networks. PLoS One 2012; 7:e35729. [PMID: 22719828 PMCID: PMC3377704 DOI: 10.1371/journal.pone.0035729] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2011] [Accepted: 03/24/2012] [Indexed: 11/18/2022] Open
Abstract
With the advent of large-scale protein interaction studies, there is much debate about data quality. Can different noise levels in the measurements be assessed by analyzing network structure? Because proteomic regulation is inherently co-operative, modular and redundant, it is inherently compressible when represented as a network. Here we propose that network compression can be used to compare false positive and false negative noise levels in protein interaction networks. We validate this hypothesis by first confirming the detrimental effect of false positives and false negatives. Second, we show that gold standard networks are more compressible. Third, we show that compressibility correlates with co-expression, co-localization, and shared function. Fourth, we also observe correlation with better protein tagging methods, physiological expression in contrast to over-expression of tagged proteins, and smart pooling approaches for yeast two-hybrid screens. Overall, this new measure is a proxy for both sensitivity and specificity and gives complementary information to standard measures such as average degree and clustering coefficients.
Collapse
Affiliation(s)
- Loic Royer
- Bioinformatics, Biotec TU Dresden, Dresden, Germany
| | | | | | | |
Collapse
|
11
|
Bielińska-Wąż D. Graphical and numerical representations of DNA sequences: statistical aspects of similarity. JOURNAL OF MATHEMATICAL CHEMISTRY 2011; 49:2345. [PMID: 32214591 PMCID: PMC7087963 DOI: 10.1007/s10910-011-9890-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 07/22/2011] [Indexed: 05/10/2023]
Abstract
New approaches aiming at a detailed similarity/dissimilarity analysis of DNA sequences are formulated. Several corrections that enrich the information which may be derived from the alignment methods are proposed. The corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. The studies are supplemented by detailed similarity studies for histones H1 and H4 coding sequences. The data are described according to the latest version of the EMBL database. The work is supplemented by a concise review of the state-of-art graphical representations of DNA sequences.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Instytut Fizyki, Uniwersytet Mikołaja Kopernika, Grudziądzka 5, 87-100 Toruń, Poland
| |
Collapse
|
12
|
Bose R, Chouhan S. Alternate measure of information useful for DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 83:051918. [PMID: 21728582 DOI: 10.1103/physreve.83.051918] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2010] [Revised: 03/07/2011] [Indexed: 05/31/2023]
Abstract
We propose an alternate measure of information, called superinformation, which has been found to be very effective for analyzing the coding and noncoding regions of the DNA. This superinformation is actually a measure of the "randomness of randomness." It has been found to be highly accurate in classifying coding and noncoding regions of human DNA. In the proposed method, no prior training is required. This technique exhibits higher accuracy than previously reported techniques in distinguishing between the coding and the noncoding portions of the DNA. Superinformation can also be used to analyze the untranslated regions in various genes.
Collapse
Affiliation(s)
- Ranjan Bose
- Department of Electrical Engineering, IIT Delhi, Hauz Khas, New Delhi, India
| | | |
Collapse
|
13
|
Lesne A, Blanc JL, Pezard L. Entropy estimation of very short symbolic sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2009; 79:046208. [PMID: 19518313 DOI: 10.1103/physreve.79.046208] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2008] [Indexed: 05/27/2023]
Abstract
While entropy per unit time is a meaningful index to quantify the dynamic features of experimental time series, its estimation is often hampered in practice by the finite length of the data. We here investigate the performance of entropy estimation procedures, relying either on block entropies or Lempel-Ziv complexity, when only very short symbolic sequences are available. Heuristic analytical arguments point at the influence of temporal correlations on the bias and statistical fluctuations, and put forward a reduced effective sequence length suitable for error estimation. Numerical studies are conducted using, as benchmarks, the wealth of different dynamic regimes generated by the family of logistic maps and stochastic evolutions generated by a Markov chain of tunable correlation time. Practical guidelines and validity criteria are proposed. For instance, block entropy leads to a dramatic overestimation for sequences of low entropy, whereas it outperforms Lempel-Ziv complexity at high entropy. As a general result, the quality of entropy estimation is sensitive to the sequence temporal correlation hence self-consistently depends on the entropy value itself, thus promoting a two-step procedure. Lempel-Ziv complexity is to be preferred in the first step and remains the best estimator for highly correlated sequences.
Collapse
Affiliation(s)
- Annick Lesne
- Institut des Hautes Etudes Scientifiques, Le Bois-Marie, F-91440 Bures-sur-Yvette, France
| | | | | |
Collapse
|
14
|
Vinga S, Almeida JS. Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 2007; 8:393. [PMID: 17939871 PMCID: PMC2238722 DOI: 10.1186/1471-2105-8-393] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 10/16/2007] [Indexed: 11/18/2022] Open
Abstract
Background In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs. Results The new methodology enables two results. On the one hand it shows that the entropic profiles are directly related with the statistical significance of motifs, allowing the study of under and over-representation of segments. On the other hand, by spanning the parameters of the kernel function it is possible to extract important information about the scale of each conserved DNA region. The computational applications, developed in Matlab m-code, the corresponding binary executables and additional material and examples are made publicly available at . Conclusion The ability to detect local conservation from a scale-independent representation of symbolic sequences is particularly relevant for biological applications where conserved motifs occur in multiple, overlapping scales, with significant future applications in the recognition of foreign genomic material and inference of motif structures.
Collapse
Affiliation(s)
- Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.
| | | |
Collapse
|
15
|
Vaillant C, Audit B, Thermes C, Arnéodo A. Formation and positioning of nucleosomes: effect of sequence-dependent long-range correlated structural disorder. THE EUROPEAN PHYSICAL JOURNAL. E, SOFT MATTER 2006; 19:263-77. [PMID: 16477390 DOI: 10.1140/epje/i2005-10053-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2005] [Accepted: 01/20/2006] [Indexed: 05/06/2023]
Abstract
The understanding of the long-range correlations (LRC) observed in DNA sequences is still an open and very challenging problem. In this paper, we start reviewing recent results obtained when exploring the scaling properties of eucaryotic, eubacterial and archaeal genomic sequences using the space-scale decomposition provided by the wavelet transform (WT). These results suggest that the existence of LRC up to distances approximately 20-30 kbp is the signature of the nucleosomal structure and dynamics of the chromatin fiber. Actually the LRC are mainly observed in the DNA bending profiles obtained when using some structural coding of the DNA sequences that accounts for the fluctuations of the local double-helix curvature within the nucleosome complex. Because of the approximate planarity of nucleosomal DNA loops, we then study the influence of the LRC structural disorder on the thermodynamical properties of 2D elastic chains submitted locally to mechanical/topological constraint as loops. The equilibrium properties of the one-loop system are derived numerically and analytically in the quite realistic weak-disorder limit. The LRC are shown to favor the spontaneous formation of small loops, the larger the LRC, the smaller the size of the loop. We further investigate the dynamical behavior of such a loop using the mean first passage time (MFPT) formalism. We show that the typical short-time loop dynamics is superdiffusive in the presence of LRC. For displacements larger than the loop size, we use large-deviation theory to derive a LRC-dependent anomalous-diffusion rule that accounts for the lack of disorder self-averaging. Potential biological implications on DNA loops involved in nucleosome positioning and dynamics in eucaryotic chromatin are discussed.
Collapse
Affiliation(s)
- C Vaillant
- Institut Bernouilli, EPFL, 1015, Lausanne, Switzerland
| | | | | | | |
Collapse
|
16
|
Larsabal E, Danchin A. Genomes are covered with ubiquitous 11 bp periodic patterns, the "class A flexible patterns". BMC Bioinformatics 2005; 6:206. [PMID: 16120222 PMCID: PMC1242344 DOI: 10.1186/1471-2105-6-206] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 08/24/2005] [Indexed: 11/17/2022] Open
Abstract
Background The genomes of prokaryotes and lower eukaryotes display a very strong 11 bp periodic bias in the distribution of their nucleotides. This bias is present throughout a given genome, both in coding and non-coding sequences. Until now this bias remained of unknown origin. Results Using a technique for analysis of auto-correlations based on linear projection, we identified the sequences responsible for the bias. Prokaryotic and lower eukaryotic genomes are covered with ubiquitous patterns that we termed "class A flexible patterns". Each pattern is composed of up to ten conserved nucleotides or dinucleotides distributed into a discontinuous motif. Each occurrence spans a region up to 50 bp in length. They belong to what we named the "flexible pattern" type, in that there is some limited fluctuation in the distances between the nucleotides composing each occurrence of a given pattern. When taken together, these patterns cover up to half of the genome in the majority of prokaryotes. They generate the previously recognized 11 bp periodic bias. Conclusion Judging from the structure of the patterns, we suggest that they may define a dense network of protein interaction sites in chromosomes.
Collapse
Affiliation(s)
- Etienne Larsabal
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Antoine Danchin
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
17
|
Dehnert M, Helm WE, Hütt MT. Information theory reveals large-scale synchronisation of statistical correlations in eukaryote genomes. Gene 2005; 345:81-90. [PMID: 15716116 DOI: 10.1016/j.gene.2004.11.026] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2004] [Revised: 10/18/2004] [Accepted: 11/09/2004] [Indexed: 11/20/2022]
Abstract
We study short-range correlations in DNA sequences with methods from information theory and statistics. We find a persisting degree of identity between the correlation patterns of different chromosomes of a species. Except for the case of human and chimpanzee inter-species differences in this correlation pattern allow robust species distinction: in a clustering tree based upon the correlation curves on the level of individual chromosomes distinct clusters for the individual species are found. This capacity of distinguishing species persists, even when the length of the underlying sequences is drastically reduced. In comparison to the standard tool for studying symbol correlations in DNA sequences, namely the mutual information function, we find that an autoregressive model for higher order Markov processes significantly improves species distinction due to an implicit subtraction of random background.
Collapse
Affiliation(s)
- Manuel Dehnert
- Bioinformatics Group, Department of Biology, Darmstadt University of Technology, D-64287 Darmstadt, Germany
| | | | | |
Collapse
|
18
|
Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol 2004; 231:377-88. [PMID: 15501469 DOI: 10.1016/j.jtbi.2004.06.030] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2004] [Accepted: 06/30/2004] [Indexed: 11/20/2022]
Abstract
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
Collapse
Affiliation(s)
- Susana Vinga
- Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, R. Qta. Grande 6, 2780-156 Oeiras, Portugal.
| | | |
Collapse
|
19
|
Audit B, Vaillant C, Arnéodo A, d'Aubenton-Carafa Y, Thermes C. Wavelet Analysis of DNA Bending Profiles reveals Structural Constraints on the Evolution of Genomic Sequences. J Biol Phys 2004; 30:33-81. [PMID: 23345861 PMCID: PMC3456503 DOI: 10.1023/b:jobp.0000016438.86794.8e] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Analyses of genomic DNA sequences have shown in previous works that base pairs are correlated at large distances with scale-invariant statistical properties. We show in the present study that these correlations between nucleotides (letters) result in fact from long-range correlations (LRC) between sequence-dependent DNA structural elements (words) involved in the packaging of DNA in chromatin. Using the wavelet transform technique, we perform a comparative analysis of the DNA text and of the corresponding bending profiles generated with curvature tables based on nucleosome positioning data. This exploration through the optics of the so-called `wavelet transform microscope' reveals a characteristic scale of 100-200 bp that separates two regimes of different LRC. We focus here on the existence of LRC in the small-scale regime (≲ 200 bp). Analysis of genomes in the three kingdoms reveals that this regime is specifically associated to the presence of nucleosomes. Indeed, small scale LRC are observed in eukaryotic genomes and to a less extent in archaeal genomes, in contrast with their absence in eubacterial genomes. Similarly, this regime is observed in eukaryotic but not in bacterial viral DNA genomes. There is one exception for genomes of Poxviruses, the only animal DNA viruses that do not replicate in the cell nucleus and do not present small scale LRC. Furthermore, no small scale LRC are detected in the genomes of all examined RNA viruses, with one exception in the case of retroviruses. Altogether, these results strongly suggest that small-scale LRC are a signature of the nucleosomal structure. Finally, we discuss possible interpretations of these small-scale LRC in terms of the mechanisms that govern the positioning, the stability and the dynamics of the nucleosomes along the DNA chain. This paper is maily devoted to a pedagogical presentation of the theoretical concepts and physical methods which are well suited to perform a statistical analysis of genomic sequences. We review the results obtained with the so-called wavelet-based multifractal analysis when investigating the DNA sequences of various organisms in the three kingdoms. Some of these results have been announced in B. Audit et al. [1, 2].
Collapse
Affiliation(s)
- Benjamin Audit
- Centre de Recherche Paul Pascal, avenue Schweitzer, 33600 Pessac, France
| | | | | | | | | |
Collapse
|
20
|
Nikolaou C, Almirantis Y. Mutually symmetric and complementary triplets: differences in their use distinguish systematically between coding and non-coding genomic sequences. J Theor Biol 2003; 223:477-87. [PMID: 12875825 DOI: 10.1016/s0022-5193(03)00123-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The general property of asymmetry in word use in meaningful texts written in a variety of languages, motivates a quantification of the differences in the use of mutually symmetric triplets in genomic sequences. When this is done in the three reading frames, high values found for one of them are used as indication that the sequence is coding for a protein. Moreover, a similar quantification of the differences in the use of complementary triplets is introduced, again with predictive power of the coding character of a sequence. This method reflects the non-equivalence between sense and anti-sense strand of a coding segment. In both approaches, "linguistic asymmetry" in coding sequences is related to the form of the genetic code and to the bias in codon usage and amino acid use skews.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- National Research Center for Physical Sciences Demokritos, Institute of Biology, 15310 Athens, Greece
| | | |
Collapse
|
21
|
Fukushima A, Ikemura T, Kinouchi M, Oshima T, Kudo Y, Mori H, Kanaya S. Periodicity in prokaryotic and eukaryotic genomes identified by power spectrum analysis. Gene 2002; 300:203-11. [PMID: 12468102 DOI: 10.1016/s0378-1119(02)00850-8] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We used a power spectrum method to identify periodic patterns in nucleotide sequence, and characterized nucleotide sequences that confer periodicities to prokaryotic and eukaryotic genomes and genomes. A 10-bp periodicity was prevalent in hyperthermophilic bacteria and archaebacteria, and an 11-bp periodicity was prevalent in eubacteria. The 10-bp periodicity was also prevalent in the eukaryotes such as the worm Caenorhabditis elegans. Additionally, in the worm genome, a 68-bp periodicity in chromosome I, a 59-bp periodicity in chromosome II, and a 94-bp periodicity in chromosome III were found. In human chromosomes 21 and 22, approximately 167- or 84-bp periodicity was detected along the entire length of these chromosomes. Because the 167-bp is identical to the length of DNA that forms two complete helical turns in nucleosome organization, we speculated that the respective sequences may correspond to arrays of a special compact form of nucleosomes clustered in specific regions of the human chromosomes. This periodic element contained a high frequency of TGG. TGG-rich sequences are known to form a specific subset of folded DNA structures, and therefore, the sequences might have potential to form specific higher order structures related to the clustered occurrence of a specific form of the speculated nucleosomes.
Collapse
Affiliation(s)
- Atsushi Fukushima
- Graduate School of Biological Sciences, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara 630-0101, Japan
| | | | | | | | | | | | | |
Collapse
|
22
|
Anh VV, Lau KS, Yu ZG. Recognition of an organism from fragments of its complete genome. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2002; 66:031910. [PMID: 12366155 DOI: 10.1103/physreve.66.031910] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2002] [Revised: 06/13/2002] [Indexed: 05/23/2023]
Abstract
This paper considers the problem of matching a fragment to an organism using its complete genome. Our method is based on the probability measure representation of a genome. We first demonstrate that these probability measures can be modeled as recurrent iterated function systems (RIFS) consisting of four contractive similarities. Our hypothesis is that the multifractal characteristics of the probability measure of a complete genome, as captured by the RIFS, is preserved in its reasonably long fragments. We compute the RIFS of fragments of various lengths and random starting points, and compare with that of the original sequence for recognition using the Euclidean distance. A demonstration on five randomly selected organisms supports the above hypothesis.
Collapse
Affiliation(s)
- V V Anh
- Centre in Statistical Science and Industrial Mathematics, Queensland University of Technology, P. O. Box 2434, Brisbane Q4001, Australia.
| | | | | |
Collapse
|
23
|
Nikolaou C, Almirantis Y. A study of the middle-scale nucleotide clustering in DNA sequences of various origin and functionality, by means of a method based on a modified standard deviation. J Theor Biol 2002; 217:479-92. [PMID: 12234754 DOI: 10.1006/jtbi.2002.3045] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The deviation from randomness in the distribution of nucleotides in genomic sequences is quantified and studied, using a modified standard deviation (MSD). This method implies a "per block" computation of the standard deviation of the nucleotide frequencies of occurrence, using local means (means taken in a neighborhood of each block). This quantity may serve as a scale-dependent measure of the nucleotide clustering. In the present work, the meso-scale of tenths of nucleotides is principally explored, by means of suitably adjusted filter parameters. This length scale is of an order of magnitude not directly affected by the grammar and syntax rules of the protein-coding procedure, remaining shorter than the scale of appearance of large-scale characteristics of the genome. MSD has been found to distinguish systematically between the sequences of different origin and functionality. The most near-random are found to be coding sequences of prokaryotes, while in intronic and intergenic regions of eukaryotic genomes, extended clustering of similar nucleotides is observed. The distributions of MSD values of large collections of sequences are found to be in most cases characteristic of their biological role and origin. Protein- and non-coding, prokaryotic and eukaryotic DNA as well as promoter, rRNA, viral and organelle sequences have been examined. The presented results corroborate a recently proposed model for genome evolution. The method is also applied for an assessment of the annotation of ORFs taken from the complete genome of Saccharomyces cerevisiae.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- Institute of Biology, National Research Center for Physical Sciences, "Demokritos" 15310, Athens, Greece
| | | |
Collapse
|
24
|
Holste D, Grosse I, Herzel H. Statistical analysis of the DNA sequence of human chromosome 22. PHYSICAL REVIEW E 2001; 64:041917. [PMID: 11690062 DOI: 10.1103/physreve.64.041917] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2001] [Indexed: 11/07/2022]
Abstract
We study statistical patterns in the DNA sequence of human chromosome 22, the first completely sequenced human chromosome. We find that (i). the 33.4 x 10(6) nucleotide long human chromosome exhibits long-range power-law correlations over more than four orders of magnitude, (ii). the entropies H(n) of the frequency distribution of oligonucleotides of length n (n-mers) grow sublinearly with increasing n, indicating the presence of higher-order correlations for all of the studied lengths 1<or=n<or=10, and (iii). the generalized entropies H(n)(q) of n-mers decrease monotonically with increasing q and the decay of H(n)(q) with q becomes steeper with increasing n<or=10, indicating that the frequency distribution of oligonucleotides becomes increasingly nonuniform as the length n increases. We investigate to what degree known biological features may explain the observed statistical patterns. We find that (iv). the presence of interspersed repeats may cause the sublinear increase of H(n) with n, and that (v). the presence of monomeric tandem repeats as well as the suppression of CG dinucleotides may cause the observed decay of H(n)(q) with q.
Collapse
Affiliation(s)
- D Holste
- Department of Theoretical Biophysics, Humboldt University Berlin, Invalidenstrasse 42, D-10115, Berlin, Germany
| | | | | |
Collapse
|
25
|
Yu ZG, Anh V, Lau KS. Measure representation and multifractal analysis of complete genomes. PHYSICAL REVIEW E 2001; 64:031903. [PMID: 11580363 DOI: 10.1103/physreve.64.031903] [Citation(s) in RCA: 85] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2000] [Revised: 05/01/2001] [Indexed: 11/07/2022]
Abstract
This paper introduces the notion of measure representation of DNA sequences. Spectral analysis and multifractal analysis are then performed on the measure representations of a large number of complete genomes. The main aim of this paper is to discuss the multifractal property of the measure representation and the classification of bacteria. From the measure representations and the values of the D(q) spectra and related C(q) curves, it is concluded that these complete genomes are not random sequences. In fact, spectral analyses performed indicate that these measure representations, considered as time series, exhibit strong long-range correlation. Here the long-range correlation is for the K-strings with dictionary ordering, and it is different from the base pair correlations introduced by other people. For substrings with length K=8, the D(q) spectra of all organisms studied are multifractal-like and sufficiently smooth for the C(q) curves to be meaningful. With the decreasing value of K, the multifractality lessens. The C(q) curves of all bacteria resemble a classical phase transition at a critical point. But the "analogous" phase transitions of chromosomes of nonbacteria organisms are different. Apart from chromosome 1 of C. elegans, they exhibit the shape of double-peaked specific heat function. A classification of genomes of bacteria by assigning to each sequence a point in two-dimensional space (D(-1),D1) and in three-dimensional space (D(-1),D1,D(-2)) was given. Bacteria that are close phylogenetically are almost close in the spaces (D(-1),D1) and (D(-1),D1,D(-2)).
Collapse
Affiliation(s)
- Z G Yu
- Centre in Statistical Science and Industrial Mathematics, Queensland University of Technology, GPO Box 2434, Brisbane, Q 4001, Australia.
| | | | | |
Collapse
|
26
|
Yu ZG, Anh VV, Wang B. Correlation property of length sequences based on global structure of the complete genome. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2001; 63:011903. [PMID: 11304283 DOI: 10.1103/physreve.63.011903] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2000] [Revised: 08/28/2000] [Indexed: 05/23/2023]
Abstract
This paper considers three kinds of length sequences of the complete genome. Detrended fluctuation analysis, spectral analysis, and the mean distance spanned within time L are used to discuss the correlation property of these sequences. The values of the exponents from these methods of these three kinds of length sequences of bacteria indicate that the long-range correlations exist in most of these sequences. The correlations have a rich variety of behaviors including the presence of anti-correlations. Furthermore, using the exponent gamma, it is found that these correlations are all linear (gamma=1.0+/-0.03). It is also found that these sequences exhibit 1/f noise in some interval of frequency (f>1). The length of this interval of frequency depends on the length of the sequence. The shape of the periodogram in f>1 exhibits some periodicity. The period seems to depend on the length and the complexity of the length sequence.
Collapse
Affiliation(s)
- Z G Yu
- Centre in Statistical Science and Industrial Mathematics, Queensland University of Technology, Brisbane, Australia
| | | | | |
Collapse
|
27
|
Abstract
The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides.
Collapse
Affiliation(s)
- O Weiss
- Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstr. 43, Berlin, D-10115, Germany
| | | | | |
Collapse
|
28
|
Lobzin VV, Chechetkin VR. Order and correlations in genomic DNA sequences. The spectral approach. ACTA ACUST UNITED AC 2000. [DOI: 10.3367/ufnr.0170.200001c.0057] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
29
|
Wackerbauer R, Schmidt T. Symbolic dynamics of jejunal motility in the irritable bowel. CHAOS (WOODBURY, N.Y.) 1999; 9:805-811. [PMID: 12779876 DOI: 10.1063/1.166454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Different studies of the irritable bowel syndrome (IBS) by conventional analysis of jejunal motility report conflicting results. Therefore, our aim is to quantify the jejunal contraction activity by symbolic dynamics in order to discriminate between IBS and control subjects. Contraction amplitudes during fasting motility (phase II) are analyzed for 30 IBS and 30 healthy subjects. On the basis of a particular scale-independent discretization of the contraction amplitudes with respect to the median, IBS patients are characterized by increased block entropy as well as increased mean contraction amplitude. In a further more elementary level of analysis these differences can be reduced to specific contraction patterns within the time series, namely the fact that successive large contraction amplitudes are less ordered in IBS than in controls. These significant differences in jejunal motility may point to an altered control of the gut in IBS, although further studies on a representative number of patients have to be done for a validation of these findings. (c) 1999 American Institute of Physics.
Collapse
Affiliation(s)
- Renate Wackerbauer
- Max-Planck-Institute for Physics of Complex Systems, 01187-Dresden, Germany
| | | |
Collapse
|
30
|
Freund J, Ebeling W, Rateitschak K. Self-similar sequences and universal scaling of dynamical entropies. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1996; 54:5561-5566. [PMID: 9965741 DOI: 10.1103/physreve.54.5561] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
|
31
|
Allegrini P, Barbi M, Grigolini P, West BJ. Dynamical model for DNA sequences. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1995; 52:5281-5296. [PMID: 9964027 DOI: 10.1103/physreve.52.5281] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
|
32
|
Mantegna RN, Buldyrev SV, Goldberger AL, Havlin S, Peng CK, Simons M, Stanley HE. Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1995; 52:2939-50. [PMID: 9963739 DOI: 10.1103/physreve.52.2939] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C. elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of coding regions. In particular, (i) a n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger "n-gram redundancy") than the coding regions. In contrast to the three chromosomes, we find that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of the zeroth- and first-order Markovian models or simple nucleotide repeats to account fully for these "linguistic" features of DNA. Finally, we emphasize that our results by no means prove the existence of a "language" in noncoding DNA.
Collapse
Affiliation(s)
- R N Mantegna
- Center for Polymer Studies and Department of Physics, Boston University, Massachusetts 02215, USA
| | | | | | | | | | | | | |
Collapse
|