1
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. BIOLOGY 2023; 12:849. [PMID: 37372134 PMCID: PMC10295253 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands;
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L. Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
2
|
Carpena P, Bernaola-Galván PA, Gómez-Extremera M, Coronado AV. Transforming Gaussian correlations. Applications to generating long-range power-law correlated time series with arbitrary distribution. CHAOS (WOODBURY, N.Y.) 2020; 30:083140. [PMID: 32872793 DOI: 10.1063/5.0013986] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2020] [Accepted: 08/06/2020] [Indexed: 06/11/2023]
Abstract
The observable outputs of many complex dynamical systems consist of time series exhibiting autocorrelation functions of great diversity of behaviors, including long-range power-law autocorrelation functions, as a signature of interactions operating at many temporal or spatial scales. Often, numerical algorithms able to generate correlated noises reproducing the properties of real time series are used to study and characterize such systems. Typically, many of those algorithms produce a Gaussian time series. However, the real, experimentally observed time series are often non-Gaussian and may follow distributions with a diversity of behaviors concerning the support, the symmetry, or the tail properties. It is always possible to transform a correlated Gaussian time series into a time series with a different marginal distribution, but the question is how this transformation affects the behavior of the autocorrelation function. Here, we study analytically and numerically how the Pearson's correlation of two Gaussian variables changes when the variables are transformed to follow a different destination distribution. Specifically, we consider bounded and unbounded distributions, symmetric and non-symmetric distributions, and distributions with different tail properties from decays faster than exponential to heavy-tail cases including power laws, and we find how these properties affect the correlation of the final variables. We extend these results to a Gaussian time series, which are transformed to have a different marginal distribution, and show how the autocorrelation function of the final non-Gaussian time series depends on the Gaussian correlations and on the final marginal distribution. As an application of our results, we propose how to generalize standard algorithms producing a Gaussian power-law correlated time series in order to create a synthetic time series with an arbitrary distribution and controlled power-law correlations. Finally, we show a practical example of this algorithm by generating time series mimicking the marginal distribution and the power-law tail of the autocorrelation function of real time series: the absolute returns of stock prices.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071 Málaga, Spain
| | - Pedro A Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071 Málaga, Spain
| | - Manuel Gómez-Extremera
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071 Málaga, Spain
| | - Ana V Coronado
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
3
|
Gómez-Martín C, Lebrón R, Oliver JL, Hackenberg M. Prediction of CpG Islands as an Intrinsic Clustering Property Found in Many Eukaryotic DNA Sequences and Its Relation to DNA Methylation. Methods Mol Biol 2019; 1766:31-47. [PMID: 29605846 DOI: 10.1007/978-1-4939-7768-0_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
The promoter region of around 70% of all genes in the human genome is overlapped by a CpG island (CGI). CGIs have known functions in the transcription initiation and outstanding compositional features like high G+C content and CpG ratios when compared to the bulk DNA. We have shown before that CGIs manifest as clusters of CpGs in mammalian genomes and can therefore be detected using clustering methods. These techniques have several advantages over sliding window approaches which apply compositional properties as thresholds. In this protocol we show how to determine local (CpG islands) and global (distance distribution) clustering properties of CG dinucleotides and how to generalize this analysis to any k-mer or combinations of it. In addition, we illustrate how to easily cross the output of a CpG island prediction algorithm with our methylation database to detect differentially methylated CGIs. The analysis is given in a step-by-step protocol and all necessary programs are implemented into a virtual machine or, alternatively, the software can be downloaded and easily installed.
Collapse
Affiliation(s)
- Cristina Gómez-Martín
- Department of Genetics, Faculty of Science, University of Granada, Granada, Spain.,Lab. de Bioinformática, Centro de Investigación Biomédica, PTS, Instituto de Biotecnología, Granada, Spain
| | - Ricardo Lebrón
- Department of Genetics, Faculty of Science, University of Granada, Granada, Spain.,Lab. de Bioinformática, Centro de Investigación Biomédica, PTS, Instituto de Biotecnología, Granada, Spain
| | - José L Oliver
- Department of Genetics, Faculty of Science, University of Granada, Granada, Spain.,Lab. de Bioinformática, Centro de Investigación Biomédica, PTS, Instituto de Biotecnología, Granada, Spain
| | - Michael Hackenberg
- Department of Genetics, Faculty of Science, University of Granada, Granada, Spain. .,Lab. de Bioinformática, Centro de Investigación Biomédica, PTS, Instituto de Biotecnología, Granada, Spain.
| |
Collapse
|
4
|
Lebrón R, Gómez-Martín C, Carpena P, Bernaola-Galván P, Barturen G, Hackenberg M, Oliver JL. NGSmethDB 2017: enhanced methylomes and differential methylation. Nucleic Acids Res 2017; 45:D97-D103. [PMID: 27794041 PMCID: PMC5210667 DOI: 10.1093/nar/gkw996] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Revised: 10/08/2016] [Accepted: 10/14/2016] [Indexed: 12/27/2022] Open
Abstract
The 2017 update of NGSmethDB stores whole genome methylomes generated from short-read data sets obtained by bisulfite sequencing (WGBS) technology. To generate high-quality methylomes, stringent quality controls were integrated with third-part software, adding also a two-step mapping process to exploit the advantages of the new genome assembly models. The samples were all profiled under constant parameter settings, thus enabling comparative downstream analyses. Besides a significant increase in the number of samples, NGSmethDB now includes two additional data-types, which are a valuable resource for the discovery of methylation epigenetic biomarkers: (i) differentially methylated single-cytosines; and (ii) methylation segments (i.e. genome regions of homogeneous methylation). The NGSmethDB back-end is now based on MongoDB, a NoSQL hierarchical database using JSON-formatted documents and dynamic schemas, thus accelerating sample comparative analyses. Besides conventional database dumps, track hubs were implemented, which improved database access, visualization in genome browsers and comparative analyses to third-part annotations. In addition, the database can be also accessed through a RESTful API. Lastly, a Python client and a multiplatform virtual machine allow for program-driven access from user desktop. This way, private methylation data can be compared to NGSmethDB without the need to upload them to public servers. Database website: http://bioinfo2.ugr.es/NGSmethDB.
Collapse
Affiliation(s)
- Ricardo Lebrón
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - Cristina Gómez-Martín
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - Pedro Carpena
- Department of Applied Physics II, Universidad de Málaga, 29071 Málaga, Spain
| | | | - Guillermo Barturen
- Genetics of Complex Diseases Group, GENyO, Pfizer-University of Granada-Junta de Andalucía Center for Genomics and Oncological Research, 18100-Granada, Spain
| | - Michael Hackenberg
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| | - José L Oliver
- Department of Genetics, Faculty of Science, University of Granada, Campus de Fuentenueva s/n, 18071-Granada, Spain
- Laboratory of Bioinformatics, Centro de Investigación Biomédica, PTS, Avda. del Conocimiento s/n, 18100-Granada, Spain
| |
Collapse
|
5
|
Carpena P, Bernaola-Galván PA, Carretero-Campos C, Coronado AV. Probability distribution of intersymbol distances in random symbolic sequences: Applications to improving detection of keywords in texts and of amino acid clustering in proteins. Phys Rev E 2016; 94:052302. [PMID: 27967154 DOI: 10.1103/physreve.94.052302] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Indexed: 11/07/2022]
Abstract
Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.
Collapse
Affiliation(s)
- Pedro Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Pedro A Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Concepción Carretero-Campos
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| | - Ana V Coronado
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, 29071, Málaga, Spain
| |
Collapse
|
6
|
Provata A, Nicolis C, Nicolis G. Complexity measures for the evolutionary categorization of organisms. Comput Biol Chem 2014; 53 Pt A:5-14. [PMID: 25216557 DOI: 10.1016/j.compbiolchem.2014.08.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/17/2023]
Abstract
Complexity measures are used to compare the genomic characteristics of five organisms belonging to distinct classes spanning the evolutionary tree: higher eukaryotes, amoebae, unicellular eukaryotes and bacteria. The comparisons are undertaken using the full four-letter alphabet and the coarse grained two-letter alphabets AG-CT and AT-CG. We show that the conditional probability matrix for the four-letter and AT-CG alphabet is markedly asymmetric in eukaryotes while it is nearly symmetric in bacterial genomes. Spatial asymmetry is revealed in the four-letter alphabet, signifying that the probability fluxes are nonvanishing and thus the reading sense of a sequence is irreversible for all organisms. Calculations of the block entropy and excess entropy demonstrate that the human genome accommodates better all possible block configurations, especially for long blocks. With respect to point-to-point details and to spatial arrangement of blocks the exit distance distributions from a particular letter demonstrate long distance characteristics in the eukaryotic sequences for all three alphabets, while the bacterial (prokaryotic) genomes deviate indicating short range characteristics. Overall, the conditional probability, the fluxes, the block entropy content and the exit distance distributions can be used as markers, discriminating between eukaryotic and prokaryotic DNA, allowing in many cases to discern details related to finer classes. In all cases the reduction from four letters to two masks some important statistical and spatial properties, with the AT-CG alphabet having higher ability of discrimination than the AG-CT one. In particular, the AT-CG alphabet reduction accentuates the CpG related properties (conditional probabilities w32, long ranged exit distance distribution for A and T nucleotides), but masks sequence asymmetry and irreversibility in all examined organisms.
Collapse
Affiliation(s)
- A Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15310 Athens, Greece.
| | - C Nicolis
- Institut Royal Météorogique de Belgique, 3 Avenue Circulaire, 1180 Bruxelles, Belgium.
| | - G Nicolis
- Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, C.P. 231, 1050 Bruxelles, Belgium.
| |
Collapse
|
7
|
Dios F, Barturen G, Lebrón R, Rueda A, Hackenberg M, Oliver JL. DNA clustering and genome complexity. Comput Biol Chem 2014; 53 Pt A:71-8. [PMID: 25182383 DOI: 10.1016/j.compbiolchem.2014.08.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 01/08/2023]
Abstract
Early global measures of genome complexity (power spectra, the analysis of fluctuations in DNA walks or compositional segmentation) uncovered a high degree of complexity in eukaryotic genome sequences. The main evolutionary mechanisms leading to increases in genome complexity (i.e. gene duplication and transposon proliferation) can all potentially produce increases in DNA clustering. To quantify such clustering and provide a genome-wide description of the formed clusters, we developed GenomeCluster, an algorithm able to detect clusters of whatever genome element identified by chromosome coordinates. We obtained a detailed description of clusters for ten categories of human genome elements, including functional (genes, exons, introns), regulatory (CpG islands, TFBSs, enhancers), variant (SNPs) and repeat (Alus, LINE1) elements, as well as DNase hypersensitivity sites. For each category, we located their clusters in the human genome, then quantifying cluster length and composition, and estimated the clustering level as the proportion of clustered genome elements. In average, we found a 27% of elements in clusters, although a considerable variation occurs among different categories. Genes form the lowest number of clusters, but these are the longest ones, both in bp and the average number of components, while the shortest clusters are formed by SNPs. Functional and regulatory elements (genes, CpG islands, TFBSs, enhancers) show the highest clustering level, as compared to DNase sites, repeats (Alus, LINE1) or SNPs. Many of the genome elements we analyzed are known to be composed of clusters of low-level entities. In addition, we found here that the clusters generated by GenomeCluster can be in turn clustered into high-level super-clusters. The observation of 'clusters-within-clusters' parallels the 'domains within domains' phenomenon previously detected through global statistical methods in eukaryotic sequences, and reveals a complex human genome landscape dominated by hierarchical clustering.
Collapse
Affiliation(s)
- Francisco Dios
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Ricardo Lebrón
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - Antonio Rueda
- Plataforma Andaluza de Genómica y Bioinformática (GBPA), Edificio INSUR, Calle Albert Einstein, 41092 Sevilla, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, 18071 Granada, Spain; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, 18100 Granada, Spain.
| |
Collapse
|
8
|
Provata A, Nicolis C, Nicolis G. DNA viewed as an out-of-equilibrium structure. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 89:052105. [PMID: 25353737 DOI: 10.1103/physreve.89.052105] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2013] [Indexed: 05/02/2023]
Abstract
The complexity of the primary structure of human DNA is explored using methods from nonequilibrium statistical mechanics, dynamical systems theory, and information theory. A collection of statistical analyses is performed on the DNA data and the results are compared with sequences derived from different stochastic processes. The use of χ^{2} tests shows that DNA can not be described as a low order Markov chain of order up to r=6. Although detailed balance seems to hold at the level of a binary alphabet, it fails when all four base pairs are considered, suggesting spatial asymmetry and irreversibility. Furthermore, the block entropy does not increase linearly with the block size, reflecting the long-range nature of the correlations in the human genomic sequences. To probe locally the spatial structure of the chain, we study the exit distances from a specific symbol, the distribution of recurrence distances, and the Hurst exponent, all of which show power law tails and long-range characteristics. These results suggest that human DNA can be viewed as a nonequilibrium structure maintained in its state through interactions with a constantly changing environment. Based solely on the exit distance distribution accounting for the nonequilibrium statistics and using the Monte Carlo rejection sampling method, we construct a model DNA sequence. This method allows us to keep both long- and short-range statistical characteristics of the native DNA data. The model sequence presents the same characteristic exponents as the natural DNA but fails to capture spatial correlations and point-to-point details.
Collapse
Affiliation(s)
- A Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15310 Athens, Greece and Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, CP. 231, 1050 Bruxelles, Belgium
| | - C Nicolis
- Institut Royal Météorologique de Belgique, 3 Avenue Circulaire, 1180 Bruxelles, Belgium
| | - G Nicolis
- Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, CP. 231, 1050 Bruxelles, Belgium
| |
Collapse
|
9
|
Ré MA, Azad RK. Generalization of entropy based divergence measures for symbolic sequence analysis. PLoS One 2014; 9:e93532. [PMID: 24728338 PMCID: PMC3984095 DOI: 10.1371/journal.pone.0093532] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Accepted: 03/04/2014] [Indexed: 11/26/2022] Open
Abstract
Entropy based measures have been frequently used in symbolic sequence analysis. A symmetrized and smoothed form of Kullback-Leibler divergence or relative entropy, the Jensen-Shannon divergence (JSD), is of particular interest because of its sharing properties with families of other divergence measures and its interpretability in different domains including statistical physics, information theory and mathematical statistics. The uniqueness and versatility of this measure arise because of a number of attributes including generalization to any number of probability distributions and association of weights to the distributions. Furthermore, its entropic formulation allows its generalization in different statistical frameworks, such as, non-extensive Tsallis statistics and higher order Markovian statistics. We revisit these generalizations and propose a new generalization of JSD in the integrated Tsallis and Markovian statistical framework. We show that this generalization can be interpreted in terms of mutual information. We also investigate the performance of different JSD generalizations in deconstructing chimeric DNA sequences assembled from bacterial genomes including that of E. coli, S. enterica typhi, Y. pestis and H. influenzae. Our results show that the JSD generalizations bring in more pronounced improvements when the sequences being compared are from phylogenetically proximal organisms, which are often difficult to distinguish because of their compositional similarity. While small but noticeable improvements were observed with the Tsallis statistical JSD generalization, relatively large improvements were observed with the Markovian generalization. In contrast, the proposed Tsallis-Markovian generalization yielded more pronounced improvements relative to the Tsallis and Markovian generalizations, specifically when the sequences being compared arose from phylogenetically proximal organisms.
Collapse
Affiliation(s)
- Miguel A. Ré
- Departamento de Ciencias Básicas, CIII - Facultad Regional Córdoba, Universidad Tecnológica Nacional, Córdoba, Argentina
- Facultad de Matemática, Astronomía y Física, Universidad Nacional de Córdoba, Córdoba, Argentina
| | - Rajeev K. Azad
- Department of Biological Sciences, University of North Texas, Denton, Texas, United States of America
- Department of Mathematics, University of North Texas, Denton, Texas, United States of America
- * E-mail:
| |
Collapse
|
10
|
Implications of human genome structural heterogeneity: functionally related genes tend to reside in organizationally similar genomic regions. BMC Genomics 2014; 15:252. [PMID: 24684786 PMCID: PMC4234528 DOI: 10.1186/1471-2164-15-252] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2012] [Accepted: 03/21/2014] [Indexed: 01/30/2023] Open
Abstract
Background In an earlier study, we hypothesized that genomic segments with different sequence
organization patterns (OPs) might display functional specificity despite their
similar GC content. Here we tested this hypothesis by dividing the human genome
into 100 kb segments, classifying these segments into five compositional
groups according to GC content, and then characterizing each segment within the
five groups by oligonucleotide counting (k-mer analysis; also referred to as
compositional spectrum analysis, or CSA), to examine the distribution of sequence
OPs in the segments. We performed the CSA on the entire DNA, i.e., its coding and
non-coding parts the latter being much more abundant in the genome than the
former. Results We identified 38 OP-type clusters of segments that differ in their compositional
spectrum (CS) organization. Many of the segments that shared the same OP type were
enriched with genes related to the same biological processes (developmental,
signaling, etc.), components of biochemical complexes, or organelles. Thirteen
OP-type clusters showed significant enrichment in genes connected to specific
gene-ontology terms. Some of these clusters seemed to reflect certain events
during periods of horizontal gene transfer and genome expansion, and subsequent
evolution of genomic regions requiring coordinated regulation. Conclusions There may be a tendency for genes that are involved in the same biological
process, complex or organelle to use the same OP, even at a distance of ~
100 kb from the genes. Although the intergenic DNA is non-coding, the general
pattern of sequence organization (e.g., reflected in over-represented
oligonucleotide “words”) may be important and were protected, to some
extent, in the course of evolution.
Collapse
|
11
|
Palivonaite R, Lukoseviciute K, Ragulskis M. Algebraic segmentation of short nonstationary time series based on evolutionary prediction algorithms. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2013.05.013] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
12
|
Bernaola-Galván P, Oliver J, Hackenberg M, Coronado A, Ivanov P, Carpena P. Segmentation of time series with long-range fractal correlations. THE EUROPEAN PHYSICAL JOURNAL. B 2012; 85:211. [PMID: 23645997 PMCID: PMC3643524 DOI: 10.1140/epjb/e2012-20969-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
Collapse
Affiliation(s)
| | - J.L. Oliver
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - M. Hackenberg
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - A.V. Coronado
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | - P.Ch. Ivanov
- Harvard Medical School, Division of Sleep Medicine, Brigham & Women’s Hospital, 02115 Boston, MA, USA
- Department of Physics and Center for Polymer Studies, Boston University, 2215 Boston, MA, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - P. Carpena
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
13
|
Clustering of DNA words and biological function: A proof of principle. J Theor Biol 2012; 297:127-36. [DOI: 10.1016/j.jtbi.2011.12.024] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 12/20/2011] [Accepted: 12/21/2011] [Indexed: 02/08/2023]
|
14
|
Carretero-Campos C, Bernaola-Galván P, Ch. Ivanov P, Carpena P. Phase transitions in the first-passage time of scale-invariant correlated processes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2012; 85:011139. [PMID: 22400544 PMCID: PMC3518899 DOI: 10.1103/physreve.85.011139] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Revised: 11/30/2011] [Indexed: 05/31/2023]
Abstract
A key quantity describing the dynamics of complex systems is the first-passage time (FPT). The statistical properties of FPT depend on the specifics of the underlying system dynamics. We present a unified approach to account for the diversity of statistical behaviors of FPT observed in real-world systems. We find three distinct regimes, separated by two transition points, with fundamentally different behavior for FPT as a function of increasing strength of the correlations in the system dynamics: stretched exponential, power-law, and saturation regimes. In the saturation regime, the average length of FPT diverges proportionally to the system size, with important implications for understanding electronic delocalization in one-dimensional correlated-disordered systems.
Collapse
Affiliation(s)
| | | | - Plamen Ch. Ivanov
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02212, USA
- Harvard Medical School and Division of Sleep Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - Pedro Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, E-29071 Málaga, Spain
| |
Collapse
|
15
|
On parameters of the human genome. J Theor Biol 2011; 288:92-104. [DOI: 10.1016/j.jtbi.2011.07.021] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Revised: 06/28/2011] [Accepted: 07/21/2011] [Indexed: 02/06/2023]
|
16
|
Camargo S, Duarte Queirós SM, Anteneodo C. Nonparametric segmentation of nonstationary time series. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:046702. [PMID: 22181302 DOI: 10.1103/physreve.84.046702] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2011] [Revised: 08/09/2011] [Indexed: 05/31/2023]
Abstract
The nonstationary evolution of observable quantities in complex systems can frequently be described as a juxtaposition of quasistationary spells. Given that standard theoretical and data analysis approaches usually rely on the assumption of stationarity, it is important to detect in real time series intervals holding that property. With that aim, we introduce a segmentation algorithm based on a fully nonparametric approach. We illustrate its applicability through the analysis of real time series presenting diverse degrees of nonstationarity, thus showing that this segmentation procedure generalizes and allows one to uncover features unresolved by previous proposals based on the discrepancy of low order statistical moments only.
Collapse
Affiliation(s)
- S Camargo
- Departamento de Física, PUC-Rio, Rio de Janeiro, Brazil
| | | | | |
Collapse
|