1
|
Moya A, Oliver JL, Verdú M, Delaye L, Arnau V, Bernaola-Galván P, de la Fuente R, Díaz W, Gómez-Martín C, González FM, Latorre A, Lebrón R, Román-Roldán R. Driven progressive evolution of genome sequence complexity in Cyanobacteria. Sci Rep 2020; 10:19073. [PMID: 33149190 PMCID: PMC7643063 DOI: 10.1038/s41598-020-76014-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 10/22/2020] [Indexed: 02/07/2023] Open
Abstract
Progressive evolution, or the tendency towards increasing complexity, is a controversial issue in biology, which resolution entails a proper measurement of complexity. Genomes are the best entities to address this challenge, as they encode the historical information of a species' biotic and environmental interactions. As a case study, we have measured genome sequence complexity in the ancient phylum Cyanobacteria. To arrive at an appropriate measure of genome sequence complexity, we have chosen metrics that do not decipher biological functionality but that show strong phylogenetic signal. Using a ridge regression of those metrics against root-to-tip distance, we detected positive trends towards higher complexity in three of them. Lastly, we applied three standard tests to detect if progressive evolution is passive or driven-the minimum, ancestor-descendant, and sub-clade tests. These results provide evidence for driven progressive evolution at the genome-level in the phylum Cyanobacteria.
Collapse
Affiliation(s)
- Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain.
- Foundation for the Promotion of Sanitary and Biomedical Research of Valencian Community (FISABIO), 46020, Valencia, Spain.
- CIBER in Epidemiology and Public Health, 28029, Madrid, Spain.
| | - José L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | - Miguel Verdú
- Centro de Investigaciones sobre Desertificación, Consejo Superior de Investigaciones Científicas (CSIC), University of València and Generalitat Valenciana, 46113, Valencia, Spain
| | - Luis Delaye
- Department of Genetic Engineering, CINVESTAV, 36821, Irapuato, Mexico
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
| | - Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071, Málaga, Spain
| | - Rebeca de la Fuente
- Institute for Cross-Disciplinary Physics and Complex Systems (IFISC), Consejo Superior de Investigaciones Científicas (CSIC) and University of Balearic Islands, 07122, Palma de Mallorca, Spain
| | - Wladimiro Díaz
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
| | - Cristina Gómez-Martín
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | | | - Amparo Latorre
- Institute of Integrative Systems Biology (I2Sysbio), University of València and Consejo Superior de Investigaciones Científicas (CSIC), 46980, Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of Valencian Community (FISABIO), 46020, Valencia, Spain
- CIBER in Epidemiology and Public Health, 28029, Madrid, Spain
| | - Ricardo Lebrón
- Department of Genetics, Faculty of Sciences, University of Granada, 18071, Granada, Spain
- Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, 18100, Granada, Spain
| | - Ramón Román-Roldán
- Department of Applied Physics, University of Granada, 18071, Granada, Spain
| |
Collapse
|
2
|
Luque-Escamilla PL, Martínez-Aroza J, Oliver JL, Gómez-Lopera JF, Román-Roldán R. Compositional searching of CpG islands in the human genome. Phys Rev E Stat Nonlin Soft Matter Phys 2005; 71:061925. [PMID: 16089783 DOI: 10.1103/physreve.71.061925] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/21/2004] [Revised: 01/31/2005] [Indexed: 05/03/2023]
Abstract
We report on an entropic edge detector based on the local calculation of the Jensen-Shannon divergence with application to the search for CpG islands. CpG islands are pieces of the genome related to gene expression and cell differentiation, and thus to cancer formation. Searching for these CpG islands is a major task in genetics and bioinformatics. Some algorithms have been proposed in the literature, based on moving statistics in a sliding window, but its size may greatly influence the results. The local use of Jensen-Shannon divergence is a completely different strategy: the nucleotide composition inside the islands is different from that in their environment, so a statistical distance--the Jensen-Shannon divergence--between the composition of two adjacent windows may be used as a measure of their dissimilarity. Sliding this double window over the entire sequence allows us to segment it compositionally. The fusion of those segments into greater ones that satisfy certain identification criteria must be achieved in order to obtain the definitive results. We find that the local use of Jensen-Shannon divergence is very suitable in processing DNA sequences for searching for compositionally different structures such as CpG islands, as compared to other algorithms in literature.
Collapse
Affiliation(s)
- Pedro Luis Luque-Escamilla
- Department of Engineering and Mining Mechanics, University of Jaén, Escuela Politécnica Superior, Campus Las Lagunillas s/n, 23071 Jaén, Spain
| | | | | | | | | |
Collapse
|
3
|
Abstract
We present a coding measure which is based on the statistical properties of the stop codons, and that is able to estimate accurately the variation of coding content along an anonymous sequence. As the stop codons play the same role in all the genomes (with very few exceptions) the measure turns out to be species-independent. We show results both for prokaryotic and for eukaryotic genomes, indicating, first, the accuracy of the measure, and, second, that better prediction is achieved if the measure is applied on homogeneous, isochore-like sequences than if it is applied following the standard moving window approach. Finally, we discuss on some of the possible applications of the measure.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, Malaga, Spain.
| | | | | | | |
Collapse
|
4
|
Abstract
Here we present a study of statistical correlations among different positions in DNA sequences and their implications by directly using the autocorrelation function. Such an analysis is possible now because of the availability of large sequences or even complete genomes of many organisms. After describing the way in which the autocorrelation function can be applied to DNA-sequence analysis, we show that long-range correlations, implying scale independence, appear in several bacterial genomes as well as in long human chromosome contigs. The source for such correlations in bacteria, which may extend up to 60 kb in Bacillus subtilis, may be related to massive lateral transfer of compositionally biased genes from other genomes. In the human genome, correlations extend for more than five decades and may be related to the evolution of the 'neogenome', a modern evolutionary acquisition composed by GC-rich isochores displaying long-range correlations and scale invariance.
Collapse
Affiliation(s)
- P Bernaola-Galván
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, Málaga, Spain.
| | | | | | | |
Collapse
|
5
|
Abstract
The human genome is a mosaic of isochores, which are long DNA segments (z.Gt;300 kbp) relatively homogeneous in G+C. Human isochores were first identified by density-gradient ultracentrifugation of bulk DNA, and differ in important features, e.g. genes are found predominantly in the GC-richest isochores. Here, we use a reliable segmentation method to partition the longest contigs in the human genome draft sequence into long homogeneous genome regions (LHGRs), thereby revealing the isochore structure of the human genome. The advantages of the isochore maps presented here are: (1) sequence heterogeneities at different scales are shown in the same plot; (2) pair-wise compositional differences between adjacent regions are all statistically significant; (3) isochore boundaries are accurately defined to single base pair resolution; and (4) both gradual and abrupt isochore boundaries are simultaneously revealed. Taking advantage of the wide sample of genome sequence analyzed, we investigate the correspondence between LHGRs and true human isochores revealed through DNA centrifugation. LHGRs show many of the typical isochore features, mainly size distribution, G+C range, and proportions of the isochore classes. The relative density of genes, Alu and long interspersed nuclear element repeats and the different types of single nucleotide polymorphisms on LHGRs also coincide with expectations in true isochores. Potential applications of isochore maps range from the improvement of gene-finding algorithms to the prediction of linkage disequilibrium levels in association studies between marker genes and complex traits. The coordinates for the LHGRs identified in all the contigs longer than 2 Mb in the human genome sequence are available at the online resource on isochore mapping: http://bioinfo2.ugr.es/isochores.
Collapse
Affiliation(s)
- José L Oliver
- Departamento de Genética, Instituto de Biotecnología, Universidad de Granada, Granada, Spain.
| | | | | | | | | | | | | |
Collapse
|
6
|
Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, Stanley HE. Analysis of symbolic sequences using the Jensen-Shannon divergence. Phys Rev E Stat Nonlin Soft Matter Phys 2002; 65:041905. [PMID: 12005871 DOI: 10.1103/physreve.65.041905] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2000] [Revised: 08/08/2001] [Indexed: 05/23/2023]
Abstract
We study statistical properties of the Jensen-Shannon divergence D, which quantifies the difference between probability distributions, and which has been widely applied to analyses of symbolic sequences. We present three interpretations of D in the framework of statistical physics, information theory, and mathematical statistics, and obtain approximations of the mean, the variance, and the probability distribution of D in random, uncorrelated sequences. We present a segmentation method based on D that is able to segment a nonstationary symbolic sequence into stationary subsequences, and apply this method to DNA sequences, which are known to be nonstationary on a wide range of different length scales.
Collapse
Affiliation(s)
- Ivo Grosse
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
| | | | | | | | | | | |
Collapse
|
7
|
Abstract
Analytical DNA ultracentrifugation revealed that eukaryotic genomes are mosaics of isochores: long DNA segments (>>300 kb on average) relatively homogeneous in G+C. Important genome features are dependent on this isochore structure, e.g. genes are found predominantly in the GC-richest isochore classes. However, no reliable method is available to rigorously partition the genome sequence into relatively homogeneous regions of different composition, thereby revealing the isochore structure of chromosomes at the sequence level. Homogeneous regions are currently ascertained by plain statistics on moving windows of arbitrary length, or simply by eye on G+C plots. On the contrary, the entropic segmentation method is able to divide a DNA sequence into relatively homogeneous, statistically significant domains. An early version of this algorithm only produced domains having an average length far below the typical isochore size. Here we show that an improved segmentation method, specifically intended to determine the most statistically significant partition of the sequence at each scale, is able to identify the boundaries between long homogeneous genome regions displaying the typical features of isochores. The algorithm precisely locates classes II and III of the human major histocompatibility complex region, two well-characterized isochores at the sequence level, the boundary between them being the first isochore boundary experimentally characterized at the sequence level. The analysis is then extended to a collection of human large contigs. The relatively homogeneous regions we find show many of the features (G+C range, relative proportion of isochore classes, size distribution, and relationship with gene density) of the isochores identified through DNA centrifugation. Isochore chromosome maps, with many potential applications in genomics, are then drawn for all the completely sequenced eukaryotic genomes available.
Collapse
Affiliation(s)
- J L Oliver
- Departamento de Genética, Instituto de Biotecnología, Universidad de Granada, E-18071, Granada, Spain.
| | | | | | | |
Collapse
|
8
|
Bernaola-Galván P, Grosse I, Carpena P, Oliver JL, Román-Roldán R, Stanley HE. Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys Rev Lett 2000; 85:1342-1345. [PMID: 10991547 DOI: 10.1103/physrevlett.85.1342] [Citation(s) in RCA: 43] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/1999] [Indexed: 05/23/2023]
Abstract
We present a new computational approach to finding borders between coding and noncoding DNA. This approach has two features: (i) DNA sequences are described by a 12-letter alphabet that captures the differential base composition at each codon position, and (ii) the search for the borders is carried out by means of an entropic segmentation method which uses only the general statistical properties of coding DNA. We find that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets. Our results appear to be more accurate than those obtained with moving windows in the discrimination of coding from noncoding DNA.
Collapse
Affiliation(s)
- P Bernaola-Galván
- Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | | | | | | | | | | |
Collapse
|
9
|
Abstract
MOTIVATION DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.
Collapse
Affiliation(s)
- J L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, Spain.
| | | | | | | |
Collapse
|
10
|
Bernaola-Galván P, Román-Roldán R, Oliver JL. Compositional segmentation and long-range fractal correlations in DNA sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 1996; 53:5181-5189. [PMID: 9964850 DOI: 10.1103/physreve.53.5181] [Citation(s) in RCA: 97] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
|
11
|
Abstract
A new method to determine entropic profiles in DNA sequences is presented. It is based on the chaos-game representation (CGR) of gene structure, a technique which produces a fractal-like picture of DNA sequences. First, the CGR image was divided into squares 4-m in size (m being the desired resolution), and the point density counted. Second, appropriate intervals were adjusted, and then a histogram of densities was prepared. Third, Shannon's formula was applied to the probability-distribution histogram, thus obtaining a new entropic estimate for DNA sequences, the histogram entropy, a measurement that goes with the level of constraints on the DNA sequence. Lastly, the entropic profile for the sequence was drawn, by considering the entropies at each resolution level, thus providing a way to summarize the complexity of large genomic regions or even entire genomes at different resolution levels. The application of the method to DNA sequences reveals that entropic profiles obtained in this way, as opposed to previously published ones, clearly discriminate between random and natural DNA sequences. Entropic profiles also show a different degree of variability within and between genomes. The results of these analyses are discussed in relation both to the genome compartmentalization in vertebrates and to the differential action of compositional and/or functional constraints on DNA sequences.
Collapse
Affiliation(s)
- J L Oliver
- Department of Genetics, Faculty of Sciences, University of Granada, Spain
| | | | | | | |
Collapse
|