1
|
Saeed M. Fractal genomics of SOD1 evolution. Immunogenetics 2020; 72:439-445. [PMID: 33237378 DOI: 10.1007/s00251-020-01184-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 10/28/2020] [Indexed: 10/22/2022]
Abstract
To understand the fundamental processes of gene evolution such as the impact of point mutations and segmental duplications on statistical topography, superoxide dismutase-1 (SOD1) orthologous sequences (n = 50) are studied. These demonstrate scale invariant self-similarity patterns and long-range correlations (LRCs) indicating fractal organization. Phylogenetic hierarchies change when SOD1 orthologs are grouped according to fractal measures, indicating that statistical topographies can be used to study gene evolution. Sliding window k-mer analysis show that majority of k-mers across all SOD1 orthologs are unique, with very few duplications. Orthologs from simpler species contribute minimally (< 1% of k-mers) to more complex species. Both simple and complex random processes fail to produce significant matching k-mer sequences for SOD1 orthologs. Point mutations causing amyotrophic lateral sclerosis do not impact the fractal organization of human SOD1. Hence, SOD1 did not evolve by a patchwork of repetitive sequences modified by point mutations. Moreover, fractal and other methods described here can be used to study the origin and evolution of genomes.
Collapse
|
2
|
Mandal S, Roychowdhury T, Chirom K, Bhattacharya A, Brojen Singh RK. Complex multifractal nature in Mycobacterium tuberculosis genome. Sci Rep 2017; 7:46395. [PMID: 28440326 PMCID: PMC5404331 DOI: 10.1038/srep46395] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Accepted: 03/15/2017] [Indexed: 11/08/2022] Open
Abstract
The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences.
Collapse
Affiliation(s)
- Saurav Mandal
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| | | | - Keilash Chirom
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| | - Alok Bhattacharya
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
- School of Life Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| | - R. K. Brojen Singh
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi-110067, India
| |
Collapse
|
3
|
Aurell E, Innocenti N, Zhou HJ. The bulk and the tail of minimal absent words in genome sequences. Phys Biol 2016; 13:026004. [PMID: 27043075 DOI: 10.1088/1478-3975/13/2/026004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the concept of a core of a MAW, which are sequences present in the genome and closest to a given MAW. We show that in E. faecalis, E. coli and yeast the cores of the longest MAWs, which exist in two or more copies, are located in highly conserved regions the most prominent example being ribosomal RNAs. We also show that while the distribution of the cores of long MAWs is roughly uniform over these genomes on a coarse-grained level, on a more detailed level it is strongly enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5' UTRs. This indicates that MAWs and associated MAW cores correspond to fine-tuned evolutionary relationships, and suggest that they can be more widely used as markers for genomic complexity.
Collapse
Affiliation(s)
- Erik Aurell
- Department of Computational Biology, KTH Royal Institute of Technology, AlbaNova University Center, SE-10691 Stockholm, Sweden. Department of Information and Computer Science, Aalto University, FI-02150 Espoo, Finland
| | | | | |
Collapse
|
4
|
Colliva A, Pellegrini R, Testori A, Caselle M. Ising-model description of long-range correlations in DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 91:052703. [PMID: 26066195 DOI: 10.1103/physreve.91.052703] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Indexed: 06/04/2023]
Abstract
We model long-range correlations of nucleotides in the human DNA sequence using the long-range one-dimensional (1D) Ising model. We show that, for distances between 10(3) and 10(6) bp, the correlations show a universal behavior and may be described by the non-mean-field limit of the long-range 1D Ising model. This allows us to make some testable hypothesis on the nature of the interaction between distant portions of the DNA chain which led to the DNA structure that we observe today in higher eukaryotes.
Collapse
Affiliation(s)
- A Colliva
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| | - R Pellegrini
- Physics Department, Swansea University, Singleton Park, Swansea SA2 8PP, UK
| | - A Testori
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| | - M Caselle
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| |
Collapse
|
5
|
Wu ZB. Analysis of correlation structures in the Synechocystis PCC6803 genome. Comput Biol Chem 2014; 53 Pt A:49-58. [PMID: 25199594 DOI: 10.1016/j.compbiolchem.2014.08.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/26/2022]
Abstract
Transfer of nucleotide strings in the Synechocystis sp. PCC6803 genome is investigated to exhibit periodic and non-periodic correlation structures by using the recurrence plot method and the phase space reconstruction technique. The periodic correlation structures are generated by periodic transfer of several substrings in long periodic or non-periodic nucleotide strings embedded in the coding regions of genes. The non-periodic correlation structures are generated by non-periodic transfer of several substrings covering or overlapping with the coding regions of genes. In the periodic and non-periodic transfer, some gaps divide the long nucleotide strings into the substrings and prevent their global transfer. Most of the gaps are either the replacement of one base or the insertion/reduction of one base. In the reconstructed phase space, the points generated from two or three steps for the continuous iterative transfer via the second maximal distance can be fitted by two lines. It partly reveals an intrinsic dynamics in the transfer of nucleotide strings. Due to the comparison of the relative positions and lengths, the substrings concerned with the non-periodic correlation structures are almost identical to the mobile elements annotated in the genome. The mobile elements are thus endowed with the basic results on the correlation structures.
Collapse
Affiliation(s)
- Zuo-Bing Wu
- State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics, Chinese Academy of Sciences, Beijing 100190, China.
| |
Collapse
|
6
|
Abstract
The existence of fractal sets of DNA sequences have long been suspected on the basis of statistical analyses of genome data. In this article we identify for the first time explicitly the GA-sequences as a class of fractal genomic sequences that are easy to recognize and to extract, and are scattered densely throughout the chromosomes of a large number of genomes from different species and kingdoms including the human genome. Their existence and their fractality may have significant consequences for our understanding of the origin and evolution of genomes. Furthermore, as universal and natural markers they may be used to chart and explore the non-coding regions.
Collapse
|
7
|
Koroteev MV, Miller J. Scale-free duplication dynamics: a model for ultraduplication. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:061919. [PMID: 22304128 DOI: 10.1103/physreve.84.061919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 07/04/2011] [Indexed: 05/31/2023]
Abstract
Empirical studies of the genome-wide length distribution of duplicated sequences have revealed an algebraic tail common to nearly all clades. The decay of the tail is often well approximated by a single exponent that takes values within a limited range. We propose and study here scale-free duplication dynamics, a class of model for genome sequence evolution that generates the observed shapes of this distribution. A transition between self-similar and non-self-similar regimes is exhibited. Our model accounts plausibly for the observed form of the algebraic tail, which is not produced by standard models for generating long-range sequence correlations.
Collapse
Affiliation(s)
- M V Koroteev
- Physics and Biology Unit, Okinawa Institute of Science and Technology Suzaki 12-22, Uruma, Okinawa 904-2234, Japan
| | | |
Collapse
|
8
|
Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments. PLoS One 2011; 6:e18464. [PMID: 21779315 PMCID: PMC3136455 DOI: 10.1371/journal.pone.0018464] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Accepted: 03/08/2011] [Indexed: 01/25/2023] Open
Abstract
Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin.
Collapse
|
9
|
Woo HJ, Wallqvist A. Nonequilibrium phase transitions associated with DNA replication. PHYSICAL REVIEW LETTERS 2011; 106:060601. [PMID: 21405451 DOI: 10.1103/physrevlett.106.060601] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2010] [Indexed: 05/30/2023]
Abstract
Thermodynamics governing the synthesis of DNA and RNA strands under a template is considered analytically and applied to the population dynamics of competing replicators. We find a nonequilibrium phase transition for high values of polymerase fidelity in a single replicator, where the two phases correspond to stationary states with higher elongation velocity and lower error rate than the other. At the critical point, the susceptibility linking velocity to thermodynamic force diverges. The overall behavior closely resembles the liquid-vapor phase transition in equilibrium. For a population of self-replicating macromolecules, Eigen's error catastrophe transition precedes this thermodynamic phase transition during starvation. For a given thermodynamic force, the fitness of replicators increases with increasing polymerase fidelity above a threshold.
Collapse
Affiliation(s)
- Hyung-June Woo
- Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Fort Dietrick, Maryland 21702, USA.
| | | |
Collapse
|
10
|
Kuhn A, Dehnert M, Helm WE, Hütt MT. Statistical evidence for ancestral correlation patterns. Biosystems 2010; 100:215-24. [DOI: 10.1016/j.biosystems.2010.03.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2009] [Revised: 12/15/2009] [Accepted: 03/16/2010] [Indexed: 10/19/2022]
|
11
|
Chen HD, Fan WL, Kong SG, Lee HC. Universal global imprints of genome growth and evolution--equivalent length and cumulative mutation density. PLoS One 2010; 5:e9844. [PMID: 20418954 PMCID: PMC2854691 DOI: 10.1371/journal.pone.0009844] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2009] [Accepted: 02/08/2010] [Indexed: 11/19/2022] Open
Abstract
Background Segmental duplication is widely held to be an important mode of genome growth and evolution. Yet how this would affect the global structure of genomes has been little discussed. Methods/Principal Findings Here, we show that equivalent length, or , a quantity determined by the variance of fluctuating part of the distribution of the -mer frequencies in a genome, characterizes the latter's global structure. We computed the s of 865 complete chromosomes and found that they have nearly universal but (-dependent) values. The differences among the of a chromosome and those of its coding and non-coding parts were found to be slight. Conclusions We verified that these non-trivial results are natural consequences of a genome growth model characterized by random segmental duplication and random point mutation, but not of any model whose dominant growth mechanism is not segmental duplication. Our study also indicates that genomes have a nearly universal cumulative “point” mutation density of about 0.73 mutations per site that is compatible with the relatively low mutation rates of (15)10/site/Mya previously determined by sequence comparison for the human and E. coli genomes.
Collapse
Affiliation(s)
- Hong-Da Chen
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan
- Department of Physics, National Central University, Chungli, Taiwan
| | - Wen-Lang Fan
- Department of Physics, National Central University, Chungli, Taiwan
- Genomic Research Center, Academia Sinaca, Taipei, Taiwan
| | - Sing-Guan Kong
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan
- Department of Physics, National Central University, Chungli, Taiwan
| | - Hoong-Chien Lee
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan
- Department of Physics, National Central University, Chungli, Taiwan
- Cathay Medical Research Institute, Cathay General Hospital, Taipei, Taiwan
- National Center for Theoretical Science, Shinchu, Taiwan
- * E-mail:
| |
Collapse
|
12
|
Kong SG, Fan WL, Chen HD, Hsu ZT, Zhou N, Zheng B, Lee HC. Inverse symmetry in complete genomes and whole-genome inverse duplication. PLoS One 2009; 4:e7553. [PMID: 19898631 PMCID: PMC2771390 DOI: 10.1371/journal.pone.0007553] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2009] [Accepted: 07/22/2009] [Indexed: 12/18/2022] Open
Abstract
The cause of symmetry is usually subtle, and its study often leads to a deeper understanding of the bearer of the symmetry. To gain insight into the dynamics driving the growth and evolution of genomes, we conducted a comprehensive study of textual symmetries in 786 complete chromosomes. We focused on symmetry based on our belief that, in spite of their extreme diversity, genomes must share common dynamical principles and mechanisms that drive their growth and evolution, and that the most robust footprints of such dynamics are symmetry related. We found that while complement and reverse symmetries are essentially absent in genomic sequences, inverse-complement plus reverse-symmetry is prevalent in complex patterns in most chromosomes, a vast majority of which have near maximum global inverse symmetry. We also discovered relations that can quantitatively account for the long observed but unexplained phenomenon of -mer skews in genomes. Our results suggest segmental and whole-genome inverse duplications are important mechanisms in genome growth and evolution, probably because they are efficient means by which the genome can exploit its double-stranded structure to enrich its code-inventory.
Collapse
Affiliation(s)
- Sing-Guan Kong
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan, Republic of China
| | - Wen-Lang Fan
- Department of Physics, National Central University, Chungli, Taiwan, Republic of China
| | - Hong-Da Chen
- Department of Physics, National Central University, Chungli, Taiwan, Republic of China
| | - Zi-Ting Hsu
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan, Republic of China
| | - Nengji Zhou
- Institute of Modern Physics, Zhejiang University, Hangzhou, Zhejiang, China
- National Center for Theoretical Science, Shinchu, Taiwan, Republic of China
| | - Bo Zheng
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan, Republic of China
| | - Hoong-Chien Lee
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan, Republic of China
- Department of Physics, National Central University, Chungli, Taiwan, Republic of China
- Institute of Modern Physics, Zhejiang University, Hangzhou, Zhejiang, China
- National Center for Theoretical Science, Shinchu, Taiwan, Republic of China
- * E-mail:
| |
Collapse
|
13
|
Kong SG, Fan WL, Chen HD, Wigger J, Torda AE, Lee HC. Quantitative measure of randomness and order for complete genomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2009; 79:061911. [PMID: 19658528 DOI: 10.1103/physreve.79.061911] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2008] [Revised: 04/14/2009] [Indexed: 05/28/2023]
Abstract
We propose an order index, phi, which gives a quantitative measure of randomness and order of complete genomic sequences. It maps genomes to a number from 0 (random and of infinite length) to 1 (fully ordered) and applies regardless of sequence length. The 786 complete genomic sequences in GenBank were found to have phi values in a very narrow range, phig=0.031(-0.015)+0.028. We show this implies that genomes are halfway toward being completely random, or, at the "edge of chaos." We further show that artificial "genomes" converted from literary classics have phi 's that almost exactly coincide with phig, but sequences of low information content do not. We infer that phig represents a high information-capacity "fixed point" in sequence space, and that genomes are driven to it by the dynamics of a robust growth and evolution process. We show that a growth process characterized by random segmental duplication can robustly drive genomes to the fixed point.
Collapse
Affiliation(s)
- Sing-Guan Kong
- Department of Physics, Graduate Institute of Biophysics, National Central University, Chungli, Taiwan 32001, Republic of China
| | | | | | | | | | | |
Collapse
|
14
|
Saakian DB. Evolution models with base substitutions, insertions, deletions, and selection. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2008; 78:061920. [PMID: 19256881 DOI: 10.1103/physreve.78.061920] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2008] [Revised: 09/11/2008] [Indexed: 05/27/2023]
Abstract
The evolution model with parallel mutation-selection scheme is solved for the case when selection is accompanied by base substitutions, insertions, and deletions. The fitness is assumed to be either a single-peak function (i.e., having one finite discontinuity) or a smooth function of the Hamming distance from the reference sequence. The mean fitness is calculated exactly in large-genome limit. In the case of insertions and deletions the evolution characteristics depend on the choice of reference sequence.
Collapse
Affiliation(s)
- D B Saakian
- Institute of Physics, Academia Sinica, Nankang, Taipei 11529, Taiwan.
| |
Collapse
|
15
|
Messer PW, Bundschuh R, Vingron M, Arndt PF. Effects of long-range correlations in DNA on sequence alignment score statistics. J Comput Biol 2007; 14:655-68. [PMID: 17683266 DOI: 10.1089/cmb.2007.r008] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Long-range correlations in genomic base composition are a ubiquitous statistical feature among many eukaryotic genomes. In this article, these correlations are shown to substantially influence the statistics of sequence alignment scores. Using a Gaussian approximation to model the correlated score landscape, we calculate the corrections to the scale parameter lambda of the extreme value distribution of alignment scores. Our approximate analytic results are supported by a detailed numerical study based on a simple algorithm to efficiently generate long-range correlated random sequences. We find both, mean and exponential tail of the score distribution for long-range correlated sequences to be substantially shifted compared to random sequences with independent nucleotides. The significance of measured alignment scores will therefore change upon incorporation of the correlations in the null model. We discuss the magnitude of this effect in a biological context.
Collapse
|
16
|
Martignetti L, Caselle M. Universal power law behaviors in genomic sequences and evolutionary models. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 76:021902. [PMID: 17930060 DOI: 10.1103/physreve.76.021902] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2007] [Indexed: 05/25/2023]
Abstract
We study the length distribution of a particular class of DNA sequences known as the 5' untranslated regions exons. These exons belong to the messenger RNA of protein coding genes, but they are not coding (they are located upstream of the coding portion of the mRNA) and are thus less constrained from an evolutionary point of view. We show that in both mice and humans these exons show a very clean power law decay in their length distribution and suggest a simple evolutionary model, which may explain this finding. We conjecture that this power law behavior could indeed be a general feature of higher eukaryotes.
Collapse
Affiliation(s)
- Loredana Martignetti
- Dipartimento di Fisica Teoric, Università di Torino and INFN, Via Pietro Giuria 1, I-10125 Torino, Italy.
| | | |
Collapse
|
17
|
Provata A, Oikonomou T. Power law exponents characterizing human DNA. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:056102. [PMID: 17677128 DOI: 10.1103/physreve.75.056102] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2006] [Revised: 02/09/2007] [Indexed: 05/16/2023]
Abstract
The size distributions of all known coding and noncoding DNA sequences are studied in all human chromosomes. In a unified approach, both introns and intergenic regions are treated as noncoding regions. The distributions of noncoding segments Pnc(S) of size S present long tails Pnc(S) approximately S(-1-mu nc) , with exponents mu nc ranging between 0.71 (for chromosome 13) and 1.2 (for chromosome 19). On the contrary, the exponential, short-range decay terms dominate in the distributions of coding (exon) segments Pc(S) in all chromosomes. Aiming to address the emergence of these statistical features, minimal, stochastic, mean-field models are proposed, based on randomly aggregating DNA strings with duplication, influx and outflux of genomic segments. These minimal models produce both the short-range statistics in the coding and the observed power law and fractal statistics in the noncoding DNA. The minimal models also demonstrate that although the two systems (coding and noncoding) coexist, alternating on the same linear chain, they act independently: the coding as a closed, equilibrium system and the noncoding as an open, out-of-equilibrium one.
Collapse
Affiliation(s)
- A Provata
- Institute of Physical Chemistry, National Center for Scientific Research Demokritos, 15310 Athens, Greece.
| | | |
Collapse
|
18
|
Zhao F, Yang H, Wang B. Complexities of human promoter sequences. J Theor Biol 2007; 247:645-9. [PMID: 17482648 DOI: 10.1016/j.jtbi.2007.03.035] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2006] [Revised: 03/28/2007] [Accepted: 03/29/2007] [Indexed: 10/23/2022]
Abstract
By means of the diffusion entropy approach, we detect the scale-invariance characteristics embedded in the 4737 human promoter sequences. The exponent for the scale-invariance is in a wide range of [0.3,0.9], which centered at delta(c)=0.66. The distribution of the exponent can be separated into left and right branches with respect to the maximum. The left and right branches are asymmetric and can be fitted exactly with Gaussian form with different widths, respectively.
Collapse
Affiliation(s)
- Fangcui Zhao
- College of Life Science and Bioengineering, Beijing University of Technology, Beijing 100022, China
| | | | | |
Collapse
|
19
|
Messer PW, Arndt PF. CorGen--measuring and generating long-range correlations for DNA sequence analysis. Nucleic Acids Res 2006; 34:W692-5. [PMID: 16845099 PMCID: PMC1538783 DOI: 10.1093/nar/gkl234] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
CorGen is a web server that measures long-range correlations in the base composition of DNA and generates random sequences with the same correlation parameters. Long-range correlations are characterized by a power-law decay of the auto correlation function of the GC-content. The widespread presence of such correlations in eukaryotic genomes calls for their incorporation into accurate null models of eukaryotic DNA in computational biology. For example, the score statistics of sequence alignment and the performance of motif finding algorithms are significantly affected by the presence of genomic long-range correlations. We use an expansion-randomization dynamics to efficiently generate the correlated random sequences. The server is available at http://corgen.molgen.mpg.de.
Collapse
Affiliation(s)
- Philipp W Messer
- Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195 Berlin, Germany.
| | | |
Collapse
|
20
|
Salerno W, Havlak P, Miller J. Scale-invariant structure of strongly conserved sequence in genomic intersections and alignments. Proc Natl Acad Sci U S A 2006; 103:13121-5. [PMID: 16924100 PMCID: PMC1559763 DOI: 10.1073/pnas.0605735103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
A power-law distribution of the length of perfectly conserved sequence from mouse/human whole-genome intersection and alignment is exhibited. Spatial correlations of these elements within the mouse genome are studied. It is argued that these power-law distributions and correlations are comprised in part by functional noncoding sequence and ought to be accounted for in estimating the statistical significance of apparent sequence conservation. These inter-genomic correlations of conservation are placed in the context of previously observed intra-genomic correlations, and their possible origins and consequences are discussed.
Collapse
Affiliation(s)
| | - Paul Havlak
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Jonathan Miller
- *Department of Biochemistry and Molecular Biology and
- Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
21
|
Dehnert M, Helm WE, Hütt MT. Informational structure of two closely related eukaryotic genomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:021913. [PMID: 17025478 DOI: 10.1103/physreve.74.021913] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2006] [Indexed: 05/12/2023]
Abstract
Attempts to identify a species on the basis of its DNA sequence on purely statistical grounds have been formulated for more than a decade. The most prominent of such genome signatures relies on neighborhood correlations (i.e., dinucleotide frequencies) and, consequently, attributes species identification to mechanisms operating on the dinucleotide level (e.g., neighbor-dependent mutations). For the examples of Mus musculus and Rattus norvegicus we analyze short- and intermediate-range statistical correlations in DNA sequences. These correlation profiles are computed for all chromosomes of the two species. We find that with increasing range of correlations the capacity to distinguish between the species on the basis of this correlation profile is getting better and requires ever shorter sequence segments for obtaining a full species separation. This finding suggests that distinctive traits within the sequence are situated beyond the level of few nucleotides. The large-scale statistical patterning of DNA sequences on which such genome signatures are based is thus substantially determined by mobile elements (e.g., transposons and retrotransposons). The study and interspecies comparison of such correlation profiles can, therefore, reveal features of retrotransposition, segmental duplications, and other processes of genome evolution.
Collapse
Affiliation(s)
- Manuel Dehnert
- Computational Systems Biology, School of Engineering and Science, International University Bremen, Campus Ring 1, D-28759 Bremen, Germany
| | | | | |
Collapse
|
22
|
Shih CT. Characteristic length scale of electric transport properties of genomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:010903. [PMID: 16907054 DOI: 10.1103/physreve.74.010903] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2005] [Indexed: 05/11/2023]
Abstract
A tight-binding model together with a statistical method are used to investigate the relation between the sequence-dependent electric transport properties and the sequences of protein-coding regions of complete genomes. A correlation parameter Omega is defined to analyze the relation. For some particular propagation length w max, the transport behaviors of the coding and noncoding sequences are very different and the correlation reaches its maximal value Omega max. w max and Omega max are characteristic values for each species. A possible reason for the difference between the features of transport properties in the coding and noncoding regions is the mechanism of DNA damage repair processes together with natural selection.
Collapse
Affiliation(s)
- C T Shih
- Department of Physics, Tunghai University, Taichung, Taiwan
| |
Collapse
|
23
|
Saakian DB, Muñoz E, Hu CK, Deem MW. Quasispecies theory for multiple-peak fitness landscapes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 73:041913. [PMID: 16711842 PMCID: PMC4474369 DOI: 10.1103/physreve.73.041913] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2005] [Revised: 12/13/2005] [Indexed: 05/09/2023]
Abstract
We use a path integral representation to solve the Eigen and Crow-Kimura molecular evolution models for the case of multiple fitness peaks with arbitrary fitness and degradation functions. In the general case, we find that the solution to these molecular evolution models can be written as the optimum of a fitness function, with constraints enforced by Lagrange multipliers and with a term accounting for the entropy of the spreading population in sequence space. The results for the Eigen model are applied to consider virus or cancer proliferation under the control of drugs or the immune system.
Collapse
Affiliation(s)
- David B Saakian
- Institute of Physics, Academia Sinica, Nankang, Taipei 11529, Taiwan
| | | | | | | |
Collapse
|