1
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. BIOLOGY 2023; 12:849. [PMID: 37372134 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
2
|
Dai Y, Pracana R, Holland PWH. Divergent genes in gerbils: prevalence, relation to GC-biased substitution, and phenotypic relevance. BMC Evol Biol 2020; 20:134. [PMID: 33076817 PMCID: PMC7574485 DOI: 10.1186/s12862-020-01696-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Accepted: 09/29/2020] [Indexed: 11/25/2022] Open
Abstract
Background Two gerbil species, sand rat (Psammomys obesus) and Mongolian jird (Meriones unguiculatus), can become obese and show signs of metabolic dysregulation when maintained on standard laboratory diets. The genetic basis of this phenotype is unknown. Recently, genome sequencing has uncovered very unusual regions of high guanine and cytosine (GC) content scattered across the sand rat genome, most likely generated by extreme and localized biased gene conversion. A key pancreatic transcription factor PDX1 is encoded by a gene in the most extreme GC-rich region, is remarkably divergent and exhibits altered biochemical properties. Here, we ask if gerbils have proteins in addition to PDX1 that are aberrantly divergent in amino acid sequence, whether they have also become divergent due to GC-biased nucleotide changes, and whether these proteins could plausibly be connected to metabolic dysfunction exhibited by gerbils. Results We analyzed ~ 10,000 proteins with 1-to-1 orthologues in human and rodents and identified 50 proteins that accumulated unusually high levels of amino acid change in the sand rat and 41 in Mongolian jird. We show that more than half of the aberrantly divergent proteins are associated with GC biased nucleotide change and many are in previously defined high GC regions. We highlight four aberrantly divergent gerbil proteins, PDX1, INSR, MEDAG and SPP1, that may plausibly be associated with dietary metabolism. Conclusions We show that through the course of gerbil evolution, many aberrantly divergent proteins have accumulated in the gerbil lineage, and GC-biased nucleotide substitution rather than positive selection is the likely cause of extreme divergence in more than half of these. Some proteins carry putatively deleterious changes that could be associated with metabolic and physiological phenotypes observed in some gerbil species. We propose that these animals provide a useful model to study the ‘tug-of-war’ between natural selection and the excessive accumulation of deleterious substitutions mutations through biased gene conversion.
Collapse
Affiliation(s)
- Yichen Dai
- Department of Zoology, University of Oxford, 11a Mansfield Road, Oxford, OX1 3SZ, UK
| | - Rodrigo Pracana
- Department of Zoology, University of Oxford, 11a Mansfield Road, Oxford, OX1 3SZ, UK
| | - Peter W H Holland
- Department of Zoology, University of Oxford, 11a Mansfield Road, Oxford, OX1 3SZ, UK.
| |
Collapse
|
3
|
Costantini M. An overview on genome organization of marine organisms. Mar Genomics 2015; 24 Pt 1:3-9. [PMID: 25899406 DOI: 10.1016/j.margen.2015.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Revised: 03/17/2015] [Accepted: 03/17/2015] [Indexed: 11/16/2022]
Abstract
In this review we will concentrate on some general genome features of marine organisms and their evolution, ranging from vertebrate to invertebrates until unicellular organisms. Before genome sequencing, the ultracentrifugation in CsCl led to high resolution of mammalian DNA (without seeing at the sequence). The analytical profile of human DNA showed that the vertebrate genome is a mosaic of isochores, typically megabase-size DNA segments that belong in a small number of families characterized by different GC levels. The recent availability of a number of fully sequenced genomes allowed mapping very precisely the isochores, based on DNA sequences. Since isochores are tightly linked to biological properties such as gene density, replication timing and recombination, the new level of detail provided by the isochore map helped the understanding of genome structure, function and evolution. This led the current level of knowledge and to further insights.
Collapse
Affiliation(s)
- Maria Costantini
- Department of Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Naples, Italy.
| |
Collapse
|
4
|
Elhaik E, Graur D. A comparative study and a phylogenetic exploration of the compositional architectures of mammalian nuclear genomes. PLoS Comput Biol 2014; 10:e1003925. [PMID: 25375262 PMCID: PMC4222635 DOI: 10.1371/journal.pcbi.1003925] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Accepted: 09/18/2014] [Indexed: 11/18/2022] Open
Abstract
For the past four decades the compositional organization of the mammalian genome posed a formidable challenge to molecular evolutionists attempting to explain it from an evolutionary perspective. Unfortunately, most of the explanations adhered to the "isochore theory," which has long been rebutted. Recently, an alternative compositional domain model was proposed depicting the human and cow genomes as composed mostly of short compositionally homogeneous and nonhomogeneous domains and a few long ones. We test the validity of this model through a rigorous sequence-based analysis of eleven completely sequenced mammalian and avian genomes. Seven attributes of compositional domains are used in the analyses: (1) the number of compositional domains, (2) compositional domain-length distribution, (3) density of compositional domains, (4) genome coverage by the different domain types, (5) degree of fit to a power-law distribution, (6) compositional domain GC content, and (7) the joint distribution of GC content and length of the different domain types. We discuss the evolution of these attributes in light of two competing phylogenetic hypotheses that differ from each other in the validity of clade Euarchontoglires. If valid, the murid genome compositional organization would be a derived state and exhibit a high similarity to that of other mammals. If invalid, the murid genome compositional organization would be closer to an ancestral state. We demonstrate that the compositional organization of the murid genome differs from those of primates and laurasiatherians, a phenomenon previously termed the "murid shift," and in many ways resembles the genome of opossum. We find no support to the "isochore theory." Instead, our findings depict the mammalian genome as a tapestry of mostly short homogeneous and nonhomogeneous domains and few long ones thus providing strong evidence in favor of the compositional domain model and seem to invalidate clade Euarchontoglires.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
- * E-mail:
| | - Dan Graur
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| |
Collapse
|
5
|
Nepomnyashchaya YN, Artemov AV, Roumiantsev SA, Roumyantsev AG, Zhavoronkov A. Non-invasive prenatal diagnostics of aneuploidy using next-generation DNA sequencing technologies, and clinical considerations. Clin Chem Lab Med 2014; 51:1141-54. [PMID: 23023923 DOI: 10.1515/cclm-2012-0281] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Accepted: 08/29/2012] [Indexed: 02/02/2023]
Abstract
Rapidly developing next-generation sequencing (NGS) technologies produce a large amount of data across the whole human genome and allow a large number of DNA samples to be analyzed simultaneously. Screening cell-free fetal DNA (cffDNA) obtained from maternal blood using NGS technologies has provided new opportunities for non-invasive prenatal diagnosis (NIPD) of fetal aneuploidies. One of the major challenges to the analysis of fetal abnormalities is the development of accurate and reliable algorithms capable of analyzing large numbers of short sequence reads. Several such algorithms have recently been developed. Here, we provide a review of recent NGS-based NIPD methods as well as the available algorithms for short-read sequence analysis. We furthermore introduce the practical application of these algorithms for the detection of different types of fetal aneuploidies, and compare the performance, cost and complexity of each approach for clinical deployment. Our review identifies several main technologies and trends in NGS-based NIPD. The main considerations for clinical development for NIPD and screening tests using DNA sequencing are: accuracy, intellectual property, cost and the ability to screen for a wide range of chromosomal abnormalities and genetic defects. The cost of the diagnostic test depends on the sequencing method, diagnostic algorithm and volume of the tests. If the cost of sequencing equipment and reagents remains at or around current levels, targeted approaches for sequencing-based aneuploidy testing and SNP-based methods are preferred.
Collapse
Affiliation(s)
- Yana N Nepomnyashchaya
- Bioinformatics and Medical Information Technology Laboratory, Federal Clinical Research Center of Pediatric Hematology, Oncology and Immunology, Moscow, Russian Federation
| | | | | | | | | |
Collapse
|
6
|
Provata A, Nicolis C, Nicolis G. DNA viewed as an out-of-equilibrium structure. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 89:052105. [PMID: 25353737 DOI: 10.1103/physreve.89.052105] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2013] [Indexed: 05/02/2023]
Abstract
The complexity of the primary structure of human DNA is explored using methods from nonequilibrium statistical mechanics, dynamical systems theory, and information theory. A collection of statistical analyses is performed on the DNA data and the results are compared with sequences derived from different stochastic processes. The use of χ^{2} tests shows that DNA can not be described as a low order Markov chain of order up to r=6. Although detailed balance seems to hold at the level of a binary alphabet, it fails when all four base pairs are considered, suggesting spatial asymmetry and irreversibility. Furthermore, the block entropy does not increase linearly with the block size, reflecting the long-range nature of the correlations in the human genomic sequences. To probe locally the spatial structure of the chain, we study the exit distances from a specific symbol, the distribution of recurrence distances, and the Hurst exponent, all of which show power law tails and long-range characteristics. These results suggest that human DNA can be viewed as a nonequilibrium structure maintained in its state through interactions with a constantly changing environment. Based solely on the exit distance distribution accounting for the nonequilibrium statistics and using the Monte Carlo rejection sampling method, we construct a model DNA sequence. This method allows us to keep both long- and short-range statistical characteristics of the native DNA data. The model sequence presents the same characteristic exponents as the natural DNA but fails to capture spatial correlations and point-to-point details.
Collapse
Affiliation(s)
- A Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15310 Athens, Greece and Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, CP. 231, 1050 Bruxelles, Belgium
| | - C Nicolis
- Institut Royal Météorologique de Belgique, 3 Avenue Circulaire, 1180 Bruxelles, Belgium
| | - G Nicolis
- Interdisciplinary Center for Nonlinear Phenomena and Complex Systems, Université Libre de Bruxelles, Campus Plaine, CP. 231, 1050 Bruxelles, Belgium
| |
Collapse
|
7
|
A genetic program theory of aging using an RNA population model. Ageing Res Rev 2014; 13:46-54. [PMID: 24263168 DOI: 10.1016/j.arr.2013.11.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2013] [Accepted: 11/08/2013] [Indexed: 12/11/2022]
Abstract
Aging is a common characteristic of multicellular eukaryotes. Copious hypotheses have been proposed to explain the mechanisms of aging, but no single theory is generally acceptable. In this article, we refine the RNA population gene activating model (Lv et al., 2003) based on existing reports as well as on our own latest findings. We propose the RNA population model as a genetic theory of aging. The new model can also be applied to differentiation and tumorigenesis and could explain the biological significance of non-coding DNA, RNA, and repetitive sequence DNA. We provide evidence from the literature as well as from our own findings for the roles of repetitive sequences in gene activation. In addition, we predict several phenomena related to aging and differentiation based on this model.
Collapse
|
8
|
A single cell level based method for copy number variation analysis by low coverage massively parallel sequencing. PLoS One 2013; 8:e54236. [PMID: 23372689 PMCID: PMC3553135 DOI: 10.1371/journal.pone.0054236] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2012] [Accepted: 12/10/2012] [Indexed: 02/02/2023] Open
Abstract
Copy number variations (CNVs), a common genomic mutation associated with various diseases, are important in research and clinical applications. Whole genome amplification (WGA) and massively parallel sequencing have been applied to single cell CNVs analysis, which provides new insight for the fields of biology and medicine. However, the WGA-induced bias significantly limits sensitivity and specificity for CNVs detection. Addressing these limitations, we developed a practical bioinformatic methodology for CNVs detection at the single cell level using low coverage massively parallel sequencing. This method consists of GC correction for WGA-induced bias removal, binary segmentation algorithm for locating CNVs breakpoints, and dynamic threshold determination for final signals filtering. Afterwards, we evaluated our method with seven test samples using low coverage sequencing (4∼9.5%). Four single-cell samples from peripheral blood, whose karyotypes were confirmed by whole genome sequencing analysis, were acquired. Three other test samples derived from blastocysts whose karyotypes were confirmed by SNP-array analysis were also recruited. The detection results for CNVs of larger than 1 Mb were highly consistent with confirmed results reaching 99.63% sensitivity and 97.71% specificity at base-pair level. Our study demonstrates the potential to overcome WGA-bias and to detect CNVs (>1 Mb) at the single cell level through low coverage massively parallel sequencing. It highlights the potential for CNVs research on single cells or limited DNA samples and may prove as a promising tool for research and clinical applications, such as pre-implantation genetic diagnosis/screening, fetal nucleated red blood cells research and cancer heterogeneity analysis.
Collapse
|
9
|
Bernaola-Galván P, Oliver J, Hackenberg M, Coronado A, Ivanov P, Carpena P. Segmentation of time series with long-range fractal correlations. THE EUROPEAN PHYSICAL JOURNAL. B 2012; 85:211. [PMID: 23645997 PMCID: PMC3643524 DOI: 10.1140/epjb/e2012-20969-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
Collapse
Affiliation(s)
| | - J.L. Oliver
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - M. Hackenberg
- Dpto. de Genética, Inst. de Biotecnología, Universidad de Granada, 18071 Granada, Spain
| | - A.V. Coronado
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | - P.Ch. Ivanov
- Harvard Medical School, Division of Sleep Medicine, Brigham & Women’s Hospital, 02115 Boston, MA, USA
- Department of Physics and Center for Polymer Studies, Boston University, 2215 Boston, MA, USA
- Institute of Solid State Physics, Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
| | - P. Carpena
- Dpto. de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
10
|
Clustering of DNA words and biological function: A proof of principle. J Theor Biol 2012; 297:127-36. [DOI: 10.1016/j.jtbi.2011.12.024] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Revised: 12/20/2011] [Accepted: 12/21/2011] [Indexed: 02/08/2023]
|
11
|
Xu Y, Ma QD, Schmitt DT, Bernaola-Galván P, Ivanov PC. Effects of coarse-graining on the scaling behavior of long-range correlated and anti-correlated signals. PHYSICA A 2011; 390:4057-4072. [PMID: 25392599 PMCID: PMC4226277 DOI: 10.1016/j.physa.2011.05.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We investigate how various coarse-graining (signal quantization) methods affect the scaling properties of long-range power-law correlated and anti-correlated signals, quantified by the detrended fluctuation analysis. Specifically, for coarse-graining in the magnitude of a signal, we consider (i) the Floor, (ii) the Symmetry and (iii) the Centro-Symmetry coarse-graining methods. We find that for anti-correlated signals coarse-graining in the magnitude leads to a crossover to random behavior at large scales, and that with increasing the width of the coarse-graining partition interval Δ, this crossover moves to intermediate and small scales. In contrast, the scaling of positively correlated signals is less affected by the coarse-graining, with no observable changes when Δ < 1, while for Δ > 1 a crossover appears at small scales and moves to intermediate and large scales with increasing Δ. For very rough coarse-graining (Δ > 3) based on the Floor and Symmetry methods, the position of the crossover stabilizes, in contrast to the Centro-Symmetry method where the crossover continuously moves across scales and leads to a random behavior at all scales; thus indicating a much stronger effect of the Centro-Symmetry compared to the Floor and the Symmetry method. For coarse-graining in time, where data points are averaged in non-overlapping time windows, we find that the scaling for both anti-correlated and positively correlated signals is practically preserved. The results of our simulations are useful for the correct interpretation of the correlation and scaling properties of symbolic sequences.
Collapse
Affiliation(s)
- Yinlin Xu
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
- College of Physics Science and Technology, Nanjing Normal University, Nanjing 210097, China
| | - Qianli D.Y. Ma
- Harvard Medical School and Division of Sleep Medicine, Brigham & Women’s Hospital, Boston, MA 02215, USA
- College of Geography and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Daniel T. Schmitt
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
| | | | - Plamen Ch. Ivanov
- Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215, USA
- Harvard Medical School and Division of Sleep Medicine, Brigham & Women’s Hospital, Boston, MA 02215, USA
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| |
Collapse
|
12
|
On parameters of the human genome. J Theor Biol 2011; 288:92-104. [DOI: 10.1016/j.jtbi.2011.07.021] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2011] [Revised: 06/28/2011] [Accepted: 07/21/2011] [Indexed: 02/06/2023]
|
13
|
Zhang W, Wu W, Lin W, Zhou P, Dai L, Zhang Y, Huang J, Zhang D. Deciphering heterogeneity in pig genome assembly Sscrofa9 by isochore and isochore-like region analyses. PLoS One 2010; 5:e13303. [PMID: 20948965 PMCID: PMC2952626 DOI: 10.1371/journal.pone.0013303] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Accepted: 09/15/2010] [Indexed: 11/18/2022] Open
Abstract
Background The isochore, a large DNA sequence with relatively small GC variance, is one of the most important structures in eukaryotic genomes. Although the isochore has been widely studied in humans and other species, little is known about its distribution in pigs. Principal Findings In this paper, we construct a map of long homogeneous genome regions (LHGRs), i.e., isochores and isochore-like regions, in pigs to provide an intuitive version of GC heterogeneity in each chromosome. The LHGR pattern study not only quantifies heterogeneities, but also reveals some primary characteristics of the chromatin organization, including the followings: (1) the majority of LHGRs belong to GC-poor families and are in long length; (2) a high gene density tends to occur with the appearance of GC-rich LHGRs; and (3) the density of LINE repeats decreases with an increase in the GC content of LHGRs. Furthermore, a portion of LHGRs with particular GC ranges (50%–51% and 54%–55%) tend to have abnormally high gene densities, suggesting that biased gene conversion (BGC), as well as time- and energy-saving principles, could be of importance to the formation of genome organization. Conclusion This study significantly improves our knowledge of chromatin organization in the pig genome. Correlations between the different biological features (e.g., gene density and repeat density) and GC content of LHGRs provide a unique glimpse of in silico gene and repeats prediction.
Collapse
Affiliation(s)
- Wenqian Zhang
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Wenwu Wu
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Wenchao Lin
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Pengfang Zhou
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Li Dai
- Bioinformatics Center, College of Life Science, Northwest A&F University, Xianyang, Shaanxi, China
| | - Yang Zhang
- Investigation Group of Molecular Virology, Immunology, Oncology and Systems Biology, and Bioinformatics Center, College of Veterinary Medicine, Northwest A&F University, Xianyang, Shaanxi, China
| | - Jingfei Huang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
- * E-mail: (DZ); (JH)
| | - Deli Zhang
- Investigation Group of Molecular Virology, Immunology, Oncology and Systems Biology, and Bioinformatics Center, College of Veterinary Medicine, Northwest A&F University, Xianyang, Shaanxi, China
- * E-mail: (DZ); (JH)
| |
Collapse
|
14
|
Elhaik E, Graur D, Josić K, Landan G. Identifying compositionally homogeneous and nonhomogeneous domains within the human genome using a novel segmentation algorithm. Nucleic Acids Res 2010; 38:e158. [PMID: 20571085 PMCID: PMC2926622 DOI: 10.1093/nar/gkq532] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
It has been suggested that the mammalian genome is composed mainly of long compositionally homogeneous domains. Such domains are frequently identified using recursive segmentation algorithms based on the Jensen–Shannon divergence. However, a common difficulty with such methods is deciding when to halt the recursive partitioning and what criteria to use in deciding whether a detected boundary between two segments is real or not. We demonstrate that commonly used halting criteria are intrinsically biased, and propose IsoPlotter, a parameter-free segmentation algorithm that overcomes such biases by using a simple dynamic halting criterion and tests the homogeneity of the inferred domains. IsoPlotter was compared with an alternative segmentation algorithm, DJS, using two sets of simulated genomic sequences. Our results show that IsoPlotter was able to infer both long and short compositionally homogeneous domains with low GC content dispersion, whereas DJS failed to identify short compositionally homogeneous domains and sequences with low compositional dispersion. By segmenting the human genome with IsoPlotter, we found that one-third of the genome is composed of compositionally nonhomogeneous domains and the remaining is a mixture of many short compositionally homogeneous domains and relatively few long ones.
Collapse
Affiliation(s)
- Eran Elhaik
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | | | | | | |
Collapse
|
15
|
Fan HC, Quake SR. Sensitivity of noninvasive prenatal detection of fetal aneuploidy from maternal plasma using shotgun sequencing is limited only by counting statistics. PLoS One 2010; 5:e10439. [PMID: 20454671 PMCID: PMC2862719 DOI: 10.1371/journal.pone.0010439] [Citation(s) in RCA: 125] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2010] [Accepted: 04/06/2010] [Indexed: 11/17/2022] Open
Abstract
We recently demonstrated noninvasive detection of fetal aneuploidy by shotgun sequencing cell-free DNA in maternal plasma using next-generation high throughput sequencer. However, GC bias introduced by the sequencer placed a practical limit on the sensitivity of aneuploidy detection. In this study, we describe a method to computationally remove GC bias in short read sequencing data by applying weight to each sequenced read based on local genomic GC content. We show that sensitivity is limited only by counting statistics and that sensitivity can be increased to arbitrary precision in sample containing arbitrarily small fraction of fetal DNA simply by sequencing more DNA molecules. High throughput shotgun sequencing of maternal plasma DNA should therefore enable noninvasive diagnosis of any type of fetal aneuploidy.
Collapse
Affiliation(s)
- H Christina Fan
- Department of Bioengineering, Stanford University and Howard Hughes Medical Institute, Stanford, California, United States of America
| | | |
Collapse
|
16
|
Expansión clónica y caracterización genómica del proceso de integración del virus linfotrópico humano tipo I en la leucemia/linfoma de células T en adultos. BIOMEDICA 2009. [DOI: 10.7705/biomedica.v29i2.24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
17
|
Keith JM, Adams P, Stephen S, Mattick JS. Delineating slowly and rapidly evolving fractions of the Drosophila genome. J Comput Biol 2008; 15:407-30. [PMID: 18435570 DOI: 10.1089/cmb.2007.0173] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.
Collapse
Affiliation(s)
- Jonathan M Keith
- School of Mathematical Sciences, Queensland University of Technology, Brisbane, Queensland, Australia.
| | | | | | | |
Collapse
|
18
|
Schmidt T, Frishman D. Assignment of isochores for all completely sequenced vertebrate genomes using a consensus. Genome Biol 2008; 9:R104. [PMID: 18590563 PMCID: PMC2481423 DOI: 10.1186/gb-2008-9-6-r104] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2008] [Revised: 05/22/2008] [Accepted: 06/30/2008] [Indexed: 11/16/2022] Open
Abstract
A new consensus isochore assignment method and a database of isochore maps for all completely sequenced vertebrate genomes are presented. We show that although the currently available isochore mapping methods agree on the isochore classification of about two-thirds of the human DNA, they produce significantly different results with regard to the location of isochore boundaries and isochore length distribution. We present a new consensus isochore assignment method based on majority voting and provide IsoBase, a comprehensive on-line database of isochore maps for all completely sequenced vertebrate genomes.
Collapse
Affiliation(s)
- Thorsten Schmidt
- Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universität München, D-85350 Freising, Germany
| | | |
Collapse
|
19
|
Oliver JL, Bernaola-Galván P, Hackenberg M, Carpena P. Phylogenetic distribution of large-scale genome patchiness. BMC Evol Biol 2008; 8:107. [PMID: 18405379 PMCID: PMC2397391 DOI: 10.1186/1471-2148-8-107] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2007] [Accepted: 04/11/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The phylogenetic distribution of large-scale genome structure (i.e. mosaic compositional patchiness) has been explored mainly by analytical ultracentrifugation of bulk DNA. However, with the availability of large, good-quality chromosome sequences, and the recently developed computational methods to directly analyze patchiness on the genome sequence, an evolutionary comparative analysis can be carried out at the sequence level. RESULTS The local variations in the scaling exponent of the Detrended Fluctuation Analysis are used here to analyze large-scale genome structure and directly uncover the characteristic scales present in genome sequences. Furthermore, through shuffling experiments of selected genome regions, computationally-identified, isochore-like regions were identified as the biological source for the uncovered large-scale genome structure. The phylogenetic distribution of short- and large-scale patchiness was determined in the best-sequenced genome assemblies from eleven eukaryotic genomes: mammals (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris), birds (Gallus gallus), fishes (Danio rerio), invertebrates (Drosophila melanogaster and Caenorhabditis elegans), plants (Arabidopsis thaliana) and yeasts (Saccharomyces cerevisiae). We found large-scale patchiness of genome structure, associated with in silico determined, isochore-like regions, throughout this wide phylogenetic range. CONCLUSION Large-scale genome structure is detected by directly analyzing DNA sequences in a wide range of eukaryotic chromosome sequences, from human to yeast. In all these genomes, large-scale patchiness can be associated with the isochore-like regions, as directly detected in silico at the sequence level.
Collapse
Affiliation(s)
- José L Oliver
- Dpto de Genética, Facultad de Ciencias, Universidad de Granada, Spain.
| | | | | | | |
Collapse
|
20
|
Abstract
Whole-genome comparisons among mammalian and other eukaryotic organisms have revealed that they contain large quantities of conserved non-protein-coding sequence. Although some of the functions of this non-coding DNA have been identified, there remains a large quantity of conserved genomic sequence that is of no known function. Moreover, the task of delineating the conserved sequences is non-trivial, particularly when some sequences are conserved in only a small number of lineages. Sequence segmentation is a statistical technique for identifying putative functional elements in genomes based on atypical sequence characteristics, such as conservation levels relative to other genomes, GC content, SNP frequency, and potentially many others. The publicly available program changept and associated programs use Bayesian multiple change-point analysis to delineate classes of genomic segments with similar characteristics, potentially representing new classes of non-coding RNAs (contact web site: http://silmaril.math.sci.qut.edu.au/~keith/) .
Collapse
|
21
|
Haiminen N, Mannila H. Discovering isochores by least-squares optimal segmentation. Gene 2007; 394:53-60. [PMID: 17389148 DOI: 10.1016/j.gene.2007.01.028] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 01/16/2007] [Accepted: 01/22/2007] [Indexed: 10/23/2022]
Abstract
The isochore structure of a genome is observable by variation in the G+C (guanine and cytosine) content within and between the chromosomes. Describing the isochore structure of vertebrate genomes is a challenging task, and many computational methods have been developed and applied to it. Here we apply a well-known least-squares optimal segmentation algorithm to isochore discovery. The algorithm finds the best division of the sequence into k pieces, such that the segments are internally as homogeneous as possible. We show how this simple segmentation method can be applied to isochore discovery using as input the G+C content of sliding windows on the sequence. To evaluate the performance of this segmentation technique on isochore detection, we present results from segmenting previously studied isochore regions of the human genome. Detailed results on the MHC locus, on parts of chromosomes 21 and 22, and on a 100 Mb region from chromosome 1 are similar to previously suggested isochore structures. We also give results on segmenting all 22 autosomal human chromosomes. An advantage of this technique is that oversegmentation of G+C rich regions can generally be avoided. This is because the technique concentrates on greater global, instead of smaller local, differences in the sequence composition. The effect is further emphasized by a log-transformation of the data that lowers the high variance that is observed in G+C rich regions. We conclude that the least-squares optimal segmentation method is computationally efficient and yields results close to previous biologically motivated isochore structures.
Collapse
MESH Headings
- Algorithms
- Chromosomes, Human/genetics
- Chromosomes, Human, Pair 1/genetics
- Chromosomes, Human, Pair 21/genetics
- Chromosomes, Human, Pair 22/genetics
- Chromosomes, Human, Pair 6/genetics
- GC Rich Sequence
- Genome, Human
- Genomics/statistics & numerical data
- Humans
- Isochores/chemistry
- Isochores/genetics
- Least-Squares Analysis
- Major Histocompatibility Complex
Collapse
Affiliation(s)
- Niina Haiminen
- HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland.
| | | |
Collapse
|
22
|
Melodelima C, Gautier C, Piau D. A markovian approach for the prediction of mouse isochores. J Math Biol 2007; 55:353-64. [PMID: 17486342 DOI: 10.1007/s00285-007-0087-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2006] [Revised: 03/01/2007] [Indexed: 10/23/2022]
Abstract
Hidden Markov models (HMMs) are effective tools to detect series of statistically homogeneous structures, but they are not well suited to analyse complex structures. For example, the duration of stay in a state of a HMM must follow a geometric law. Numerous other methodological difficulties are encountered when using HMMs to segregate genes from transposons or retroviruses, or to determine the isochore classes of genes. The aim of this paper is to analyse these methodological difficulties, and to suggest new tools for the exploration of genome data. We show that HMMs can be used to analyse complex gene structures with bell-shaped length distribution by using convolution of geometric distributions. Thus, we have introduced macros-states to model the distributions of the lengths of the regions. Our study shows that simple HMM could be used to model the isochore organisation of the mouse genome. This potential use of markovian models to help in data exploration has been underestimated until now.
Collapse
Affiliation(s)
- Christelle Melodelima
- UMR 5558 CNRS Biométrie et Biologie Evolutive, Université Claude Bernard Lyon 1, 43 boulevard du 11 Novembre 1818, 69622 Villeurbanne Cedex, France.
| | | | | |
Collapse
|
23
|
Carpena P, Bernaola-Galván P, Coronado AV, Hackenberg M, Oliver JL. Identifying characteristic scales in the human genome. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:032903. [PMID: 17500745 DOI: 10.1103/physreve.75.032903] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2006] [Indexed: 05/15/2023]
Abstract
The scale-free, long-range correlations detected in DNA sequences contrast with characteristic lengths of genomic elements, being particularly incompatible with the isochores (long, homogeneous DNA segments). By computing the local behavior of the scaling exponent alpha of detrended fluctuation analysis (DFA), we discriminate between sequences with and without true scaling, and we find that no single scaling exists in the human genome. Instead, human chromosomes show a common compositional structure with two characteristic scales, the large one corresponding to the isochores and the other to small and medium scale genomic elements.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, Universidad de Málaga, 29071 Málaga, Spain
| | | | | | | | | |
Collapse
|
24
|
Abstract
Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/ approximately uqjkeith/) or upon request to the author.
Collapse
Affiliation(s)
- Jonathan M Keith
- Department of Mathematics, University of Queensland, Brisbane, Australia.
| |
Collapse
|
25
|
A computational prediction of isochores based on hidden Markov models. Gene 2006; 385:41-9. [PMID: 17020791 DOI: 10.1016/j.gene.2006.04.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2005] [Revised: 03/17/2006] [Accepted: 04/03/2006] [Indexed: 11/30/2022]
Abstract
Mammalian genomes are organised into a mosaic of regions (in general more than 300 kb in length), with differing, relatively homogeneous G+C contents. The G+C content is the basic characteristic of isochores, but they have also been associated with many other biological properties. For instance, the genes are more compact and their density is highest in G+C rich isochores. Various ways of locating isochores in the human genome have been developed, but such methods use only the base composition of the DNA sequences. The present paper proposes a new method, based on a hidden Markov model, which takes into account several of the biological properties associated with the isochore structure of a genome. This method leads to good segmentation of the human genome into isochores, and also permits a new analysis of the known heterogeneity of G+C rich isochores: most (60%) of the G+C poor genes embedded in G+C rich isochores have UTR sequences characteristic of G+C rich genes. This genomic feature is discussed in the context of both evolution and genome function.
Collapse
|
26
|
Abstract
Isochores are large DNA segments (>>300 kb on average) that are characterized by an internal variation in GC well below the full variation seen in the mammalian genome. Precisely defining in terms of size and composition as well as mapping the isochores on human chromosomes have, however, remained largely unsolved problems. Here we used a very simple approach to segment the human chromosomes de novo, based on assessments of GC and its variation within and between adjacent regions. We obtain a complete coverage of the human genome (neglecting the remaining gaps) by approximately 3200 isochores, which may be visualized as the ultimate chromosomal bands. Isochores visibly belong to five families characterized by different GC levels, as expected from previous investigations. Since we previously showed that isochores are tightly linked to basic biological properties such as gene density, replication timing, and recombination, the new level of detail provided by the isochore map will help the understanding of genome structure, function, and evolution.
Collapse
Affiliation(s)
- Maria Costantini
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy
| | - Oliver Clay
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy
| | - Fabio Auletta
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy
| | - Giorgio Bernardi
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, 80121 Naples, Italy
- Corresponding author.E-mail ; fax 39 081 2455807
| |
Collapse
|
27
|
del Carmen Seleme M, Vetter MR, Cordaux R, Bastone L, Batzer MA, Kazazian HH. Extensive individual variation in L1 retrotransposition capability contributes to human genetic diversity. Proc Natl Acad Sci U S A 2006; 103:6611-6. [PMID: 16618923 PMCID: PMC1458931 DOI: 10.1073/pnas.0601324103] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Despite being scarce in the human genome, active L1 retrotransposons continue to play a significant role in its evolution. Because of their recent expansion, many L1s are not fixed in humans, and, when present, their mobilization potential can vary among individuals. Previously, we showed that the great majority of retrotransposition events in humans are caused by highly active, or hot, L1s. Here, in four populations of diverse geographic origins (160 haploid genomes), we investigated the degree of sequence polymorphism of three hot L1s and the extent of individual variation in mobilization capability of their allelic variants. For each locus, we found one previously uncharacterized allele in every three to five genomes, including some with nonsense and insertion/deletion mutations. Single or multiple nucleotide substitutions drastically affected the retrotransposition efficiency of some alleles. One-third of elements were no longer hot, and these so-called cool alleles substantially increased the range of individual susceptibility to retrotransposition events. Adding the activity of the three elements in each individual resulted in a surprising degree of variation in mobilization capability, ranging from 0% to 390% of a reference L1. These data suggest that individual variation in retrotransposition potential makes an important contribution to human genetic diversity.
Collapse
Affiliation(s)
| | | | - Richard Cordaux
- Department of Biological Sciences, Biological Computation and Visualization Center, Center for Bio-Modular Multi-Scale Systems, Louisiana State University, Baton Rouge, LA 70803
| | - Laurel Bastone
- Division of Biostatistics, University of Pennsylvania School of Medicine, Philadelphia, PA 19104; and
| | - Mark A. Batzer
- Department of Biological Sciences, Biological Computation and Visualization Center, Center for Bio-Modular Multi-Scale Systems, Louisiana State University, Baton Rouge, LA 70803
| | - Haig H. Kazazian
- Department of Genetics
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
28
|
Zhang CT, Gao F, Zhang R. Segmentation algorithm for DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 72:041917. [PMID: 16383430 DOI: 10.1103/physreve.72.041917] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2005] [Indexed: 05/05/2023]
Abstract
A new measure, to quantify the difference between two probability distributions, called the quadratic divergence, has been proposed. Based on the quadratic divergence, a new segmentation algorithm to partition a given genome or DNA sequence into compositionally distinct domains is put forward. The new algorithm has been applied to segment the 24 human chromosome sequences, and the boundaries of isochores for each chromosome were obtained. Compared with the results obtained by using the entropic segmentation algorithm based on the Jensen-Shannon divergence, both algorithms resulted in all identical coordinates of segmentation points. An explanation of the equivalence of the two segmentation algorithms is presented. The new algorithm has a number of advantages. Particularly, it is much simpler and faster than the entropy-based method. Therefore, the new algorithm is more suitable for analyzing long genome sequences, such as human and other newly sequenced eukaryotic genome sequences.
Collapse
Affiliation(s)
- Chun-Ting Zhang
- Department of Physics, Tianjin University, Tianjin 300072, China.
| | | | | |
Collapse
|
29
|
Simpson CL, Hansen VK, Sham PC, Collins A, Powell JF, Al-Chalabi A. MaGIC: a program to generate targeted marker sets for genome-wide association studies. Biotechniques 2005; 37:996-9. [PMID: 15597550 DOI: 10.2144/04376bin03] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
High-throughput genotyping technologies such as DNA pooling and DNA microarrays mean that whole-genome screens are now practical for complex disease gene discovery using association studies. Because it is currently impractical to use all available markers, a subset is typically selected on the basis of required saturation density. Restricting markers to those within annotated genomic features of interest (e.g., genes or exons) or within feature-rich regions, reduces workload and cost while retaining much information. We have designed a program (MaGIC) that exploits genome assembly data to create lists of markers correlated with other genomic features. Marker lists are generated at a user-defined spacing and can target features with a user-defined density. Maps are in base pairs or linkage disequilibrium units (LDUs) as derived from the International HapMap data, which is useful for association studies and fine-mapping. Markers may be selected on the basis of heterozygosity and source database, and single nucleotide polymorphism (SNP) markers may additionally be selected on the basis of validation status. The import function means the method can be used for any genomic features such as housekeeping genes, long interspersed elements (LINES), or Alu repeats in humans, and is also functional for other species with equivalent data. The program and source code is freely available at http://cogent.iop.kcl.ac.uk/MaGIC.cogx.
Collapse
|
30
|
Hackenberg M, Bernaola-Galván P, Carpena P, Oliver JL. The Biased Distribution of Alus in Human Isochores Might Be Driven by Recombination. J Mol Evol 2005; 60:365-77. [PMID: 15871047 DOI: 10.1007/s00239-004-0197-2] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2004] [Accepted: 10/01/2004] [Indexed: 11/30/2022]
Abstract
Alu retrotransposons do not show a homogeneous distribution over the human genome but have a higher density in GC-rich (H) than in AT-rich (L) isochores. However, since they preferentially insert into the L isochores, the question arises: What is the evolutionary mechanism that shifts the Alu density maximum from L to H isochores? To disclose the role played by each of the potential mechanisms involved in such biased distribution, we carried out a genome-wide analysis of the density of the Alus as a function of their evolutionary age, isochore membership, and intron vs. intergene location. Since Alus depend on the retrotransposase encoded by the LINE1 elements, we also studied the distribution of LINE1 to provide a complete evolutionary scenario. We consecutively check, and discard, the contributions of the Alu/LINE1 competition for retrotransposase, compositional matching pressure, and Alu overrepresentation in introns. In analyzing the role played by unequal recombination, we scan the genome for Alu trimers, a direct product of Alu-Alu recombination. Through computer simulations, we show that such trimers are much more frequent than expected, the observed/expected ratio being higher in L than in H isochores. This result, together with the known higher selective disadvantage of recombination products in H isochores, points to Alu-Alu recombination as the main agent provoking the density shift of Alus toward the GC-rich parts of the genome. Two independent pieces of evidence-the lower evolutionary divergence shown by recently inserted Alu subfamilies and the higher frequency of old stand-alone Alus in L isochores-support such a conclusion. Other evolutionary factors, such as population bottlenecks during primate speciation, may have accelerated the fast accumulation of Alus in GC-rich isochores.
Collapse
Affiliation(s)
- Michael Hackenberg
- Departamento de Genética, Facultad de Ciencias, Universidad de Granada, Spain
| | | | | | | |
Collapse
|
31
|
Cohen N, Dagan T, Stone L, Graur D. GC composition of the human genome: in search of isochores. Mol Biol Evol 2005; 22:1260-72. [PMID: 15728737 DOI: 10.1093/molbev/msi115] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The isochore theory, proposed nearly three decades ago, depicts the mammalian genome as a mosaic of long, fairly homogeneous genomic regions that are characterized by their guanine and cytosine (GC) content. The human genome, for instance, was claimed to consist of five distinct isochore families: L1, L2, H1, H2, and H3, with GC contents of <37%, 37%-42%, 42%-47%, 47%-52%, and >52%, respectively. In this paper, we address the question of the validity of the isochore theory through a rigorous sequence-based analysis of the human genome. Toward this end, we adopt a set of six attributes that are generally claimed to characterize isochores and statistically test their veracity against the available draft sequence of the complete human genome. By the selection criteria used in this study: distinctiveness, homogeneity, and minimal length of 300 kb, we identify 1,857 genomic segments that warrant the label "isochore." These putative isochores are nonuniformly scattered throughout the genome and cover about 41% of the human genome. We found that a four-family model of putative isochores is the most parsimonious multi-Gaussian model that can be fitted to the empirical data. These families, however, are GC poor, with mean GC contents of 35%, 38%, 41%, and 48% and do not resemble the five isochore families in the literature. Moreover, due to large overlaps among the families, it is impossible to classify genomic segments into isochore families reliably, according to compositional properties alone. These findings undermine the utility of the isochore theory and seem to indicate that the theory may have reached the limits of its usefulness as a description of genomic compositional structures.
Collapse
Affiliation(s)
- Netta Cohen
- School of Computing, University of Leeds, Leeds, United Kingdom
| | | | | | | |
Collapse
|
32
|
Bernaola-Galván P, Oliver JL, Carpena P, Clay O, Bernardi G. Quantifying intrachromosomal GC heterogeneity in prokaryotic genomes. Gene 2004; 333:121-33. [PMID: 15177687 DOI: 10.1016/j.gene.2004.02.042] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2003] [Revised: 11/14/2003] [Accepted: 02/10/2004] [Indexed: 11/15/2022]
Abstract
The sequencing of prokaryotic genomes covering a wide taxonomic range has sparked renewed interest in intrachromosomal compositional (GC) heterogeneity, largely in view of lateral transfers. We present here a brief overview of some methods for visualizing and quantifying GC variation in prokaryotes. We used these methods to examine heterogeneity levels in sequenced prokaryotes, for a range of scales or stringencies. Some species are consistently homogeneous, whereas others are markedly heterogeneous in comparison, in particular Aeropyrum pernix, Xylella fastidiosa, Mycoplasma genitalium, Enterococcus faecalis, Bacillus subtilis, Pyrobaculum aerophilum, Vibrio vulnificus chromosome I, Deinococcus radiodurans chromosome II and Halobacterium. As we discuss here, the wide range of heterogeneities calls for reexamination of an accepted belief, namely that the endogenous DNA of bacteria and archaea should typically exhibit low intrachromosomal GC contrasts. Supplementary results for all species analyzed are available at our website: http://bioinfo2.ugr.es/prok.
Collapse
|
33
|
Zhang R, Zhang CT. Isochore Structures in the Genome of the Plant Arabidopsis thaliana. J Mol Evol 2004; 59:227-38. [PMID: 15486696 DOI: 10.1007/s00239-004-2617-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2003] [Accepted: 02/10/2004] [Indexed: 10/26/2022]
Abstract
Arabidopsis thaliana is an important model system for the study of plant biology. We have analyzed the complete genome sequences of Arabidopsis by using a newly developed windowless method for the GC content computation, the cumulative GC profile. It is shown that the Arabidopsis genome is organized into a mosaic structure of isochores. All the centromeric regions are located in GC-rich isochores, called centromere-isochores, which are characterized by a high GC content but low gene and T-DNA insertion densities. This characteristic distinguishes centromere-isochores from the other class of GC-rich isochores, called GC-isochores, which have high gene and T-DNA insertion densities. Consequently, 15 isochores have been identified, i.e., 7 AT-isochores, 3 GC-isochores, and 5 centromere-isochores. The genes in centromere-isochores, which have the highest GC content, have much shorter intron lengths and lower intron numbers, compared to those of the other two types. There is also considerable difference in the numbers and lengths of transposable elements (TEs) between AT and GC-isochores, i.e., the TE number (length) of AT-isochores is 6.3 (7.3) times that of GC-isochores. It is generally believed that TEs are accumulated in the regions surrounding the centromeres. However, within these TE-rich regions, there are regions of extremely low TE numbers (TE deserts), which correspond to the positions of centromere-isochores. In addition, a heterochromatic knob is located at the boundary of an AT-isochore. Furthermore, we show that the differences in GC content among isochores are mainly due to the GC content variation of introns, the third codon positions and intergenic regions.
Collapse
Affiliation(s)
- Ren Zhang
- Department of Epidemiology and Biostatistics, Tianjin Cancer Institute and Hospital, 300060 Tianjin, China
| | | |
Collapse
|
34
|
Oliver JL, Carpena P, Hackenberg M, Bernaola-Galván P. IsoFinder: computational prediction of isochores in genome sequences. Nucleic Acids Res 2004; 32:W287-92. [PMID: 15215396 PMCID: PMC441537 DOI: 10.1093/nar/gkh399] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2004] [Revised: 03/04/2004] [Accepted: 03/25/2004] [Indexed: 11/13/2022] Open
Abstract
Isochores are long genome segments homogeneous in G+C. Here, we describe an algorithm (IsoFinder) running on the web (http://bioinfo2.ugr.es/IsoF/isofinder.html) able to predict isochores at the sequence level. We move a sliding pointer from left to right along the DNA sequence. At each position of the pointer, we compute the mean G+C values to the left and to the right of the pointer. We then determine the position of the pointer for which the difference between left and right mean values (as measured by the t-statistic) reaches its maximum. Next, we determine the statistical significance of this potential cutting point, after filtering out short-scale heterogeneities below 3 kb by applying a coarse-graining technique. Finally, the program checks whether this significance exceeds a probability threshold. If so, the sequence is cut at this point into two subsequences; otherwise, the sequence remains undivided. The procedure continues recursively for each of the two resulting subsequences created by each cut. This leads to the decomposition of a chromosome sequence into long homogeneous genome regions (LHGRs) with well-defined mean G+C contents, each significantly different from the G+C contents of the adjacent LHGRs. Most LHGRs can be identified with Bernardi's isochores, given their correlation with biological features such as gene density, SINE and LINE (short, long interspersed repetitive elements) densities, recombination rate or single nucleotide polymorphism variability. The resulting isochore maps are available at our web site (http://bioinfo2.ugr.es/isochores/), and also at the UCSC Genome Browser (http://genome.cse.ucsc.edu/).
Collapse
Affiliation(s)
- José L Oliver
- Departamento de Genética, Instituto de Biotecnología, Facultad de Ciencias, Universidad de Granada, Spain.
| | | | | | | |
Collapse
|
35
|
Han JS, Szak ST, Boeke JD. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 2004; 429:268-74. [PMID: 15152245 DOI: 10.1038/nature02536] [Citation(s) in RCA: 371] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2004] [Accepted: 03/30/2004] [Indexed: 11/08/2022]
Abstract
LINE-1 (L1) elements are the most abundant autonomous retrotransposons in the human genome, accounting for about 17% of human DNA. The L1 retrotransposon encodes two proteins, open reading frame (ORF)1 and the ORF2 endonuclease/reverse transcriptase. L1 RNA and ORF2 protein are difficult to detect in mammalian cells, even in the context of overexpression systems. Here we show that inserting L1 sequences on a transcript significantly decreases RNA expression and therefore protein expression. This decreased RNA concentration does not result from major effects on the transcription initiation rate or RNA stability. Rather, the poor L1 expression is primarily due to inadequate transcriptional elongation. Because L1 is an abundant and broadly distributed mobile element, the inhibition of transcriptional elongation by L1 might profoundly affect expression of endogenous human genes. We propose a model in which L1 affects gene expression genome-wide by acting as a 'molecular rheostat' of target genes. Bioinformatic data are consistent with the hypothesis that L1 can serve as an evolutionary fine-tuner of the human transcriptome.
Collapse
Affiliation(s)
- Jeffrey S Han
- Department of Molecular Biology and Genetics and High Throughput Biology Center, The Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, USA
| | | | | |
Collapse
|
36
|
Wen SY, Zhang CT. Identification of isochore boundaries in the human genome using the technique of wavelet multiresolution analysis. Biochem Biophys Res Commun 2004; 311:215-22. [PMID: 14575716 DOI: 10.1016/j.bbrc.2003.09.198] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Incorporated with the Z curve method, the technique of wavelet multiresolution (also known as multiscale) analysis has been proposed to identify the boundaries of isochores in the human genome. The human MHC sequence and the longest contigs of human chromosomes 21 and 22 are used as examples. The boundary between the isochores of Class III and Class II in the MHC sequence has been detected and found to be situated at the position 2,490,368bp. This result is in good agreement with the experimental evidence. An isochore with a length of about 7Mb in chromosome 21 has been identified and found to be gene- and Alu-poor. We have also found that the G+C content of chromosome 21 is more homogeneous than that of chromosome 22. Compared with the window-based methods, the present method has the highest resolution for identifying the boundaries of isochores, even at a scale of single base. Compared with the entropic segmentation method, the present method has the merits of more intuitiveness and less calculations. The important conclusion drawn in this study is that the segmentation points, at which the G+C content undergoes relatively dramatic changes, do exist in the human genome. These 'singularity' points may be considered to be candidates of isochore boundaries in the human genome. The method presented is a general one and can be used to analyze any other genomes.
Collapse
|
37
|
Hon LS, Jain AN. Compositional structure of repetitive elements is quantitatively related to co-expression of gene pairs. J Mol Biol 2003; 332:305-10. [PMID: 12948482 DOI: 10.1016/s0022-2836(03)00926-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
A sequence similarity metric operating on 10 kb upstream regions of gene pairs quantitatively predicts a portion of co-variation of expression of gene pairs in large-scale gene expression studies in human tumors and tumor-derived cell lines. The signal on which the metric depends most strongly originates in the compositional structure of repetitive genomic sequences (particularly Alu elements) present in these upstream regions. This effect is completely separable from effects of isochore composition on gene expression. The results implicate repetitive elements with some functional role in transcriptional regulation of the specific genes in whose promoter regions they reside and lend credence to suggestions that the general phenomenon of repetitive element insertions may be a fundamental evolutionary mechanism for modulating gene transcription.
Collapse
Affiliation(s)
- Lawrence S Hon
- Cancer Research Institute, University of California, 2340 Sutter Street S-336, Box 0128, San Francisco, CA 94143-0128, USA
| | | |
Collapse
|
38
|
Abstract
We present a coding measure which is based on the statistical properties of the stop codons, and that is able to estimate accurately the variation of coding content along an anonymous sequence. As the stop codons play the same role in all the genomes (with very few exceptions) the measure turns out to be species-independent. We show results both for prokaryotic and for eukaryotic genomes, indicating, first, the accuracy of the measure, and, second, that better prediction is achieved if the measure is applied on homogeneous, isochore-like sequences than if it is applied following the standard moving window approach. Finally, we discuss on some of the possible applications of the measure.
Collapse
Affiliation(s)
- P Carpena
- Departamento de Física Aplicada II, E.T.S.I. de Telecomunicación, Universidad de Málaga, Malaga, Spain.
| | | | | | | |
Collapse
|
39
|
Abstract
Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the null hypothesis (the mean value of a group of GC contents remains the same along the sequence) may or may not be rejected for an isochore sequence, depending on the subwindow sizes at which GC contents are sampled, and the window size within which group members are defined. The segmentation analysis shows that there are stronger indications of GC content changes at isochore borders than within an isochore. These analyses support the notion of isochore sequences, but reject the assumption that isochore sequences are homogeneous at the base level. An isochore sequence may pass a homogeneity test when GC content fluctuations at smaller length scales are ignored or averaged out.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore - LIJ Research Institute, 350 Community Drive, Manhasset, NY 10030, USA.
| |
Collapse
|