1
|
Shaukat MA, Nguyen TT, Hsu EB, Yang S, Bhatti A. Comparative study of encoded and alignment-based methods for virus taxonomy classification. Sci Rep 2023; 13:18662. [PMID: 37907535 PMCID: PMC10618506 DOI: 10.1038/s41598-023-45461-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 10/19/2023] [Indexed: 11/02/2023] Open
Abstract
The emergence of viruses and their variants has made virus taxonomy more important than ever before in controlling the spread of diseases. The creation of efficient treatments and cures that target particular virus properties can be aided by understanding virus taxonomy. Alignment-based methods are commonly used for this task, but are computationally expensive and time-consuming, especially when dealing with large datasets or when detecting new virus variants is time sensitive. An alternative approach, the encoded method, has been developed that does not require prior sequence alignment and provides faster results. However, each encoded method has its own claimed accuracy. Therefore, careful evaluation and comparison of the performance of different encoded methods are essential to identify the most accurate and reliable approach for virus taxonomy classification. This study aims to address this issue by providing a comprehensive and comparative analysis of the potential of encoded methods for virus classification and phylogenetics. We compared the vectors generated for each encoded method using distance metrics to determine their similarity to alignment-based methods. The results and their validation show that K-merNV followed by CgrDft encoded methods, perform similarly to state-of-the-art multi-sequence alignment methods. This is the first study to incorporate and compare encoded methods that will facilitate future research in making more informed decisions regarding selection of a suitable method for virus taxonomy.
Collapse
Affiliation(s)
- Muhammad Arslan Shaukat
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia.
| | - Thanh Thi Nguyen
- Faculty of Information Technology, Monash University, Victoria, Australia
| | - Edbert B Hsu
- Department of Emergency Medicine, Johns Hopkins University, Maryland, USA
| | - Samuel Yang
- Department of Emergency Medicine, Stanford University, California, USA
| | - Asim Bhatti
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Victoria, Australia
| |
Collapse
|
2
|
Li DJ. Distributional features of triplet codons in genomes underlie the diversification of life. Biosystems 2022; 217:104681. [DOI: 10.1016/j.biosystems.2022.104681] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2021] [Revised: 04/04/2022] [Accepted: 04/07/2022] [Indexed: 11/02/2022]
|
3
|
Moqtaderi Z, Brown S, Bender W. Genome-wide oscillations in G + C density and sequence conservation. Genome Res 2021; 31:2050-2057. [PMID: 34649930 PMCID: PMC8559709 DOI: 10.1101/gr.274332.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 09/01/2021] [Indexed: 11/25/2022]
Abstract
Eukaryotic genomes typically show a uniform G + C content among chromosomes, but on smaller scales, many species have a G + C density that fluctuates with a characteristic wavelength. This oscillation is evident in many insect species, with wavelengths ranging between 700 bp and 4 kb. Measures of evolutionary conservation oscillate in phase with G + C content, with conserved regions having higher G + C. Loci with large regulatory regions show more regular oscillations; coding sequences and heterochromatic regions show little or no oscillation. There is little oscillation in vertebrate genomes in regions with densely distributed mobile repetitive elements. However, species with few repeats show oscillation in both G + C density and sequence conservation. These oscillations may reflect optimal spacing of cis-regulatory elements.
Collapse
Affiliation(s)
- Zarmik Moqtaderi
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Susan Brown
- Department of Biology, Kansas State University, Manhattan, Kansas 66506, USA
| | - Welcome Bender
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, Massachusetts 02115, USA
| |
Collapse
|
4
|
Touati R, Messaoudi I, Oueslati AE, Lachiri Z. Distinguishing between intra-genomic helitron families using time-frequency features and random forest approaches. Biomed Signal Process Control 2019. [DOI: 10.1016/j.bspc.2019.101579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
5
|
Mitaku S, Sawada R. Biological meaning of "habitable zone" in nucleotide composition space. Biophys Physicobiol 2018; 15:75-85. [PMID: 29892513 PMCID: PMC5992858 DOI: 10.2142/biophysico.15.0_75] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 03/17/2018] [Indexed: 12/01/2022] Open
Abstract
Organisms generally display two contrasting properties: large biodiversity and a uniform state of "life". In this study, we focused on the question of how genome sequences describe "life" where a large number of biomolecules are harmonized. We analyzed the whole genome sequence of 2664 organisms, paying attention to the nucleotide composition which is an intensive parameter from the genome sequence. The results showed that all organisms were plotted in narrow regions of the nucleotide composition space of the first and second letters of the codon. Since all genome sequences overlap irrespective of the living environment, it can be called a "habitable zone". The habitable zone deviates by 500 times the standard deviation from the nucleotide composition expected from the random sequence, indicating that unexpectedly rare sequences are realized. Furthermore, we found that the habitable zones at the first and second letters of the codon serve as the background mechanisms for the functional network of biological systems. The habitable zone at the second letter of the codon controls the formation of transmembrane regions and the habitable zone at the first letter controls the formation of molecular recognition unit. These analyses showed that the habitable zone of the nucleotide composition space and the exquisite arrangement of amino acids in the codon table are conjugated to form biological systems. Finally, we discussed the evolution of the higher order of genome sequences.
Collapse
Affiliation(s)
- Shigeki Mitaku
- Emeritus Professor of Nagoya University, Kokubunji, Tokyo 185-0021, Japan
| | - Ryusuke Sawada
- Division of System Cohort, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan
| |
Collapse
|
6
|
Danchin A, Sekowska A, Noria S. Functional Requirements in the Program and the Cell Chassis for Next-Generation Synthetic Biology. Synth Biol (Oxf) 2018. [DOI: 10.1002/9783527688104.ch5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Affiliation(s)
- Antoine Danchin
- Institute of Cardiometabolism and Nutrition; 47 boulevard de l'Hôpital Paris 75013 France
| | - Agnieszka Sekowska
- Institute of Cardiometabolism and Nutrition; 47 boulevard de l'Hôpital Paris 75013 France
| | - Stanislas Noria
- Fondation Fourmentin-Guilbert; 2 avenue du Pavé Neuf Noisy le Grand 93160 France
| |
Collapse
|
7
|
Oueslati AE, Messaoudi I, Lachiri Z, Ellouze N. A new way to visualize DNA's base succession: the Caenorhabditis elegans chromosome landscapes. Med Biol Eng Comput 2015; 53:1165-76. [PMID: 26003183 DOI: 10.1007/s11517-015-1304-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2014] [Accepted: 05/03/2015] [Indexed: 12/21/2022]
Abstract
In the eukaryotic genomes, the genetic diseases are generally associated with the tandem repeats. These repeats seem to appear frequently. In this paper, we are describing a wavelet transform technique which provides a new way to represent the DNA succession bases as a DNA progression images. These images offer DNA landscapes, visualizing and following up periodicities through genomes. We investigated in a structural coding technique the Pnuc. Then, we illustrated, with time-frequency representation, the existence and the superposition of the periodicities in some biological features, their locations and the different ways in which they appear. The representations generated showed that one periodicity can sometimes be alone, but generally, it is incorporated to others. These periodicities associations create, in the Caenorhabditis elegans chromosome, a precise structural image of biological features, such as CeRep, Helitrons, repeats and satellites.
Collapse
Affiliation(s)
- Afef Elloumi Oueslati
- Laboratoire Signal, Image et Technologies de l'information, Département de Génie Electrique, Ecole Nationale d'Ingénieurs de Tunis, BP 37, Campus Universitaire, Le Belvédère, 1002, Tunis Cedex, Tunisia.
| | - Imen Messaoudi
- Laboratoire Signal, Image et Technologies de l'information, Département de Génie Electrique, Ecole Nationale d'Ingénieurs de Tunis, BP 37, Campus Universitaire, Le Belvédère, 1002, Tunis Cedex, Tunisia
| | - Zied Lachiri
- Département de Génie Physique et Instrumentation, Institut National des Sciences Appliquées et de Technologie, BP 676, Centre Urbain, 1080, Tunis Cedex, Tunisia
| | - Noureddine Ellouze
- Laboratoire Signal, Image et Technologies de l'information, Département de Génie Electrique, Ecole Nationale d'Ingénieurs de Tunis, BP 37, Campus Universitaire, Le Belvédère, 1002, Tunis Cedex, Tunisia
| |
Collapse
|
8
|
Hoang T, Yin C, Zheng H, Yu C, Lucy He R, Yau SST. A new method to cluster DNA sequences using Fourier power spectrum. J Theor Biol 2015; 372:135-45. [PMID: 25747773 PMCID: PMC7094126 DOI: 10.1016/j.jtbi.2015.02.026] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Revised: 01/15/2015] [Accepted: 02/23/2015] [Indexed: 11/27/2022]
Abstract
A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. We propose to use Fourier power spectrum to cluster genes and genomes. We construct mathematical moments from the power spectrum. We perform phylogenetic analysis of genes and genomes based on moments.
Collapse
Affiliation(s)
- Tung Hoang
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Hui Zheng
- Department of Mathematics, Statistics and Computer Science, University of Ilinois at Chicago, Chicago, IL 60607, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
9
|
Messaoudi I, Elloumi-Oueslati A, Lachiri Z. Building Specific Signals from Frequency Chaos Game and Revealing Periodicities Using a Smoothed Fourier Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:863-877. [PMID: 26356859 DOI: 10.1109/tcbb.2014.2315991] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Investigating the roles and functions of DNA within genomes is becoming a primary focus of genomic research. Thus, the research works are moving towards cooperation between different scientific disciplines which aims at facilitating the interpretation of genetic information. In order to characterize the DNA of living organisms, signal processing tools appear to be very suitable for such study. However, a DNA sequence must be converted into a numerical sequence before processing; which defines the concept of DNA coding. In line with this, we propose a new one dimensional model based on the chaos game representation theory called Frequency Chaos Game Signal: FCGS. Then, we perform a Smoothed Fourier Transform to enhance hidden periodicities in the C.elegans DNA sequences. Through this study, we demonstrate the performance of our coding approach in highlighting characteristic periodicities. Indeed, several periodicities are shown to be involved in the 1D spectra and the 2D spectrograms of FCGSs. To investigate further about the contribution of our method in the enhancement of characteristic spectral attributes, a comparison with a range of binary indicators is established.
Collapse
|
10
|
Valenzuela CY. The structure of selective dinucleotide interactions and periodicities in D melanogaster mtDNA. Biol Res 2014; 47:18. [PMID: 25027717 PMCID: PMC4101722 DOI: 10.1186/0717-6287-47-18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 04/26/2014] [Indexed: 10/28/2022] Open
Abstract
BACKGROUND We found a strong selective 3-sites periodicity of deviations from randomness of the dinucleotide (DN) distribution, where both bases of DN were separated by 1, 2, K sites in prokaryotes and mtDNA. Three main aspects are studied. I) the specific 3 K-sites periodic structure of the 16 DN. II) to discard the possibility that the periodicity was produced by the highly nonrandom interactive association of contiguous bases, by studying the interaction of non-contiguous bases, the first one chosen each I sites and the second chosen J sites downstream. III) the difference between this selective periodicity of association (distance to randomness) of the four bases with the described fixed periodicities of base sequences. RESULTS I) The 16 pairs presented a consistent periodicity in the strength of association of both bases of the pairs; the most deviated pairs are those where G and C are involved and the least deviated ones are those where A and T are involved. II) we found significant non-random interactions when the first nucleotide is chosen every I sites and the second J sites downstream until I=J=76. III) we showed conclusive differences between these internucleotide association periodicities and sequence periodicities. CONCLUSIONS This relational selective periodicity is different from sequence periodicities and indicates that any base strongly interacts with the bases of the residual genome; this interaction and periodicity is highly structured and systematic for every pair of bases. This interaction should be destroyed in few generations by recurrent mutation; it is only compatible with the Synthetic Theory of Evolution and agrees with the Wright's adaptive landscape conception and evolution by shifting balanced adaptive peaks.
Collapse
|
11
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
12
|
Xing YQ, Liu GQ, Zhao XJ, Cai L. An analysis and prediction of nucleosome positioning based on information content. Chromosome Res 2013; 21:63-74. [PMID: 23435498 DOI: 10.1007/s10577-013-9338-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Revised: 01/20/2013] [Accepted: 01/24/2013] [Indexed: 11/26/2022]
Abstract
Nucleosome positioning plays a key role in the regulation of many biological processes. In this study, the statistical difference of information content was investigated in nucleosome and linker DNA regions across eukaryotic organisms. By analyzing the information redundancy, D k , in Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans genomes, the short-range dominance of nucleotide correlation in nucleosome and linker DNA regions was confirmed. Significant difference of the D k value between the nucleosome and linker DNA regions was also found. The underlying reason for many successful oligonucleotide-based predictions of nucleosome positioning in eukaryotic model organisms may be attributed to the short-range dominance of nucleotide correlation in the nucleosome and linker DNA regions. When applying power spectrum analysis to the nucleosome and linker DNA regions, some obvious differences in sequence periodic signals were observed. The parameter F k was introduced to describe particular base correlation. Furthermore, the support vector machine combining F k was used to classify nucleosome and linker DNA regions in Homo sapiens, Oryzias latipes, C. elegans, Candida albicans, and S. cerevisiae. Independent test demonstrated that a good performance can be achieved by using this algorithm. This result further revealed that base correlation information has an important role in nucleosome positioning.
Collapse
Affiliation(s)
- Yong-qiang Xing
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | | | | | | |
Collapse
|
13
|
Glunčić M, Paar V. Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res 2012; 41:e17. [PMID: 22977183 PMCID: PMC3592446 DOI: 10.1093/nar/gks721] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).
Collapse
Affiliation(s)
- Matko Glunčić
- Faculty of Science, University of Zagreb, Bijenička 32 and Croatian Academy of Sciences and Arts, Zrinski trg 11, 10000 Zagreb, Croatia.
| | | |
Collapse
|
14
|
Wavelet analysis of DNA walks on the human and chimpanzee MAGE/CSAG-palindromes. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:230-6. [PMID: 23084779 PMCID: PMC5054716 DOI: 10.1016/j.gpb.2012.07.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Revised: 02/18/2012] [Accepted: 03/02/2012] [Indexed: 11/22/2022]
Abstract
The palindrome is one class of symmetrical duplications with reverse complementary characters, which is widely distributed in many organisms. Graphical representation of DNA sequence provides a simple way of viewing and comparing various genomic structures. Through 3-D DNA walk analysis, the similarity and differences in nucleotide composition, as well as the evolutionary relationship between human and chimpanzee MAGE/CSAG-palindromes, can be clearly revealed. Further wavelet analysis indicated that duplicated segments have irregular patterns compared to their surrounding sequences. However, sequence similarity analysis suggests that there is possible common ancestor between human and chimpanzee MAGE/CSAG-palindromes. Based on the specific distribution and orientation of the repeated sequences, a simple possible evolutionary model of the palindromes is suggested, which may help us to better understand the evolutionary course of the genes and the symmetrical sequences.
Collapse
|
15
|
Nunes MCS, Wanner EF, Weber G. Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome. BMC Genomics 2011; 12 Suppl 4:S4. [PMID: 22369134 PMCID: PMC3287587 DOI: 10.1186/1471-2164-12-s4-s4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Fourier transforms and their associated power spectra are used for detecting periodicities and protein-coding genes and is generally regarded as a well established technique. Many of the periodicities which have been found with this method are quite well understood such as the periodicity of 3 nt which is associated to codon usage. But what is the origin of the peculiar frequency multiples k/21 which were reported for a tiny section of chromosome 2 in P. falciparum? Are these present in other chromosomes and perhaps in related organisms? And how should we interpret fractional periodicities in genomes? Results We applied the binary indicator power spectrum to all chromosomes of P. falciparum, and found that the frequency overtones k/21 are present only in non-coding sections. We did not find such frequency overtones in any other related genomes. Furthermore, the frequency overtones were identified as artifacts of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. By choosing a different way to encode the sequence the overtones do not appear. In view of these results, we revisited early applications of this technique to proteins where frequency overtones were reported. Conclusions Some authors hinted recently at the possibility of mapping artifacts and frequency aliases in power spectra. However, in the case of P. falciparum the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, few researchers seem to be aware of the problems represented by frequency aliases.
Collapse
Affiliation(s)
- Miriam C S Nunes
- Department of Biological Sciences, Federal University of Ouro Preto, 35400-000 Ouro Preto, MG, Brazil
| | | | | |
Collapse
|
16
|
Xing Y, Zhao X, Cai L. Prediction of nucleosome occupancy in Saccharomyces cerevisiae using position-correlation scoring function. Genomics 2011; 98:359-66. [DOI: 10.1016/j.ygeno.2011.07.008] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2011] [Revised: 07/16/2011] [Accepted: 07/26/2011] [Indexed: 10/17/2022]
|
17
|
Grishkevich V, Hashimshony T, Yanai I. Core promoter T-blocks correlate with gene expression levels in C. elegans. Genome Res 2011; 21:707-17. [PMID: 21367940 PMCID: PMC3083087 DOI: 10.1101/gr.113381.110] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2010] [Accepted: 02/17/2011] [Indexed: 02/01/2023]
Abstract
Core promoters mediate transcription initiation by the integration of diverse regulatory signals encoded in the proximal promoter and enhancers. It has been suggested that genes under simple regulation may have low-complexity permissive promoters. For these genes, the core promoter may serve as the principal regulatory element; however, the mechanism by which this occurs is unclear. We report here a periodic poly-thymine motif, which we term T-blocks, enriched in occurrences within core promoter forward strands in Caenorhabditis elegans. An increasing number of T-blocks on either strand is associated with increasing nucleosome eviction. Strikingly, only forward strand T-blocks are correlated with expression levels, whereby genes with ≥6 T-blocks have fivefold higher expression levels than genes with ≤3 T-blocks. We further demonstrate that differences in T-block numbers between strains predictably affect expression levels of orthologs. Highly expressed genes and genes in operons tend to have a large number of T-blocks, as well as the previously characterized SL1 motif involved in trans-splicing. The presence of T-blocks thus correlates with low nucleosome occupancy and the precision of a trans-splicing motif, suggesting its role at both the DNA and RNA levels. Collectively, our results suggest that core promoters may tune gene expression levels through the occurrences of T-blocks, independently of the spatio-temporal regulation mediated by the proximal promoter.
Collapse
Affiliation(s)
| | - Tamar Hashimshony
- Department of Biology, Technion–Israel Institute of Technology, Haifa 32000, Israel
| | - Itai Yanai
- Department of Biology, Technion–Israel Institute of Technology, Haifa 32000, Israel
| |
Collapse
|
18
|
Liu H, Duan X, Yu S, Sun X. Analysis of nucleosome positioning determined by DNA helix curvature in the human genome. BMC Genomics 2011; 12:72. [PMID: 21269520 PMCID: PMC3037905 DOI: 10.1186/1471-2164-12-72] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2010] [Accepted: 01/27/2011] [Indexed: 12/03/2022] Open
Abstract
Background Nucleosome positioning has an important role in gene regulation. However, dynamic positioning in vivo casts doubt on the reliability of predictions based on DNA sequence characteristics. What role does sequence-dependent positioning play? In this paper, using a curvature profile model, nucleosomes are predicted in the human genome and patterns of nucleosomes near some key sites are investigated. Results Curvature profiling revealed that in the vicinity of a transcription start site, there is also a nucleosome-free region. Near transcription factor binding sites, curvature profiling showed a trough, indicating nucleosome depletion. The trough of the curvature profile corresponds well to the high binding scores of transcription factors. Moreover, our analysis suggests that nucleosome positioning has a selective protection role. Target sites of miRNAs are occupied by nucleosomes, while single nucleotide polymorphism sites are depleted of nucleosomes. Conclusions The results indicate that DNA sequences play an important role in nucleosome positioning, and the positioning is important not only in gene regulation, but also in genetic variation and miRNA functions.
Collapse
Affiliation(s)
- Hongde Liu
- State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China
| | | | | | | |
Collapse
|
19
|
Chen K, Wang L, Yang M, Liu J, Xin C, Hu S, Yu J. Sequence signatures of nucleosome positioning in Caenorhabditis elegans. GENOMICS PROTEOMICS & BIOINFORMATICS 2010; 8:92-102. [PMID: 20691394 PMCID: PMC5054450 DOI: 10.1016/s1672-0229(10)60010-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Our recent investigation in the protist Trichomonas vaginalis suggested a DNA sequence periodicity with a unit length of 120.9 nt, which represents a sequence signature for nucleosome positioning. We now extended our observation in higher eukaryotes and identified a similar periodicity of 175 nt in length in Caenorhabditis elegans. In the process of defining the sequence compositional characteristics, we found that the 10.5-nt periodicity, the sequence signature of DNA double helix, may not be sufficient for cross-nucleosome positioning but provides essential guiding rails to facilitate positioning. We further dissected nucleosome-protected sequences and identified a strong positive purine (AG) gradient from the 5′-end to the 3′-end, and also learnt that the nucleosome-enriched regions are GC-rich as compared to the nucleosome-free sequences as purine content is positively correlated with GC content. Sequence characterization allowed us to develop a hidden Markov model (HMM) algorithm for decoding nucleosome positioning computationally, and based on a set of training data from the fifth chromosome of C. elegans, our algorithm predicted 60%-70% of the well-positioned nucleosomes, which is 15%-20% higher than random positioning. We concluded that nucleosomes are not randomly positioned on DNA sequences and yet bind to different genome regions with variable stability, well-positioned nucleosomes leave sequence signatures on DNA, and statistical positioning of nucleosomes across genome can be decoded computationally based on these sequence signatures.
Collapse
Affiliation(s)
- Kaifu Chen
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | | | | | | | | | | | | |
Collapse
|
20
|
Tanaka Y, Yamashita R, Suzuki Y, Nakai K. Effects of Alu elements on global nucleosome positioning in the human genome. BMC Genomics 2010; 11:309. [PMID: 20478020 PMCID: PMC2878307 DOI: 10.1186/1471-2164-11-309] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2009] [Accepted: 05/17/2010] [Indexed: 11/14/2022] Open
Abstract
Background Understanding the genome sequence-specific positioning of nucleosomes is essential to understand various cellular processes, such as transcriptional regulation and replication. As a typical example, the 10-bp periodicity of AA/TT and GC dinucleotides has been reported in several species, but it is still unclear whether this feature can be observed in the whole genomes of all eukaryotes. Results With Fourier analysis, we found that this is not the case: 84-bp and 167-bp periodicities are prevalent in primates. The 167-bp periodicity is intriguing because it is almost equal to the sum of the lengths of a nucleosomal unit and its linker region. After masking Alu elements, these periodicities were greatly diminished. Next, using two independent large-scale sets of nucleosome mapping data, we analyzed the distribution of nucleosomes in the vicinity of Alu elements and showed that (1) there are one or two fixed slot(s) for nucleosome positioning within the Alu element and (2) the positioning of neighboring nucleosomes seems to be in phase, more or less, with the presence of Alu elements. Furthermore, (3) these effects of Alu elements on nucleosome positioning are consistent with inactivation of promoter activity in Alu elements. Conclusions Our discoveries suggest that the principle governing nucleosome positioning differs greatly across species and that the Alu family is an important factor in primate genomes.
Collapse
Affiliation(s)
- Yoshiaki Tanaka
- Department of Medical Genome Sciences, University of Tokyo, Minato-ku, Japan
| | | | | | | |
Collapse
|
21
|
Abstract
In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Collapse
Affiliation(s)
- Nicolas Carels
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | | |
Collapse
|
22
|
Blinowska KJ, Trzaskowski B, Kaminski M, Kus R. Multivariate autoregressive model for a study of phylogenetic diversity. Gene 2009; 435:104-18. [PMID: 19393180 DOI: 10.1016/j.gene.2009.01.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2008] [Revised: 12/17/2008] [Accepted: 01/05/2009] [Indexed: 12/01/2022]
Abstract
We present a computationally effective model to parameterize DNA sequences in a way describing comprehensively its auto and cross-correlation structure. The approach is based on four-channel Multivariate Autoregressive Model (MVAR). The model was applied to a study of genes from the globin family for 6 vertebrate species. First, the sequences were coded as four signals (corresponding to the nucleotides), which were fitted to a four-channel MVAR. From the correlation matrices the vectors of model coefficients were calculated as functions of the nucleotide distance. The between-chromosomes and inter-species differences were best distinguished in the cross-coefficients binding different nucleotide sequences. For clustering purposes different metrics were tested and then two clustering procedures (Nearest Neighbor and UPGMA) were applied. The clustering trees and consensus trees were constructed for exons, introns and whole genes. The results were in agreement with the known dependencies between the chromosomes of the globin family. The orthological genes for different species were grouped together. Inside these groups the phylogenetically close organisms were localized in proximity.
Collapse
Affiliation(s)
- K J Blinowska
- Department of Biomedical Physics, Warsaw University, Poland.
| | | | | | | |
Collapse
|
23
|
Babbitt GA, Kim Y. Inferring natural selection on fine-scale chromatin organization in yeast. Mol Biol Evol 2008; 25:1714-27. [PMID: 18515262 DOI: 10.1093/molbev/msn127] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Despite its potential role in the evolution of complex phenotypes, the detection of negative (purifying) and positive selection on noncoding regulatory sequence has been elusive because of the inherent difficulty in predicting the functional consequences of mutations on noncoding sequence. Because the functioning of regulatory sequence depends upon both chromatin configuration and cis-regulatory factor binding, we investigate the idea that the functional conservation of regulatory regions should be associated with the conservation of sequence-dependent bending properties of DNA that determine its affinity for the nucleosome. Recent advances in the computational prediction of sequence-dependent affinity to nucleosomes provide an opportunity to distinguish between neutral and nonneutral evolution of fine-scale chromatin organization. Here, a statistical test is presented for detecting evolutionary conservation and/or adaptive evolution of nucleosome affinity from interspecies comparisons of DNA sequences. Local nucleosome affinities of homologous sequences were calculated using 2 recently published methods. A randomization test was applied to sites of mutation to evaluate the similarity of DNA-nucleosome affinity between several closely related species of Saccharomyces yeast. For most of the genes we analyzed, the conservation of local nucleosome affinity was detected at a few distinct locations in the upstream noncoding region. Our results also demonstrate that different patterns of chromatin evolution have shaped DNA-nucleosome interaction at the core promoters of TATA-containing and TATA-less genes and that elevated purifying selection has maintained low affinity for nucleosome in the core promoters of the latter group. Across the entire yeast genome, DNA-nucleosome interaction was also discovered to be significantly more conserved in TATA-less genes compared with TATA-containing genes.
Collapse
Affiliation(s)
- G A Babbitt
- Center for Evolutionary Functional Genomics, The Biodesign Institute, Arizona State University, USA.
| | | |
Collapse
|
24
|
Shelenkov A, Korotkov A, Korotkov E. MMsat—a database of potential micro- and minisatellites. Gene 2008; 409:53-60. [DOI: 10.1016/j.gene.2007.11.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2007] [Revised: 10/08/2007] [Accepted: 11/16/2007] [Indexed: 11/28/2022]
|
25
|
Revisiting the relationship between compositional sequence complexity and periodicity. Comput Biol Chem 2007; 32:17-28. [PMID: 17983838 DOI: 10.1016/j.compbiolchem.2007.09.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Revised: 05/21/2007] [Accepted: 09/03/2007] [Indexed: 11/20/2022]
Abstract
BACKGROUND Given a big sequence fragment or a set of functionally related sequences we consider two problems of a sequence analysis associated with the given sequence(s). The first problem is to measure sequence complexity (repetitiveness, compactness) to estimate how informative the set as a whole is. Usually an obtained measure should be compared with an appropriate random background calculated using permutation of the given sequences. We propose a novel and effective approach for background information measurement instead of the usual sequence reshuffling. The second problem is to detect a periodic bias to determine if it is one of the set features. Sequence periodicity, when sometimes one has in mind hidden periodicity, is a very basic genomic property. The sequence period of 3, which is considered to characterize coding sequences, and period 10-11, which may be due to the alternation of hydrophobic and hydrophilic amino acids, DNA curvature, and bendability were discovered and described. Searching for periodical biases brought significant results in the study of sequence-dependent nucleosome positioning: nucleosomal sites carry hidden period of about 10.4 bases. RESULTS Calculated differences between genomic sequences and background showed high biological relevancy of the method that we proposed in this study. Our algorithm was applied to a few natural and artificial datasets. We constructed a simple "periodic" dataset by replacement of every tenth dinucleotide in each sequence of a trial set by the same dinucleotide "CC". We showed that the method reveals the introduced periodicity and that this periodical pattern carries higher information than in uninterrupted subsequences. An application of the method to the nucleosomal dataset revealed a weak pseudo-periodicity of 10.4 nucleotides confirming previous knowledge. An application of the method to Escherichia coli datasets revealed the well-known periodicity of 3bp as a genic attribute, a secondary genic period slightly larger than 11bp, and an intergenic period a bit smaller than 11bp. CONCLUSIONS We reported a novel compositional complexity-based method for sequence analysis. We found that the difference between the sequence complexity of a natural sequence and of background is especially high for a set consisting exclusively of coding sequences. Hidden periodicities were found with no need of any preliminary assumptions regarding a composition of periodic elements. We illustrated the power of the method by studying the sets with known weak periodic properties: a nucleosomal database and sets of different regions of E. coli. We showed that the method conveniently indicated all kinds of periodicity and related features in these sets of DNA sequences.
Collapse
|
26
|
Ma BG. How to describe genes: Enlightenment from the quaternary number system. Biosystems 2007; 90:20-7. [PMID: 16945479 DOI: 10.1016/j.biosystems.2006.06.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2005] [Revised: 06/15/2006] [Accepted: 06/19/2006] [Indexed: 11/17/2022]
Abstract
As an open problem, computational gene identification has been widely studied, and many gene finders (software) become available today. However, little attention has been given to the problem of describing the common features of known genes in databanks to transform raw data into human understandable knowledge. In this paper, we draw attention to the task of describing genes and propose a trial implementation by treating DNA sequences as quaternary numbers. Under such a treatment, the common features of genes can be represented by a "position weight function", the core concept for a number system. In principle, the "position weight function" can be any real-valued function. In this paper, by approximating the function using trigonometric functions, some characteristic parameters indicating single nucleotide periodicities were obtained for the bacteria Escherichia coli K12's genome and the eukaryote yeast's genome. As a byproduct of this approach, a single-nucleotide-level measure is derived that complements codon-based indexes in describing the coding quality and expression level of an open reading frame (ORF). The ideas presented here have the potential to become a general methodology for biological sequence analysis.
Collapse
Affiliation(s)
- Bin-Guang Ma
- College of Chemistry and Chemical Engineering, Suzhou University, Suzhou 215006, PR China.
| |
Collapse
|
27
|
Johnson SM, Tan FJ, McCullough HL, Riordan DP, Fire AZ. Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans chromatin. Genome Res 2006; 16:1505-16. [PMID: 17038564 PMCID: PMC1665634 DOI: 10.1101/gr.5560806] [Citation(s) in RCA: 146] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Nucleosome positions within the chromatin landscape are known to serve as a major determinant of DNA accessibility to transcription factors and other interacting components. To delineate nucleosomal patterns in a model genetic organism, Caenorhabditis elegans, we have carried out a genome-wide analysis in which DNA fragments corresponding to nucleosome cores were liberated using an enzyme (micrococcal nuclease) with a strong preference for cleavage in non-nucleosomal regions. Sequence analysis of 284,091 putative nucleosome cores obtained in this manner from a mixed-stage population of C. elegans reveals a combined picture of flexibility and constraint in nucleosome positioning. As has previously been observed in studies of individual loci in diverse biological systems, we observe areas in the genome where nucleosomes can adopt a wide variety of positions in a given region, areas with little or no nucleosome coverage, and areas where nucleosomes reproducibly adopt a specific positional pattern. In addition to illuminating numerous aspects of chromatin structure for C. elegans, this analysis provides a reference from which to begin an investigation of relationships between the nucleosomal pattern, chromosomal architecture, and lineage-based gene activity on a genome-wide scale.
Collapse
Affiliation(s)
- Steven M. Johnson
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305-5324, USA
| | - Frederick J. Tan
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, 21218, USA
| | - Heather L. McCullough
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5324, USA
| | - Daniel P. Riordan
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5324, USA
| | - Andrew Z. Fire
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305-5324, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5324, USA
- Corresponding author.E-mail ; fax (650) 724-9070
| |
Collapse
|
28
|
Li W, Miramontes P. Large-scale oscillation of structure-related DNA sequence features in human chromosome 21. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:021912. [PMID: 17025477 DOI: 10.1103/physreve.74.021912] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Indexed: 05/12/2023]
Abstract
Human chromosome 21 is the only chromosome in the human genome that exhibits oscillation of the (G+C) content of a cycle length of hundreds kilobases (kb) ( 500 kb near the right telomere). We aim at establishing the existence of a similar periodicity in structure-related sequence features in order to relate this (G+C)% oscillation to other biological phenomena. The following quantities are shown to oscillate with the same 500 kb periodicity in human chromosome 21: binding energy calculated by two sets of dinucleotide-based thermodynamic parameters, AA/TT and AAA/TTT bi- and tri-nucleotide density, 5'-TA-3' dinucleotide density, and signal for 10- or 11-base periodicity of AA/TT or AAA/TTT. These intrinsic quantities are related to structural features of the double helix of DNA molecules, such as base-pair binding, untwisting or unwinding, stiffness, and a putative tendency for nucleosome formation.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, New York 11030, USA.
| | | |
Collapse
|
29
|
Fire A, Alcazar R, Tan F. Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans. Genetics 2006; 173:1259-73. [PMID: 16648589 PMCID: PMC1526662 DOI: 10.1534/genetics.106.057364] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2006] [Accepted: 04/21/2006] [Indexed: 11/18/2022] Open
Abstract
We describe a surprising long-range periodicity that underlies a substantial fraction of C. elegans genomic sequence. Extended segments (up to several hundred nucleotides) of the C. elegans genome show a strong bias toward occurrence of AA/TT dinucleotides along one face of the helix while little or no such constraint is evident on the opposite helical face. Segments with this characteristic periodicity are highly overrepresented in intron sequences and are associated with a large fraction of genes with known germline expression in C. elegans. In addition to altering the path and flexibility of DNA in vitro, sequences of this character have been shown by others to constrain DNA::nucleosome interactions, potentially producing a structure that could resist the assembly of highly ordered (phased) nucleosome arrays that have been proposed as a precursor to heterochromatin. We propose a number of ways that the periodic occurrence of An/Tn clusters could reflect evolution and function of genes that express in the germ cell lineage of C. elegans.
Collapse
Affiliation(s)
- Andrew Fire
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305-5324, USA.
| | | | | |
Collapse
|
30
|
Moreno-Herrero F, Seidel R, Johnson SM, Fire A, Dekker NH. Structural analysis of hyperperiodic DNA from Caenorhabditis elegans. Nucleic Acids Res 2006; 34:3057-66. [PMID: 16738142 PMCID: PMC1474062 DOI: 10.1093/nar/gkl397] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Several bioinformatics studies have identified an unexpected but remarkably prevalent approximately 10 bp periodicity of AA/TT dinucleotides (hyperperiodicity) in certain regions of the Caenorhabditis elegans genome. Although the relevant C.elegans DNA segments share certain sequence characteristics with bent DNAs from other sources (e.g. trypanosome mitochondria), the nematode sequences exhibit a much more extensive and defined hyperperiodicity. Given the presence of hyperperiodic structures in a number of critical C.elegans genes, the physical characteristics of hyperperiodic DNA are of considerable interest. In this work, we demonstrate that several hyperperiodic DNA segments from C.elegans exhibit structural anomalies using high-resolution atomic force microscopy (AFM) and gel electrophoresis. Our quantitative analysis of AFM images reveals that hyperperiodic DNA adopts a significantly smaller mean square end-to-end distance, hence a more compact coil structure, compared with non-periodic DNA of similar length. While molecules remain capable of adopting both bent and straight (rod-like) configurations, indicating that their flexibility is still retained, examination of the local curvatures along the DNA contour length reveals that the decreased mean square end-to-end distance can be attributed to the presence of long-scale intrinsic bending in hyperperiodic DNA. Such bending is not detected in non-periodic DNA. Similar studies of shorter, nucleosome-length DNAs that survived micrococcal nuclease digestion show that sequence hyperperiodicity in short segments can likewise induce strong intrinsic bending. It appears, therefore, that regions of the C.elegans genome display a significant correlation between DNA sequence and unusual mechanical properties.
Collapse
Affiliation(s)
| | | | - Steven M. Johnson
- Department of Pathology, Stanford University School of Medicine300 Pasteur Drive, Room L235, Stanford, CA 94305-5324, USA
- Department of Genetics, Stanford University School of Medicine300 Pasteur Drive, Room L235, Stanford, CA 94305-5324, USA
| | - Andrew Fire
- Department of Pathology, Stanford University School of Medicine300 Pasteur Drive, Room L235, Stanford, CA 94305-5324, USA
- Department of Genetics, Stanford University School of Medicine300 Pasteur Drive, Room L235, Stanford, CA 94305-5324, USA
| | - Nynke H. Dekker
- To whom correspondence should be addressed. Tel: +31 0 15 278 3219; Fax: +31 0 15 278 1202;
| |
Collapse
|
31
|
|
32
|
Tian YX, Chen C, Zou XY, Tan XC, Cai PX, Mo JY. Study on Fractal Characteristics of the Coding Sequences in DNA. CHINESE J CHEM 2006. [DOI: 10.1002/cjoc.200690081] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
33
|
Cao Y, Tung WW, Gao JB. Recurrence time statistics: versatile tools for genomic DNA sequence analysis. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:40-51. [PMID: 16447998 DOI: 10.1109/csb.2004.1332415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.
Collapse
|
34
|
Cao Y, Tung WW, Gao JB, Qi Y. Recurrence time statistics: versatile tools for genomic DNA sequence analysis. J Bioinform Comput Biol 2005; 3:677-96. [PMID: 16108089 DOI: 10.1142/s0219720005001235] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2004] [Revised: 11/05/2004] [Accepted: 12/10/2004] [Indexed: 11/18/2022]
Abstract
With the completion of the human and a few model organisms' genomes, and with the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time-based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Our method requires approximately 6 . N byte memory and a computational time of N log N to extract all the repeat-related and periodic or quasi-periodic features from a sequence of length N without any prior knowledge on the consensus sequence of those features, hence enables us to carry out sequence analysis on the whole genomic scale by a PC.
Collapse
Affiliation(s)
- Yinhe Cao
- Biosieve, 1026 Springfield Drive, Campbell, CA 95008, USA.
| | | | | | | |
Collapse
|
35
|
Larsabal E, Danchin A. Genomes are covered with ubiquitous 11 bp periodic patterns, the "class A flexible patterns". BMC Bioinformatics 2005; 6:206. [PMID: 16120222 PMCID: PMC1242344 DOI: 10.1186/1471-2105-6-206] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2005] [Accepted: 08/24/2005] [Indexed: 11/17/2022] Open
Abstract
Background The genomes of prokaryotes and lower eukaryotes display a very strong 11 bp periodic bias in the distribution of their nucleotides. This bias is present throughout a given genome, both in coding and non-coding sequences. Until now this bias remained of unknown origin. Results Using a technique for analysis of auto-correlations based on linear projection, we identified the sequences responsible for the bias. Prokaryotic and lower eukaryotic genomes are covered with ubiquitous patterns that we termed "class A flexible patterns". Each pattern is composed of up to ten conserved nucleotides or dinucleotides distributed into a discontinuous motif. Each occurrence spans a region up to 50 bp in length. They belong to what we named the "flexible pattern" type, in that there is some limited fluctuation in the distances between the nucleotides composing each occurrence of a given pattern. When taken together, these patterns cover up to half of the genome in the majority of prokaryotes. They generate the previously recognized 11 bp periodic bias. Conclusion Judging from the structure of the patterns, we suggest that they may define a dense network of protein interaction sites in chromosomes.
Collapse
Affiliation(s)
- Etienne Larsabal
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| | - Antoine Danchin
- Unité de Génétique des Génomes Bactériens, Institut Pasteur, URA CNRS 2171, 28, rue du Docteur Roux, 75724 Paris Cedex 15, France
| |
Collapse
|
36
|
Li W, Holste D. Universal 1/f noise, crossovers of scaling exponents, and chromosome-specific patterns of guanine-cytosine content in DNA sequences of the human genome. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 71:041910. [PMID: 15903704 DOI: 10.1103/physreve.71.041910] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/28/2004] [Revised: 10/28/2004] [Indexed: 05/02/2023]
Abstract
Spatial fluctuations of guanine and cytosine base content (GC%) are studied by spectral analysis for the complete set of human genomic DNA sequences. We find that (i) 1/ f(alpha) decay is universally observed in the power spectra of all 24 chromosomes, and (ii) the exponent alpha approximately 1 extends to about 10(7) bases, one order of magnitude longer than has previously been observed. We further find that (iii) almost all human chromosomes exhibit a crossover from alpha(1) approximately 1 (1/ f (alpha(1))) at lower frequency to alpha(2) <1 (1/ f (alpha(2))) at higher frequency, typically occurring at around 30,000-100,000 bases, while (iv) the crossover in this frequency range is virtually absent in human chromosome 22. In addition to the universal 1/ f(alpha) noise in power spectra, we find (v) several lines of evidence for chromosome-specific correlation structures, including a 500,000 base long oscillation in human chromosome 21. The universal 1/ f(alpha) spectrum in the human genome is further substantiated by a resistance to reduction in variance of guanine and cytosine content when the window size is increased.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, North Shore LIJ Institute for Medical Research, 350 Community Drive, Manhasset, New York 10030, USA.
| | | |
Collapse
|
37
|
Li W, Holste D. An unusual 500,000 bases long oscillation of guanine and cytosine content in human chromosome 21. Comput Biol Chem 2004; 28:393-9. [PMID: 15556480 DOI: 10.1016/j.compbiolchem.2004.09.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2004] [Revised: 09/30/2004] [Accepted: 09/30/2004] [Indexed: 01/09/2023]
Abstract
An oscillation with a period of around 500 kb in guanine and cytosine content (GC%) is observed in the DNA sequence of human chromosome 21. This oscillation is localized in the rightmost one-eighth region of the chromosome, from 43.5 Mb to 46.5 Mb. Five cycles of oscillation are observed in this region with six GC-rich peaks and five GC-poor valleys. The GC-poor valleys comprise regions with low density of CpG islands and, alternating between the two DNA strands, low gene density regions. Consequently, the long-range oscillation of GC% result in spacing patterns of both CpG island density, and to a lesser extent, gene densities.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, North Shore LIJ Institute for Medical Research, 350 Community Drive, Manhasset, NY 11030, USA.
| | | |
Collapse
|
38
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2003. [PMCID: PMC2448450 DOI: 10.1002/cfg.228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
39
|
FUKUSHIMA A, IKEMURA T, KANAYA S. Comparative Genome Analysis Focused on Periodicity from Prokaryote to Higher Eukaryote Genomes Based on Power Spectrum. JOURNAL OF COMPUTER CHEMISTRY-JAPAN 2003. [DOI: 10.2477/jccj.2.95] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|