51
|
Shakya DK, Saxena R, Sharma SN. An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1241-1252. [PMID: 24384711 DOI: 10.1109/tcbb.2013.76] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Signal processing-based algorithms for identification of coding sequences (CDS) in eukaryotes are non-data driven and exploit the presence of three-base periodicity in these regions for their detection. Three-base periodicity is commonly detected using short time Fourier transform (STFT) that uses a window function of fixed length. As the length of the protein coding and noncoding regions varies widely, the identification accuracy of STFT-based algorithms is poor. In this paper, a novel signal processing-based algorithm is developed by enabling the window length adaptation in STFT of DNA sequences for improving the identification of three-base periodicity. The length of the window function has been made adaptive in coding regions to maximize the magnitude of period-3 measure, whereas in the noncoding regions, the window length is tailored to minimize this measure. Simulation results on bench mark data sets demonstrate the advantage of this algorithm when compared with other non-data-driven methods for CDS prediction.
Collapse
Affiliation(s)
| | - Rajiv Saxena
- Jaypee University of Engineering and Technology, Raghogarh, Guna
| | | |
Collapse
|
52
|
Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics. BMC Genomics 2012; 13 Suppl 8:S19. [PMID: 23282225 PMCID: PMC3535712 DOI: 10.1186/1471-2164-13-s8-s19] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detecting the borders between coding and non-coding regions is an essential step in the genome annotation. And information entropy measures are useful for describing the signals in genome sequence. However, the accuracies of previous methods of finding borders based on entropy segmentation method still need to be improved. METHODS In this study, we first applied a new recursive entropic segmentation method on DNA sequences to get preliminary significant cuts. A 22-symbol alphabet is used to capture the differential composition of nucleotide doublets and stop codon patterns along three phases in both DNA strands. This process requires no prior training datasets. RESULTS Comparing with the previous segmentation methods, the experimental results on three bacteria genomes, Rickettsia prowazekii, Borrelia burgdorferi and E.coli, show that our approach improves the accuracy for finding the borders between coding and non-coding regions in DNA sequences. CONCLUSIONS This paper presents a new segmentation method in prokaryotes based on Jensen-Rényi divergence with a 22-symbol alphabet. For three bacteria genomes, comparing to A12_JR method, our method raised the accuracy of finding the borders between protein coding and non-coding regions in DNA sequences.
Collapse
|
53
|
Rushdi A, Tuqan J, Strohmer T. Map-invariant spectral analysis for the identification of DNA periodicities. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2012; 2012:16. [PMID: 23067324 PMCID: PMC3751961 DOI: 10.1186/1687-4153-2012-16] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2012] [Accepted: 09/06/2012] [Indexed: 11/10/2022]
Abstract
Many signal processing based methods for finding hidden periodicities in DNA sequences have primarily focused on assigning numerical values to the symbolic DNA sequence and then applying spectral analysis tools such as the short-time discrete Fourier transform (ST-DFT) to locate these repeats. The key results pertaining to this approach are however obtained using a very specific symbolic to numerical map, namely the so-called Voss representation. An important research problem is to therefore quantify the sensitivity of these results to the choice of the symbolic to numerical map. In this article, a novel algebraic approach to the periodicity detection problem is presented and provides a natural framework for studying the role of the symbolic to numerical map in finding these repeats. More specifically, we derive a new matrix-based expression of the DNA spectrum that comprises most of the widely used mappings in the literature as special cases, shows that the DNA spectrum is in fact invariable under all these mappings, and generates a necessary and sufficient condition for the invariance of the DNA spectrum to the symbolic to numerical map. Furthermore, the new algebraic framework decomposes the periodicity detection problem into several fundamental building blocks that are totally independent of each other. Sophisticated digital filters and/or alternate fast data transforms such as the discrete cosine and sine transforms can therefore be always incorporated in the periodicity detection scheme regardless of the choice of the symbolic to numerical map. Although the newly proposed framework is matrix based, identification of these periodicities can be achieved at a low computational cost.
Collapse
Affiliation(s)
- Ahmad Rushdi
- Department of Electrical and Computer Engineering at the University of California, Davis, CA 95616, USA, and is now with Cisco Systems, Inc,, San Jose CA 95134, USA.
| | | | | |
Collapse
|
54
|
Glunčić M, Paar V. Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res 2012; 41:e17. [PMID: 22977183 PMCID: PMC3592446 DOI: 10.1093/nar/gks721] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).
Collapse
Affiliation(s)
- Matko Glunčić
- Faculty of Science, University of Zagreb, Bijenička 32 and Croatian Academy of Sciences and Arts, Zrinski trg 11, 10000 Zagreb, Croatia.
| | | |
Collapse
|
55
|
Wavelet analysis of DNA walks on the human and chimpanzee MAGE/CSAG-palindromes. GENOMICS PROTEOMICS & BIOINFORMATICS 2012; 10:230-6. [PMID: 23084779 PMCID: PMC5054716 DOI: 10.1016/j.gpb.2012.07.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Revised: 02/18/2012] [Accepted: 03/02/2012] [Indexed: 11/22/2022]
Abstract
The palindrome is one class of symmetrical duplications with reverse complementary characters, which is widely distributed in many organisms. Graphical representation of DNA sequence provides a simple way of viewing and comparing various genomic structures. Through 3-D DNA walk analysis, the similarity and differences in nucleotide composition, as well as the evolutionary relationship between human and chimpanzee MAGE/CSAG-palindromes, can be clearly revealed. Further wavelet analysis indicated that duplicated segments have irregular patterns compared to their surrounding sequences. However, sequence similarity analysis suggests that there is possible common ancestor between human and chimpanzee MAGE/CSAG-palindromes. Based on the specific distribution and orientation of the repeated sequences, a simple possible evolutionary model of the palindromes is suggested, which may help us to better understand the evolutionary course of the genes and the symmetrical sequences.
Collapse
|
56
|
Abstract
The distributions of codons in the DNA sequence of Escherichia coli K-12 are studied by using several statistical methods of analysis. Codons corresponding to the amino acids leucine, alanine and isoleucine are considered. The pair distributions of the codons as a function of the pair separation are evaluated and are seen to decay exponentially. The exponential decay constants have a linear relation with the numbers of the codons, indicating that the codons are randomly distributed in the sequence. The pair correlation and power spectral methods also show similar statistical behavior of codons in the sequence, with the exception that there appear very small peaks about the frequency f=0.286 in the power spectra of the amino acids leucine, alanine and isoleucine. Such a frequency reflects a periodicity of about 3.5 amino acids and a general helical structure of the proteins of the bacterium.
Collapse
Affiliation(s)
- SU-LONG NYEO
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, Republic of China
| | - I-CHING YANG
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, Republic of China
| |
Collapse
|
57
|
Rivard SR, Mailloux JG, Beguenane R, Bui HT. Design of high-performance parallelized gene predictors in MATLAB. BMC Res Notes 2012; 5:183. [PMID: 22490084 PMCID: PMC3444342 DOI: 10.1186/1756-0500-5-183] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2011] [Accepted: 04/10/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper proposes a method of implementing parallel gene prediction algorithms in MATLAB. The proposed designs are based on either Goertzel's algorithm or on FFTs and have been implemented using varying amounts of parallelism on a central processing unit (CPU) and on a graphics processing unit (GPU). FINDINGS Results show that an implementation using a straightforward approach can require over 4.5 h to process 15 million base pairs (bps) whereas a properly designed one could perform the same task in less than five minutes. In the best case, a GPU implementation can yield these results in 57 s. CONCLUSIONS The present work shows how parallelism can be used in MATLAB for gene prediction in very large DNA sequences to produce results that are over 270 times faster than a conventional approach. This is significant as MATLAB is typically overlooked due to its apparent slow processing time even though it offers a convenient environment for bioinformatics. From a practical standpoint, this work proposes two strategies for accelerating genome data processing which rely on different parallelization mechanisms. Using a CPU, the work shows that direct access to the MEX function increases execution speed and that the PARFOR construct should be used in order to take full advantage of the parallelizable Goertzel implementation. When the target is a GPU, the work shows that data needs to be segmented into manageable sizes within the GFOR construct before processing in order to minimize execution time.
Collapse
Affiliation(s)
- Sylvain Robert Rivard
- Département des sciences appliquées, Université du Québec à Chicoutimi, 555 blvd de l'Université, Chicoutimi, QC G7H 2B1, Canada.
| | | | | | | |
Collapse
|
58
|
Ramachandran P, Lu WS, Antoniou A. Filter-based methodology for the location of hot spots in proteins and exons in DNA. IEEE Trans Biomed Eng 2012; 59:1598-609. [PMID: 22410955 DOI: 10.1109/tbme.2012.2190512] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The so-called receiver operating characteristic technique is used as a tool in an optimization procedure for the improvement and assessment of a filter-based methodology for the location of hot spots in protein sequences and exons in DNA sequences. By optimizing the characteristic values of the nucleotides, high efficiency as well as improved accuracy can be achieved relative to results obtained with the electron-ion interaction potentials. On the other hand, by using the proposed filter-based methodology with binary sequences, improved accuracy can be achieved although the efficiency is somewhat compromised relative to that achieved using the optimized characteristic values. Extensive experimental results, evaluated using measures such as the g-mean, the Matthews correlation coefficient, and the chi-square statistic, show that the filter-based methodology performs much better than existing techniques using the short-time discrete Fourier transform, particularly in applications where short exons are involved.
Collapse
|
59
|
Calvete O, González J, Betrán E, Ruiz A. Segmental duplication, microinversion, and gene loss associated with a complex inversion breakpoint region in Drosophila. Mol Biol Evol 2012; 29:1875-89. [PMID: 22328714 DOI: 10.1093/molbev/mss067] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Chromosomal inversions are usually portrayed as simple two-breakpoint rearrangements changing gene order but not gene number or structure. However, increasing evidence suggests that inversion breakpoints may often have a complex structure and entail gene duplications with potential functional consequences. Here, we used a combination of different techniques to investigate the breakpoint structure and the functional consequences of a complex rearrangement fixed in Drosophila buzzatii and comprising two tandemly arranged inversions sharing the middle breakpoint: 2m and 2n. By comparing the sequence in the breakpoint regions between D. buzzatii (inverted chromosome) and D. mojavensis (noninverted chromosome), we corroborate the breakpoint reuse at the molecular level and infer that inversion 2m was associated with a duplication of a ~13 kb segment and likely generated by staggered breaks plus repair by nonhomologous end joining. The duplicated segment contained the gene CG4673, involved in nuclear transport, and its two nested genes CG5071 and CG5079. Interestingly, we found that other than the inversion and the associated duplication, both breakpoints suffered additional rearrangements, that is, the proximal breakpoint experienced a microinversion event associated at both ends with a 121-bp long duplication that contains a promoter. As a consequence of all these different rearrangements, CG5079 has been lost from the genome, CG5071 is now a single copy nonnested gene, and CG4673 has a transcript ~9 kb shorter and seems to have acquired a more complex gene regulation. Our results illustrate the complex effects of chromosomal rearrangements and highlight the need of complementing genomic approaches with detailed sequence-level and functional analyses of breakpoint regions if we are to fully understand genome structure, function, and evolutionary dynamics.
Collapse
Affiliation(s)
- Oriol Calvete
- Departament de Genètica i de Microbiologia, Facultat de Biociències, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, Spain
| | | | | | | |
Collapse
|
60
|
Nunes MCS, Wanner EF, Weber G. Origin of multiple periodicities in the Fourier power spectra of the Plasmodium falciparum genome. BMC Genomics 2011; 12 Suppl 4:S4. [PMID: 22369134 PMCID: PMC3287587 DOI: 10.1186/1471-2164-12-s4-s4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Fourier transforms and their associated power spectra are used for detecting periodicities and protein-coding genes and is generally regarded as a well established technique. Many of the periodicities which have been found with this method are quite well understood such as the periodicity of 3 nt which is associated to codon usage. But what is the origin of the peculiar frequency multiples k/21 which were reported for a tiny section of chromosome 2 in P. falciparum? Are these present in other chromosomes and perhaps in related organisms? And how should we interpret fractional periodicities in genomes? Results We applied the binary indicator power spectrum to all chromosomes of P. falciparum, and found that the frequency overtones k/21 are present only in non-coding sections. We did not find such frequency overtones in any other related genomes. Furthermore, the frequency overtones were identified as artifacts of the way the genome is encoded into a numerical sequence, that is, they are frequency aliases. By choosing a different way to encode the sequence the overtones do not appear. In view of these results, we revisited early applications of this technique to proteins where frequency overtones were reported. Conclusions Some authors hinted recently at the possibility of mapping artifacts and frequency aliases in power spectra. However, in the case of P. falciparum the frequency aliases are particularly strong and can mask the 1/3 frequency which is used for gene detecting. This shows that albeit being a well known technique, with a long history of application in proteins, few researchers seem to be aware of the problems represented by frequency aliases.
Collapse
Affiliation(s)
- Miriam C S Nunes
- Department of Biological Sciences, Federal University of Ouro Preto, 35400-000 Ouro Preto, MG, Brazil
| | | | | |
Collapse
|
61
|
Derrien T, Vaysse A, André C, Hitte C. Annotation of the domestic dog genome sequence: finding the missing genes. Mamm Genome 2011; 23:124-31. [PMID: 22076420 DOI: 10.1007/s00335-011-9372-0] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2011] [Accepted: 10/23/2011] [Indexed: 12/20/2022]
Abstract
There are over 350 genetically distinct breeds of domestic dog that present considerable variation in morphology, physiology, and disease susceptibility. The genome sequence of the domestic dog was assembled and released in 2005, providing an estimated 20,000 protein-coding genes that are a great asset to the scientific community that uses the dog system as a genetic biomedical model and for comparative and evolutionary studies. Although the canine gene set had been predicted using a combination of ab initio methods, homology studies, motif analysis, and similarity-based programs, it still requires a deep annotation of noncoding genes, alternative splicing, pseudogenes, regulatory regions, and gain and loss events. Such analyses could benefit from new sequencing technologies (RNA-Seq) to better exploit the advantages of the canine genetic system in tracking disease genes. Here, we review the catalog of canine protein-coding genes and the search for missing genes, and we propose rationales for an accurate identification of noncoding genes though next-generation sequencing.
Collapse
Affiliation(s)
- Thomas Derrien
- Institut de Génétique et Développement de Rennes, CNRS-UMR6061, Université de Rennes 1, 2 av Pr. Léon Bernard, 35043 Rennes, France
| | | | | | | |
Collapse
|
62
|
Abbasi O, Rostami A, Karimian G. Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform. BMC Bioinformatics 2011; 12:430. [PMID: 22050630 PMCID: PMC3306003 DOI: 10.1186/1471-2105-12-430] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Accepted: 11/03/2011] [Indexed: 11/18/2022] Open
Abstract
Background The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing. In this paper, a new method is presented for the identification of exonic regions in DNA sequences. This method is based on the cross-correlation technique that can identify periodic regions in DNA sequences. Results The method reduces the dependency of window length on identification accuracy. The proposed algorithm is applied to different eukaryotic datasets and the output results are compared with those of other established methods. The proposed method increased the accuracy of exon detection by 4% to 41% relative to the most common digital signal processing methods for exon prediction. Conclusions We demonstrated that periodic signals can be estimated using cross-correlation. In addition, discrete wavelet transform (DWT) can minimise noise while maintaining the signal. The proposed algorithm, which combines cross-correlation and DWT, significantly increases the accuracy of exonic region identification.
Collapse
Affiliation(s)
- Omid Abbasi
- School of Engineering-Emerging Technologies, University of Tabriz, Tabriz 5166614761, Iran
| | | | | |
Collapse
|
63
|
Abstract
Novel methods for identifying a new type of DNA latent periodicity, called latent profile periodicity or latent profility, are used to search for periodic structures in genes. These methods reveal two distinct levels of organization of genetic information encoding. It is shown that latent profility in genes may correlate with specific structural features of their encoded proteins.
Collapse
Affiliation(s)
- Maria Chaley
- Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st., 4, 142290 Pushchino, Russia.
| | | |
Collapse
|
64
|
Trotta E. The 3-base periodicity and codon usage of coding sequences are correlated with gene expression at the level of transcription elongation. PLoS One 2011; 6:e21590. [PMID: 21738721 PMCID: PMC3125259 DOI: 10.1371/journal.pone.0021590] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2011] [Accepted: 06/03/2011] [Indexed: 11/18/2022] Open
Abstract
Background Gene transcription is regulated by DNA transcriptional regulatory elements, promoters and enhancers that are located outside the coding regions. Here, we examine the characteristic 3-base periodicity of the coding sequences and analyse its correlation with the genome-wide transcriptional profile of yeast. Principal Findings The analysis of coding sequences by a new class of indices proposed here identified two different sources of 3-base periodicity: the codon frequency and the codon sequence. In exponentially growing yeast cells, the codon-frequency component of periodicity accounts for 71.9% of the variability of the cellular mRNA by a strong association with the density of elongating mRNA polymerase II complexes. The mRNA abundance explains most of the correlation between the codon-frequency component of periodicity and protein levels. Furthermore, pyrimidine-ending codons of the four-fold degenerate small amino acids alanine, glycine and valine are associated with genes with double the transcription rate of those associated with purine-ending codons. Conclusions We demonstrate that the 3-base periodicity of coding sequences is higher than expected by the codon usage frequency (CUF) and that its components, associated with codon bias and amino acid composition, are correlated with gene expression, principally at the level of transcription elongation. This indicates a role of codon sequences in maximising the transcription efficiency in exponentially growing yeast cells. Moreover, the results contrast with the common Darwinian explanation that attributes the codon bias to translational selection by an adjustment of synonymous codon frequencies to the most abundant isoaccepting tRNA. Here, we show that selection on codon bias likely acts at both the transcriptional and translational level and that codon usage and the relative abundance of tRNA could drive each other in order to synergistically optimize the efficiency of gene expression.
Collapse
Affiliation(s)
- Edoardo Trotta
- Institute of Translational Pharmacology, Consiglio Nazionale delle Ricerche, Roma, Italy.
| |
Collapse
|
65
|
Sahu SS, Panda G. Identification of protein-coding regions in DNA sequences using a time-frequency filtering approach. GENOMICS, PROTEOMICS & BIOINFORMATICS 2011; 9:45-55. [PMID: 21641562 PMCID: PMC5054166 DOI: 10.1016/s1672-0229(11)60007-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2010] [Accepted: 10/31/2010] [Indexed: 11/13/2022]
Abstract
Accurate identification of protein-coding regions (exons) in DNA sequences has been a challenging task in bioinformatics. Particularly the coding regions have a 3-base periodicity, which forms the basis of all exon identification methods. Many signal processing tools and techniques have been applied successfully for the identification task but still improvement in this direction is needed. In this paper, we have introduced a new promising model-independent time-frequency filtering technique based on S-transform for accurate identification of the coding regions. The S-transform is a powerful linear time-frequency representation useful for filtering in time-frequency domain. The potential of the proposed technique has been assessed through simulation study and the results obtained have been compared with the existing methods using standard datasets. The comparative study demonstrates that the proposed method outperforms its counterparts in identifying the coding regions.
Collapse
Affiliation(s)
- Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, National Institute of Technology, Rourkela, India.
| | | |
Collapse
|
66
|
Wang L, Stein LD. Localizing triplet periodicity in DNA and cDNA sequences. BMC Bioinformatics 2010; 11:550. [PMID: 21059240 PMCID: PMC2992068 DOI: 10.1186/1471-2105-11-550] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2010] [Accepted: 11/08/2010] [Indexed: 01/23/2023] Open
Abstract
Background The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans. Results Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT. Conclusions MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Collapse
Affiliation(s)
- Liya Wang
- Cold Spring Harbor Laboratory, Williams #5, Cold Spring Harbor, NY 11724, USA.
| | | |
Collapse
|
67
|
Chen B, Ji P. Visualization of the protein-coding regions with a self adaptive spectral rotation approach. Nucleic Acids Res 2010; 39:e3. [PMID: 20947567 PMCID: PMC3017620 DOI: 10.1093/nar/gkq891] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners’ shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N).
Collapse
Affiliation(s)
- Bo Chen
- Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| | | |
Collapse
|
68
|
Gupta R, Sarthi D, Mittal A, Singh K. A novel signal processing measure to identify exact and inexact tandem repeat patterns in DNA sequences. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010:43596. [PMID: 17713591 PMCID: PMC3171338 DOI: 10.1155/2007/43596] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Revised: 11/20/2006] [Accepted: 12/07/2006] [Indexed: 01/07/2023]
Abstract
The identification and analysis of repetitive patterns are active areas of biological and computational research. Tandem repeats in telomeres play a role in cancer and hypervariable trinucleotide tandem repeats are linked to over a dozen major neurodegenerative genetic disorders. In this paper, we present an algorithm to identify the exact and inexact repeat patterns in DNA sequences based on orthogonal exactly periodic subspace decomposition technique. Using the new measure our algorithm resolves the problems like whether the repeat pattern is of period P or its multiple (i.e., 2P, 3P, etc.), and several other problems that were present in previous signal-processing-based algorithms. We present an efficient algorithm of O(NL(w) log L(w)), where N is the length of DNA sequence and L(w) is the window length, for identifying repeats. The algorithm operates in two stages. In the first stage, each nucleotide is analyzed separately for periodicity, and in the second stage, the periodic information of each nucleotide is combined together to identify the tandem repeats. Datasets having exact and inexact repeats were taken up for the experimental purpose. The experimental result shows the effectiveness of the approach.
Collapse
Affiliation(s)
- Ravi Gupta
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Divya Sarthi
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Ankush Mittal
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Kuldip Singh
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| |
Collapse
|
69
|
Shrimal S, Bhattacharya S, Bhattacharya A. Serum-dependent selective expression of EhTMKB1-9, a member of Entamoeba histolytica B1 family of transmembrane kinases. PLoS Pathog 2010; 6:e1000929. [PMID: 20532220 PMCID: PMC2880585 DOI: 10.1371/journal.ppat.1000929] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2009] [Accepted: 04/28/2010] [Indexed: 11/29/2022] Open
Abstract
Entamoeba histolytica transmembrane kinases (EhTMKs) can be grouped into six distinct families on the basis of motifs and sequences. Analysis of the E. histolytica genome revealed the presence of 35 EhTMKB1 members on the basis of sequence identity (≥95%). Only six homologs were full length containing an extracellular domain, a transmembrane segment and an intracellular kinase domain. Reverse transcription followed by polymerase chain reaction (RT-PCR) of the kinase domain was used to generate a library of expressed sequences. Sequencing of randomly picked clones from this library revealed that about 95% of the clones were identical with a single member, EhTMKB1-9, in proliferating cells. On serum starvation, the relative number of EhTMKB1-9 derived sequences decreased with concomitant increase in the sequences derived from another member, EhTMKB1-18. The change in their relative expression was quantified by real time PCR. Northern analysis and RNase protection assay were used to study the temporal nature of EhTMKB1-9 expression after serum replenishment of starved cells. The results showed that the expression of EhTMKB1-9 was sinusoidal. Specific transcriptional induction of EhTMKB1-9 upon serum replenishment was further confirmed by reporter gene (luciferase) expression and the upstream sequence responsible for serum responsiveness was identified. EhTMKB1-9 is one of the first examples of an inducible gene in Entamoeba. The protein encoded by this member was functionally characterized. The recombinant kinase domain of EhTMKB1-9 displayed protein kinase activity. It is likely to have dual specificity as judged from its sensitivity to different kinase inhibitors. Immuno-localization showed EhTMKB1-9 to be a surface protein which decreased on serum starvation and got relocalized on serum replenishment. Cell lines expressing either EhTMKB1-9 without kinase domain, or EhTMKB1-9 antisense RNA, showed decreased cellular proliferation and target cell killing. Our results suggest that E. histolytica TMKs of B1 family are functional kinases likely to be involved in serum response and cellular proliferation. The presence of a vast array of putative transmembrane kinase genes suggests an extensive network of signaling systems in E. histolytica, particularly the ability to perceive signals from the extracellular environment and transduce these intracellularly. However, it has been very difficult to work with these molecules due to the presence of a large number of homologs. It is also not clear if these molecules are indeed protein kinases, as no kinase activity has yet been shown associated with these molecules. In this report, we show that EhTMKB1-9 is a protein kinase and it is one of the early serum-induced genes. It is a predominant EhTMKB1 molecule that is expressed in proliferating cells and its expression is modulated by serum. Cells containing a reduced level of EhTMKB1-9 or high level of a mutant protein result in decreased proliferation, target cell killing and adherence. The results presented in this report suggest that EhTMKB1-9 is an important signaling molecule likely to be involved in E. histolytica proliferation and virulence. We have also identified a serum starvation induced response where expression of EhTMKB1-18 was found to be induced.
Collapse
Affiliation(s)
- Shiteshu Shrimal
- School of Life Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Sudha Bhattacharya
- School of Environmental Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Alok Bhattacharya
- School of Life Sciences, Jawaharlal Nehru University, New Delhi, India
- School of Information Technology, Jawaharlal Nehru University, New Delhi, India
- * E-mail:
| |
Collapse
|
70
|
Tanaka Y, Yamashita R, Suzuki Y, Nakai K. Effects of Alu elements on global nucleosome positioning in the human genome. BMC Genomics 2010; 11:309. [PMID: 20478020 PMCID: PMC2878307 DOI: 10.1186/1471-2164-11-309] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2009] [Accepted: 05/17/2010] [Indexed: 11/14/2022] Open
Abstract
Background Understanding the genome sequence-specific positioning of nucleosomes is essential to understand various cellular processes, such as transcriptional regulation and replication. As a typical example, the 10-bp periodicity of AA/TT and GC dinucleotides has been reported in several species, but it is still unclear whether this feature can be observed in the whole genomes of all eukaryotes. Results With Fourier analysis, we found that this is not the case: 84-bp and 167-bp periodicities are prevalent in primates. The 167-bp periodicity is intriguing because it is almost equal to the sum of the lengths of a nucleosomal unit and its linker region. After masking Alu elements, these periodicities were greatly diminished. Next, using two independent large-scale sets of nucleosome mapping data, we analyzed the distribution of nucleosomes in the vicinity of Alu elements and showed that (1) there are one or two fixed slot(s) for nucleosome positioning within the Alu element and (2) the positioning of neighboring nucleosomes seems to be in phase, more or less, with the presence of Alu elements. Furthermore, (3) these effects of Alu elements on nucleosome positioning are consistent with inactivation of promoter activity in Alu elements. Conclusions Our discoveries suggest that the principle governing nucleosome positioning differs greatly across species and that the Alu family is an important factor in primate genomes.
Collapse
Affiliation(s)
- Yoshiaki Tanaka
- Department of Medical Genome Sciences, University of Tokyo, Minato-ku, Japan
| | | | | | | |
Collapse
|
71
|
Abstract
Background This paper compares the most common digital signal processing methods of exon prediction in eukaryotes, and also proposes a technique for noise suppression in exon prediction. The specimen used here which has relevance in medical research, has been taken from the public genomic database - GenBank. Methods Here exon prediction has been done using the digital signal processing methods viz. binary method, EIIP (electron-ion interaction psuedopotential) method and filter methods. Under filter method two filter designs, and two approaches using these two designs have been tried. The discrete wavelet transform has been used for de-noising of the exon plots. Results Results of exon prediction based on the methods mentioned above, which give values closest to the ones found in the NCBI database are given here. The exon plot de-noised using discrete wavelet transform is also given. Conclusion Alterations to the proven methods as done by the authors, improves performance of exon prediction algorithms. Also it has been proven that the discrete wavelet transform is an effective tool for de-noising which can be used with exon prediction algorithms.
Collapse
Affiliation(s)
- Tina P George
- Department of Electronics and Instrumentation, College of Engineering, Kidangoor, Kottayam, Kerala, India.
| | | |
Collapse
|
72
|
Abstract
In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Collapse
Affiliation(s)
- Nicolas Carels
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | | |
Collapse
|
73
|
Rè M, Pesole G, Horner DS. Accurate discrimination of conserved coding and non-coding regions through multiple indicators of evolutionary dynamics. BMC Bioinformatics 2009; 10:282. [PMID: 19737408 PMCID: PMC2758873 DOI: 10.1186/1471-2105-10-282] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2008] [Accepted: 09/08/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The conservation of sequences between related genomes has long been recognised as an indication of functional significance and recognition of sequence homology is one of the principal approaches used in the annotation of newly sequenced genomes. In the context of recent findings that the number non-coding transcripts in higher organisms is likely to be much higher than previously imagined, discrimination between conserved coding and non-coding sequences is a topic of considerable interest. Additionally, it should be considered desirable to discriminate between coding and non-coding conserved sequences without recourse to the use of sequence similarity searches of protein databases as such approaches exclude the identification of novel conserved proteins without characterized homologs and may be influenced by the presence in databases of sequences which are erroneously annotated as coding. RESULTS Here we present a machine learning-based approach for the discrimination of conserved coding sequences. Our method calculates various statistics related to the evolutionary dynamics of two aligned sequences. These features are considered by a Support Vector Machine which designates the alignment coding or non-coding with an associated probability score. CONCLUSION We show that our approach is both sensitive and accurate with respect to comparable methods and illustrate several situations in which it may be applied, including the identification of conserved coding regions in genome sequences and the discrimination of coding from non-coding cDNA sequences.
Collapse
Affiliation(s)
- Matteo Rè
- Dipartimento di Scienze Biomolecolari e Biotecnologie, Università degli Studi di Milano, Via Celoria 26, 20133 Milano, Italia.
| | | | | |
Collapse
|
74
|
Hongxia Zhou, Liping Du, Hong Yan. Detection of Tandem Repeats in DNA Sequences Based on Parametric Spectral Estimation. ACTA ACUST UNITED AC 2009; 13:747-55. [DOI: 10.1109/titb.2008.920626] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
75
|
A hybrid technique for the periodicity characterization of genomic sequence data. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2009:924601. [PMID: 19365578 DOI: 10.1155/2009/924601] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2008] [Revised: 10/13/2008] [Accepted: 01/21/2009] [Indexed: 11/17/2022]
Abstract
Many studies of biological sequence data have examined sequence structure in terms of periodicity, and various methods for measuring periodicity have been suggested for this purpose. This paper compares two such methods, autocorrelation and the Fourier transform, using synthetic periodic sequences, and explains the differences in periodicity estimates produced by each. A hybrid autocorrelation-integer period discrete Fourier transform is proposed that combines the advantages of both techniques. Collectively, this representation and a recently proposed variant on the discrete Fourier transform offer alternatives to the widely used autocorrelation for the periodicity characterization of sequence data. Finally, these methods are compared for various tetramers of interest in C. elegans chromosome I.
Collapse
|
76
|
Identification and chromosomal localization of one locus of Leishmania (L.) major related with resistance to itraconazole. Parasitol Res 2009; 105:471-8. [PMID: 19322586 DOI: 10.1007/s00436-009-1418-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2008] [Accepted: 03/12/2009] [Indexed: 10/21/2022]
Abstract
Ergosterol is an important compound responsible to maintain integrity and fluidity of Leishmania spp. membranes. Starting from an overexpression/selection method, our group has isolated and mapped nine different loci of Leishmania (L.) major related to resistance against two inhibitors of the ergosterol biosynthesis pathway, terbinafine (TBF) and itraconazole (ITZ). Individual functional analysis after overexpression induction of these loci in the presence of TBF and/or ITZ [or the ITZ analog ketoconazole (CTZ)] have shown low but significant levels of resistance after transfection into L. major wild-type parasites. In this work, we have shown the insert mapping and chromosomal identification of one of these loci (cosItz2). Functional analysis experiments associated with chromosomal localization by comparison at genomic database allowed us to identify two prospective gene-protein systems not related to the ergosterol biosynthesis and capable to confer wild-type cells resistance to ITZ-CTZ after transfection. We expected that this approach can open new insights for a better understanding of mechanisms of ITZ-CTZ action and resistance in Leishmania resulting in new strategies for the leishmaniasis treatment.
Collapse
|
77
|
Frenkel FE, Korotkov EV. Using triplet periodicity of nucleotide sequences for finding potential reading frame shifts in genes. DNA Res 2009; 16:105-14. [PMID: 19261626 PMCID: PMC2671204 DOI: 10.1093/dnares/dsp002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
We introduce a novel approach for the detection of possible mutations leading to a reading frame (RF) shift in a gene. Deletions and insertions of DNA coding regions are considerable events for genes because an RF shift results in modifications of the extensive region of amino acid sequence coded by a gene. The suggested method is based on the phenomenon of triplet periodicity (TP) in coding regions of genes and its relative resistance to substitutions in DNA sequence. We attempted to extend 326 933 regions of continuous TP found in genes from the KEGG databank by considering possible insertions and deletions. We revealed totally 824 genes where such extension was possible and statistically significant. Then we generated amino acid sequences according to active (KEGG's) and hypothetically ancient RFs in order to find confirmation of a shift at a protein level. Consequently, 64 sequences have protein similarities only for ancient RF, 176 only for active RF, 3 for both and 581 have no protein similarity at all. We aimed to have revealed lower bound for the number of genes in which a shift between RF and TP is possible. Further ways to increase the number of revealed RF shifts are discussed.
Collapse
Affiliation(s)
- F E Frenkel
- Bioengineering Centre of RAS, 60-letiya Oktyabrya prosp., 7/1, Moscow, Russia.
| | | |
Collapse
|
78
|
Xing C, Bitzer DL, Alexander WE, Vouk MA, Stomp AM. Identification of protein-coding sequences using the hybridization of 18S rRNA and mRNA during translation. Nucleic Acids Res 2008; 37:591-601. [PMID: 19073698 PMCID: PMC2632891 DOI: 10.1093/nar/gkn917] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
We introduce a new approach in this article to distinguish protein-coding sequences from non-coding sequences utilizing a period-3, free energy signal that arises from the interactions of the 3′-terminal nucleotides of the 18S rRNA with mRNA. We extracted the special features of the amplitude and the phase of the period-3 signal in protein-coding regions, which is not found in non-coding regions, and used them to distinguish protein-coding sequences from non-coding sequences. We tested on all the experimental genes from Saccharomyces cerevisiae and Schizosaccharomyces pombe. The identification was consistent with the corresponding information from GenBank, and produced better performance compared to existing methods that use a period-3 signal. The primary tests on some fly, mouse and human genes suggests that our method is applicable to higher eukaryotic genes. The tests on pseudogenes indicated that most pseudogenes have no period-3 signal. Some exploration of the 3′-tail of 18S rRNA and pattern analysis of protein-coding sequences supported further our assumption that the 3′-tail of 18S rRNA has a role of synchronization throughout translation elongation process. This, in turn, can be utilized for the identification of protein-coding sequences.
Collapse
Affiliation(s)
- Chuanhua Xing
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC 27695-7911, USA.
| | | | | | | | | |
Collapse
|
79
|
Li E, Reich CI, Olsen GJ. A whole-genome approach to identifying protein binding sites: promoters in Methanocaldococcus (Methanococcus) jannaschii. Nucleic Acids Res 2008; 36:6948-58. [PMID: 18981048 PMCID: PMC2602779 DOI: 10.1093/nar/gkm499] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
We have adapted an electrophoretic mobility shift assay (EMSA) to isolate genomic DNA fragments that bind the archaeal transcription initiation factors TATA-binding protein (TBP) and transcription factor B (TFB) to perform a genome-wide search for promoters. Mobility-shifted fragments were cloned, tested for their ability to compete with known promoter-containing fragments for a limited concentration of transcription factors, and sequenced. We applied the method to search for promoters in the genome of Methanocaldococcus jannaschii. Selection was most efficient for promoters of tRNA genes and genes for several presumed small non-coding RNAs (ncRNA). Protein-coding gene promoters were dramatically underrepresented relative to their frequency in the genome. The repeated isolation of these genomic regions was partially rectified by including a hybridization-based screening. Sequence alignment of the affinity-selected promoters revealed previously identified TATA box, BRE, and the putative initiator element. In addition, the conserved bases immediately upstream and downstream of the BRE and TATA box suggest that the composition and structure of archaeal natural promoters are more complicated.
Collapse
Affiliation(s)
- Enhu Li
- Division of Biology, California Institute of Technology, Pasadena, CA 91125, USA
| | | | | |
Collapse
|
80
|
Paar V, Pavin N, Basar I, Rosandić M, Gluncić M, Paar N. Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats. BMC Bioinformatics 2008; 9:466. [PMID: 18980673 PMCID: PMC2661002 DOI: 10.1186/1471-2105-9-466] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 11/03/2008] [Indexed: 11/28/2022] Open
Abstract
Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor n for nmer) and higher harmonics. In general, nmer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/fβ – noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of nmer HOR, i.e., the number n of monomers contained in consensus HOR.
Collapse
Affiliation(s)
- Vladimir Paar
- Faculty of Science, University of Zagreb, Bijenicka 32, Zagreb, Croatia.
| | | | | | | | | | | |
Collapse
|
81
|
Galimov AR, Kruglov AA, Bol'sheva NL, Iurkevich OI, Lipin'sh DI, Mufazalov IA, Kuprash DV, Nedospasov SA. [Chromosomal localization and molecular organization of human genomic fragment containing TNF/LT locus in transgenic mice]. Mol Biol (Mosk) 2008; 42:629-38. [PMID: 18856063 DOI: 10.1134/s0026893308040201] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Molecular organization, copy number and chromosomal localization of human TNF/LT locus fragment were determined in genomes of two transgenic mouse lines. Genome of the first one contains two copies, organized in head-to-tail manner and determined on eighth chromosome by karyotyping; single transgene copy of the second line is observed on the fifth chromosome. These mice could serve as valuable model for studying both human tumor necrosis factor and lymphotoxin physiological functions.
Collapse
|
82
|
Lin MF, Deoras AN, Rasmussen MD, Kellis M. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. PLoS Comput Biol 2008; 4:e1000067. [PMID: 18421375 PMCID: PMC2291194 DOI: 10.1371/journal.pcbi.1000067] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2007] [Accepted: 03/20/2008] [Indexed: 01/22/2023] Open
Abstract
Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons (< or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human.
Collapse
Affiliation(s)
- Michael F. Lin
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
| | - Ameya N. Deoras
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Matthew D. Rasmussen
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
| | - Manolis Kellis
- Broad Institute of MIT and Harvard University, Cambridge, Massachusetts, United States of America
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
83
|
Mena-Chalco JP, Carrer H, Zana Y, Cesar RM. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:198-207. [PMID: 18451429 DOI: 10.1109/tcbb.2007.70259] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods, based on the occurrence of specific patterns of nucleotides at coding regions, have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically pre-defined window length required for a local analysis of a DNA region. We introduce a method, based on a modified Gabor-wavelet transform (MGWT), for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods using eukaryote datasets. The results show that the MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors, but also makes available a tool for detailed exploration of the nucleotide occurrence.
Collapse
Affiliation(s)
- Jesús P Mena-Chalco
- Departmento de Ciencia da Computação, Instituto de Matemática e Estatística de Universidade de São Paulo, Rua do Matão, Cidade Universitária, São Paulo, SP, Brasil.
| | | | | | | |
Collapse
|
84
|
Abstract
Fast-sequencing throughput methods have increased the number of completely sequenced bacterial genomes to about 400 by December 2006, with the number increasing rapidly. These include several strains. In silico methods of comparative genomics are of use in categorizing and phylogenetically sorting these bacteria. Various word-based tools have been used for quantifying the similarities and differences between entire genomes. The simple di-nucleotide frequency comparison, codon specificity and k-mer repeat detection are among some of the well-known methods. In this paper, we show that the Mutual Information function, which is a measure of correlations and a concept from Information Theory, is very effective in determining the similarities and differences among genome sequences of various strains of bacteria such as the plant pathogen Xylella fastidiosa, marine Cyanobacteria Prochlorococcus marinus or animal and human pathogens such as species of Ehrlichia and Legionella. The short-range three-base periodicity, small sequence repeats and long-range correlations taken together constitute a genome signature that can be used as a technique for identifying new bacterial strains with the help of strains already catalogued in the database. There have been several applications of using the Mutual Information function as a measure of correlations in genomics but this is the first whole genome analysis done to detect strain similarities and differences.
Collapse
Affiliation(s)
- D Swati
- Department of Physics, MMV, Banaras Hindu University, Varanasi 221005, India.
| |
Collapse
|
85
|
Dias FC, Ruiz JC, Lopes WCZ, Squina FM, Renzi A, Cruz AK, Tosi LRO. Organization of H locus conserved repeats in Leishmania (Viannia) braziliensis correlates with lack of gene amplification and drug resistance. Parasitol Res 2007; 101:667-76. [PMID: 17393181 DOI: 10.1007/s00436-007-0528-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2007] [Accepted: 03/14/2007] [Indexed: 11/27/2022]
Abstract
Resistance to antimonials is a major problem when treating visceral leishmaniasis in India and has already been described for New World parasites. Clinical response to meglumine antimoniate in patients infected with parasites of the Viannia sub-genus can be widely variable, suggesting the presence of mechanisms of drug resistance. In this work, we have compared L. major and L. braziliensis mutants selected in different drugs. The cross-resistance profiles of some cell lines resembled those of mutants bearing H locus amplicons. However, amplified episomal molecules were exclusively detected in L. major mutants. The analysis of the L. braziliensis H region revealed a strong conservation of gene synteny. The typical intergenic repeats that are believed to mediate the amplification of the H locus in species of the Leishmania sub-genus are partially conserved in the Viannia species. The conservation of these non-coding elements in equivalent positions in both species is indicative of their relevance within this locus. The absence of amplicons in L. braziliensis suggests that this species may not favour extra-chromosomal gene amplification as a source of phenotypic heterogeneity and fitness maintenance in changing environments.
Collapse
Affiliation(s)
- Fabricio C Dias
- Departamento de Biologia Celular e Molecular e Bioagentes Patogênicos, Faculdade de Medicina de Ribeirão Preto, Universidade de São Paulo, 14049-900, Ribeirão Preto, Sao Paulo, Brazil
| | | | | | | | | | | | | |
Collapse
|
86
|
Saeys Y, Rouzé P, Van de Peer Y. In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. Bioinformatics 2007; 23:414-20. [PMID: 17204465 DOI: 10.1093/bioinformatics/btl639] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA http://bioinformatics.psb.ugent.be/.
Collapse
Affiliation(s)
- Yvan Saeys
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Technologiepark 927, B-9052 Ghent, Belgium.
| | | | | |
Collapse
|
87
|
Pinho AJ, Neves AJR, Afreixo V, Bastos CAC, Ferreira PJSG. A three-state model for DNA protein-coding regions. IEEE Trans Biomed Eng 2006; 53:2148-55. [PMID: 17073319 DOI: 10.1109/tbme.2006.879477] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
It is known that the protein-coding regions of DNA are usually characterized by a three-base periodicity. In this paper, we exploit this property, studying a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three base positions of a codon differs and that this variation is not the same among the organisms analyzed.
Collapse
Affiliation(s)
- Armando J Pinho
- Signal Processing Laboratory, DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal.
| | | | | | | | | |
Collapse
|
88
|
Law NF, Cheng KO, Siu WC. On relationship of Z-curve and Fourier approaches for DNA coding sequence classification. Bioinformation 2006; 1:242-6. [PMID: 17597898 PMCID: PMC1891701 DOI: 10.6026/97320630001242] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2006] [Accepted: 11/02/2006] [Indexed: 11/23/2022] Open
Abstract
Z-curve features are one of the popular features used in exon/intron classification. We showed that although both Z-curve and Fourier approaches are based on detecting 3-periodicity in coding regions, there are significant differences in their spectral formulation. From the spectral formulation of the Z-curve, we obtained three modified sequences that characterize different biological properties. Spectral analysis on the modified sequences showed a much more prominent 3-periodicity peak in coding regions than the Fourier approach. For long sequences, prominent peaks at 2Pi/3 are observed at coding regions, whereas for short sequences, clearly discernible peaks are still visible. Better classification can be obtained using spectral features derived from the modified sequences.
Collapse
Affiliation(s)
- Ngai-Fong Law
- Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong.
| | | | | |
Collapse
|
89
|
Jehan Z, Vallinayagam S, Tiwari S, Pradhan S, Singh L, Suresh A, Reddy HM, Ahuja Y, Jesudasan RA. Novel noncoding RNA from human Y distal heterochromatic block (Yq12) generates testis-specific chimeric CDC2L2. Genome Res 2006; 17:433-40. [PMID: 17095710 PMCID: PMC1832090 DOI: 10.1101/gr.5155706] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The human Y chromosome, because it is enriched in repetitive DNA, has been very intractable to genetic and molecular analyses. There is no previous evidence for developmental stage- and testis-specific transcription from the male-specific region of the Y (MSY). Here, we present evidence for the first time for a developmental stage- and testis-specific transcription from MSY distal heterochromatic block. We isolated two novel RNAs, which localize to Yq12 in multiple copies, show testis-specific expression, and lack active X-homologs. Experimental evidence shows that one of the above Yq12 noncoding RNAs (ncRNAs) trans-splices with CDC2L2 mRNA from chromosome 1p36.3 locus to generate a testis-specific chimeric beta sv13 isoform. This 67-nt 5'UTR provided by the Yq12 transcript contains within it a Y box protein-binding CCAAT motif, indicating translational regulation of the beta sv13 isoform in testis. This is also the first report of trans-splicing between a Y chromosomal and an autosomal transcript.
Collapse
Affiliation(s)
- Zeenath Jehan
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | | | - Shrish Tiwari
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | - Suman Pradhan
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | - Lalji Singh
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | - Amritha Suresh
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | - Hemakumar M. Reddy
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
| | - Y.R. Ahuja
- Genetics Unit, Vasavi Medical and Research Centre, Hyderabad, India, AP, India
| | - Rachel A. Jesudasan
- Centre for Cellular and Molecular Biology, Uppal Road Hyderabad–500 007, AP, India
- Corresponding author.E-mail ; fax 91-40-27160311
| |
Collapse
|
90
|
Nagarajan V, Kaushik N, Murali B, Zhang C, Lakhera S, Elasri MO, Deng Y. A Fourier transformation based method to mine peptide space for antimicrobial activity. BMC Bioinformatics 2006; 7 Suppl 2:S2. [PMID: 17118141 PMCID: PMC1683563 DOI: 10.1186/1471-2105-7-s2-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Naturally occurring antimicrobial peptides are currently being explored as potential candidate peptide drugs. Since antimicrobial peptides are part of the innate immune system of every living organism, it is possible to discover new candidate peptides using the available genomic and proteomic data. High throughput computational techniques could also be used to virtually scan the entire peptide space for discovering out new candidate antimicrobial peptides. Result We have identified a unique indexing method based on biologically distinct characteristic features of known antimicrobial peptides. Analysis of the entries in the antimicrobial peptide databases, based on our indexing method, using Fourier transformation technique revealed a distinct peak in their power spectrum. We have developed a method to mine the genomic and proteomic data, for the presence of peptides with potential antimicrobial activity, by looking for this distinct peak. We also used the Euclidean metric to rank the potential antimicrobial peptides activity. We have parallelized our method so that virtually any given protein space could be data mined, in search of antimicrobial peptides. Conclusion The results show that the Fourier transform based method with the property based coding strategy could be used to scan the peptide space for discovering new potential antimicrobial peptides.
Collapse
Affiliation(s)
- Vijayaraj Nagarajan
- Department of Biological Sciences, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Navodit Kaushik
- Department of Computer Science, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Beddhu Murali
- Department of Computer Science, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Chaoyang Zhang
- Department of Computer Science, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Sanyogita Lakhera
- Department of Mathematics, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Mohamed O Elasri
- Department of Biological Sciences, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Youping Deng
- Department of Biological Sciences, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| |
Collapse
|
91
|
Masoom H, Datta S, Asif A, Cunningham L, Wu G. A Fast Algorithm for Detecting Frame Shifts in DNA sequences. 2006 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2006. [DOI: 10.1109/cibcb.2006.330971] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
|
92
|
Dutta S, Singhal P, Agrawal P, Tomer R, Kritee K, Khurana E, Jayaram B. A physicochemical model for analyzing DNA sequences. J Chem Inf Model 2006; 46:78-85. [PMID: 16426042 DOI: 10.1021/ci050119x] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In search of an ab initio model to characterize DNA sequences as genes and nongenes, we examined some physicochemical properties of each trinucleotide (codon), which could accomplish this task. We constructed three-dimensional vectors for each double-helical trinucleotide sequence considering hydrogen-bonding energy, stacking energy, and a third parameter, which we provisionally identified with DNA-protein interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and nongene regions to make a distinction feasible, if the underlying model has some merits. An analysis of 331 prokaryotic genomes comprising a total of 294 786 experimentally verified genes (nonoverlapping) and an equal number of nongenes presents a proof of concept of the model without the need for further parametrization. Also, initial analyses on Saccharomyces cerevisiae and Arabidopsis thaliana suggest that the methodology is extendable to eukaryotes. The physicochemical model (ChemGenome1.0) introduced has the potential to be developed into a gene-finding algorithm and, more pressingly, could be employed for an independent assessment of the annotation of DNA sequences.
Collapse
Affiliation(s)
- Samrat Dutta
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi
| | | | | | | | | | | | | |
Collapse
|
93
|
Cao Y, Tung WW, Gao JB. Recurrence time statistics: versatile tools for genomic DNA sequence analysis. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2006:40-51. [PMID: 16447998 DOI: 10.1109/csb.2004.1332415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.
Collapse
|
94
|
Gao J, Qi Y, Cao Y, Tung WW. Protein coding sequence identification by simultaneously characterizing the periodic and random features of DNA sequences. J Biomed Biotechnol 2006; 2005:139-46. [PMID: 16046819 PMCID: PMC1184046 DOI: 10.1155/jbb.2005.139] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Most codon indices used today are based on highly biased
nonrandom usage of codons in coding regions. The background of
a coding or noncoding DNA sequence, however, is fairly random,
and can be characterized as a random fractal. When a gene-finding algorithm incorporates multiple sources of information
about coding regions, it becomes more successful. It is thus
highly desirable to develop new and efficient codon indices by
simultaneously characterizing the fractal and periodic
features of a DNA sequence. In this paper, we describe a novel
way of achieving this goal. The efficiency of the new codon
index is evaluated by studying all of the 16 yeast
chromosomes. In particular, we show that the method
automatically and correctly identifies which of the three
reading frames is the one that contains a gene.
Collapse
Affiliation(s)
- Jianbo Gao
- Department of Electrical & Computer Engineering, University of Florida, Gainesville, FL 32611-6200, USA.
| | | | | | | |
Collapse
|
95
|
Lu Q, Hao P, Curcin V, He W, Li YY, Luo QM, Guo YK, Li YX. KDE Bioscience: platform for bioinformatics analysis workflows. J Biomed Inform 2005; 39:440-50. [PMID: 16260186 PMCID: PMC7106075 DOI: 10.1016/j.jbi.2005.09.001] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2005] [Revised: 08/19/2005] [Accepted: 09/05/2005] [Indexed: 11/16/2022]
Abstract
Bioinformatics is a dynamic research area in which a large number of algorithms and programs have been developed rapidly and independently without much consideration so far of the need for standardization. The lack of such common standards combined with unfriendly interfaces make it difficult for biologists to learn how to use these tools and to translate the data formats from one to another. Consequently, the construction of an integrative bioinformatics platform to facilitate biologists' research is an urgent and challenging task. KDE Bioscience is a java-based software platform that collects a variety of bioinformatics tools and provides a workflow mechanism to integrate them. Nucleotide and protein sequences from local flat files, web sites, and relational databases can be entered, annotated, and aligned. Several home-made or 3rd-party viewers are built-in to provide visualization of annotations or alignments. KDE Bioscience can also be deployed in client-server mode where simultaneous execution of the same workflow is supported for multiple users. Moreover, workflows can be published as web pages that can be executed from a web browser. The power of KDE Bioscience comes from the integrated algorithms and data sources. With its generic workflow mechanism other novel calculations and simulations can be integrated to augment the current sequence analysis functions. Because of this flexible and extensible architecture, KDE Bioscience makes an ideal integrated informatics environment for future bioinformatics or systems biology research.
Collapse
Affiliation(s)
- Qiang Lu
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
| | | | | | | | | | | | | | | |
Collapse
|
96
|
Jin J. Identification of protein coding regions of rice genes using alternative spectral rotation measure and linear discriminant analysis. GENOMICS PROTEOMICS & BIOINFORMATICS 2005; 2:167-73. [PMID: 15862117 PMCID: PMC5172472 DOI: 10.1016/s1672-0229(04)02022-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
An improved method, called Alternative Spectral Rotation (ASR) measure, for predicting protein coding regions in rice DNA has been developed. The method is based on the Spectral Rotation (SR) measure proposed by Kotlar and Lavner, and its accuracy is higher than that of the SR measure and the Spectral Content (SC) measure proposed by Tiwari et al. In order to increase the identifying accuracy, we chose three different coding characters, namely the asymmetric, purine, and stop-codon variables as parameters, and an approving result was presented by the method of Linear Discriminant Analysis (LDA).
Collapse
Affiliation(s)
- Jiao Jin
- Department of Statistics and Financial Mathematics, School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China.
| |
Collapse
|
97
|
Balakirev ES, Chechetkin VR, Lobzin VV, Ayala FJ. Entropy and GC Content in the beta-esterase gene cluster of the Drosophila melanogaster subgroup. Mol Biol Evol 2005; 22:2063-72. [PMID: 15972847 DOI: 10.1093/molbev/msi197] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
We perform spectral entropy and GC content analyses in the beta-esterase gene cluster, including the Est-6 gene and the psiEst-6 putative pseudogene, in seven species of the Drosophila melanogaster species subgroup. psiEst-6 combines features of functional and nonfunctional genes. The spectral entropies show distinctly lower structural ordering for psiEst-6 than for Est-6 in all species studied. Our observations agree with previous results for D. melanogaster and provide additional support to our hypothesis that after the duplication event Est-6 retained the esterase-coding function and its role during copulation, while psiEst-6 lost that function but now operates in conjunction with Est-6 as an intergene. Entropy accumulation is not a completely random process for either gene. Structural entropy is nucleotide dependent. The relative normalized deviations for structural entropy are higher for G than for C nucleotides. The entropy values are similar for Est-6 and psiEst-6 in the case of A and T but are lower for Est-6 in the case of G and C. The GC content in synonymous positions is uniformly higher in Est-6 than in psiEst-6, which agrees with the reduced GC content generally observed in pseudogenes and nonfunctional sequences. The observed differences in entropy and GC content reflect an evolutionary shift associated with the process of pseudogenization and subsequent functional divergence of psiEst-6 and Est-6 after the duplication event.
Collapse
Affiliation(s)
- Evgeniy S Balakirev
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | | | | | | |
Collapse
|
98
|
Ruvinsky A, Eskesen ST, Eskesen FN, Hurst LD. Can codon usage bias explain intron phase distributions and exon symmetry? J Mol Evol 2005; 60:99-104. [PMID: 15696372 DOI: 10.1007/s00239-004-0032-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2004] [Accepted: 08/31/2004] [Indexed: 10/25/2022]
Abstract
More introns exist between codons (phase 0) than between the first and the second bases (phase 1) or between the second and the third base (phase 2) within the codon. Many explanations have been suggested for this excess of phase 0. It has, for example, been argued to reflect an ancient utility for introns in separating exons that code for separate protein modules. There may, however, be a simple, alternative explanation. Introns typically require, for correct splicing, particular nucleotides immediately 5' in exons (typically a G) and immediately 3' in the following exon (also often a G). Introns therefore tend to be found between particular nucleotide pairs (e.g., G|G pairs) in the coding sequence. If, owing to bias in usage of different codons, these pairs are especially common at phase 0, then intron phase biases may have a trivial explanation. Here we take codon usage frequencies for a variety of eukaryotes and use these to generate random sequences. We then ask about the phase of putative intron insertion sites. Importantly, in all simulated data sets intron phase distribution is biased in favor of phase 0. In many cases the bias is of the magnitude observed in real data and can be attributed to codon usage bias. It is also known that exons may carry either the same phase (symmetric) or different phases (asymmetric) at the opposite ends. We simulated a distribution of different types of exons using frequencies of introns observed in real genes assuming random combination of intron phases at the opposite sides of exons. Surprisingly the simulated pattern was quite similar to that observed. In the simulants we typically observe a prevalence of symmetric exons carrying phase 0 at both ends, which is common for eukaryotic genes. However, at least in some species, the extent of the bias in favor of symmetric (0,0) exons is not as great in simulants as in real genes. These results emphasize the need to construct a biologically relevant null model of successful intron insertion.
Collapse
Affiliation(s)
- A Ruvinsky
- Institute for Genetics and Bioinformatics, University of New England, Armidale 2351, NSW, Australia.
| | | | | | | |
Collapse
|
99
|
Nikolaou C, Almirantis Y. Measuring the coding potential of genomic sequences through a combination of triplet occurrence patterns and RNY preference. J Mol Evol 2005; 59:309-16. [PMID: 15553086 DOI: 10.1007/s00239-004-2626-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The distribution of n-tuplet frequencies is shown to strongly correlate with functionality when examining a genomic sequence in a reading-frame specific manner. The approach described herein applies a coarse-graining procedure, which is able to reveal aspects of triplet usage that are related to protein coding, while at the same time remaining species independent, based on a simple summation of suitable triplet occurrences measures. These quantities are ratios of simple frequencies to suitable mononucleotide-frequency products promoting the incidence of the RNY motif, preferred in the most widely used codons. A significant distinction of coding and noncoding sequences is achieved.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- Institute of Biology, National Research Center for Physical Sciences Demokritos, Athens, Greece
| | | |
Collapse
|
100
|
Eskesen ST, Eskesen FN, Kinghorn B, Ruvinsky A. Periodicity of DNA in exons. BMC Mol Biol 2004; 5:12. [PMID: 15315715 PMCID: PMC516030 DOI: 10.1186/1471-2199-5-12] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2003] [Accepted: 08/18/2004] [Indexed: 01/29/2023] Open
Abstract
Background The periodic pattern of DNA in exons is a known phenomenon. It was suggested that one of the initial causes of periodicity could be the universal (RNY)npattern (R = A or G, Y = C or U, N = any base) of ancient RNA. Two major questions were addressed in this paper. Firstly, the cause of DNA periodicity, which was investigated by comparisons between real and simulated coding sequences. Secondly, quantification of DNA periodicity was made using an evolutionary algorithm, which was not previously used for such purposes. Results We have shown that simulated coding sequences, which were composed using codon usage frequencies only, demonstrate DNA periodicity very similar to the observed in real exons. It was also found that DNA periodicity disappears in the simulated sequences, when the frequencies of codons become equal. Frequencies of the nucleotides (and the dinucleotide AG) at each location along phase 0 exons were calculated for C. elegans, D. melanogaster and H. sapiens. Two models were used to fit these data, with the key objective of describing periodicity. Both of the models showed that the best-fit curves closely matched the actual data points. The first dynamic period determination model consistently generated a value, which was very close to the period equal to 3 nucleotides. The second fixed period model, as expected, kept the period exactly equal to 3 and did not detract from its goodness of fit. Conclusions Conclusion can be drawn that DNA periodicity in exons is determined by codon usage frequencies. It is essential to differentiate between DNA periodicity itself, and the length of the period equal to 3. Periodicity itself is a result of certain combinations of codons with different frequencies typical for a species. The length of period equal to 3, instead, is caused by the triplet nature of genetic code. The models and evolutionary algorithm used for characterising DNA periodicity are proven to be an effective tool for describing the periodicity pattern in a species, when a number of exons in the same phase are analysed.
Collapse
Affiliation(s)
- Stephen T Eskesen
- Institute of Genetics and Bioinformatics, University of New England, Armidale, NSW, Australia
| | | | - Brian Kinghorn
- Institute of Genetics and Bioinformatics, University of New England, Armidale, NSW, Australia
| | - Anatoly Ruvinsky
- Institute of Genetics and Bioinformatics, University of New England, Armidale, NSW, Australia
| |
Collapse
|