1
|
Yadav Y, Sharma SN, Shakya DK. Detection of Tandem Repeats in DNA Sequences Using Short-Time Ramanujan Fourier Transform. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1583-1591. [PMID: 33493119 DOI: 10.1109/tcbb.2021.3053656] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Tandem repeats in genomic sequences are characterized by two or more contiguous copies of a pattern of nucleotides. The role of these repeats as molecular markers is well established in various genetic disorders, human evolution studies, DNA forensics and intron retention. In this work a computational method has been developed for the extraction of both exact and approximate tandem repeats. The proposed algorithm uses Ramanujan Fourier Transform (RFT) to identify periodicities in the DNA sequences. Since RFT estimates the period directly, rather than inferring it from the signal's spectrum, it provides a more sensitive and rapid detection of tandem repeats as compared to other available popular computational methods.
Collapse
|
2
|
Genovese LM, Mosca MM, Pellegrini M, Geraci F. Dot2dot: accurate whole-genome tandem repeats discovery. Bioinformatics 2019; 35:914-922. [PMID: 30165507 PMCID: PMC6419916 DOI: 10.1093/bioinformatics/bty747] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Revised: 08/03/2018] [Accepted: 08/24/2018] [Indexed: 01/18/2023] Open
Abstract
MOTIVATION Large-scale sequencing projects have confirmed the hypothesis that eukaryotic DNA is rich in repetitions whose functional role needs to be elucidated. In particular, tandem repeats (TRs) (i.e. short, almost identical sequences that lie adjacent to each other) have been associated to many cellular processes and, indeed, are also involved in several genetic disorders. The need of comprehensive lists of TRs for association studies and the absence of a computational model able to capture their variability have revived research on discovery algorithms. RESULTS Building upon the idea that sequence similarities can be easily displayed using graphical methods, we formalized the structure that TRs induce in dot-plot matrices where a sequence is compared with itself. Leveraging on the observation that a compact representation of these matrices can be built and searched in linear time, we developed Dot2dot: an accurate algorithm fast enough to be suitable for whole-genome discovery of TRs. Experiments on five manually curated collections of TRs have shown that Dot2dot is more accurate than other established methods, and completes the analysis of the biggest known reference genome in about one day on a standard PC. AVAILABILITY AND IMPLEMENTATION Source code and datasets are freely available upon paper acceptance at the URL: https://github.com/Gege7177/Dot2dot. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Marco M Mosca
- Department of Computer Science, University of Liverpool, Liverpool, UK
| | - Marco Pellegrini
- Institute for Informatics and Telematics, CNR, Pisa, Italy.,Laboratory of Integrative Systems Medicine (LISM), Institute of Informatics and Telematics and Institute of Clinical Physiology, Pisa, Italy
| | - Filippo Geraci
- Institute for Informatics and Telematics, CNR, Pisa, Italy
| |
Collapse
|
3
|
Sharma SD, Saxena R, Sharma SN. Tandem repeats detection in DNA sequences using Kaiser window based adaptive S-transform. BIO-ALGORITHMS AND MED-SYSTEMS 2017. [DOI: 10.1515/bams-2017-0014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
AbstractIn computational biology the development of algorithms for the identification of tandem repeats in DNA sequences is a challenging problem. Tandem repeats identification is helpful in gene annotation, forensics, and the study of human evolution. In this work a signal processing algorithm based on adaptive S-transform, with Kaiser window, has been proposed for the exact and approximate tandem repeats detection. Usage of Kaiser window helped in identifying short as well as long tandem repeats. Thus, the limitation of earlier S-transform based algorithm that identified only microsatellites has been alleviated by this more versatile algorithm. The superiority of this algorithm has been established by comparative simulation studies with other reported methods.
Collapse
|
4
|
Sharma SD, Saxena R, Sharma SN. Identification of Microsatellites in DNA Using Adaptive S-Transform. IEEE J Biomed Health Inform 2015; 19:1097-105. [DOI: 10.1109/jbhi.2014.2330901] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
5
|
Pugacheva V, Frenkel F, Korotkov E. Investigation of phase shifts for different period lengths in the genomes of C. elegans, D. melanogaster and S. cerevisiae. Comput Biol Chem 2014; 51:12-21. [PMID: 24840641 DOI: 10.1016/j.compbiolchem.2014.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2013] [Revised: 03/31/2014] [Accepted: 03/31/2014] [Indexed: 11/26/2022]
Abstract
We describe a new mathematical method for finding very diverged short tandem repeats containing a single indel. The method involves comparison of two frequency matrices: a first matrix for a subsequence before shift and a second one for a subsequence after it. A measure of comparison is based on matrix similarity. The approach developed was applied to analysis of the genomes of Caenorhabditis elegans, Drosophila melanogaster and Saccharomyces cerevisiae. They were investigated regarding the presence of tandem repeats having repeat length equal to 2 - 11 nucleotides except equal to 3, 6 and 9 nucleotides. A number of phase shift regions for these genomes was approximately 2.2 × 10(4), 1.5 × 10(4) and 1.7 × 10(2), respectively. Type I error was less than 5%. The mean length of fuzzy periodicity and phase shift regions was about 220 nucleotides. The regions of fuzzy periodicity having single insertion or deletion occupy substantial parts of the genomes: 5%, 3% and 0.3%, respectively. Only less than 10% of these regions have been detected previously. That is, the number of such regions in the genomes of C. elegans, D. melanogaster and S. cerevisiae is dramatically higher than it has been revealed by any known methods. We suppose that some found regions of fuzzy periodicity could be the regions for protein binding.
Collapse
Affiliation(s)
| | - Felix Frenkel
- Bioengineering Centre of Russian Academy of Science, Moscow 117312, Russia
| | - Eugene Korotkov
- Bioengineering Centre of Russian Academy of Science, Moscow 117312, Russia; National Research Nuclear University "MEPhI", Moscow 115409, Russia
| |
Collapse
|
6
|
de Ridder C, Kourie D, Watson B, Fourie T, Reyneke P. Fine-tuning the search for microsatellites. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.jda.2012.12.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
7
|
Song NY, Hong Yan. Autoregressive and Iterative Hidden Markov Models for Periodicity Detection and Solenoid Structure Recognition in Protein Sequences. IEEE J Biomed Health Inform 2013; 17:436-41. [DOI: 10.1109/jbhi.2012.2235852] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
8
|
|
9
|
Liang T, Fan X, Li Q, Li SYR. Detection of dispersed short tandem repeats using reversible jump Markov chain Monte Carlo. Nucleic Acids Res 2012; 40:e147. [PMID: 22753023 PMCID: PMC3479165 DOI: 10.1093/nar/gks644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.
Collapse
Affiliation(s)
- Tong Liang
- Department of Information Engineering and Department of Statistics, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Xiaodan Fan
- Department of Information Engineering and Department of Statistics, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong,*To whom correspondence should be addressed. Tel: +852 3943 7930; Fax: +852 2603 5188;
| | - Qiwei Li
- Department of Information Engineering and Department of Statistics, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Shuo-yen R. Li
- Department of Information Engineering and Department of Statistics, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| |
Collapse
|
10
|
Glunčić M, Paar V. Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm. Nucleic Acids Res 2012; 41:e17. [PMID: 22977183 PMCID: PMC3592446 DOI: 10.1093/nar/gks721] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The main feature of global repeat map (GRM) algorithm (www.hazu.hr/grm/software/win/grm2012.exe) is its ability to identify a broad variety of repeats of unbounded length that can be arbitrarily distant in sequences as large as human chromosomes. The efficacy is due to the use of complete set of a K-string ensemble which enables a new method of direct mapping of symbolic DNA sequence into frequency domain, with straightforward identification of repeats as peaks in GRM diagram. In this way, we obtain very fast, efficient and highly automatized repeat finding tool. The method is robust to substitutions and insertions/deletions, as well as to various complexities of the sequence pattern. We present several case studies of GRM use, in order to illustrate its capabilities: identification of α-satellite tandem repeats and higher order repeats (HORs), identification of Alu dispersed repeats and of Alu tandems, identification of Period 3 pattern in exons, implementation of ‘magnifying glass’ effect, identification of complex HOR pattern, identification of inter-tandem transitional dispersed repeat sequences and identification of long segmental duplications. GRM algorithm is convenient for use, in particular, in cases of large repeat units, of highly mutated and/or complex repeats, and of global repeat maps for large genomic sequences (chromosomes and genomes).
Collapse
Affiliation(s)
- Matko Glunčić
- Faculty of Science, University of Zagreb, Bijenička 32 and Croatian Academy of Sciences and Arts, Zrinski trg 11, 10000 Zagreb, Croatia.
| | | |
Collapse
|
11
|
Grover A, Aishwarya V, Sharma PC. Searching microsatellites in DNA sequences: approaches used and tools developed. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2012; 18:11-9. [PMID: 23573036 PMCID: PMC3550526 DOI: 10.1007/s12298-011-0098-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Microsatellite instability associated genomic activities and evolutionary changes have led to a renewed focus on microsatellite research. In last decade, a number of microsatellite mining tools have been introduced based on different computational approaches. The choice is generally made between slow but exhaustive dynamic programming based approaches, or fast and incomplete heuristic methods. Tools based on stochastic approaches are more popular due to their simplicity and added ornamental features. We have performed a comparative evaluation of the relative efficiency of some microsatellite search tools with their default settings. The graphical user interface, the statistical analysis of the output and ability to mine imperfect repeats are the most important criteria in selecting a tool for a particular investigation. However, none of the available tools alone provides complete and accurate information about microsatellites, and a lot depends on the discretion of the user.
Collapse
Affiliation(s)
- Atul Grover
- />University School of Biotechnology, Guru Gobind Singh Indraprastha University, Sector 16C Dwarka, New Delhi, 110075 India
- />Molecular Biology and Genetic Engineering Laboratory, Defence Institute of Bio Energy Research, Goraparao, Haldwani, 263139 India
| | - Veenu Aishwarya
- />University School of Biotechnology, Guru Gobind Singh Indraprastha University, Sector 16C Dwarka, New Delhi, 110075 India
- />Division of Hematology/Oncology, Department of Medicine, University of Pennsylvania School of Medicine, Philadelphia, PA USA
| | - P. C. Sharma
- />University School of Biotechnology, Guru Gobind Singh Indraprastha University, Sector 16C Dwarka, New Delhi, 110075 India
| |
Collapse
|
12
|
Wang DD, Yan H. The relationship between periodic dinucleotides and the nucleosomal DNA deformation revealed by normal mode analysis. Phys Biol 2011; 8:066004. [DOI: 10.1088/1478-3975/8/6/066004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
13
|
Li Q, Fan X, Liang T, Li SR. An MCMC algorithm for detecting short adjacent repeats shared by multiple sequences. Bioinformatics 2011; 27:1772-9. [DOI: 10.1093/bioinformatics/btr287] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|