1
|
Behboudi R, Nouri-Baygi M, Naghibzadeh M. RPTRF: A rapid perfect tandem repeat finder tool for DNA sequences. Biosystems 2023; 226:104869. [PMID: 36858110 DOI: 10.1016/j.biosystems.2023.104869] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 01/23/2023] [Accepted: 02/23/2023] [Indexed: 03/02/2023]
Abstract
The sequencing of eukaryotic genomes has shown that tandem repeats are abundant in their sequences. In addition to affecting some cellular processes, tandem repeats in the genome may be associated with specific diseases and have been the key to resolving criminal cases. Any tool developed for detecting tandem repeats must be accurate, fast, and useable in thousands of laboratories worldwide, including those with not very advanced computing capabilities. The proposed method, the Rapid Perfect Tandem Repeat Finder (RPTRF), minimizes the need for excess character comparison processing by indexing the input file and significantly helps to accelerate and prepare the output without artifacts by using an interval tree in the filtering section. The experiments demonstrated that the RPTRF is very fast in discovering all perfect tandem repeats of all categories of any genomic sequences. Although the detection of imperfect TRs is not the focus of the RPTRF, comparisons show that it even outperforms some other tools (in five selected gold standards) designed explicitly for this purpose. The implemented tool and how to use it are available on GitHub.
Collapse
Affiliation(s)
- Reza Behboudi
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mostafa Nouri-Baygi
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran.
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
2
|
Developing an ultra-efficient microsatellite discoverer to find structural differences between SARS-CoV-1 and Covid-19. INFORMATICS IN MEDICINE UNLOCKED 2020; 19:100356. [PMID: 32501423 PMCID: PMC7241407 DOI: 10.1016/j.imu.2020.100356] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 05/20/2020] [Accepted: 05/20/2020] [Indexed: 12/24/2022] Open
Abstract
Motivation Recently, the outbreak of Coronavirus-Covid-19 has forced the World Health Organization to declare a pandemic status. A genome sequence is the core of this virus which interferes with the normal activities of its counterparts within humans. Analysis of its genome may provide clues toward the proper treatment of patients and the design of new drugs and vaccines. Microsatellites are composed of short genome subsequences which are successively repeated many times in the same direction. They are highly variable in terms of their building blocks, number of repeats, and their locations in the genome sequences. This mutability property has been the source of many diseases. Usually the host genome is analyzed to diagnose possible diseases in the victim. In this research, the focus is concentrated on the attacker's genome for discovery of its malicious properties. Results The focus of this research is the microsatellites of both SARS and Covid-19. An accurate and highly efficient computer method for identifying all microsatellites in the genome sequences is discovered and implemented, and it is used to find all microsatellites in the Coronavirus-Covid-19 and SARS2003. The Microsatellite discovery is based on an efficient indexing technique called K-Mer Hash Indexing. The method is called Fast Microsatellite Discovery (FMSD) and it is used for both SARS and Covid-19. A table composed of all microsatellites is reported. There are many differences between SARS and Covid-19, but there is an outstanding difference which requires further investigation. Availability FMSD is freely available at https://gitlab.com/FUM_HPCLab/fmsd_project, implemented in C on Linux-Ubuntu system. Software related contact: hossein_savari@mail.um.ac.ir.
Collapse
|
3
|
Makhortykh SA, Kulikova LI, Pankratov AN, Tetuev RK. Generalized Spectral-Analytical Method and Its Applications in Image Analysis and Pattern Recognition Problems. PATTERN RECOGNITION AND IMAGE ANALYSIS 2019. [DOI: 10.1134/s1054661819040102] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
4
|
Chaley M, Kutyrkin V. Stochastic models for description of structural-statistical properties in DNA sequences. J Theor Biol 2019; 496:110126. [PMID: 31866393 DOI: 10.1016/j.jtbi.2019.110126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Revised: 12/02/2019] [Accepted: 12/18/2019] [Indexed: 10/25/2022]
Abstract
New stochastic models based on a notion of stochastic codon are proposed. These models, presented by special random strings, describe practical structural-statistical properties which are peculiar to coding DNA both from prokaryotic and eukaryotic genomes. In such the case coding regions are considered as the realizations of random strings. The models introduced explain existence of latent profile periodicity with a period which is not only equal to but also multiplied of three in the coding regions. For the sequences with latent profile period multiplied of three, but not equal to three, the proposed models ensure existence of special property of 3-regularity in these sequences which is practically recognized in all coding sequences of the genomes analyzed. Feasibility of the stochastic models proposed was tested in numerical experiments with binary reencoded paragraphs of literary texts (in English and Italian languages), used as analog of DNA coding regions.
Collapse
Affiliation(s)
- Maria Chaley
- Institute of Mathematical Problems of Biology RAS - Branch of Keldysh Institute of Applied Mathematics RAS, Professor Vitkevich St.,1, 142290 Pushchino, Russia.
| | - Vladimir Kutyrkin
- Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st.,5, 105005 Moscow, Russia.
| |
Collapse
|
5
|
Abstract
Whether the source is autonomous car, robotic vacuum cleaner, or a quadcopter, signals from sensors tend to have some hidden patterns that repeat themselves. For example, typical GPS traces from a smartphone contain periodic trajectories such as “home, work, home, work, ⋯”. Our goal in this study was to automatically reverse engineer such signals, identify their periodicity, and then use it to compress and de-noise these signals. To do so, we present a novel method of using algorithms from the field of pattern matching and text compression to represent the “language” in such signals. Common text compression algorithms are less tailored to handle such strings. Moreover, they are lossless, and cannot be used to recover noisy signals. To this end, we define the recursive run-length encoding (RRLE) method, which is a generalization of the well known run-length encoding (RLE) method. Then, we suggest lossy and lossless algorithms to compress and de-noise such signals. Unlike previous results, running time and optimality guarantees are proved for each algorithm. Experimental results on synthetic and real data sets are provided. We demonstrate our system by showing how it can be used to turn commercial micro air-vehicles into autonomous robots. This is by reverse engineering their unpublished communication protocols and using a laptop or on-board micro-computer to control them. Our open source code may be useful for both the community of millions of toy robots users, as well as for researchers that may extend it for further protocols.
Collapse
|
6
|
Jelovic AM, Mitic NS, Eshafah S, Beljanski MV. Finding Statistically Significant Repeats in Nucleic Acids and Proteins. J Comput Biol 2017; 25:375-387. [PMID: 29272145 DOI: 10.1089/cmb.2017.0046] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
DNA repeats have great importance for biological research and a large number of tools for determining repeats have been developed. Herein we define a method for extracting a statistically significant subset of a determined set of repeats. Our aim was to identify a subset of repeats in the input sequences that are not expected to occur with a number of their appearances in a random sequence of the same length. It is expected that results obtained in such manner would reduce the quantity of processed material and could thereby represent a more important biological signal. With DNA, RNA, and protein sequences serving as input material, we also examined the possibility of statistical filtering of repeats in sequences over an arbitrary alphabet. A new method for selecting statistically significant repeats from a set of determined repeats has been defined. The proposed method was tested on a large number of randomly generated sequences. The application of the method on biological sequences revealed that for some viruses, shorter repeats are more statistically significant than longer ones because of their frequent appearance, whereas for bacteria, the majority of identified repeats are statistically significant.
Collapse
Affiliation(s)
- Ana M Jelovic
- 1 Faculty of Transport and Traffic Engineering, University of Belgrade , Belgrade, Serbia .,2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Nenad S Mitic
- 2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Samira Eshafah
- 2 Faculty of Mathematics, University of Belgrade , Belgrade, Serbia
| | - Milos V Beljanski
- 3 Institute of General and Physical Chemistry , Bio-Lab, Belgrade, Serbia
| |
Collapse
|
7
|
Lizárraga E, Blesa MJ, Blum C, Raidl GR. Large neighborhood search for the most strings with few bad columns problem. Soft comput 2017. [DOI: 10.1007/s00500-016-2379-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
8
|
Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis. Algorithms Mol Biol 2012; 7:10. [PMID: 22551152 PMCID: PMC3402988 DOI: 10.1186/1748-7188-7-10] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Accepted: 05/02/2012] [Indexed: 01/06/2023] Open
Abstract
Background Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
Collapse
|
9
|
Freschi V, Bogliolo A. A lossy compression technique enabling duplication-aware sequence alignment. Evol Bioinform Online 2012; 8:171-80. [PMID: 22518086 PMCID: PMC3327517 DOI: 10.4137/ebo.s9131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences under comparison may impair the biological significance of the resulting alignment. Although solutions have been proposed, repeat-aware sequence alignment is still considered to be an open problem and new efficient and effective methods have been advocated. The present paper describes an alternative lossy compression scheme for genomic sequences which iteratively collapses repeats of increasing length. The resulting approximate representations do not contain tandem duplications, while retaining enough information for making their comparison even more significant than the edit distance between the original sequences. This allows us to exploit traditional alignment algorithms directly on the compressed sequences. Results confirm the validity of the proposed approach for the problem of duplication-aware sequence alignment.
Collapse
Affiliation(s)
- Valerio Freschi
- Department of Base Sciences and Fundamentals, University of Urbino, Italy
| | | |
Collapse
|
10
|
Matroud AA, Hendy MD, Tuffley CP. NTRFinder: a software tool to find nested tandem repeats. Nucleic Acids Res 2011; 40:e17. [PMID: 22121222 PMCID: PMC3273788 DOI: 10.1093/nar/gkr1070] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We introduce the software tool NTRFinder to search for a complex repetitive structure in DNA we call a nested tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose that NTRs can be used as phylogenetic and population markers. We have tested our algorithm on both real and simulated data, and present some real NTRs of interest. NTRFinder can be downloaded from http://www.maths.otago.ac.nz/~aamatroud/.
Collapse
Affiliation(s)
- Atheer A Matroud
- Institute of Fundamental Sciences, Massey University, Private Bag 11 222, Palmerston North 4442, New Zealand.
| | | | | |
Collapse
|
11
|
Freschi V, Bogliolo A. A monte carlo method for assessing the quality of duplication-aware alignment algorithms. Evol Bioinform Online 2011; 7:31-40. [PMID: 21698090 PMCID: PMC3118696 DOI: 10.4137/ebo.s6662] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The increasing availability of high throughput sequencing technologies poses several challenges concerning the analysis of genomic data. Within this context, duplication-aware sequence alignment taking into account complex mutation events is regarded as an important problem, particularly in light of recent evolutionary bioinformatics researches that highlighted the role of tandem duplications as one of the most important mutation events. Traditional sequence comparison algorithms do not take into account these events, resulting in poor alignments in terms of biological significance, mainly because of their assumption of statistical independence among contiguous residues. Several duplication-aware algorithms have been proposed in the last years which differ either for the type of duplications they consider or for the methods adopted to identify and compare them. However, there is no solution which clearly outperforms the others and no methods exist for assessing the reliability of the resulting alignments. This paper proposes a Monte Carlo method for assessing the quality of duplication-aware alignment algorithms and for driving the choice of the most appropriate alignment technique to be used in a specific context. The applicability and usefulness of the proposed approach are demonstrated on a case study, namely, the comparison of alignments based on edit distance with or without repeat masking.
Collapse
Affiliation(s)
- Valerio Freschi
- DiSBeF-Department of Base Sciences and Fundamentals, University of Urbino, Italy
| | | |
Collapse
|
12
|
Ilie L, Navarro G, Tinta L. The longest common extension problem revisited and applications to approximate string searching. ACTA ACUST UNITED AC 2010. [DOI: 10.1016/j.jda.2010.08.004] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Pellegrini M, Renda ME, Vecchio A. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. ACTA ACUST UNITED AC 2010; 26:i358-66. [PMID: 20529928 PMCID: PMC2881393 DOI: 10.1093/bioinformatics/btq209] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events. Results: We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences. Availability: TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it. Contact:marco.pellegrini@iit.cnr.it Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marco Pellegrini
- CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa, Italy.
| | | | | |
Collapse
|
14
|
Sokol D, Atagun F. TRedD--a database for tandem repeats over the edit distance. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq003. [PMID: 20624712 PMCID: PMC2911838 DOI: 10.1093/database/baq003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
A ‘tandem repeat’ in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers for human identity testing, disease diagnosis, sequence homology and population studies. In this article, we describe a new database, TRedD, which contains the tandem repeats found in the human genome. The database is publicly available online, and the software for locating the repeats is also freely available. The definition of tandem repeats used by TRedD is a new and innovative definition based upon the concept of ‘evolutive tandem repeats’. In addition, we have developed a tool, called TandemGraph, to graphically depict the repeats occurring in a sequence. This tool can be coupled with any repeat finding software, and it should greatly facilitate analysis of results. Database URL:http://tandem.sci.brooklyn.cuny.edu/
Collapse
Affiliation(s)
- Dina Sokol
- Department of Computer and Information Science, Brooklyn College of the City University of New York, 2900 Bedford Avenue, Brooklyn, NY 11210, USA.
| | | |
Collapse
|
15
|
Gupta R, Sarthi D, Mittal A, Singh K. A novel signal processing measure to identify exact and inexact tandem repeat patterns in DNA sequences. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010:43596. [PMID: 17713591 PMCID: PMC3171338 DOI: 10.1155/2007/43596] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2006] [Revised: 11/20/2006] [Accepted: 12/07/2006] [Indexed: 01/07/2023]
Abstract
The identification and analysis of repetitive patterns are active areas of biological and computational research. Tandem repeats in telomeres play a role in cancer and hypervariable trinucleotide tandem repeats are linked to over a dozen major neurodegenerative genetic disorders. In this paper, we present an algorithm to identify the exact and inexact repeat patterns in DNA sequences based on orthogonal exactly periodic subspace decomposition technique. Using the new measure our algorithm resolves the problems like whether the repeat pattern is of period P or its multiple (i.e., 2P, 3P, etc.), and several other problems that were present in previous signal-processing-based algorithms. We present an efficient algorithm of O(NL(w) log L(w)), where N is the length of DNA sequence and L(w) is the window length, for identifying repeats. The algorithm operates in two stages. In the first stage, each nucleotide is analyzed separately for periodicity, and in the second stage, the periodic information of each nucleotide is combined together to identify the tandem repeats. Datasets having exact and inexact repeats were taken up for the experimental purpose. The experimental result shows the effectiveness of the approach.
Collapse
Affiliation(s)
- Ravi Gupta
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Divya Sarthi
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Ankush Mittal
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| | - Kuldip Singh
- Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttaranchal 247 667, India
| |
Collapse
|
16
|
Rouchka EC. Database of exact tandem repeats in the Zebrafish genome. BMC Genomics 2010; 11:347. [PMID: 20515480 PMCID: PMC2901318 DOI: 10.1186/1471-2164-11-347] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2009] [Accepted: 06/01/2010] [Indexed: 11/23/2022] Open
Abstract
Background Sequencing of the approximately 1.7 billion bases of the zebrafish genome is currently underway. To date, few high resolution genetic maps exist for the zebrafish genome, based mainly on single nucleotide polymorphisms (SNPs) and short microsatellite repeats. The desire to construct a higher resolution genetic map led to the construction of a database of tandemly repeating elements within the zebrafish Zv8 assembly. Description Exact tandem repeats with a repeat length of at least three bases and a copy number of at least 10 were reported. Repeats with a total length of 250 or fewer bases and their flanking regions were masked for known vertebrate repeats. Optimal primer pairs were computationally designed in the regions flanking the detected repeats. This database of exact tandem repeats can then be used as a resource by molecular biologists with interests in experimentally testing VNTRs within a zebrafish population. Conclusions A total of 116,915 repeats with a base length of at least three nucleotides were detected. The longest of these was a 54-base repeat with fourteen tandem copies. A significant number of repeats with a base length of 18, 24, 27 and 30 were detected, many with potentially novel proline-rich coding regions. Detection of exact tandem repeats in the zebrafish genome leads to a wealth of information regarding potential polymorphic sites for VNTRs. The association of many of these repeats with potentially novel yet similar coding regions yields an exciting potential for disease associated genes. A web interface for querying repeats is available at http://bioinformatics.louisville.edu/zebrafish/. This portal allows for users to search for a repeats of a selected base size from any valid specified region within the 25 linkage groups.
Collapse
Affiliation(s)
- Eric C Rouchka
- Department of Computer Engineering and Computer Science, Speed School of Engineering, University of Louisville, Duthie Center, Room 208, Louisville, KY, USA.
| |
Collapse
|
17
|
Jorda J, Kajava AV. T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm. ACTA ACUST UNITED AC 2009; 25:2632-8. [PMID: 19671691 DOI: 10.1093/bioinformatics/btp482] [Citation(s) in RCA: 131] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Over the last years a number of evidences have been accumulated about high incidence of tandem repeats in proteins carrying fundamental biological functions and being related to a number of human diseases. At the same time, frequently, protein repeats are strongly degenerated during evolution and, therefore, cannot be easily identified. To solve this problem, several computer programs which were based on different algorithms have been developed. Nevertheless, our tests showed that there is still room for improvement of methods for accurate and rapid detection of tandem repeats in proteins. RESULTS We developed a new program called T-REKS for ab initio identification of the tandem repeats. It is based on clustering of lengths between identical short strings by using a K-means algorithm. Benchmark of the existing programs and T-REKS on several sequence datasets is presented. Our program being linked to the Protein Repeat DataBase opens the way for large-scale analysis of protein tandem repeats. T-REKS can also be applied to the nucleotide sequences. AVAILABILITY The algorithm has been implemented in JAVA, the program is available upon request at http://bioinfo.montp.cnrs.fr/?r=t-reks. Protein Repeat DataBase generated by using T-REKS is accessible at http://bioinfo.montp.cnrs.fr/?r=repeatDB.
Collapse
Affiliation(s)
- Julien Jorda
- Centre de Recherches de Biochimie Macromoléculaire UMR 5237, CNRS, University of Montpellier 1 and 2, Montpellier, France.
| | | |
Collapse
|
18
|
Chaley MB, Nazipova NN, Kutyrkin VA. Statistical methods for detecting latent periodicity patterns in biological sequences: The case of small-size samples. PATTERN RECOGNITION AND IMAGE ANALYSIS 2009. [DOI: 10.1134/s1054661809020217] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
19
|
Merkel A, Gemmell N. Detecting short tandem repeats from genome data: opening the software black box. Brief Bioinform 2008; 9:355-66. [PMID: 18621747 DOI: 10.1093/bib/bbn028] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Short tandem repeats, specifically microsatellites, are widely used genetic markers, associated with human genetic diseases, and play an important role in various regulatory mechanisms and evolution. Despite their importance, much is yet unknown about their mutational dynamics. The increasing availability of genome data has led to several in silico studies of microsatellite evolution which have produced a vast range of algorithms and software for tandem repeat detection. Documentation of these tools is often sparse, or provided in a format that is impenetrable to most biologists without informatics background. This article introduces the major concepts behind repeat detecting software essential for informed tool selection. We reflect on issues such as parameter settings and program bias, as well as redundancy filtering and efficiency using examples from the currently available range of programs, to provide an integrated comparison and practical guide to microsatellite detecting programs.
Collapse
Affiliation(s)
- Angelika Merkel
- School of Biological Sciences, University of Canterbury, Private Bag 4800, Christchurch 8041, New Zealand.
| | | |
Collapse
|
20
|
Wexler Y, Yakhini Z, Kashi Y, Geiger D. Finding approximate tandem repeats in genomic sequences. J Comput Biol 2008; 12:928-42. [PMID: 16201913 DOI: 10.1089/cmb.2005.12.928] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
An efficient algorithm is presented for detecting approximate tandem repeats in genomic sequences. The algorithm is based on a flexible statistical model which allows a wide range of definitions of approximate tandem repeats. The ideas and methods underlying the algorithm are described and its effectiveness on genomic data is demonstrated.
Collapse
Affiliation(s)
- Ydo Wexler
- Computer Science Department, Technion, Technion Campus, Haifa, 32000, Israel.
| | | | | | | |
Collapse
|
21
|
Domaniç NO, Preparata FP. A novel approach to the detection of genomic approximate tandem repeats in the Levenshtein metric. J Comput Biol 2008; 14:873-91. [PMID: 17803368 DOI: 10.1089/cmb.2007.0018] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022] Open
Abstract
An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.
Collapse
Affiliation(s)
- Nevzat Onur Domaniç
- Department of Computer Science, Brown University, Providence, Rhode Island 02912, USA
| | | |
Collapse
|
22
|
Shelenkov AA, Skryabin KG, Korotkov EV. Classification analysis of a latent dinucleotide periodicity of plant genomes. RUSS J GENET+ 2008. [DOI: 10.1134/s1022795408010134] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
23
|
Model of perfect tandem repeat with random pattern and empirical homogeneity testing poly-criteria for latent periodicity revelation in biological sequences. Math Biosci 2008; 211:186-204. [DOI: 10.1016/j.mbs.2007.10.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Revised: 10/19/2007] [Accepted: 10/26/2007] [Indexed: 11/23/2022]
|
24
|
Newman AM, Cooper JB. XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences. BMC Bioinformatics 2007; 8:382. [PMID: 17931424 PMCID: PMC2233649 DOI: 10.1186/1471-2105-8-382] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2007] [Accepted: 10/11/2007] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Biological sequence repeats arranged in tandem patterns are widespread in DNA and proteins. While many software tools have been designed to detect DNA tandem repeats (TRs), useful algorithms for identifying protein TRs with varied levels of degeneracy are still needed. RESULTS To address limitations of current repeat identification methods, and to provide an efficient and flexible algorithm for the detection and analysis of TRs in protein sequences, we designed and implemented a new computational method called XSTREAM. Running time tests confirm the practicality of XSTREAM for analyses of multi-genome datasets. Each of the key capabilities of XSTREAM (e.g., merging, nesting, long-period detection, and TR architecture modeling) are demonstrated using anecdotal examples, and the utility of XSTREAM for identifying TR proteins was validated using data from a recently published paper. CONCLUSION We show that XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy. Because of these useful features, XSTREAM has significant potential for the discovery of naturally-evolved modular proteins with applications for engineering novel biostructural and biomimetic materials, and identifying new vaccine and diagnostic targets.
Collapse
Affiliation(s)
- Aaron M Newman
- Biomolecular Science and Engineering Program, University of California, Santa Barbara, CA 93106, USA.
| | | |
Collapse
|
25
|
Leclercq S, Rivals E, Jarne P. Detecting microsatellites within genomes: significant variation among algorithms. BMC Bioinformatics 2007; 8:125. [PMID: 17442102 PMCID: PMC1876248 DOI: 10.1186/1471-2105-8-125] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2006] [Accepted: 04/18/2007] [Indexed: 11/25/2022] Open
Abstract
Background Microsatellites are short, tandemly-repeated DNA sequences which are widely distributed among genomes. Their structure, role and evolution can be analyzed based on exhaustive extraction from sequenced genomes. Several dedicated algorithms have been developed for this purpose. Here, we compared the detection efficiency of five of them (TRF, Mreps, Sputnik, STAR, and RepeatMasker). Results Our analysis was first conducted on the human X chromosome, and microsatellite distributions were characterized by microsatellite number, length, and divergence from a pure motif. The algorithms work with user-defined parameters, and we demonstrate that the parameter values chosen can strongly influence microsatellite distributions. The five algorithms were then compared by fixing parameters settings, and the analysis was extended to three other genomes (Saccharomyces cerevisiae, Neurospora crassa and Drosophila melanogaster) spanning a wide range of size and structure. Significant differences for all characteristics of microsatellites were observed among algorithms, but not among genomes, for both perfect and imperfect microsatellites. Striking differences were detected for short microsatellites (below 20 bp), regardless of motif. Conclusion Since the algorithm used strongly influences empirical distributions, studies analyzing microsatellite evolution based on a comparison between empirical and theoretical size distributions should therefore be considered with caution. We also discuss why a typological definition of microsatellites limits our capacity to capture their genomic distributions.
Collapse
Affiliation(s)
- Sébastien Leclercq
- LIRMM, UMR 5506 CNRS – Université de Montpellier II, 161 rue Ada, Montpellier, France
- CEFE, UMR 5175 CNRS – Université de Montpellier II, 1919 route de Mende, Montpellier, France
| | - Eric Rivals
- LIRMM, UMR 5506 CNRS – Université de Montpellier II, 161 rue Ada, Montpellier, France
| | - Philippe Jarne
- CEFE, UMR 5175 CNRS – Université de Montpellier II, 1919 route de Mende, Montpellier, France
| |
Collapse
|
26
|
Paar V, Basar I, Rosandić M, Glunčić M. Consensus higher order repeats and frequency of string distributions in human genome. Curr Genomics 2007; 8:93-111. [PMID: 18660848 PMCID: PMC2435359 DOI: 10.2174/138920207780368169] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2007] [Revised: 01/26/2007] [Accepted: 01/30/2007] [Indexed: 02/01/2023] Open
Abstract
Key string algorithm (KSA) could be viewed as robust computational generalization of restriction enzyme method. KSA enables robust and effective identification and structural analyzes of any given genomic sequences, like in the case of NCBI assembly for human genome. We have developed a method, using total frequency distribution of all r-bp key strings in dependence on the fragment length l, to determine the exact size of all repeats within the given genomic sequence, both of monomeric and HOR type. Subsequently, for particular fragment lengths equal to each of these repeat sizes we compute the partial frequency distribution of r-bp key strings; the key string with highest frequency is a dominant key string, optimal for segmentation of a given genomic sequence into repeat units. We illustrate how a wide class of 3-bp key strings leads to a key-string-dependent periodic cell which enables a simple identification and consensus length determinations of HORs, or any other highly convergent repeat of monomeric or HOR type, both tandem or dispersed. We illustrated KSA application for HORs in human genome and determined consensus HORs in the Build 35.1 assembly. In the next step we compute suprachromosomal family classification and CENP-B box / pJalpha distributions for HORs. In the case of less convergent repeats, like for example monomeric alpha satellite (20-40% divergence), we searched for optimal compact key string using frequency method and developed a concept of composite key string (GAAAC--CTTTG) or flexible relaxation (28 bp key string) which provides both monomeric alpha satellites as well as alpha monomer segmentation of internal HOR structure. This method is convenient also for study of R-strand (direct) / S-strand (reverse complement) alpha monomer alternations. Using KSA we identified 16 alternating regions of R-strand and S-strand monomers in one contig in choromosome 7. Use of CENP-B box and/or pJalpha motif as key string is suitable both for identification of HORs and monomeric pattern as well as for studies of CENP-B box / pJalpha distribution. As an example of application of KSA to sequences outside of HOR regions we present our finding of a tandem with highly convergent 3434-bp Long monomer in chromosome 5 (divergence less then 0.3%).
Collapse
Affiliation(s)
- Vladimir Paar
- Faculty of Science, University of Zagreb, Bijenička 32, 10000 Zagreb, Croatia
| | - Ivan Basar
- Faculty of Science, University of Zagreb, Bijenička 32, 10000 Zagreb, Croatia
| | - Marija Rosandić
- Department of Internal Medicine,
University Hospital Rebro, Kišpatićeva 12, 10000 Zagreb, Croatia
| | - Matko Glunčić
- Faculty of Science, University of Zagreb, Bijenička 32, 10000 Zagreb, Croatia
| |
Collapse
|
27
|
Abstract
MOTIVATION A tandem repeat in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats occur in the genomes of both eukaryotic and prokaryotic organisms. They are important in numerous fields including disease diagnosis, mapping studies, human identity testing (DNA fingerprinting), sequence homology and population studies. Although tandem repeats have been used by biologists for many years, there are few tools available for performing an exhaustive search for all tandem repeats in a given sequence. RESULTS In this paper we describe an efficient algorithm for finding all tandem repeats within a sequence, under the edit distance measure. The contributions of this paper are two-fold: theoretical and practical. We present a precise definition for tandem repeats over the edit distance and an efficient, deterministic algorithm for finding these repeats. AVAILABILITY The algorithm has been implemented in C++, and the software is available upon request and can be used at http://www.sci.brooklyn.cuny.edu/~sokol/trepeats. The use of this tool will assist biologists in discovering new ways that tandem repeats affect both the structure and function of DNA and protein molecules.
Collapse
Affiliation(s)
- Dina Sokol
- Department of Computer and Information Science, Brooklyn College of the City University of New York, Brooklyn, NY, USA.
| | | | | |
Collapse
|
28
|
Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of amino acid latent periodicity within 94 protein families. J Comput Biol 2006; 13:946-64. [PMID: 16761920 DOI: 10.1089/cmb.2006.13.946] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Here, we have applied information decomposition, cyclic profile alignment, and noise decomposition techniques to search for latent repeats within protein families of various functions. We have identified 94 protein families with a family-specific periodicity. In each case, the periodic element was found in greater than 70% of family members. Latent periodicity profiles with specific length and signature were obtained in each case. The possible relationship between the periodic elements thus identified and the evolutionary development of the protein families are discussed with specific reference to the possibility that there is a correlation between the periodic elements and protein function.
Collapse
Affiliation(s)
- Vera P Turutina
- Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya, Moscow
| | | | | | | | | |
Collapse
|
29
|
Boeva V, Regnier M, Papatsenko D, Makeev V. Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics 2006; 22:676-84. [PMID: 16403795 DOI: 10.1093/bioinformatics/btk032] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION Genomic sequences are highly redundant and contain many types of repetitive DNA. Fuzzy tandem repeats (FTRs) are of particular interest. They are found in regulatory regions of eukaryotic genes and are reported to interact with transcription factors. However, accurate assessment of FTR occurrences in different genome segments requires specific algorithm for efficient FTR identification and classification. RESULTS We have obtained formulas for P-values of FTR occurrence and developed an FTR identification algorithm implemented in TandemSWAN software. Using TandemSWAN we compared the structure and the occurrence of FTRs with short period length (up to 24 bp) in coding and non-coding regions including UTRs, heterochromatic, intergenic and enhancer sequences of Drosophila melanogaster and Drosophila pseudoobscura. Tandems with period three and its multiples were found in coding segments, whereas FTRs with periods multiple of six are overrepresented in all non-coding segment. Periods equal to 5-7 and 11-14 were characteristic of the enhancer regions and other non-coding regions close to genes. AVAILABILITY TandemSWAN web page, stand-alone version and documentation can be found at http://bioinform.genetika.ru/projects/swan/www/ SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Valentina Boeva
- Department of Bioengineering and Bioinformatics, Moscow State University Moscow, Russia.
| | | | | | | |
Collapse
|
30
|
Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of latent periodicity in amino acid sequences of protein families. BIOCHEMISTRY. BIOKHIMIIA 2006; 71:18-31. [PMID: 16457614 DOI: 10.1134/s0006297906010032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
For detection of the latent periodicity of the protein families responsible for various biological functions, methods of information decomposition, cyclic profile alignment, and the method of noise decomposition have been used. The latent periodicity, being specific to a particular family, is recognized in 94 of 110 analyzed protein families. Family specific periodicity was found for more than 70% of amino acid sequences in each of these families. Based on such sequences the characteristic profile of the latent periodicity has been deduced for each family. Possible relationship between the recognized latent periodicity, evolution of proteins, and their structural organization is discussed.
Collapse
Affiliation(s)
- V P Turutina
- Bioengineering Center, Russian Academy of Sciences, Moscow, Russia
| | | | | | | | | |
Collapse
|
31
|
Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE. COMBINATORIAL PATTERN MATCHING 2006. [DOI: 10.1007/11780441_5] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
32
|
Shelenkov A, Skryabin K, Korotkov E. Search and Classification of Potential Minisatellite Sequences from Bacterial Genomes. DNA Res 2006; 13:89-102. [PMID: 16980713 DOI: 10.1093/dnares/dsl004] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
We used the method of Information Decomposition developed by us to identify the latent dinucleotide periodicity regions in bacterial genomes. The number of potential minisatellite sequences obtained at high level of statistical significance was 454. Then we classified the periodicity matrices and obtained 45 classes. We used the other new method developed by us--Modified Profile Analysis--to reveal more periodic sequences in the presence of indels using the classes obtained. The number of sequences found by combination of these two methods was 3949. Most of them cannot be revealed by other methods including dynamic programming and Fourier transformation.
Collapse
Affiliation(s)
- Andrew Shelenkov
- Bioengineering Centre of Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, 117312 Moscow, Russia.
| | | | | |
Collapse
|
33
|
Reneker J, Shyu CR. Refined repetitive sequence searches utilizing a fast hash function and cross species information retrievals. BMC Bioinformatics 2005; 6:111. [PMID: 15869708 PMCID: PMC1131890 DOI: 10.1186/1471-2105-6-111] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2004] [Accepted: 05/03/2005] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Searching for small tandem/disperse repetitive DNA sequences streamlines many biomedical research processes. For instance, whole genomic array analysis in yeast has revealed 22 PHO-regulated genes. The promoter regions of all but one of them contain at least one of the two core Pho4p binding sites, CACGTG and CACGTT. In humans, microsatellites play a role in a number of rare neurodegenerative diseases such as spinocerebellar ataxia type 1 (SCA1). SCA1 is a hereditary neurodegenerative disease caused by an expanded CAG repeat in the coding sequence of the gene. In bacterial pathogens, microsatellites are proposed to regulate expression of some virulence factors. For example, bacteria commonly generate intra-strain diversity through phase variation which is strongly associated with virulence determinants. A recent analysis of the complete sequences of the Helicobacter pylori strains 26695 and J99 has identified 46 putative phase-variable genes among the two genomes through their association with homopolymeric tracts and dinucleotide repeats. Life scientists are increasingly interested in studying the function of small sequences of DNA. However, current search algorithms often generate thousands of matches -- most of which are irrelevant to the researcher. RESULTS We present our hash function as well as our search algorithm to locate small sequences of DNA within multiple genomes. Our system applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. We discuss our incorporation of the Gene Ontology (GO) database into these algorithms. We conduct an exhaustive time analysis of our system for various repetitive sequence lengths. For instance, a search for eight bases of sequence within 3.224 GBases on 49 different chromosomes takes 1.147 seconds on average. To illustrate the relevance of the search results, we conduct a search with and without added annotation terms for the yeast Pho4p binding sites, CACGTG and CACGTT. Also, a cross-species search is presented to illustrate how potential hidden correlations in genomic data can be quickly discerned. The findings in one species are used as a catalyst to discover something new in another species. These experiments also demonstrate that our system performs well while searching multiple genomes -- without the main memory constraints present in other systems. CONCLUSION We present a time-efficient algorithm to locate small segments of DNA and concurrently to search the annotation data accompanying the sequence. Genome-wide searches for short sequences often return hundreds of hits. Our experiments show that subsequently searching the annotation data can refine and focus the results for the user. Our algorithms are also space-efficient in terms of main memory requirements. Source code is available upon request.
Collapse
Affiliation(s)
- Jeff Reneker
- Department of Computer Science, University of Missouri, Columbia, USA
- Department of Health Management & Informatics, University of Missouri, Columbia, USA
| | - Chi-Ren Shyu
- Department of Computer Science, University of Missouri, Columbia, USA
- Department of Health Management & Informatics, University of Missouri, Columbia, USA
| |
Collapse
|
34
|
Reneker J, Shyu CR, Zeng P, Polacco JC, Gassmann W. ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval. Nucleic Acids Res 2004; 32:W649-53. [PMID: 15215469 PMCID: PMC441593 DOI: 10.1093/nar/gkh455] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We have developed a web server for the life sciences community to use to search for short repeats of DNA sequence of length between 3 and 10,000 bases within multiple species. This search employs a unique and fast hash function approach. Our system also applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. Furthermore, we have incorporated a part of the Gene Ontology database into our information retrieval algorithms to broaden the coverage of the search. Our web server and tutorial can be found at http://acmes.rnet.missouri.edu.
Collapse
Affiliation(s)
- Jeff Reneker
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | | | | | | | |
Collapse
|
35
|
Kolpakov R, Bana G, Kucherov G. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res 2003; 31:3672-8. [PMID: 12824391 PMCID: PMC169196 DOI: 10.1093/nar/gkg617] [Citation(s) in RCA: 218] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.
Collapse
Affiliation(s)
- Roman Kolpakov
- French-Russian Institute for Informatics and Applied Mathematics, Moscow University, 119899 Moscow, Russia
| | | | | |
Collapse
|
36
|
Huang CH, Rajasekaran S. Parallel pattern identification in biological sequences on clusters. IEEE Trans Nanobioscience 2003; 2:29-34. [PMID: 15382420 DOI: 10.1109/tnb.2003.810165] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Tandem repeats are ubiquitous sequence features in both prokaryotic and eukaryotic genomes. They are known to cause several inherited neurological diseases in humans. Identifying these patterns is a highly computation-intensive process. Previous parallel implementations use straightforward domain decomposition based on existing sequential algorithms and rely on parallel machines with low-latency interconnection network and fast hardware support for processor synchronization. Our research exploits the superior cost effectiveness and flexibility achieved through low-cost clusters to speed up biological computations by designing communication-efficient parallel algorithms for pattern identification. This paper presents a low communication-overhead parallel algorithm for pattern identification in biological sequences. Given a biological sequence of length n and a pattern of length m, we conclude an algorithm with five computation/communication phases, each requiring O(n) computation time and only O(p) message units. The low communication overhead of the algorithm is essential in achieving reasonable speedups on clusters, where the inter-processor communication latency is usually higher.
Collapse
Affiliation(s)
- Chun-Hsi Huang
- Department of Computer Science and Engineering, U-155, University of Connecticut, Storrs, CT 06269, USA.
| | | |
Collapse
|