1
|
Kinney N, Titus-Glover K, Wren JD, Varghese RT, Michalak P, Liao H, Anandakrishnan R, Pulenthiran A, Kang L, Garner HR. CAGm: a repository of germline microsatellite variations in the 1000 genomes project. Nucleic Acids Res 2019; 47:D39-D45. [PMID: 30329086 PMCID: PMC6323891 DOI: 10.1093/nar/gky969] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Revised: 10/04/2018] [Accepted: 10/05/2018] [Indexed: 12/14/2022] Open
Abstract
The human genome harbors an abundance of repetitive DNA; however, its function continues to be debated. Microsatellites-a class of short tandem repeat-are established as an important source of genetic variation. Array length variants are common among microsatellites and affect gene expression; but, efforts to understand the role and diversity of microsatellite variation has been hampered by several challenges. Without adequate depth, both long-read and short-read sequencing may not detect the variants present in a sample; additionally, large sample sizes are needed to reveal the degree of population-level polymorphism. To address these challenges we present the Comparative Analysis of Germline Microsatellites (CAGm): a database of germline microsatellites from 2529 individuals in the 1000 genomes project. A key novelty of CAGm is the ability to aggregate microsatellite variation by population, ethnicity (super population) and gender. The database provides advanced searching for microsatellites embedded in genes and functional elements. All data can be downloaded as Microsoft Excel spreadsheets. Two use-case scenarios are presented to demonstrate its utility: a mononucleotide (A) microsatellite at the BAT-26 locus and a dinucleotide (CA) microsatellite in the coding region of FGFRL1. CAGm is freely available at http://www.cagmdb.org/.
Collapse
Affiliation(s)
- Nicholas Kinney
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
| | - Kyle Titus-Glover
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Jonathan D Wren
- Arthritis and Clinical Immunology Research Program, Division of Genomics and Data Sciences Oklahoma Medical Research Foundation, Oklahoma City, OK 73104, USA
- Department of Biochemistry and Molecular Biology, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA
| | - Robin T Varghese
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
| | - Pawel Michalak
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
- One Health Research Center, Virginia-Maryland College of Veterinary Medicine, 1410 Prices Fork Rd, Blacksburg, VA 24060, USA
- Institute of Evolution,University of Haifa, Abba Khoushy Ave 199, Haifa, 3498838, Israel
| | - Han Liao
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Ramu Anandakrishnan
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
| | - Arichanah Pulenthiran
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
| | - Lin Kang
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
| | - Harold R Garner
- Edward Via College of Osteopathic Medicine, 2265 Kraft Drive, Blacksburg, VA 24060, USA
- Gibbs Cancer Center & Research Institute, 101 E Wood St., Spartanburg, SC 29303, USA
| |
Collapse
|
2
|
Chaley M, Kutyrkin V, Tulbasheva G, Teplukhina E, Nazipova N. HeteroGenome: database of genome periodicity. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau040. [PMID: 24857969 PMCID: PMC4038257 DOI: 10.1093/database/bau040] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We present the first release of the HeteroGenome database collecting latent periodicity regions in genomes. Tandem repeats and highly divergent tandem repeats along with the regions of a new type of periodicity, known as profile periodicity, have been collected for the genomes of Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. We obtained data with the aid of a spectral-statistical approach to search for reliable latent periodicity regions (with periods up to 2000 bp) in DNA sequences. The original two-level mode of data presentation (a broad view of the region of latent periodicity and a second level indicating conservative fragments of its structure) was further developed to enable us to obtain the estimate, without redundancy, that latent periodicity regions make up ∼10% of the analyzed genomes. Analysis of the quantitative and qualitative content of located periodicity regions on all chromosomes of the analyzed organisms revealed dominant characteristic types of periodicity in the genomes. The pattern of density distribution of latent periodicity regions on chromosome unambiguously characterizes each chromosome in genome. Database URL:http://www.jcbi.ru/lp_baze/
Collapse
Affiliation(s)
- Maria Chaley
- Laboratory of Bioinformatics, Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st. 4, 142290 Pushchino, Russia and Department of Computational Mathematics and Mathematical Physics, Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st., 5, 105005 Moscow, Russia
| | - Vladimir Kutyrkin
- Laboratory of Bioinformatics, Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st. 4, 142290 Pushchino, Russia and Department of Computational Mathematics and Mathematical Physics, Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st., 5, 105005 Moscow, Russia
| | - Gayane Tulbasheva
- Laboratory of Bioinformatics, Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st. 4, 142290 Pushchino, Russia and Department of Computational Mathematics and Mathematical Physics, Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st., 5, 105005 Moscow, Russia
| | - Elena Teplukhina
- Laboratory of Bioinformatics, Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st. 4, 142290 Pushchino, Russia and Department of Computational Mathematics and Mathematical Physics, Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st., 5, 105005 Moscow, Russia
| | - Nafisa Nazipova
- Laboratory of Bioinformatics, Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st. 4, 142290 Pushchino, Russia and Department of Computational Mathematics and Mathematical Physics, Moscow State Technical University n.a. N.E. Bauman, the 2nd Baumanskaya st., 5, 105005 Moscow, Russia
| |
Collapse
|
3
|
Schaper E, Kajava AV, Hauser A, Anisimova M. Repeat or not repeat?--Statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res 2012; 40:10005-17. [PMID: 22923522 PMCID: PMC3488214 DOI: 10.1093/nar/gks726] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Tandem repeats (TRs) represent one of the most prevalent features of genomic sequences. Due to their abundance and functional significance, a plethora of detection tools has been devised over the last two decades. Despite the longstanding interest, TR detection is still not resolved. Our large-scale tests reveal that current detectors produce different, often nonoverlapping inferences, reflecting characteristics of the underlying algorithms rather than the true distribution of TRs in genomic data. Our simulations show that the power of detecting TRs depends on the degree of their divergence, and repeat characteristics such as the length of the minimal repeat unit and their number in tandem. To reconcile the diverse predictions of current algorithms, we propose and evaluate several statistical criteria for measuring the quality of predicted repeat units. In particular, we propose a model-based phylogenetic classifier, entailing a maximum-likelihood estimation of the repeat divergence. Applied in conjunction with the state of the art detectors, our statistical classification scheme for inferred repeats allows to filter out false-positive predictions. Since different algorithms appear to specialize at predicting TRs with certain properties, we advise applying multiple detectors with subsequent filtering to obtain the most complete set of genuine repeats.
Collapse
Affiliation(s)
- Elke Schaper
- Computer Science Department, ETH Zürich, Universitätsstrasse 6, CH-8092 Zürich, Switzerland.
| | | | | | | |
Collapse
|
4
|
To detect and analyze sequence repeats whatever be their origin. Methods Mol Biol 2012; 859:69-90. [PMID: 22367866 DOI: 10.1007/978-1-61779-603-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
The development of numerous programs for the identification of mobile elements raises the issue of the founding concepts that are shared in their design. This is necessary for at least three reasons. First, the cost of designing, developing, debugging, and maintaining software could present a danger of distracting biologists from their main bioanalysis tasks that require a lot of energy. Some key concepts on exact repeats are always underlying the search for genomic repeats and we recall the most important ones. All along the chapter, we try to select practical tools that may help the design of new identification pipelines. Second, the huge increase of sequence production capacities requires to use the most efficient data structures and algorithms to scale up tools in front of the data deluge. This paper provides an up-to-date glimpse on the art of string indexing and string matching. Third, there exists a growing knowledge on the architecture of mobile elements built from literature and the analysis of results generated by these pipelines. Besides data management which has led to the discovery of new families or new elements of a family, the community has an increasing need in knowledge management tools in order to compare, validate, or simply keep trace of mobile element models. We end the paper with first considerations on what could help the near future of such research on models.
Collapse
|
5
|
Bun1 C, Ziccardi W, Doering J, Putonti C. MilP: The Monomer Identification and Isolation Program. Evol Bioinform Online 2012. [PMCID: PMC3382395 DOI: 10.4137/ebo.s9248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Repetitive elements within genomic DNA are both functionally and evolutionarily informative. Discovering these sequences ab initio is computationally challenging, compounded by the fact that selection on these repeats is often relaxed; thus sequence identity between repetitive elements can vary significantly. Here we present a new application, the Monomer Identification and Isolation Program (MiIP), which provides functionality to both search for a particular repeat as well as discover repetitive elements within a larger genomic sequence. To compare MiIP’s performance with other repeat detection tools, analysis was conducted for synthetic sequences as well as several α21-II clones and HC21 BAC sequences. The primary benefit of MiIP is the fact that it is a single tool capable of searching for both known monomeric sequences as well as discovering the occurrence of repeats ab initio, per the user’s required sensitivity of the search. Furthermore, the report functionality helps easily facilitate subsequent phylogenetic analysis.
Collapse
Affiliation(s)
- Christopher Bun1
- Department of Computer Science, Loyola University Chicago, 820 N Michigan Avenue, Chicago, IL 60611 USA
- Current Affiliation: Department of Computer Science, University of Chicago, 1100 E 58th Street, Chicago, IL 60637 USA
| | - William Ziccardi
- Department of Biology, Loyola University Chicago, 1032 W Sheridan Road, Chicago, IL 60660 USA
| | - Jeffrey Doering
- Department of Biology, Loyola University Chicago, 1032 W Sheridan Road, Chicago, IL 60660 USA
| | - Catherine Putonti
- Department of Computer Science, Loyola University Chicago, 820 N Michigan Avenue, Chicago, IL 60611 USA
- Department of Biology, Loyola University Chicago, 1032 W Sheridan Road, Chicago, IL 60660 USA
- Bioinformatics Program, Loyola University Chicago, 1032 W Sheridan Road, Chicago, IL 60660 USA
| |
Collapse
|