1
|
Whole Exome-Sequencing of Pooled Genomic DNA Samples to Detect Quantitative Trait Loci in Esotropia and Exotropia of Strabismus in Japanese. LIFE (BASEL, SWITZERLAND) 2021; 12:life12010041. [PMID: 35054434 PMCID: PMC8777842 DOI: 10.3390/life12010041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 11/30/2021] [Accepted: 12/23/2021] [Indexed: 11/16/2022]
Abstract
BACKGROUND Esotropia and exotropia are two major phenotypes of comitant strabismus. It remains controversial whether esotropia and exotropia would share common genetic backgrounds. In this study, we used a quantitative trait locus (QTL)-sequencing pipeline for diploid plants to screen for susceptibility loci of strabismus in whole exome sequencing of pooled genomic DNAs of individuals. METHODS Pooled genomic DNA (2.5 ng each) of 20 individuals in three groups, Japanese patients with esotropia and exotropia, and normal members in the families, was sequenced twice after exome capture, and the first and second sets of data in each group were combined to increase the read depth. The SNP index, as the ratio of variant genotype reads to all reads, and Δ(SNP index) values, as the difference of SNP index between two groups, were calculated by sliding window analysis with a 4 Mb window size and 10 kb slide size. The rows of 200 "N"s were inserted as a putative 200-b spacer between every adjoining locus to depict Δ(SNP index) plots on each chromosome. SNP positions with depth < 20 as well as SNP positions with SNP index of <0.3 were excluded. RESULTS After the exclusion of SNPs, 12,242 SNPs in esotropia/normal group and 12,108 SNPs in exotropia/normal group remained. The patterns of the Δ(SNP index) plots on each chromosome appeared different between esotropia/normal group and exotropia/normal group. When the consecutive groups of SNPs on each chromosome were set at three patterns: SNPs in each cytogenetic band, 50 consecutive sliding SNPs, and SNPs in 4 Mb window size with 10 kb slide size, p values (Wilcoxon signed rank test) and Q values (false discovery rate) in a few loci as Manhattan plots showed significant differences in comparison between the Δ(SNP index) in the esotropia/normal group and exotropia/normal group. CONCLUSIONS The pooled DNA sequencing and QTL mapping approach for plants could provide overview of genetic background on each chromosome and would suggest different genetic backgrounds for two major phenotypes of comitant strabismus, esotropia and exotropia.
Collapse
|
2
|
Cechova M. Probably Correct: Rescuing Repeats with Short and Long Reads. Genes (Basel) 2020; 12:48. [PMID: 33396198 PMCID: PMC7823596 DOI: 10.3390/genes12010048] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/23/2020] [Accepted: 12/24/2020] [Indexed: 02/07/2023] Open
Abstract
Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome-estimated 50-69%-is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from "telomere to telomere". Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.
Collapse
Affiliation(s)
- Monika Cechova
- Genetics and Reproductive Biotechnologies, Veterinary Research Institute, Central European Institute of Technology (CEITEC), 621 00 Brno, Czech Republic
| |
Collapse
|
3
|
Mokveld T, Linthorst J, Al-Ars Z, Holstege H, Reinders M. CHOP: haplotype-aware path indexing in population graphs. Genome Biol 2020; 21:65. [PMID: 32160922 PMCID: PMC7066762 DOI: 10.1186/s13059-020-01963-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 02/18/2020] [Indexed: 12/20/2022] Open
Abstract
The practical use of graph-based reference genomes depends on the ability to align reads to them. Performing substring queries to paths through these graphs lies at the core of this task. The combination of increasing pattern length and encoded variations inevitably leads to a combinatorial explosion of the search space. Instead of heuristic filtering or pruning steps to reduce the complexity, we propose CHOP, a method that constrains the search space by exploiting haplotype information, bounding the search space to the number of haplotypes so that a combinatorial explosion is prevented. We show that CHOP can be applied to large and complex datasets, by applying it on a graph-based representation of the human genome encoding all 80 million variants reported by the 1000 Genomes Project.
Collapse
Affiliation(s)
- Tom Mokveld
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
| | - Jasper Linthorst
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
- Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, Amsterdam, 1081 BT The Netherlands
| | - Zaid Al-Ars
- Computer Engineering, Delft University of Technology, Mekelweg 4, Delft, 2628 CD The Netherlands
| | - Henne Holstege
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
- Department of Clinical Genetics, VU University Medical Center, Van der Boechorststraat 7, Amsterdam, 1081 BT The Netherlands
| | - Marcel Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, Delft, 2628 XE The Netherlands
| |
Collapse
|
4
|
Li R, Fu W, Su R, Tian X, Du D, Zhao Y, Zheng Z, Chen Q, Gao S, Cai Y, Wang X, Li J, Jiang Y. Towards the Complete Goat Pan-Genome by Recovering Missing Genomic Segments From the Reference Genome. Front Genet 2019; 10:1169. [PMID: 31803240 PMCID: PMC6874019 DOI: 10.3389/fgene.2019.01169] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 10/23/2019] [Indexed: 01/08/2023] Open
Abstract
It is broadly expected that next generation sequencing will ultimately generate a complete genome as is the latest goat reference genome (ARS1), which is considered to be one of the most continuous assemblies in livestock. However, the rich diversity of worldwide goat breeds indicates that a genome from one individual would be insufficient to represent the whole genomic contents of goats. By comparing nine de novo assemblies from seven sibling species of domestic goat with ARS1 and using resequencing and transcriptome data from goats for verification, we identified a total of 38.3 Mb sequences that were absent in ARS1. The pan-sequences contain genic fractions with considerable expression. Using the pan-genome (ARS1 together with the pan-sequences) as a reference genome, variation calling efficacy can be appreciably improved. A total of 56,657 spurious SNPs per individual were repressed and 24,414 novel SNPs per individual on average were recovered as a result of better reads mapping quality. The transcriptomic mapping rate was also increased by ∼1.15%. Our study demonstrated that comparing de novo assemblies from closely related species is an efficient and reliable strategy for finding missing sequences from the reference genome and could be applicable to other species. Pan-genome can serve as an improved reference genome in animals for a better exploration of the underlying genomic variations and could increase the probability of finding genotype-phenotype associations assessed by a comprehensive variation database containing much more differences between individuals. We have constructed a goat pan-genome web interface for data visualization (http://animal.nwsuaf.edu.cn/panGoat).
Collapse
Affiliation(s)
- Ran Li
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Weiwei Fu
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Rui Su
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, China
| | - Xiaomeng Tian
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Duo Du
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Yue Zhao
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Zhuqing Zheng
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Qiuming Chen
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Shan Gao
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Yudong Cai
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Xihong Wang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Jinquan Li
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, China
| | - Yu Jiang
- Key Laboratory of Animal Genetics, Breeding and Reproduction of Shaanxi Province, College of Animal Science and Technology, Northwest A&F University, Yangling, China
| |
Collapse
|
5
|
Brain-enriched MicroRNA-184 is downregulated in older adults with major depressive disorder: A translational study. J Psychiatr Res 2019; 111:110-120. [PMID: 30716647 DOI: 10.1016/j.jpsychires.2019.01.019] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 11/21/2018] [Accepted: 01/18/2019] [Indexed: 11/23/2022]
Abstract
Changes in microRNAs (miRNAs) expression have been described in major depressive disorder in young and middle-aged adults. However, no study has evaluated miRNA expression in older adults with major depression (or late-life depression [LLD]). Our primary aim was to evaluate the expression of miRNAs in subjects with LLD. We first evaluated the miRNA expression using next-generation sequencing (NGS) and then we validated the miRNAs found in NGS in an independent sample of LLD patients, using RT-qPCR. Drosophila melanogaster model was used to evaluate the impact of changes in miRNA expression on behavior. NGS analysis showed that hsa-miR-184 (log2foldchange = -4.21, p = 1.2 × 10-03) and hsa-miR-1-3p (log2foldchange = -3.45, p = 1.3 × 10-02) were significantly downregulated in LLD compared to the control group. RT-qPCR validated the downregulation of hsa-miR-184 (p < 0.001), but not for the hsa-miR-1-3p. The knockout flies of the ortholog of hsa-miR-184 showed significantly reduced locomotor activity at 21-24 d.p.e (p = 0.04) and worse memory retention at 21-24 d.p.e (24h post-stimulus, p = 0.02) compared to control flies. Our results demonstrated that subjects with LLD have significant downregulation of hsa-miR-184. Moreover, the knockout of hsa-miR-184 in flies lead to depressive-like behaviors, being more pronounce in older flies.
Collapse
|
6
|
Srivastava K, Wollenberg KR, Flegel WA. The phylogeny of 48 alleles, experimentally verified at 21 kb, and its application to clinical allele detection. J Transl Med 2019; 17:43. [PMID: 30744658 PMCID: PMC6371619 DOI: 10.1186/s12967-019-1791-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 02/04/2019] [Indexed: 01/19/2023] Open
Abstract
Background Sequence information generated from next generation sequencing is often computationally phased using haplotype-phasing algorithms. Utilizing experimentally derived allele or haplotype information improves this prediction, as routinely used in HLA typing. We recently established a large dataset of long ERMAP alleles, which code for protein variants in the Scianna blood group system. We propose the phylogeny of this set of 48 alleles and identify evolutionary steps to derive the observed alleles. Methods The nucleotide sequence of > 21 kb each was used for all physically confirmed 48 ERMAP alleles that we previously published. Full-length sequences were aligned and variant sites were extracted manually. The Bayesian coalescent algorithm implemented in BEAST v1.8.3 was used to estimate a coalescent phylogeny for these variants and the allelic ancestral states at the internal nodes of the phylogeny. Results The phylogenetic analysis allowed us to identify the evolutionary relationships among the 48 ERMAP alleles, predict 4243 potential ancestral alleles and calculate a posterior probability for each of these unobserved alleles. Some of them coincide with observed alleles that are extant in the population. Conclusions Our proposed strategy places known alleles in a phylogenetic framework, allowing us to describe as-yet-undiscovered alleles. In this new approach, which relies heavily on the accuracy of the alleles used for the phylogenetic analysis, an expanded set of predicted alleles can be used to infer alleles when large genotype data are analyzed, as typically generated by high-throughput sequencing. The alleles identified by studies like ours may be utilized in designing of microarray technologies, imputing of genotypes and mapping of next generation sequencing data. Electronic supplementary material The online version of this article (10.1186/s12967-019-1791-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kshitij Srivastava
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Kurt R Wollenberg
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA
| | - Willy A Flegel
- Laboratory Services Section, Department of Transfusion Medicine, NIH Clinical Center, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
7
|
Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs. Sci Rep 2018; 8:10862. [PMID: 30022108 PMCID: PMC6052005 DOI: 10.1038/s41598-018-29190-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Accepted: 06/27/2018] [Indexed: 12/29/2022] Open
Abstract
Dogs are excellent animal models for human disease. They have extensive veterinary histories, pedigrees, and a unique genetic system due to breeding practices. Despite these advantages, one factor limiting their usefulness is the canine genome reference (CGR) which was assembled using a single purebred Boxer. Although a common practice, this results in many high-quality reads remaining unmapped. To address this whole-genome sequence data from three breeds, Border Collie (n = 26), Bearded Collie (n = 7), and Entlebucher Sennenhund (n = 8), were analyzed to identify novel, non-CGR genomic contigs using the previously validated pseudo-de novo assembly pipeline. We identified 256,957 novel contigs and paired-end relationships together with BLAT scores provided 126,555 (49%) high-quality contigs with genomic coordinates containing 4.6 Mb of novel sequence absent from the CGR. These contigs close 12,503 known gaps, including 2.4 Mb containing partially missing sequences for 11.5% of Ensembl, 16.4% of RefSeq and 12.2% of canFam3.1+ CGR annotated genes and 1,748 unmapped contigs containing 2,366 novel gene variants. Examples for six disease-associated genes (SCARF2, RD3, COL9A3, FAM161A, RASGRP1 and DLX6) containing gaps or alternate splice variants missing from the CGR are also presented. These findings from non-reference breeds support the need for improvement of the current Boxer-only CGR to avoid missing important biological information. The inclusion of the missing gene sequences into the CGR will facilitate identification of putative disease mutations across diverse breeds and phenotypes.
Collapse
|
8
|
Shirota M, Kinoshita K. Discrepancies between human DNA, mRNA and protein reference sequences and their relation to single nucleotide variants in the human population. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw124. [PMID: 27589963 PMCID: PMC5009343 DOI: 10.1093/database/baw124] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 08/04/2016] [Indexed: 01/24/2023]
Abstract
The protein coding sequences of the human reference genome GRCh38, RefSeq mRNA and UniProt protein databases are sometimes inconsistent with each other, due to polymorphisms in the human population, but the overall landscape of the discordant sequences has not been clarified. In this study, we comprehensively listed the discordant bases and regions between the GRCh38, RefSeq and UniProt reference sequences, based on the genomic coordinates of GRCh38. We observed that the RefSeq sequences are more likely to represent the major alleles than GRCh38 and UniProt, by assigning the alternative allele frequencies of the discordant bases. Since some reference sequences have minor alleles, functional and structural annotations may be performed based on rare alleles in the human population, thereby biasing these analyses. Some of the differences between the RefSeq and GRCh38 account for biological differences due to known RNA-editing sites. The definitions of the coding regions are frequently complicated by possible micro-exons within introns and by SNVs with large alternative allele frequencies near exon–intron boundaries. The mRNA or protein regions missing from GRCh38 were mainly due to small deletions, and these sequences need to be identified. Taken together, our results clarify overall consistency and remaining inconsistency between the reference sequences.
Collapse
Affiliation(s)
- Matsuyuki Shirota
- Graduate School of Medicine, Tohoku University, Sendai, Miyagi 9808575, Japan Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 9808575, Japan Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi 9808579, Japan Institute for Development, Aging and Cancer, Tohoku University, Sendai, Miyagi 9808575, Japan
| |
Collapse
|
9
|
van der Weide RH, Simonis M, Hermsen R, Toonen P, Cuppen E, de Ligt J. The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats. PLoS One 2016; 11:e0160036. [PMID: 27501045 PMCID: PMC4976967 DOI: 10.1371/journal.pone.0160036] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Accepted: 07/12/2016] [Indexed: 01/17/2023] Open
Abstract
Unmapped next-generation sequencing reads are typically ignored while they contain biologically relevant information. We systematically analyzed unmapped reads from whole genome sequencing of 33 inbred rat strains. High quality reads were selected and enriched for biologically relevant sequences; similarity-based analysis revealed clustering similar to previously reported phylogenetic trees. Our results demonstrate that on average 20% of all unmapped reads harbor sequences that can be used to improve reference genomes and generate hypotheses on potential genotype-phenotype relationships. Analysis pipelines would benefit from incorporating the described methods and reference genomes would benefit from inclusion of the genomic segments obtained through these efforts.
Collapse
Affiliation(s)
- Robin H. van der Weide
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
- Division of Gene Regulation, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Marieke Simonis
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Roel Hermsen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Pim Toonen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Edwin Cuppen
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| | - Joep de Ligt
- Hubrecht Institute, Royal Netherlands Academy of Arts and Sciences (KNAW), University Medical Centre Utrecht, Utrecht, The Netherlands
| |
Collapse
|
10
|
Faber-Hammond JJ, Brown KH. Anchored pseudo-de novo assembly of human genomes identifies extensive sequence variation from unmapped sequence reads. Hum Genet 2016; 135:727-40. [PMID: 27061184 PMCID: PMC4899208 DOI: 10.1007/s00439-016-1667-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Accepted: 03/29/2016] [Indexed: 01/08/2023]
Abstract
The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2-5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10-20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.
Collapse
Affiliation(s)
- Joshua J Faber-Hammond
- Department of Biology, Portland State University, 1719 SW 10th Ave., SRTC 246, Portland, 97207-0751, USA
| | - Kim H Brown
- Department of Biology, Portland State University, 1719 SW 10th Ave., SRTC 246, Portland, 97207-0751, USA.
| |
Collapse
|
11
|
Genomic leftovers: identifying novel microsatellites, over-represented motifs and functional elements in the human genome. Sci Rep 2016; 6:27722. [PMID: 27278669 PMCID: PMC4899811 DOI: 10.1038/srep27722] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 05/23/2016] [Indexed: 01/29/2023] Open
Abstract
The human genome is 99% complete. This study contributes to filling the 1% gap by enriching previously unknown repeat regions called microsatellites (MST). We devised a Global MST Enrichment (GME) kit to enrich and nextgen sequence 2 colorectal cell lines and 16 normal human samples to illustrate its utility in identifying contigs from reads that do not map to the genome reference. The analysis of these samples yielded 790 novel extra-referential concordant contigs that are observed in more than one sample. We searched for evidence of functional elements in the concordant contigs in two ways: (1) BLAST-ing each contig against normal RNA-Seq samples, (2) Checking for predicted functional elements using GlimmerHMM. Of the 790 concordant contigs, 37 had an exact match to at least one RNA-Seq read; 15 aligned to more than 100 RNA-Seq reads. Of the 249 concordant contigs predicted by GlimmerHMM to have functional elements, 6 had at least one exact RNA-Seq match. BLAST-ing these novel contigs against all publically available sequences confirmed that they were found in human and chimpanzee BAC and FOSMID clones sequenced as part of the original human genome project. These extra-referential contigs predominantly contained pentameric repeats, especially two motifs: AATGG and GTGGA.
Collapse
|