951
|
Wang W, Wei Z, Lam TW, Wang J. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Sci Rep 2011; 1:55. [PMID: 22355574 PMCID: PMC3216542 DOI: 10.1038/srep00055] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2011] [Accepted: 07/25/2011] [Indexed: 01/30/2023] Open
Abstract
The rapid development of next generation sequencing (NGS) technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently map millions of short reads to the reference genome and how to make accurate SNP calls are two major challenges in taking full advantage of NGS. In this article, we reviewed the current software tools for mapping and SNP calling, and evaluated their performance on samples from The Cancer Genome Atlas (TCGA) project. We found that BWA and Bowtie are better than the other alignment tools in comprehensive performance for Illumina platform, while NovoalignCS showed the best overall performance for SOLiD. Furthermore, we showed that next-generation sequencing platform has significantly lower coverage and poorer SNP-calling performance in the CpG islands, promoter and 5'-UTR regions of the genome. NGS experiments targeting for these regions should have higher sequencing depth than the normal genomic region.
Collapse
Affiliation(s)
- Weixin Wang
- Department of Biochemistry, LKS Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | | | | | | |
Collapse
|
952
|
Marroni F, Pinosio S, Di Centa E, Jurman I, Boerjan W, Felice N, Cattonaro F, Morgante M. Large-scale detection of rare variants via pooled multiplexed next-generation sequencing: towards next-generation Ecotilling. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2011; 67:736-45. [PMID: 21554453 DOI: 10.1111/j.1365-313x.2011.04627.x] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Common variants, such as those identified by genome-wide association scans, explain only a small proportion of trait variation. Growing evidence suggests that rare functional variants, which are usually missed by genome-wide association scans, play an important role in determining the phenotype. We used pooled multiplexed next-generation sequencing and a customized analysis workflow to detect mutations in five candidate genes for lignin biosynthesis in 768 pooled Populus nigra accessions. We identified a total of 36 non-synonymous single nucleotide polymorphisms, one of which causes a premature stop codon. The most common variant was estimated to be present in 672 of the 1536 tested chromosomes, while the rarest was estimated to occur only once in 1536 chromosomes. Comparison with individual Sanger sequencing in a selected sub-sample confirmed that variants are identified with high sensitivity and specificity, and that the variant frequency was estimated accurately. This proposed method for identification of rare polymorphisms allows accurate detection of variation in many individuals, and is cost-effective compared to individual sequencing.
Collapse
|
953
|
Missirian V, Comai L, Filkov V. Statistical mutation calling from sequenced overlapping DNA pools in TILLING experiments. BMC Bioinformatics 2011. [PMID: 21756356 DOI: 10.1186/1471‐2105‐12‐287] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND TILLING (Targeting induced local lesions IN genomes) is an efficient reverse genetics approach for detecting induced mutations in pools of individuals. Combined with the high-throughput of next-generation sequencing technologies, and the resolving power of overlapping pool design, TILLING provides an efficient and economical platform for functional genomics across thousands of organisms. RESULTS We propose a probabilistic method for calling TILLING-induced mutations, and their carriers, from high throughput sequencing data of overlapping population pools, where each individual occurs in two pools. We assign a probability score to each sequence position by applying Bayes' Theorem to a simplified binomial model of sequencing error and expected mutations, taking into account the coverage level. We test the performance of our method on variable quality, high-throughput sequences from wheat and rice mutagenized populations. CONCLUSIONS We show that our method effectively discovers mutations in large populations with sensitivity of 92.5% and specificity of 99.8%. It also outperforms existing SNP detection methods in detecting real mutations, especially at higher levels of coverage variability across sequenced pools, and in lower quality short reads sequence data. The implementation of our method is available from: http://www.cs.ucdavis.edu/filkov/CAMBa/.
Collapse
Affiliation(s)
- Victor Missirian
- Department of Computer Science, UC Davis, 1 Shields Ave., Davis, CA 95616, USA
| | | | | |
Collapse
|
954
|
Missirian V, Comai L, Filkov V. Statistical mutation calling from sequenced overlapping DNA pools in TILLING experiments. BMC Bioinformatics 2011; 12:287. [PMID: 21756356 PMCID: PMC3150297 DOI: 10.1186/1471-2105-12-287] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 07/14/2011] [Indexed: 11/17/2022] Open
Abstract
Background TILLING (Targeting induced local lesions IN genomes) is an efficient reverse genetics approach for detecting induced mutations in pools of individuals. Combined with the high-throughput of next-generation sequencing technologies, and the resolving power of overlapping pool design, TILLING provides an efficient and economical platform for functional genomics across thousands of organisms. Results We propose a probabilistic method for calling TILLING-induced mutations, and their carriers, from high throughput sequencing data of overlapping population pools, where each individual occurs in two pools. We assign a probability score to each sequence position by applying Bayes' Theorem to a simplified binomial model of sequencing error and expected mutations, taking into account the coverage level. We test the performance of our method on variable quality, high-throughput sequences from wheat and rice mutagenized populations. Conclusions We show that our method effectively discovers mutations in large populations with sensitivity of 92.5% and specificity of 99.8%. It also outperforms existing SNP detection methods in detecting real mutations, especially at higher levels of coverage variability across sequenced pools, and in lower quality short reads sequence data. The implementation of our method is available from: http://www.cs.ucdavis.edu/filkov/CAMBa/.
Collapse
Affiliation(s)
- Victor Missirian
- Department of Computer Science, UC Davis, 1 Shields Ave., Davis, CA 95616, USA
| | | | | |
Collapse
|
955
|
Girard SL, Gauthier J, Noreau A, Xiong L, Zhou S, Jouan L, Dionne-Laporte A, Spiegelman D, Henrion E, Diallo O, Thibodeau P, Bachand I, Bao JYJ, Tong AHY, Lin CH, Millet B, Jaafari N, Joober R, Dion PA, Lok S, Krebs MO, Rouleau GA. Increased exonic de novo mutation rate in individuals with schizophrenia. Nat Genet 2011; 43:860-3. [PMID: 21743468 DOI: 10.1038/ng.886] [Citation(s) in RCA: 295] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2011] [Accepted: 06/15/2011] [Indexed: 12/17/2022]
Abstract
Schizophrenia is a severe psychiatric disorder that profoundly affects cognitive, behavioral and emotional processes. The wide spectrum of symptoms and clinical variability in schizophrenia suggest a complex genetic etiology, which is consistent with the numerous loci thus far identified by linkage, copy number variation and association studies. Although schizophrenia heritability may be as high as ∼80%, the genes responsible for much of this heritability remain to be identified. Here we sequenced the exomes of 14 schizophrenia probands and their parents. We identified 15 de novo mutations (DNMs) in eight probands, which is significantly more than expected considering the previously reported DNM rate. In addition, 4 of the 15 identified DNMs are nonsense mutations, which is more than what is expected by chance. Our study supports the notion that DNMs may account for some of the heritability reported for schizophrenia while providing a list of genes possibly involved in disease pathogenesis.
Collapse
Affiliation(s)
- Simon L Girard
- Centre of Excellence in Neuromics of Université de Montréal, Centre Hospitalier de l'Université de Montréal Research Center, Montréal, Québec, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
956
|
Margraf RL, Durtschi JD, Dames S, Pattison DC, Stephens JE, Voelkerding KV. Variant identification in multi-sample pools by illumina genome analyzer sequencing. J Biomol Tech 2011; 22:74-84. [PMID: 21738440 PMCID: PMC3121147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Multi-sample pooling and Illumina Genome Analyzer (GA) sequencing allows high throughput sequencing of multiple samples to determine population sequence variation. A preliminary experiment, using the RET proto-oncogene as a model, predicted ≤ 30 samples could be pooled to reliably detect singleton variants without requiring additional confirmation testing. This report used 30 and 50 sample pools to test the hypothesized pooling limit and also to test recent protocol improvements, Illumina GAIIx upgrades, and longer read chemistry. The SequalPrep(TM) method was used to normalize amplicons before pooling. For comparison, a single 'control' sample was run in a different flow cell lane. Data was evaluated by variant read percentages and the subtractive correction method which utilizes the control sample. In total, 59 variants were detected within the pooled samples, which included all 47 known true variants. The 15 known singleton variants due to Sanger sequencing had an average of 1.62 ± 0.26% variant reads for the 30 pool (expected 1.67% for a singleton variant [unique variant within the pool]) and 1.01 ± 0.19% for the 50 pool (expected 1%). The 76 base read lengths had higher error rates than shorter read lengths (33 and 50 base reads), which eliminated the distinction of true singleton variants from background error. This report demonstrated pooling limits from 30 up to 50 samples (depending on error rates and coverage), for reliable singleton variant detection. The presented pooling protocols and analysis methods can be used for variant discovery in other genes, facilitating molecular diagnostic test design and interpretation.
Collapse
Affiliation(s)
- Rebecca L. Margraf
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
| | - Jacob D. Durtschi
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
| | - Shale Dames
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
| | - David C. Pattison
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
| | - Jack E. Stephens
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
| | - Karl V. Voelkerding
- ARUP Institute for Clinical & Experimental Pathology®, Salt Lake City, Utah and
- Department of Pathology, University of Utah School of Medicine, Salt Lake City, Utah
| |
Collapse
|
957
|
Tsai H, Howell T, Nitcher R, Missirian V, Watson B, Ngo KJ, Lieberman M, Fass J, Uauy C, Tran RK, Khan AA, Filkov V, Tai TH, Dubcovsky J, Comai L. Discovery of rare mutations in populations: TILLING by sequencing. PLANT PHYSIOLOGY 2011; 156:1257-68. [PMID: 21531898 PMCID: PMC3135940 DOI: 10.1104/pp.110.169748] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 11/20/2010] [Accepted: 04/28/2011] [Indexed: 05/19/2023]
Abstract
Discovery of rare mutations in populations requires methods, such as TILLING (for Targeting Induced Local Lesions in Genomes), for processing and analyzing many individuals in parallel. Previous TILLING protocols employed enzymatic or physical discrimination of heteroduplexed from homoduplexed target DNA. Using mutant populations of rice (Oryza sativa) and wheat (Triticum durum), we developed a method based on Illumina sequencing of target genes amplified from multidimensionally pooled templates representing 768 individuals per experiment. Parallel processing of sequencing libraries was aided by unique tracer sequences and barcodes allowing flexibility in the number and pooling arrangement of targeted genes, species, and pooling scheme. Sequencing reads were processed and aligned to the reference to identify possible single-nucleotide changes, which were then evaluated for frequency, sequencing quality, intersection pattern in pools, and statistical relevance to produce a Bayesian score with an associated confidence threshold. Discovery was robust both in rice and wheat using either bidimensional or tridimensional pooling schemes. The method compared favorably with other molecular and computational approaches, providing high sensitivity and specificity.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - Luca Comai
- Department of Plant Biology and Genome Center (H.T., T.H., B.W., K.J.N., M.L., R.K.T., A.A.K., L.C.), Department of Plant Sciences (R.N., C.U., T.H.T., J.D.), Department of Computer Sciences (V.M., V.F.), Bioinformatics Core, Genome Center (J.F.), and United States Department of Agriculture Agricultural Research Service, Crops Pathology and Genetics Research Unit (T.H.T.), University of California, Davis, California 95616
| |
Collapse
|
958
|
Ekblom R, Galindo J. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity (Edinb) 2011; 107:1-15. [PMID: 21139633 PMCID: PMC3186121 DOI: 10.1038/hdy.2010.152] [Citation(s) in RCA: 642] [Impact Index Per Article: 45.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2010] [Revised: 09/10/2010] [Accepted: 11/02/2010] [Indexed: 11/09/2022] Open
Abstract
As most biologists are probably aware, technological advances in molecular biology during the last few years have opened up possibilities to rapidly generate large-scale sequencing data from non-model organisms at a reasonable cost. In an era when virtually any study organism can 'go genomic', it is worthwhile to review how this may impact molecular ecology. The first studies to put the next generation sequencing (NGS) to the test in ecologically well-characterized species without previous genome information were published in 2007 and the beginning of 2008. Since then several studies have followed in their footsteps, and a large number are undoubtedly under way. This review focuses on how NGS has been, and can be, applied to ecological, population genetic and conservation genetic studies of non-model species, in which there is no (or very limited) genomic resources. Our aim is to draw attention to the various possibilities that are opening up using the new technologies, but we also highlight some of the pitfalls and drawbacks with these methods. We will try to provide a snapshot of the current state of the art for this rapidly advancing and expanding field of research and give some likely directions for future developments.
Collapse
Affiliation(s)
- R Ekblom
- Department of Animal and Plant Sciences, University of Sheffield, UK.
| | | |
Collapse
|
959
|
Hornsey M, Loman N, Wareham DW, Ellington MJ, Pallen MJ, Turton JF, Underwood A, Gaulton T, Thomas CP, Doumith M, Livermore DM, Woodford N. Whole-genome comparison of two Acinetobacter baumannii isolates from a single patient, where resistance developed during tigecycline therapy. J Antimicrob Chemother 2011; 66:1499-503. [PMID: 21565804 DOI: 10.1093/jac/dkr168] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
OBJECTIVES The whole genomes of two Acinetobacter baumannii isolates recovered from a single patient were sequenced to gain insight into the nature and extent of genomic plasticity in this important nosocomial pathogen over the course of a short infection. The first, AB210, was recovered before tigecycline therapy and was susceptible to this agent; the second, AB211, was recovered after therapy and was resistant. METHODS DNA from AB210 was sequenced by 454 GS FLX pyrosequencing according to the standard protocol for whole-genome shotgun sequencing, producing ∼250 bp fragment reads. AB211 was shotgun sequenced using the Illumina Genetic Analyzer to produce fragment reads of exactly 36 bp. Single nucleotide polymorphisms (SNPs) and large deletions detected in AB211 in relation to AB210 were confirmed by PCR and DNA sequencing. RESULTS Automated gene prediction detected 3850 putative coding sequences (CDSs). Sequence analysis demonstrated the presence of plasmids pAB0057 and pACICU2 in both isolates. Eighteen putative SNPs were detected between the pre- and post-therapy isolates, AB210 and AB211. Three contigs in AB210 were not covered by reads in AB211, representing three deletions of ∼15, 44 and 17 kb. CONCLUSIONS This study demonstrates that significant differences were detectable between two bacterial isolates recovered 1 week apart from the same patient, and reveals the potential of whole-genome sequencing as a tool for elucidating the processes responsible for changes in antibiotic susceptibility profiles.
Collapse
Affiliation(s)
- Michael Hornsey
- Antimicrobial Research Group, Centre for Immunology & Infectious Disease, Blizard Institute, Barts and The London, Queen Mary's School of Medicine and Dentistry, 4 Newark Street, London, E1 2AT, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
960
|
Deng X. SeqGene: a comprehensive software solution for mining exome- and transcriptome- sequencing data. BMC Bioinformatics 2011; 12:267. [PMID: 21714929 PMCID: PMC3148209 DOI: 10.1186/1471-2105-12-267] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Accepted: 06/29/2011] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The popularity of massively parallel exome and transcriptome sequencing projects demands new data mining tools with a comprehensive set of features to support a wide range of analysis tasks. RESULTS SeqGene, a new data mining tool, supports mutation detection and annotation, dbSNP and 1000 Genome data integration, RNA-Seq expression quantification, mutation and coverage visualization, allele specific expression (ASE), differentially expressed genes (DEGs) identification, copy number variation (CNV) analysis, and gene expression quantitative trait loci (eQTLs) detection. We also developed novel methods for testing the association between SNP and expression and identifying genotype-controlled DEGs. We showed that the results generated from SeqGene compares favourably to other existing methods in our case studies. CONCLUSION SeqGene is designed as a general-purpose software package. It supports both paired-end reads and single reads generated on most sequencing platforms; it runs on all major types of computers; it supports arbitrary genome assemblies for arbitrary organisms; and it scales well to support both large and small scale sequencing projects. The software homepage is http://seqgene.sourceforge.net.
Collapse
Affiliation(s)
- Xutao Deng
- Bioinformatics Core Facility, Department of Molecular Medicine, Beckman Research Institute, City of Hope Medical Center, Duarte, CA 91010, USA.
| |
Collapse
|
961
|
Kim SY, Lohmueller KE, Albrechtsen A, Li Y, Korneliussen T, Tian G, Grarup N, Jiang T, Andersen G, Witte D, Jorgensen T, Hansen T, Pedersen O, Wang J, Nielsen R. Estimation of allele frequency and association mapping using next-generation sequencing data. BMC Bioinformatics 2011; 12:231. [PMID: 21663684 PMCID: PMC3212839 DOI: 10.1186/1471-2105-12-231] [Citation(s) in RCA: 127] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2011] [Accepted: 06/11/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates. RESULTS We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data. CONCLUSIONS Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.
Collapse
Affiliation(s)
- Su Yeon Kim
- Departments of Integrative Biology and Statistics, UC Berkeley, Berkeley, CA 94720, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
962
|
Bowne SJ, Humphries MM, Sullivan LS, Kenna PF, Tam LCS, Kiang AS, Campbell M, Weinstock GM, Koboldt DC, Ding L, Fulton RS, Sodergren EJ, Allman D, Millington-Ward S, Palfi A, McKee A, Blanton SH, Slifer S, Konidari I, Farrar GJ, Daiger SP, Humphries P. A dominant mutation in RPE65 identified by whole-exome sequencing causes retinitis pigmentosa with choroidal involvement. Eur J Hum Genet 2011; 19:1074-81. [PMID: 21654732 DOI: 10.1038/ejhg.2011.86] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Linkage testing using Affymetrix 6.0 SNP Arrays mapped the disease locus in TCD-G, an Irish family with autosomal dominant retinitis pigmentosa (adRP), to an 8.8 Mb region on 1p31. Of 50 known genes in the region, 11 candidates, including RPE65 and PDE4B, were sequenced using di-deoxy capillary electrophoresis. Simultaneously, a subset of family members was analyzed using Agilent SureSelect All Exome capture, followed by sequencing on an Illumina GAIIx platform. Candidate gene and exome sequencing resulted in the identification of an Asp477Gly mutation in exon 13 of the RPE65 gene tracking with the disease in TCD-G. All coding exons of genes not sequenced to sufficient depth by next generation sequencing were sequenced by di-deoxy sequencing. No other potential disease-causing variants were found to segregate with disease in TCD-G. The Asp477Gly mutation was not present in Irish controls, but was found in a second Irish family provisionally diagnosed with choroideremia, bringing the combined maximum two-point LOD score to 5.3. Mutations in RPE65 are a known cause of recessive Leber congenital amaurosis (LCA) and recessive RP, but no dominant mutations have been reported. Protein modeling suggests that the Asp477Gly mutation may destabilize protein folding, and mutant RPE65 protein migrates marginally faster on SDS-PAGE, compared with wild type. Gene therapy for LCA patients with RPE65 mutations has shown great promise, raising the possibility of related therapies for dominant-acting mutations in this gene.
Collapse
Affiliation(s)
- Sara J Bowne
- Human Genetics Center, The University of Texas Health Science Center, Houston, TX, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
963
|
Erlich Y. Blood ties: chimerism can mask twin discordance in high-throughput sequencing. Twin Res Hum Genet 2011; 14:137-43. [PMID: 21425895 DOI: 10.1375/twin.14.2.137] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Twin studies have long provided a means to separate the contributions of genetic and environmental factors. A recent pioneering report by Baranzini et al. presented an analysis of the complete genomes and epigenomes of a monozygotic (MZ) twin pair discordant for multiple sclerosis. This failed to find any difference between the twins, raising doubts regarding the value of whole-genome twin studies for defining disease susceptibility alleles. However, the study was carried out with DNA extracted from blood. In many cases, the hematopoietic lineages of MZ twins are chimeric due to twin-to-twin exchange of hematopoietic stem cells during embryogenesis. We therefore wondered how chimerism might impact the ability to identify genetic differences. We inferred the blood chimerism rates and profiles of more than 30 discordant twin cases from a wide variety of medical conditions. We found that the genotype compositions of the twins were highly similar. We then benchmarked the performance of SNP callers to detect discordant variations using high-throughput sequencing data. Our analysis revealed that chimerism patterns, well within the range normally observed in MZ twins, greatly reduce the sensitivity of SNP calls. This raises questions regarding any conclusions of genomic homogeneity that might be drawn from studies of blood-derived twin DNA.
Collapse
Affiliation(s)
- Yaniv Erlich
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, United States of America.
| |
Collapse
|
964
|
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011. [PMID: 21478889 DOI: 10.1038/ng.806.a] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2023]
Abstract
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
Collapse
Affiliation(s)
- Mark A DePristo
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
965
|
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011; 43:491-8. [PMID: 21478889 PMCID: PMC3083463 DOI: 10.1038/ng.806] [Citation(s) in RCA: 8233] [Impact Index Per Article: 588.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2010] [Accepted: 03/17/2011] [Indexed: 02/07/2023]
Abstract
Recent advances in sequencing technology make it possible to comprehensively catalog genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious, and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (i) initial read mapping; (ii) local realignment around indels; (iii) base quality score recalibration; (iv) SNP discovery and genotyping to find all potential variants; and (v) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We here discuss the application of these tools, instantiated in the Genome Analysis Toolkit, to deep whole-genome, whole-exome capture and multi-sample low-pass (∼4×) 1000 Genomes Project datasets.
Collapse
Affiliation(s)
- Mark A DePristo
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
966
|
Erlich Y, Edvardson S, Hodges E, Zenvirt S, Thekkat P, Shaag A, Dor T, Hannon GJ, Elpeleg O. Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res 2011; 21:658-64. [PMID: 21487076 DOI: 10.1101/gr.117143.110] [Citation(s) in RCA: 153] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Whole exome sequencing has become a pivotal methodology for rapid and cost-effective detection of pathogenic variations in Mendelian disorders. A major challenge of this approach is determining the causative mutation from a substantial number of bystander variations that do not play any role in the disease etiology. Current strategies to analyze variations have mainly relied on genetic and functional arguments such as mode of inheritance, conservation, and loss of function prediction. Here, we demonstrate that disease-network analysis provides an additional layer of information to stratify variations even in the presence of incomplete sequencing coverage, a known limitation of exome sequencing. We studied a case of Hereditary Spastic Paraparesis (HSP) in a single inbred Palestinian family. HSP is a group of neuropathological disorders that are characterized by abnormal gait and spasticity of the lower limbs. Forty-five loci have been associated with HSP and lesions in 20 genes have been documented to induce the disorder. We used whole exome sequencing and homozygosity mapping to create a list of possible candidates. After exhausting the genetic and functional arguments, we stratified the remaining candidates according to their similarity to the previously known disease genes. Our analysis implicated the causative mutation in the motor domain of KIF1A, a gene that has not yet associated with HSP, which functions in anterograde axonal transportation. Our strategy can be useful for a large class of disorders that are characterized by locus heterogeneity, particularly when studying disorders in single families.
Collapse
Affiliation(s)
- Yaniv Erlich
- Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
967
|
HIRD SARAHM, BRUMFIELD ROBBT, CARSTENS BRYANC. PRG
matic
: an efficient pipeline for collating genome‐enriched second‐generation sequencing data using a ‘provisional‐reference genome’. Mol Ecol Resour 2011; 11:743-8. [DOI: 10.1111/j.1755-0998.2011.03005.x] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- SARAH M. HIRD
- Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803, USA
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - ROBB T. BRUMFIELD
- Museum of Natural Science, Louisiana State University, Baton Rouge, LA 70803, USA
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| | - BRYAN C. CARSTENS
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA
| |
Collapse
|
968
|
Legendre M, Santini S, Rico A, Abergel C, Claverie JM. Breaking the 1000-gene barrier for Mimivirus using ultra-deep genome and transcriptome sequencing. Virol J 2011; 8:99. [PMID: 21375749 PMCID: PMC3058096 DOI: 10.1186/1743-422x-8-99] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2011] [Accepted: 03/04/2011] [Indexed: 11/30/2022] Open
Abstract
Background Mimivirus, a giant dsDNA virus infecting Acanthamoeba, is the prototype of the mimiviridae family, the latest addition to the family of the nucleocytoplasmic large DNA viruses (NCLDVs). Its 1.2 Mb-genome was initially predicted to encode 917 genes. A subsequent RNA-Seq analysis precisely mapped many transcript boundaries and identified 75 new genes. Findings We now report a much deeper analysis using the SOLiD™ technology combining RNA-Seq of the Mimivirus transcriptome during the infectious cycle (202.4 Million reads), and a complete genome re-sequencing (45.3 Million reads). This study corrected the genome sequence and identified several single nucleotide polymorphisms. Our results also provided clear evidence of previously overlooked transcription units, including an important RNA polymerase subunit distantly related to Euryarchea homologues. The total Mimivirus gene count is now 1018, 11% greater than the original annotation. Conclusions This study highlights the huge progress brought about by ultra-deep sequencing for the comprehensive annotation of virus genomes, opening the door to a complete one-nucleotide resolution level description of their transcriptional activity, and to the realistic modeling of the viral genome expression at the ultimate molecular level. This work also illustrates the need to go beyond bioinformatics-only approaches for the annotation of short protein and non-coding genes in viral genomes.
Collapse
Affiliation(s)
- Matthieu Legendre
- Structural & genomic Information Laboratory (CNRS, UPR2589), Mediterranean Institute of Microbiology, Aix-Marseille Université, 163 Avenue de Luminy, Case 934, FR-13288 Marseille, France.
| | | | | | | | | |
Collapse
|
969
|
Lupton MK, Proitsi P, Danillidou M, Tsolaki M, Hamilton G, Wroe R, Pritchard M, Lord K, Martin BM, Kloszewska I, Soininen H, Mecocci P, Vellas B, Harold D, Hollingworth P, Lovestone S, Powell JF. Deep sequencing of the Nicastrin gene in pooled DNA, the identification of genetic variants that affect risk of Alzheimer's disease. PLoS One 2011; 6:e17298. [PMID: 21364883 PMCID: PMC3045431 DOI: 10.1371/journal.pone.0017298] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2010] [Accepted: 01/27/2011] [Indexed: 11/18/2022] Open
Abstract
Nicastrin is an obligatory component of the γ-secretase; the enzyme complex that leads to the production of Aβ fragments critically central to the pathogenesis of Alzheimer's disease (AD). Analyses of the effects of common variation in this gene on risk for late onset AD have been inconclusive. We investigated the effect of rare variation in the coding regions of the Nicastrin gene in a cohort of AD patients and matched controls using an innovative pooling approach and next generation sequencing. Five SNPs were identified and validated by individual genotyping from 311 cases and 360 controls. Association analysis identified a non-synonymous rare SNP (N417Y) with a statistically higher frequency in cases compared to controls in the Greek population (OR 3.994, CI 1.105–14.439, p = 0.035). This finding warrants further investigation in a larger cohort and adds weight to the hypothesis that rare variation explains some of genetic heritability still to be identified in Alzheimer's disease.
Collapse
Affiliation(s)
- Michelle K. Lupton
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
- * E-mail:
| | - Petroula Proitsi
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Makrina Danillidou
- 3rd Department of Neurology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Magda Tsolaki
- 3rd Department of Neurology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Gillian Hamilton
- Medical Genetics, Molecular Medicine Centre, Western General Hospital, University of Edinburgh, Edinburgh, United Kingdom
| | - Richard Wroe
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Megan Pritchard
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Kathryn Lord
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Belinda M. Martin
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - Iwona Kloszewska
- Department of Old Age Psychiatry and Psychotic Disorders, Medical University of Lodz, Lodz, Poland
| | - Hilkka Soininen
- Department of Neurology, University of Eastern Finland and Kuopio University Hospital, Kuopio, Finland
| | - Patrizia Mecocci
- Section of Gerontology and Geriatrics, Department of Clinical and Experimental Medicine, University of Perugia, Perugia, Italy
| | - Bruno Vellas
- Department of Internal and Geriatrics Medicine, Hôpitaux de Toulouse, Toulouse, France
| | - Denise Harold
- Department of Psychological Medicine and Neurology, MRC Centre for Neuropsychiatric Genetics and Genomics, School of Medicine, Cardiff University, Cardiff, United Kingdom
| | - Paul Hollingworth
- Department of Psychological Medicine and Neurology, MRC Centre for Neuropsychiatric Genetics and Genomics, School of Medicine, Cardiff University, Cardiff, United Kingdom
| | - Simon Lovestone
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| | - John F. Powell
- MRC Centre for Neurodegeneration Research, Institute of Psychiatry, King's College London, London, United Kingdom
| |
Collapse
|
970
|
Edmonson MN, Zhang J, Yan C, Finney RP, Meerzaman DM, Buetow KH. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 2011; 27:865-6. [PMID: 21278191 DOI: 10.1093/bioinformatics/btr032] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Bambino is a variant detector and graphical alignment viewer for next-generation sequencing data in the SAM/BAM format, which is capable of pooling data from multiple source files. The variant detector takes advantage of SAM-specific annotations, and produces detailed output suitable for genotyping and identification of somatic mutations. The assembly viewer can display reads in the context of either a user-provided or automatically generated reference sequence, retrieve genome annotation features from a UCSC genome annotation database, display histograms of non-reference allele frequencies, and predict protein-coding changes caused by SNPs. AVAILABILITY Bambino is written in platform-independent Java and available from https://cgwb.nci.nih.gov/goldenPath/bamview/documentation/index.html, along with documentation and example data. Bambino may be launched online via Java Web Start or downloaded and run locally.
Collapse
Affiliation(s)
- Michael N Edmonson
- National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | | | | | | | | | | |
Collapse
|
971
|
Bowne SJ, Sullivan LS, Koboldt DC, Ding L, Fulton R, Abbott RM, Sodergren EJ, Birch DG, Wheaton DH, Heckenlively JR, Liu Q, Pierce EA, Weinstock GM, Daiger SP. Identification of disease-causing mutations in autosomal dominant retinitis pigmentosa (adRP) using next-generation DNA sequencing. Invest Ophthalmol Vis Sci 2011; 52:494-503. [PMID: 20861475 DOI: 10.1167/iovs.10-6180] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
PURPOSE To determine whether massively parallel next-generation DNA sequencing offers rapid and efficient detection of disease-causing mutations in patients with monogenic inherited diseases. Retinitis pigmentosa (RP) is a challenging application for this technology because it is a monogenic disease in individuals and families but is highly heterogeneous in patient populations. RP has multiple patterns of inheritance, with mutations in many genes for each inheritance pattern and numerous, distinct, disease-causing mutations at each locus; further, many RP genes have not been identified yet. METHODS Next-generation sequencing was used to identify mutations in pairs of affected individuals from 21 families with autosomal dominant RP, selected from a cohort of families without mutations in "common" RP genes. One thousand amplicons targeting 249,267 unique bases of 46 candidate genes were sequenced with the 454GS FLX Titanium (Roche Diagnostics, Indianapolis, IN) and GAIIx (Illumina/Solexa, San Diego, CA) platforms. RESULTS An average sequence depth of 70× and 125× was obtained for the 454GS FLX and GAIIx platforms, respectively. More than 9000 sequence variants were identified and analyzed, to assess the likelihood of pathogenicity. One hundred twelve of these were selected as likely candidates and tested for segregation with traditional di-deoxy capillary electrophoresis sequencing of additional family members and control subjects. Five disease-causing mutations (24%) were identified in the 21 families. CONCLUSION This project demonstrates that next-generation sequencing is an effective approach for detecting novel, rare mutations causing heterogeneous monogenic disorders such as RP. With the addition of this technology, disease-causing mutations can now be identified in 65% of autosomal dominant RP cases.
Collapse
Affiliation(s)
- Sara J Bowne
- Human Genetics Center, The University of Texas Health Science Center, Houston, Texas 77030, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
972
|
Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V, Futschik A, Kosiol C, Schlötterer C. PoPoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One 2011; 6:e15925. [PMID: 21253599 PMCID: PMC3017084 DOI: 10.1371/journal.pone.0015925] [Citation(s) in RCA: 421] [Impact Index Per Article: 30.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Accepted: 11/30/2010] [Indexed: 11/19/2022] Open
Abstract
Recent statistical analyses suggest that sequencing of pooled samples provides a cost effective approach to determine genome-wide population genetic parameters. Here we introduce PoPoolation, a toolbox specifically designed for the population genetic analysis of sequence data from pooled individuals. PoPoolation calculates estimates of θ(Watterson), θ(π), and Tajima's D that account for the bias introduced by pooling and sequencing errors, as well as divergence between species. Results of genome-wide analyses can be graphically displayed in a sliding window plot. PoPoolation is written in Perl and R and it builds on commonly used data formats. Its source code can be downloaded from http://code.google.com/p/popoolation/. Furthermore, we evaluate the influence of mapping algorithms, sequencing errors, and read coverage on the accuracy of population genetic parameter estimates from pooled data.
Collapse
Affiliation(s)
- Robert Kofler
- Institute of Population Genetics, Vetmeduni Vienna, Vienna, Austria
| | | | - Nicola De Maio
- Institute of Population Genetics, Vetmeduni Vienna, Vienna, Austria
| | - Ram Vinay Pandey
- Institute of Population Genetics, Vetmeduni Vienna, Vienna, Austria
| | - Viola Nolte
- Institute of Population Genetics, Vetmeduni Vienna, Vienna, Austria
| | | | - Carolin Kosiol
- Institute of Population Genetics, Vetmeduni Vienna, Vienna, Austria
| | | |
Collapse
|
973
|
Vallania FLM, Druley TE, Ramos E, Wang J, Borecki I, Province M, Mitra RD. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res 2010; 20:1711-8. [PMID: 21041413 DOI: 10.1101/gr.109157.110] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Pooled-DNA sequencing strategies enable fast, accurate, and cost-effect detection of rare variants, but current approaches are not able to accurately identify short insertions and deletions (indels), despite their pivotal role in genetic disease. Furthermore, the sensitivity and specificity of these methods depend on arbitrary, user-selected significance thresholds, whose optimal values change from experiment to experiment. Here, we present a combined experimental and computational strategy that combines a synthetically engineered DNA library inserted in each run and a new computational approach named SPLINTER that detects and quantifies short indels and substitutions in large pools. SPLINTER integrates information from the synthetic library to select the optimal significance thresholds for every experiment. We show that SPLINTER detects indels (up to 4 bp) and substitutions in large pools with high sensitivity and specificity, accurately quantifies variant frequency (r = 0.999), and compares favorably with existing algorithms for the analysis of pooled sequencing data. We applied our approach to analyze a cohort of 1152 individuals, identifying 48 variants and validating 14 of 14 (100%) predictions by individual genotyping. Thus, our strategy provides a novel and sensitive method that will speed the discovery of novel disease-causing rare variants.
Collapse
Affiliation(s)
- Francesco L M Vallania
- Center for Genome Sciences and Systems Biology, Department of Genetics, Washington University in St. Louis School of Medicine, St. Louis, Missouri 63108, USA
| | | | | | | | | | | | | |
Collapse
|
974
|
Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: accurate indel calls from short-read data. Genome Res 2010; 21:961-73. [PMID: 20980555 DOI: 10.1101/gr.112326.110] [Citation(s) in RCA: 323] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.
Collapse
Affiliation(s)
- Cornelis A Albers
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, United Kingdom.
| | | | | | | | | | | |
Collapse
|
975
|
Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 2010; 26:i318-24. [PMID: 20529923 PMCID: PMC2881398 DOI: 10.1093/bioinformatics/btq214] [Citation(s) in RCA: 129] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Motivation: Next-generation sequencing technologies have enabled the sequencing of several human genomes in their entirety. However, the routine resequencing of complete genomes remains infeasible. The massive capacity of next-generation sequencers can be harnessed for sequencing specific genomic regions in hundreds to thousands of individuals. Sequencing-based association studies are currently limited by the low level of multiplexing offered by sequencing platforms. Pooled sequencing represents a cost-effective approach for studying rare variants in large populations. To utilize the power of DNA pooling, it is important to accurately identify sequence variants from pooled sequencing data. Detection of rare variants from pooled sequencing represents a different challenge than detection of variants from individual sequencing. Results: We describe a novel statistical approach, CRISP [Comprehensive Read analysis for Identification of Single Nucleotide Polymorphisms (SNPs) from Pooled sequencing] that is able to identify both rare and common variants by using two approaches: (i) comparing the distribution of allele counts across multiple pools using contingency tables and (ii) evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. Information about the distribution of reads between the forward and reverse strands and the size of the pools is also incorporated within this framework to filter out false variants. Validation of CRISP on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer demonstrates that it can detect 80–85% of SNPs identified using individual sequencing while achieving a low false discovery rate (3–5%). Comparison with previous methods for pooled SNP detection demonstrates the significantly lower false positive and false negative rates for CRISP. Availability: Implementation of this method is available at http://polymorphism.scripps.edu/∼vbansal/software/CRISP/ Contact:vbansal@scripps.edu
Collapse
Affiliation(s)
- Vikas Bansal
- Scripps Genomic Medicine, Scripps Translational Science Institute, La Jolla, CA 92037, USA.
| |
Collapse
|
976
|
Collins SC, Bray SM, Suhl JA, Cutler DJ, Coffee B, Zwick ME, Warren ST. Identification of novel FMR1 variants by massively parallel sequencing in developmentally delayed males. Am J Med Genet A 2010; 152A:2512-20. [PMID: 20799337 PMCID: PMC2946449 DOI: 10.1002/ajmg.a.33626] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Fragile X syndrome (FXS), the most common inherited form of developmental delay, is typically caused by CGG-repeat expansion in FMR1. However, little attention has been paid to sequence variants in FMR1. Through the use of pooled-template massively parallel sequencing, we identified 130 novel FMR1 sequence variants in a population of 963 developmentally delayed males without CGG-repeat expansion mutations. Among these, we identified a novel missense change, p.R138Q, which alters a conserved residue in the nuclear localization signal of FMRP. We have also identified three promoter mutations in this population, all of which significantly reduce in vitro levels of FMR1 transcription. Additionally, we identified 10 noncoding variants of possible functional significance in the introns and 3'-untranslated region of FMR1, including two predicted splice site mutations. These findings greatly expand the catalog of known FMR1 sequence variants and suggest that FMR1 sequence variants may represent an important cause of developmental delay.
Collapse
Affiliation(s)
- Stephen C. Collins
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Steven M. Bray
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Joshua A. Suhl
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - David J. Cutler
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Bradford Coffee
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Michael E. Zwick
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Stephen T. Warren
- Department of Human Genetics, Emory University School of Medicine, Atlanta, GA 30322, USA
- Departments of Biochemistry and Pediatrics, Emory University School of Medicine, Atlanta, GA 30322, USA
| |
Collapse
|
977
|
Abstract
MicroRNAs (miRNAs) are small noncoding RNAs that regulate gene expression and have been implicated in the pathogenesis of cancer. In this study, we applied next generation sequencing techniques to comprehensively assess miRNA expression, identify genetic variants of miRNA genes, and screen for alterations in miRNA binding sites in a patient with acute myeloid leukemia. RNA sequencing of leukemic myeloblasts or CD34(+) cells pooled from healthy donors showed that 472 miRNAs were expressed, including 7 novel miRNAs, some of which displayed differential expression. Sequencing of all known miRNA genes revealed several novel germline polymorphisms but no acquired mutations in the leukemia genome. Analysis of the sequence of the 3'-untranslated regions (UTRs) of all coding genes identified a single somatic mutation in the 3'-UTR of TNFAIP2, a known target of the PML-RARα oncogene. This mutation resulted in translational repression of a reporter gene in a Dicer-dependent fashion. This study represents the first complete characterization of the "miRNAome" in a primary human cancer and suggests that generation of miRNA binding sites in the UTR regions of genes is another potential mechanism by which somatic mutations can affect gene expression.
Collapse
|
978
|
Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet 2010; 11:685-96. [PMID: 20847746 DOI: 10.1038/nrg2841] [Citation(s) in RCA: 778] [Impact Index Per Article: 51.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
979
|
Ding L, Wendl MC, Koboldt DC, Mardis ER. Analysis of next-generation genomic data in cancer: accomplishments and challenges. Hum Mol Genet 2010; 19:R188-96. [PMID: 20843826 DOI: 10.1093/hmg/ddq391] [Citation(s) in RCA: 95] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The application of next-generation sequencing technology has produced a transformation in cancer genomics, generating large data sets that can be analyzed in different ways to answer a multitude of questions about the genomic alterations associated with the disease. Analytical approaches can discover focused mutations such as substitutions and small insertion/deletions, large structural alterations and copy number events. As our capacity to produce such data for multiple cancers of the same type is improving, so are the demands to analyze multiple tumor genomes simultaneously growing. For example, pathway-based analyses that provide the full mutational impact on cellular protein networks and correlation analyses aimed at revealing causal relationships between genomic alterations and clinical presentations are both enabled. As the repertoire of data grows to include mRNA-seq, non-coding RNA-seq and methylation for multiple genomes, our challenge will be to intelligently integrate data types and genomes to produce a coherent picture of the genetic basis of cancer.
Collapse
Affiliation(s)
- Li Ding
- Department of Genetics, The Genome Center at Washington University School of Medicine, 4444 Forest Park Blvd., St Louis, MO 63108, USA
| | | | | | | |
Collapse
|
980
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010. [PMID: 20644199 DOI: 10.1101/gr.107524.110.20] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/14/2023]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
981
|
Margraf RL, Durtschi JD, Dames S, Pattison DC, Stephens JE, Mao R, Voelkerding KV. Multi-sample pooling and illumina genome analyzer sequencing methods to determine gene sequence variation for database development. J Biomol Tech 2010; 21:126-140. [PMID: 20808642 PMCID: PMC2922832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Determination of sequence variation within a genetic locus to develop clinically relevant databases is critical for molecular assay design and clinical test interpretation, so multisample pooling for Illumina genome analyzer (GA) sequencing was investigated using the RET proto-oncogene as a model. Samples were Sanger-sequenced for RET exons 10, 11, and 13-16. Ten samples with 13 known unique variants ("singleton variants" within the pool) and seven common changes were amplified and then equimolar-pooled before sequencing on a single flow cell lane, generating 36 base reads. For comparison, a single "control" sample was run in a different lane. After alignment, a 24-base quality score-screening threshold and 3; read end trimming of three bases yielded low background error rates with a 27% decrease in aligned read coverage. Sequencing data were evaluated using an established variant detection method (percent variant reads), by the presented subtractive correction method, and with SNPSeeker software. In total, 41 variants (of which 23 were singleton variants) were detected in the 10 pool data, which included all Sanger-identified variants. The 23 singleton variants were detected near the expected 5% allele frequency (average 5.17%+/-0.90% variant reads), well above the highest background error (1.25%). Based on background error rates, read coverage, simulated 30, 40, and 50 sample pool data, expected singleton allele frequencies within pools, and variant detection methods; >or=30 samples (which demonstrated a minimum 1% variant reads for singletons) could be pooled to reliably detect singleton variants by GA sequencing.
Collapse
Affiliation(s)
- Rebecca L Margraf
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, Utah 84108, USA.
| | | | | | | | | | | | | |
Collapse
|
982
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010. [PMID: 20644199 DOI: 10.1101/gr.107524.110.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
983
|
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20:1297-303. [PMID: 20644199 DOI: 10.1101/gr.107524.110] [Citation(s) in RCA: 18669] [Impact Index Per Article: 1244.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Collapse
Affiliation(s)
- Aaron McKenna
- Program in Medical and Population Genetics, The Broad Institute of Harvard and MIT, Cambridge, Massachusetts 02142, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
984
|
Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform 2010; 11:484-98. [PMID: 20519329 DOI: 10.1093/bib/bbq016] [Citation(s) in RCA: 101] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35-250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
Collapse
Affiliation(s)
- Daniel C Koboldt
- The Genome Center at Washington University, St. Louis, Missouri 63108, USA.
| | | | | | | |
Collapse
|
985
|
Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010; 11:473-83. [PMID: 20460430 DOI: 10.1093/bib/bbq015] [Citation(s) in RCA: 419] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
Collapse
Affiliation(s)
- Heng Li
- Broad Institute, Cambridge, MA 02142, USA.
| | | |
Collapse
|
986
|
Huss M. Introduction into the analysis of high-throughput-sequencing based epigenome data. Brief Bioinform 2010; 11:512-23. [PMID: 20457755 DOI: 10.1093/bib/bbq014] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Sequencing-based approaches now allow high-resolution, genome-scale investigation of cellular epigenetic landscapes. For example, mapping of open chromatin regions, post-translational histone modifications and DNA methylation across a whole genome is now feasible, and new non-coding regulatory RNAs can be sensitively identified via RNA sequencing. The resulting large-scale data sets promise to contribute towards a more precise and complete understanding of gene regulation and to yield insights into the interplay between genomes and the environment. In this article, I review some of the conceptual issues and currently available software tools for the analysis of sequencing-based whole-genome epigenetics data.
Collapse
|
987
|
Robison K. Application of second-generation sequencing to cancer genomics. Brief Bioinform 2010; 11:524-34. [DOI: 10.1093/bib/bbq013] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
|
988
|
Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, Liu Y, Weinstock GM, Wheeler DA, Gibbs RA, Yu F. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 2009; 20:273-80. [PMID: 20019143 DOI: 10.1101/gr.096388.109] [Citation(s) in RCA: 123] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Accurate identification of genetic variants from next-generation sequencing (NGS) data is essential for immediate large-scale genomic endeavors such as the 1000 Genomes Project, and is crucial for further genetic analysis based on the discoveries. The key challenge in single nucleotide polymorphism (SNP) discovery is to distinguish true individual variants (occurring at a low frequency) from sequencing errors (often occurring at frequencies orders of magnitude higher). Therefore, knowledge of the error probabilities of base calls is essential. We have developed Atlas-SNP2, a computational tool that detects and accounts for systematic sequencing errors caused by context-related variables in a logistic regression model learned from training data sets. Subsequently, it estimates the posterior error probability for each substitution through a Bayesian formula that integrates prior knowledge of the overall sequencing error probability and the estimated SNP rate with the results from the logistic regression model for the given substitutions. The estimated posterior SNP probability can be used to distinguish true SNPs from sequencing errors. Validation results show that Atlas-SNP2 achieves a false-positive rate of lower than 10%, with an approximately 5% or lower false-negative rate.
Collapse
Affiliation(s)
- Yufeng Shen
- The Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
989
|
Johnson ML, Lara N, Kamel MA. How genomics has informed our understanding of the pathogenesis of osteoporosis. Genome Med 2009; 1:84. [PMID: 19735586 PMCID: PMC2768991 DOI: 10.1186/gm84] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Osteoporosis is a skeletal disorder characterized by compromised bone strength that predisposes a person to an increased risk of fracture. Osteoporosis is a complex trait that involves multiple genes, environmental factors, and gene-gene and gene-environment interactions. Twin and family studies have indicated that between 25% and 85% of the variation in bone mass and other skeletal phenotypes is heritable, but our knowledge of the underlying genes is limited. Bone mineral density is the most common assessment for diagnosing osteoporosis and is the most often used quantitative value in the design of genetic studies. In recent years, our understanding of the pathophysiology of osteoporosis has been greatly facilitated by advances brought about by the Human Genome Project. Genetic approaches ranging from family studies of monogenic traits to association studies with candidate genes, to whole-genome scans in both humans and animals have identified a small number of genes that contribute to the heritability of bone mass. Studies with transgenic and knockout mouse models have revealed major new insights into the biology of many of these identified genes, but much more needs to be learned. Ultimately, we hope that by revealing the underlying genetics and biology driving the pathophysiology of osteoporosis, new and effective treatment can be developed to combat and possibly cure this devastating disease. Here we review the rapidly evolving field of the genomics of osteoporosis with a focus on important gene discoveries, new biological/physiological paradigms that are emerging, and many of the unanswered questions and hurdles yet to be overcome in the field.
Collapse
Affiliation(s)
- Mark L Johnson
- Department of Oral Biology, University of Missouri - Kansas City School of Dentistry, 650 East 25th Street, Kansas City, MO 64108, USA.
| | | | | |
Collapse
|