Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	[Subscribe] [Scholar Register]

Number

Cited by Other Article(s)

Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016;17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open

Pal S, Aluru S. In search of perfect reads. BMC Bioinformatics 2015;16 Suppl 17:S7. [PMID: 26679555 PMCID: PMC4674851 DOI: 10.1186/1471-2105-16-s17-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Abstract

BACKGROUND

Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources.

METHODS

We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need.

RESULTS

Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%.

CONCLUSIONS

Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.

Collapse

Saha S, Rajasekaran S. EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 2015;16 Suppl 17:S2. [PMID: 26678663 PMCID: PMC4674864 DOI: 10.1186/1471-2105-16-s17-s2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2023] Open

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015;4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open

Outbred genome sequencing and CRISPR/Cas9 gene editing in butterflies. Nat Commun 2015;6:8212. [PMID: 26354079 PMCID: PMC4568561 DOI: 10.1038/ncomms9212] [Citation(s) in RCA: 125] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2014] [Accepted: 07/29/2015] [Indexed: 12/22/2022] Open

Iyer S, Casey E, Bouzek H, Kim M, Deng W, Larsen BB, Zhao H, Bumgarner RE, Rolland M, Mullins JI. Comparison of Major and Minor Viral SNPs Identified through Single Template Sequencing and Pyrosequencing in Acute HIV-1 Infection. PLoS One 2015;10:e0135903. [PMID: 26317928 PMCID: PMC4552882 DOI: 10.1371/journal.pone.0135903] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 07/27/2015] [Indexed: 01/03/2023] Open

Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015;31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open

Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015;15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open

Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015;10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open

Abstract

Motivation

Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.

Results

We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.

Availability

QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu.

Contact

gmarcais@umd.edu.

Collapse

Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015;31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open

Craveiro SR, Inglis PW, Togawa RC, Grynberg P, Melo FL, Ribeiro ZMA, Ribeiro BM, Báo SN, Castro MEB. The genome sequence of Pseudoplusia includens single nucleopolyhedrovirus and an analysis of p26 gene evolution in the baculoviruses. BMC Genomics 2015;16:127. [PMID: 25765042 PMCID: PMC4346127 DOI: 10.1186/s12864-015-1323-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Accepted: 02/04/2015] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Pseudoplusia includens single nucleopolyhedrovirus (PsinSNPV-IE) is a baculovirus recently identified in our laboratory, with high pathogenicity to the soybean looper, Chrysodeixis includens (Lepidoptera: Noctuidae) (Walker, 1858). In Brazil, the C. includens caterpillar is an emerging pest and has caused significant losses in soybean and cotton crops. The PsinSNPV genome was determined and the phylogeny of the p26 gene within the family Baculoviridae was investigated.

RESULTS

The complete genome of PsinSNPV was sequenced (Roche 454 GS FLX - Titanium platform), annotated and compared with other Alphabaculoviruses, displaying a genome apparently different from other baculoviruses so far sequenced. The circular double-stranded DNA genome is 139,132 bp in length, with a GC content of 39.3 % and contains 141 open reading frames (ORFs). PsinSNPV possesses the 37 conserved baculovirus core genes, 102 genes found in other baculoviruses and 2 unique ORFs. Two baculovirus repeat ORFs (bro) homologs, bro-a (Psin33) and bro-b (Psin69), were identified and compared with Chrysodeixis chalcites nucleopolyhedrovirus (ChchNPV) and Trichoplusia ni single nucleopolyhedrovirus (TnSNPV) bro genes and showed high similarity, suggesting that these genes may be derived from an ancestor common to these viruses. The homologous repeats (hrs) are absent from the PsinSNPV genome, which is also the case in ChchNPV and TnSNPV. Two p26 gene homologs (p26a and p26b) were found in the PsinSNPV genome. P26 is thought to be required for optimal virion occlusion in the occlusion bodies (OBs), but its function is not well characterized. The P26 phylogenetic tree suggests that this gene was obtained from three independent acquisition events within the Baculoviridae family. The presence of a signal peptide only in the PsinSNPV p26a/ORF-20 homolog indicates distinct function between the two P26 proteins.

CONCLUSIONS

PsinSNPV has a genomic sequence apparently different from other baculoviruses sequenced so far. The complete genome sequence of PsinSNPV will provide a valuable resource, contributing to studies on its molecular biology and functional genomics, and will promote the development of this virus as an effective bioinsecticide.

Collapse

Salehi F, Baronio R, Idrogo-Lam R, Vu H, Hall LV, Kaiser P, Lathrop RH. CHOPER filters enable rare mutation detection in complex mutagenesis populations by next-generation sequencing. PLoS One 2015;10:e0116877. [PMID: 25692681 PMCID: PMC4333345 DOI: 10.1371/journal.pone.0116877] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 12/08/2014] [Indexed: 01/12/2023] Open

Schulz MH, Weese D, Holtgrewe M, Dimitrova V, Niu S, Reinert K, Richard H. Fiona: a parallel and automatic strategy for read error correction. ACTA ACUST UNITED AC 2015;30:i356-63. [PMID: 25161220 PMCID: PMC4147893 DOI: 10.1093/bioinformatics/btu440] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Affiliation(s)

Marcel H Schulz 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
David Weese 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
Manuel Holtgrewe 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
Viktoria Dimitrova 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
Sijia Niu 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
Knut Reinert 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
Hugues Richard 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France

Collapse

Kopylova E, Noé L, Da Silva C, Berthelot JF, Alberti A, Aury JM, Touzet H. Deciphering metatranscriptomic data. Methods Mol Biol 2015;1269:279-91. [PMID: 25577385 DOI: 10.1007/978-1-4939-2291-8_17] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

Rare biosphere exploration using high-throughput sequencing: research progress and perspectives. CONSERV GENET 2014. [DOI: 10.1007/s10592-014-0678-9] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J, Harmsen D. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS One 2014;9:e107014. [PMID: 25198770 PMCID: PMC4157817 DOI: 10.1371/journal.pone.0107014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/07/2014] [Indexed: 12/28/2022] Open

Abstract

De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.

Collapse

Ahola V, Lehtonen R, Somervuo P, Salmela L, Koskinen P, Rastas P, Välimäki N, Paulin L, Kvist J, Wahlberg N, Tanskanen J, Hornett EA, Ferguson LC, Luo S, Cao Z, de Jong MA, Duplouy A, Smolander OP, Vogel H, McCoy RC, Qian K, Chong WS, Zhang Q, Ahmad F, Haukka JK, Joshi A, Salojärvi J, Wheat CW, Grosse-Wilde E, Hughes D, Katainen R, Pitkänen E, Ylinen J, Waterhouse RM, Turunen M, Vähärautio A, Ojanen SP, Schulman AH, Taipale M, Lawson D, Ukkonen E, Mäkinen V, Goldsmith MR, Holm L, Auvinen P, Frilander MJ, Hanski I. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nat Commun 2014;5:4737. [PMID: 25189940 PMCID: PMC4164777 DOI: 10.1038/ncomms5737] [Citation(s) in RCA: 158] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/17/2014] [Indexed: 12/30/2022] Open

Affiliation(s)

Virpi Ahola 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2]
Rainer Lehtonen 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [3] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland [4] Center of Excellence in Cancer Genetics, University of Helsinki, FI-00014 Helsinki, Finland [5] [6]
Panu Somervuo 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
Leena Salmela Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
Patrik Koskinen 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Pasi Rastas Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
Niko Välimäki 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
Lars Paulin Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Jouni Kvist Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Niklas Wahlberg Department of Biology, University of Turku, FI-20014 Turku, Finland
Jaakko Tanskanen 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
Emily A Hornett 1] Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK [2] Department of Biology, Pennsylvania State University, Pennsylvania 16802, USA
Laura C Ferguson Department of Zoology, University of Oxford, Oxford OX1 3PS, UK
Shiqi Luo College of Life Sciences, Peking University, Beijing 100871, P.R. China
Zijuan Cao College of Life Sciences, Peking University, Beijing 100871, P.R. China
Maaike A de Jong 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] School of Biological Sciences, University of Bristol, Bristol BS8 1UG, UK
Anne Duplouy Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
Olli-Pekka Smolander Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Heiko Vogel Department of Entomology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
Rajiv C McCoy Department of Biology, Stanford University, Stanford, California 94305, USA
Kui Qian Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Wong Swee Chong Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
Qin Zhang BioMediTech, University of Tampere, FI-33520 Tampere, Finland
Freed Ahmad Department of Information Technology, University of Turku, FI-20014 Turku, Finland
Jani K Haukka BioMediTech, University of Tampere, FI-33520 Tampere, Finland
Aruj Joshi BioMediTech, University of Tampere, FI-33520 Tampere, Finland
Jarkko Salojärvi Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
Christopher W Wheat Department of Zoology, Stockholm University, SE-10691 Stockholm, Sweden
Ewald Grosse-Wilde Department of Evolutionary Neuroethology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
Daniel Hughes 1] European Bioinformatics Institute, Hinxton CB10 1SD, UK [2] Baylor College of Medicine, Human Genome Sequencing Center, Houston, Texas 77030-3411, USA
Riku Katainen 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
Esa Pitkänen 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
Johannes Ylinen Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
Robert M Waterhouse 1] Department of Genetic Medicine and Development, University of Geneva Medical School &Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland [2] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA [3] The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
Mikko Turunen Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland
Anna Vähärautio 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Department of Pathology, University of Helsinki, FI-00014 Helsinki, Finland [3] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
Sami P Ojanen Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
Alan H Schulman 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
Minna Taipale 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
Daniel Lawson European Bioinformatics Institute, Hinxton CB10 1SD, UK
Esko Ukkonen Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
Veli Mäkinen Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
Marian R Goldsmith Department of Biological Sciences, University of Rhode Island, Kingston, Rhode Island 02881-0816, USA
Liisa Holm 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
Petri Auvinen 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
Mikko J Frilander 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
Ilkka Hanski Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland

Collapse

Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014;16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open

Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. ACTA ACUST UNITED AC 2014;30:3506-14. [PMID: 25165095 PMCID: PMC4253826 DOI: 10.1093/bioinformatics/btu538] [Citation(s) in RCA: 476] [Impact Index Per Article: 43.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Lim EC, Müller J, Hagmann J, Henz SR, Kim ST, Weigel D. Trowel: a fast and accurate error correction module for Illumina sequencing reads. ACTA ACUST UNITED AC 2014;30:3264-5. [PMID: 25075116 DOI: 10.1093/bioinformatics/btu513] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. ACTA ACUST UNITED AC 2014;30:2723-32. [PMID: 24919879 DOI: 10.1093/bioinformatics/btu368] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Abstract

MOTIVATION

Bioinformatics tools, such as assemblers and aligners, are expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of stand-alone tools whose sole purpose is to detect and remove sequencing errors. A good error-correcting tool would be a transparent component in a bioinformatics pipeline, simply taking sequence data in any of the standard formats and producing a higher quality version of the same data containing far fewer errors. It should not only be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), but it has to be both fast enough and scalable enough to be usable on the large datasets being produced by current sequencing technologies, and work on data derived from both haploid and diploid organisms.

RESULTS

This article presents Blue, an error-correction algorithm based on k-mer consensus and context. Blue can correct substitution, deletion and insertion errors, as well as uncalled bases. It accepts both FASTQ and FASTA formats, and corrects quality scores for corrected bases. Blue also maintains the pairing of reads, both within a file and between pairs of files, making it compatible with downstream tools that depend on read pairing. Blue is memory efficient, scalable and faster than other published tools, and usable on large sequencing datasets. On the tests undertaken, Blue also proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors. One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes.

AVAILABILITY AND IMPLEMENTATION

The code for Blue and its related tools are available from http://www.bioinformatics.csiro.au/Blue. These programs are written in C# and run natively under Windows and under Mono on Linux.

Collapse

Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. FRONTIERS IN PLANT SCIENCE 2014;5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]

Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014;15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open

Alkio M, Jonas U, Declercq M, Van Nocker S, Knoche M. Transcriptional dynamics of the developing sweet cherry (Prunus avium L.) fruit: sequencing, annotation and expression profiling of exocarp-associated genes. HORTICULTURE RESEARCH 2014;1:11. [PMID: 26504533 PMCID: PMC4591669 DOI: 10.1038/hortres.2014.11] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 01/17/2014] [Indexed: 05/24/2023]

Mbandi SK, Hesse U, Rees DJG, Christoffels A. A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads. Front Genet 2014;5:17. [PMID: 24575122 PMCID: PMC3921913 DOI: 10.3389/fgene.2014.00017] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Accepted: 01/19/2014] [Indexed: 11/13/2022] Open

Istvánek J, Jaros M, Krenek A, Řepková J. Genome assembly and annotation for red clover (Trifolium pratense; Fabaceae). AMERICAN JOURNAL OF BOTANY 2014;101:327-37. [PMID: 24500806 DOI: 10.3732/ajb.1300340] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]

Heo Y, Wu XL, Chen D, Ma J, Hwu WM. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. ACTA ACUST UNITED AC 2014;30:1354-62. [PMID: 24451628 DOI: 10.1093/bioinformatics/btu030] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013;9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open

Ferrarini M, Moretto M, Ward JA, Šurbanovski N, Stevanović V, Giongo L, Viola R, Cavalieri D, Velasco R, Cestaro A, Sargent DJ. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics 2013;14:670. [PMID: 24083400 PMCID: PMC3853357 DOI: 10.1186/1471-2164-14-670] [Citation(s) in RCA: 112] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2013] [Accepted: 09/26/2013] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome.

RESULTS

Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously.

CONCLUSIONS

This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.

Collapse

Iyer S, Bouzek H, Deng W, Larsen B, Casey E, Mullins JI. Quality score based identification and correction of pyrosequencing errors. PLoS One 2013;8:e73015. [PMID: 24039850 PMCID: PMC3764156 DOI: 10.1371/journal.pone.0073015] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 07/22/2013] [Indexed: 12/26/2022] Open

Matullo G, Di Gaetano C, Guarrera S. Next generation sequencing and rare genetic variants: from human population studies to medical genetics. ENVIRONMENTAL AND MOLECULAR MUTAGENESIS 2013;54:518-532. [PMID: 23922201 DOI: 10.1002/em.21799] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Revised: 05/31/2013] [Accepted: 06/09/2013] [Indexed: 06/02/2023]

Deng W, Maust BS, Westfall DH, Chen L, Zhao H, Larsen BB, Iyer S, Liu Y, Mullins JI. Indel and Carryforward Correction (ICC): a new analysis approach for processing 454 pyrosequencing data. ACTA ACUST UNITED AC 2013;29:2402-9. [PMID: 23900188 DOI: 10.1093/bioinformatics/btt434] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Ilie L, Molnar M. RACER: Rapid and accurate correction of errors in reads. ACTA ACUST UNITED AC 2013;29:2490-3. [PMID: 23853064 DOI: 10.1093/bioinformatics/btt407] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Le HS, Schulz MH, McCauley BM, Hinman VF, Bar-Joseph Z. Probabilistic error correction for RNA sequencing. Nucleic Acids Res 2013;41:e109. [PMID: 23558750 PMCID: PMC3664804 DOI: 10.1093/nar/gkt215] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet 2013;14:157-67. [PMID: 23358380 DOI: 10.1038/nrg3367] [Citation(s) in RCA: 262] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. ACTA ACUST UNITED AC 2012. [PMID: 23202746 DOI: 10.1093/bioinformatics/bts690] [Citation(s) in RCA: 180] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]

Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH. Megraft: a software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Res Microbiol 2012;163:407-12. [PMID: 22824070 DOI: 10.1016/j.resmic.2012.07.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2012] [Accepted: 06/26/2012] [Indexed: 12/31/2022]

Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, Yokosawa J, Zelikovsky A, Khudyakov Y. Efficient error correction for next-generation sequencing of viral amplicons. BMC Bioinformatics 2012;13 Suppl 10:S6. [PMID: 22759430 PMCID: PMC3382444 DOI: 10.1186/1471-2105-13-s10-s6] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open

Abstract

BACKGROUND

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

RESULTS

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

CONCLUSIONS

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

Collapse

Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 2012;7:e37135. [PMID: 22675423 PMCID: PMC3365034 DOI: 10.1371/journal.pone.0037135] [Citation(s) in RCA: 2086] [Impact Index Per Article: 160.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 04/13/2012] [Indexed: 12/14/2022] Open

Abstract

The ability to efficiently and accurately determine genotypes is a keystone technology in modern genetics, crucial to studies ranging from clinical diagnostics, to genotype-phenotype association, to reconstruction of ancestry and the detection of selection. To date, high capacity, low cost genotyping has been largely achieved via “SNP chip” microarray-based platforms which require substantial prior knowledge of both genome sequence and variability, and once designed are suitable only for those targeted variable nucleotide sites. This method introduces substantial ascertainment bias and inherently precludes detection of rare or population-specific variants, a major source of information for both population history and genotype-phenotype association. Recent developments in reduced-representation genome sequencing experiments on massively parallel sequencers (commonly referred to as RAD-tag or RADseq) have brought direct sequencing to the problem of population genotyping, but increased cost and procedural and analytical complexity have limited their widespread adoption. Here, we describe a complete laboratory protocol, including a custom combinatorial indexing method, and accompanying software tools to facilitate genotyping across large numbers (hundreds or more) of individuals for a range of markers (hundreds to hundreds of thousands). Our method requires no prior genomic knowledge and achieves per-site and per-individual costs below that of current SNP chip technology, while requiring similar hands-on time investment, comparable amounts of input DNA, and downstream analysis times on the order of hours. Finally, we provide empirical results from the application of this method to both genotyping in a laboratory cross and in wild populations. Because of its flexibility, this modified RADseq approach promises to be applicable to a diversity of biological questions in a wide range of organisms.

Collapse

Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012;14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Archer J, Baillie G, Watson SJ, Kellam P, Rambaut A, Robertson DL. Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II. BMC Bioinformatics 2012;13:47. [PMID: 22443413 PMCID: PMC3359224 DOI: 10.1186/1471-2105-13-47] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Accepted: 03/23/2012] [Indexed: 01/23/2023] Open

Abstract

Background

Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified.

Results

Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants.

Conclusion

We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.

Collapse

Studholme DJ. Deep sequencing of small RNAs in plants: applied bioinformatics. Brief Funct Genomics 2011;11:71-85. [PMID: 22184332 DOI: 10.1093/bfgp/elr039] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One 2011;6:e22594. [PMID: 21949676 PMCID: PMC3174933 DOI: 10.1371/journal.pone.0022594] [Citation(s) in RCA: 441] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2011] [Accepted: 06/29/2011] [Indexed: 01/15/2023] Open

Philippe N, Salson M, Lecroq T, Léonard M, Commes T, Rivals E. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 2011;12:242. [PMID: 21682852 PMCID: PMC3163563 DOI: 10.1186/1471-2105-12-242] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2010] [Accepted: 06/17/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.

RESULTS

Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).

CONCLUSIONS

Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.

Collapse

Ranwez V, Harispe S, Delsuc F, Douzery EJP. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One 2011. [PMID: 21949676 DOI: 10.1371/journal.pone.00022594] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/24/2023] Open