151
|
Horn F, Linde J, Mattern DJ, Walther G, Guthke R, Brakhage AA, Valiante V. Draft Genome Sequence of the Fungus Penicillium brasilianum MG11. GENOME ANNOUNCEMENTS 2015; 3:e00724-15. [PMID: 26337871 PMCID: PMC4559720 DOI: 10.1128/genomea.00724-15] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Accepted: 07/24/2015] [Indexed: 02/02/2023]
Abstract
The genus Penicillium belongs to the phylum Ascomycota and includes a variety of fungal species important for food and drug production. We report the draft genome sequence of Penicillium brasilianum MG11. This strain was isolated from soil, and it was reported to produce different secondary metabolites.
Collapse
Affiliation(s)
- Fabian Horn
- Department of Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| | - Jörg Linde
- Department of Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| | - Derek J Mattern
- Department of Molecular and Applied Microbiology, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| | - Grit Walther
- National Center for Invasive Mycoses, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| | - Reinhard Guthke
- Department of Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| | - Axel A Brakhage
- Department of Molecular and Applied Microbiology, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany Friedrich Schiller University, Institute for Microbiology, Jena, Germany
| | - Vito Valiante
- Department of Molecular and Applied Microbiology, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany Leibniz Junior Research Group-Biobricks of Microbial Natural Product Syntheses, Leibniz Institute for Natural Product Research and Infection Biology, Hans Knöll Institute (HKI), Jena, Germany
| |
Collapse
|
152
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
153
|
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015; 15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open
Abstract
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Collapse
|
154
|
Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015; 10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact gmarcais@umd.edu.
Collapse
Affiliation(s)
- Guillaume Marçais
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| | | | - Aleksey Zimin
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| |
Collapse
|
155
|
Sahl JW, Schupp JM, Rasko DA, Colman RE, Foster JT, Keim P. Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data. Genome Med 2015; 7:52. [PMID: 26136847 PMCID: PMC4487561 DOI: 10.1186/s13073-015-0176-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 05/15/2015] [Indexed: 12/30/2022] Open
Abstract
We describe an approach for genotyping bacterial strains from low coverage genome datasets, including metagenomic data from complex samples. Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined. The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly. To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement. WG-FAST can be downloaded from https://github.com/jasonsahl/wgfast.
Collapse
Affiliation(s)
- Jason W. Sahl
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
| | - James M. Schupp
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
| | - David A. Rasko
- />Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD USA
| | - Rebecca E. Colman
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
| | - Jeffrey T. Foster
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
- />Current address: Department of Molecular, Cellular & Biomedical Sciences, University of New Hampshire, Durham, NH USA
| | - Paul Keim
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
| |
Collapse
|
156
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|
157
|
Abstract
UNLABELLED BFC is a free, fast and easy-to-use sequencing error corrector designed for Illumina short reads. It uses a non-greedy algorithm but still maintains a speed comparable to implementations based on greedy methods. In evaluations on real data, BFC appears to correct more errors with fewer overcorrections in comparison to existing tools. It particularly does well in suppressing systematic sequencing errors, which helps to improve the base accuracy of de novo assemblies. AVAILABILITY AND IMPLEMENTATION https://github.com/lh3/bfc CONTACT hengli@broadinstitute.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Heng Li
- Medical Population Genetics Program, Broad Institute, Cambridge, MA 02142, USA
| |
Collapse
|
158
|
Insights from the metagenome of an acid salt lake: the role of biology in an extreme depositional environment. PLoS One 2015; 10:e0122869. [PMID: 25923206 PMCID: PMC4414474 DOI: 10.1371/journal.pone.0122869] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2014] [Accepted: 02/24/2015] [Indexed: 12/31/2022] Open
Abstract
The extremely acidic brine lakes of the Yilgarn Craton of Western Australia are home to some of the most biologically challenging waters on Earth. In this study, we employed metagenomic shotgun sequencing to generate a microbial profile of the depositional environment associated with the sulfur-rich sediments of one such lake. Of the 1.5 M high-quality reads generated, 0.25 M were mapped to protein features, which in turn provide new insights into the metabolic function of this community. In particular, 45 diverse genes associated with sulfur metabolism were identified, the majority of which were linked to either the conversion of sulfate to adenylylsulfate and the subsequent production of sulfide from sulfite or the oxidation of sulfide, elemental sulfur, and thiosulfate via the sulfur oxidation (Sox) system. This is the first metagenomic study of an acidic, hypersaline depositional environment, and we present evidence for a surprisingly high level of microbial diversity. Our findings also illuminate the possibility that we may be meaningfully underestimating the effects of biology on the chemistry of these sulfur-rich sediments, thereby influencing our understanding of past geobiological conditions that may have been present on Earth as well as early Mars.
Collapse
|
159
|
Horn F, Üzüm Z, Möbius N, Guthke R, Linde J, Hertweck C. Draft Genome Sequences of Symbiotic and Nonsymbiotic Rhizopus microsporus Strains CBS 344.29 and ATCC 62417. GENOME ANNOUNCEMENTS 2015; 3:e01370-14. [PMID: 25614557 PMCID: PMC4319578 DOI: 10.1128/genomea.01370-14] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 12/09/2014] [Indexed: 12/13/2022]
Abstract
Specific Rhizopus microsporus pathovars harbor bacterial endosymbionts (Burkholderia rhizoxinica) for the production of a phytotoxin. Here, we present the draft genome sequences of two R. microsporus strains, one symbiotic (ATCC 62417), and one endosymbiont-free (CBS 344.29). The gene predictions were supported by RNA sequencing (RNA-seq) data. The functional annotation sets the basis for comparative analyses.
Collapse
Affiliation(s)
- Fabian Horn
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Zerrin Üzüm
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Nadine Möbius
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Reinhard Guthke
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Jörg Linde
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Christian Hertweck
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| |
Collapse
|
160
|
Horn F, Habel A, Scharf DH, Dworschak J, Brakhage AA, Guthke R, Hertweck C, Linde J. Draft Genome Sequence and Gene Annotation of the Entomopathogenic Fungus Verticillium hemipterigenum. GENOME ANNOUNCEMENTS 2015; 3:e01439-14. [PMID: 25614560 PMCID: PMC4319583 DOI: 10.1128/genomea.01439-14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Accepted: 12/09/2014] [Indexed: 11/20/2022]
Abstract
Verticillium hemipterigenum (anamorph Torrubiella hemipterigena) is an entomopathogenic fungus and produces a broad range of secondary metabolites. Here, we present the draft genome sequence of the fungus, including gene structure and functional annotation. Genes were predicted incorporating RNA-Seq data and functionally annotated to provide the basis for further genome studies.
Collapse
Affiliation(s)
- Fabian Horn
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Andreas Habel
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Daniel H Scharf
- Molecular and Applied Microbiology, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Jan Dworschak
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Axel A Brakhage
- Molecular and Applied Microbiology, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Reinhard Guthke
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Christian Hertweck
- Biomolecular Chemistry, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Jörg Linde
- Systems Biology/Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| |
Collapse
|
161
|
Marinier E, Brown DG, McConkey BJ. Pollux: platform independent error correction of single and mixed genomes. BMC Bioinformatics 2015; 16:10. [PMID: 25592313 PMCID: PMC4307147 DOI: 10.1186/s12859-014-0435-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2014] [Accepted: 12/17/2014] [Indexed: 12/13/2022] Open
Abstract
Background Second-generation sequencers generate millions of relatively short, but error-prone, reads. These errors make sequence assembly and other downstream projects more challenging. Correcting these errors improves the quality of assemblies and projects which benefit from error-free reads. Results We have developed a general-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software. Conclusions Pollux is highly effective at correcting errors across platforms, and is consistently able to perform as well or better than currently available error correction software. Pollux provides general-purpose error correction and may be used in applications with or without assembly.
Collapse
Affiliation(s)
- Eric Marinier
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Daniel G Brown
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada.
| | - Brendan J McConkey
- Department of Biology, University of Waterloo, 200 University Ave W, N2L3G1 Waterloo, Canada.
| |
Collapse
|
162
|
Addisalem AB, Esselink GD, Bongers F, Smulders MJM. Genomic sequencing and microsatellite marker development for Boswellia papyrifera, an economically important but threatened tree native to dry tropical forests. AOB PLANTS 2015; 7:plu086. [PMID: 25573702 PMCID: PMC4433549 DOI: 10.1093/aobpla/plu086] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2014] [Accepted: 12/08/2014] [Indexed: 06/01/2023]
Abstract
Microsatellite (or simple sequence repeat, SSR) markers are highly informative DNA markers often used in conservation genetic research. Next-generation sequencing enables efficient development of large numbers of SSR markers at lower costs. Boswellia papyrifera is an economically important tree species used for frankincense production, an aromatic resinous gum exudate from bark. It grows in dry tropical forests in Africa and is threatened by a lack of rejuvenation. To help guide conservation efforts for this endangered species, we conducted an analysis of its genomic DNA sequences using Illumina paired-end sequencing. The genome size was estimated at 705 Mb per haploid genome. The reads contained one microsatellite repeat per 5.7 kb. Based on a subset of these repeats, we developed 46 polymorphic SSR markers that amplified 2-12 alleles in 10 genotypes. This set included 30 trinucleotide repeat markers, four tetranucleotide repeat markers, six pentanucleotide markers and six hexanucleotide repeat markers. Several markers were cross-transferable to Boswellia pirrotae and B. popoviana. In addition, retrotransposons were identified, the reads were assembled and several contigs were identified with similarity to genes of the terpene and terpenoid backbone synthesis pathways, which form the major constituents of the bark resin.
Collapse
Affiliation(s)
- A B Addisalem
- Wageningen UR Plant Breeding, Wageningen University and Research Center, PO Box 386, NL-6700 AJ Wageningen, The Netherlands Center for Ecosystem Studies, Forest Ecology and Forest Management Group, Wageningen University and Research Center, PO Box 47, NL-6700 AA Wageningen, The Netherlands Wondo Genet College of Forestry and Natural Resources, PO Box 128, Shashemene, Ethiopia
| | - G Danny Esselink
- Wageningen UR Plant Breeding, Wageningen University and Research Center, PO Box 386, NL-6700 AJ Wageningen, The Netherlands
| | - F Bongers
- Center for Ecosystem Studies, Forest Ecology and Forest Management Group, Wageningen University and Research Center, PO Box 47, NL-6700 AA Wageningen, The Netherlands
| | - M J M Smulders
- Wageningen UR Plant Breeding, Wageningen University and Research Center, PO Box 386, NL-6700 AJ Wageningen, The Netherlands
| |
Collapse
|
163
|
Computational and Statistical Analyses of Insertional Polymorphic Endogenous Retroviruses in a Non-Model Organism. COMPUTATION 2014. [DOI: 10.3390/computation2040221] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
164
|
Melsted P, Halldórsson BV. KmerStream: streaming algorithms for k-mer abundance estimation. ACTA ACUST UNITED AC 2014; 30:3541-7. [PMID: 25355787 DOI: 10.1093/bioinformatics/btu713] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
MOTIVATION Several applications in bioinformatics, such as genome assemblers and error corrections methods, rely on counting and keeping track of k-mers (substrings of length k). Histograms of k-mer frequencies can give valuable insight into the underlying distribution and indicate the error rate and genome size sampled in the sequencing experiment. RESULTS We present KmerStream, a streaming algorithm for estimating the number of distinct k-mers present in high-throughput sequencing data. The algorithm runs in time linear in the size of the input and the space requirement are logarithmic in the size of the input. We derive a simple model that allows us to estimate the error rate of the sequencing experiment, as well as the genome size, using only the aggregate statistics reported by KmerStream. As an application we show how KmerStream can be used to compute the error rate of a DNA sequencing experiment. We run KmerStream on a set of 2656 whole genome sequenced individuals and compare the error rate to quality values reported by the sequencing equipment. We discover that while the quality values alone are largely reliable as a predictor of error rate, there is considerable variability in the error rates between sequencing runs, even when accounting for reported quality values.
Collapse
Affiliation(s)
- Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland, deCODE Genetics/Amgen, Reykjavík, Iceland and School of Science and Engineering, Reykjavík University, Reykjavík, Iceland Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland, deCODE Genetics/Amgen, Reykjavík, Iceland and School of Science and Engineering, Reykjavík University, Reykjavík, Iceland
| | - Bjarni V Halldórsson
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland, deCODE Genetics/Amgen, Reykjavík, Iceland and School of Science and Engineering, Reykjavík University, Reykjavík, Iceland Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland, deCODE Genetics/Amgen, Reykjavík, Iceland and School of Science and Engineering, Reykjavík University, Reykjavík, Iceland
| |
Collapse
|
165
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
166
|
Lim EC, Müller J, Hagmann J, Henz SR, Kim ST, Weigel D. Trowel: a fast and accurate error correction module for Illumina sequencing reads. ACTA ACUST UNITED AC 2014; 30:3264-5. [PMID: 25075116 DOI: 10.1093/bioinformatics/btu513] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The ability to accurately read the order of nucleotides in DNA and RNA is fundamental for modern biology. Errors in next-generation sequencing can lead to many artifacts, from erroneous genome assemblies to mistaken inferences about RNA editing. Uneven coverage in datasets also contributes to false corrections. RESULT We introduce Trowel, a massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications.The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size. AVAILABILITY AND IMPLEMENTATION Trowel is written in C++ and is provided under the General Public License v3.0 (GPLv3). It is available at http://trowel-ec.sourceforge.net. CONTACT euncheon.lim@tue.mpg.de or weigel@tue.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eun-Cheon Lim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jonas Müller
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jörg Hagmann
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Stefan R Henz
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Sang-Tae Kim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| |
Collapse
|
167
|
Drezen E, Rizk G, Chikhi R, Deltel C, Lemaitre C, Peterlongo P, Lavenier D. GATB: Genome Assembly & Analysis Tool Box. ACTA ACUST UNITED AC 2014; 30:2959-61. [PMID: 24990603 PMCID: PMC4184257 DOI: 10.1093/bioinformatics/btu406] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Motivation: Efficient and fast next-generation sequencing (NGS) algorithms are essential to analyze the terabytes of data generated by the NGS machines. A serious bottleneck can be the design of such algorithms, as they require sophisticated data structures and advanced hardware implementation. Results: We propose an open-source library dedicated to genome assembly and analysis to fasten the process of developing efficient software. The library is based on a recent optimized de-Bruijn graph implementation allowing complex genomes to be processed on desktop computers using fast algorithms with low memory footprints. Availability and implementation: The GATB library is written in C++ and is available at the following Web site http://gatb.inria.fr under the A-GPL license. Contact:lavenier@irisa.fr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erwan Drezen
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Guillaume Rizk
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Rayan Chikhi
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Charles Deltel
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Claire Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Pierre Peterlongo
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| | - Dominique Lavenier
- INRIA/IRISA/GenScale, Campus de Beaulieu, 35042 Rennes Cedex, France and Department of Computer Science and Engineering, Pennsylvania State University, PA 16802, USA
| |
Collapse
|
168
|
Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics 2014; 30:2796-801. [PMID: 24950811 DOI: 10.1093/bioinformatics/btu387] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input. RESULTS We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization. AVAILABILITY AND IMPLEMENTATION BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.
Collapse
Affiliation(s)
- Lilian Janin
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| | - Ole Schulz-Trieglaff
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| | - Anthony J Cox
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| |
Collapse
|
169
|
Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. FRONTIERS IN PLANT SCIENCE 2014; 5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]
Abstract
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Collapse
Affiliation(s)
- Claudia Knief
- Institute of Crop Science and Resource Conservation—Molecular Biology of the Rhizosphere, Faculty of Agriculture, University of BonnBonn, Germany
| |
Collapse
|
170
|
Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014; 15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Current-generation sequencing technologies are able to produce low-cost, high-throughput reads. However, the produced reads are imperfect and may contain various sequencing errors. Although many error correction methods have been developed in recent years, none explicitly targets homopolymer-length errors in the 454 sequencing reads. RESULTS We present HECTOR, a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. In this algorithm, for the first time we have investigated a novel homopolymer spectrum based approach to handle homopolymer insertions or deletions, which are the dominant sequencing errors in 454 pyrosequencing reads. We have evaluated the performance of HECTOR, in terms of correction quality, runtime and parallel scalability, using both simulated and real pyrosequencing datasets. This performance has been further compared to that of Coral, a state-of-the-art error corrector which is based on multiple sequence alignment and Acacia, a recently published error corrector for amplicon pyrosequences. Our evaluations reveal that HECTOR demonstrates comparable correction quality to Coral, but runs 3.7× faster on average. In addition, HECTOR performs well even when the coverage of the dataset is low. CONCLUSION Our homopolymer spectrum based approach is theoretically capable of processing arbitrary-length homopolymer-length errors, with a linear time complexity. HECTOR employs a multi-threaded design based on a master-slave computing model. Our experimental results show that HECTOR is a practical 454 pyrosequencing read error corrector which is competitive in terms of both correction quality and speed. The source code and all simulated data are available at: http://hector454.sourceforge.net.
Collapse
Affiliation(s)
- Adrianto Wirawan
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany.
| | | | | | | | | |
Collapse
|
171
|
Yu YW, Yorukoglu D, Berger B. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2014; 8394:385-399. [PMID: 28825060 DOI: 10.1007/978-3-319-05269-4_31] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
UNLABELLED It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.
Collapse
Affiliation(s)
- Y William Yu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Deniz Yorukoglu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Bonnie Berger
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| |
Collapse
|
172
|
Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol 2014; 23:1679-700. [DOI: 10.1111/mec.12680] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Revised: 01/17/2014] [Accepted: 01/22/2014] [Indexed: 12/17/2022]
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| | - Antonis Rokas
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| |
Collapse
|
173
|
Genomic sequence and experimental tractability of a new decapod shrimp model, Neocaridina denticulata. Mar Drugs 2014; 12:1419-37. [PMID: 24619275 PMCID: PMC3967219 DOI: 10.3390/md12031419] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2014] [Revised: 02/23/2014] [Accepted: 02/28/2014] [Indexed: 12/14/2022] Open
Abstract
The speciose Crustacea is the largest subphylum of arthropods on the planet after the Insecta. To date, however, the only publically available sequenced crustacean genome is that of the water flea, Daphnia pulex, a member of the Branchiopoda. While Daphnia is a well-established ecotoxicological model, previous study showed that one-third of genes contained in its genome are lineage-specific and could not be identified in any other metazoan genomes. To better understand the genomic evolution of crustaceans and arthropods, we have sequenced the genome of a novel shrimp model, Neocaridina denticulata, and tested its experimental malleability. A library of 170-bp nominal fragment size was constructed from DNA of a starved single adult and sequenced using the Illumina HiSeq2000 platform. Core eukaryotic genes, the mitochondrial genome, developmental patterning genes (such as Hox) and microRNA processing pathway genes are all present in this animal, suggesting it has not undergone massive genomic loss. Comparison with the published genome of Daphnia pulex has allowed us to reveal 3750 genes that are indeed specific to the lineage containing malacostracans and branchiopods, rather than Daphnia-specific (E-value: 10⁻⁶). We also show the experimental tractability of N. denticulata, which, together with the genomic resources presented here, make it an ideal model for a wide range of further aquacultural, developmental, ecotoxicological, food safety, genetic, hormonal, physiological and reproductive research, allowing better understanding of the evolution of crustaceans and other arthropods.
Collapse
|
174
|
Roy RS, Bhattacharya D, Schliep A. Turtle: Identifying frequent k -mers with cache-efficient algorithms. Bioinformatics 2014; 30:1950-7. [DOI: 10.1093/bioinformatics/btu132] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
175
|
Heo Y, Wu XL, Chen D, Ma J, Hwu WM. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. ACTA ACUST UNITED AC 2014; 30:1354-62. [PMID: 24451628 DOI: 10.1093/bioinformatics/btu030] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Rapid advances in next-generation sequencing (NGS) technology have led to exponential increase in the amount of genomic information. However, NGS reads contain far more errors than data from traditional sequencing methods, and downstream genomic analysis results can be improved by correcting the errors. Unfortunately, all the previous error correction methods required a large amount of memory, making it unsuitable to process reads from large genomes with commodity computers. RESULTS We present a novel algorithm that produces accurate correction results with much less memory compared with previous solutions. The algorithm, named BLoom-filter-based Error correction Solution for high-throughput Sequencing reads (BLESS), uses a single minimum-sized Bloom filter, and is also able to tolerate a higher false-positive rate, thus allowing us to correct errors with a 40× memory usage reduction on average compared with previous methods. Meanwhile, BLESS can extend reads like DNA assemblers to correct errors at the end of reads. Evaluations using real and simulated reads showed that BLESS could generate more accurate results than existing solutions. After errors were corrected using BLESS, 69% of initially unaligned reads could be aligned correctly. Additionally, de novo assembly results became 50% longer with 66% fewer assembly errors. AVAILABILITY AND IMPLEMENTATION Freely available at http://sourceforge.net/p/bless-ec CONTACT dchen@illinois.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Heo
- Department of Electrical and Computer Engineering, Department of Bioengineering and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | |
Collapse
|
176
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
177
|
Anvar SY, Khachatryan L, Vermaat M, van Galen M, Pulyakhina I, Ariyurek Y, Kraaijeveld K, den Dunnen JT, de Knijff P, ’t Hoen PAC, Laros JFJ. Determining the quality and complexity of next-generation sequencing data without a reference genome. Genome Biol 2014; 15:555. [PMID: 25514851 PMCID: PMC4298064 DOI: 10.1186/s13059-014-0555-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2014] [Accepted: 11/27/2014] [Indexed: 01/22/2023] Open
Abstract
We describe an open-source kPAL package that facilitates an alignment-free assessment of the quality and comparability of sequencing datasets by analyzing k-mer frequencies. We show that kPAL can detect technical artefacts such as high duplication rates, library chimeras, contamination and differences in library preparation protocols. kPAL also successfully captures the complexity and diversity of microbiomes and provides a powerful means to study changes in microbial communities. Together, these features make kPAL an attractive and broadly applicable tool to determine the quality and comparability of sequence libraries even in the absence of a reference sequence. kPAL is freely available at https://github.com/LUMC/kPAL webcite.
Collapse
Affiliation(s)
- Seyed Yahya Anvar
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Lusine Khachatryan
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Martijn Vermaat
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Michiel van Galen
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Irina Pulyakhina
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Yavuz Ariyurek
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Ken Kraaijeveld
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
- />Department of Ecological Science, VU University Amsterdam, Amsterdam, The Netherlands
| | - Johan T den Dunnen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
- />Department of Clinical Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter de Knijff
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter AC ’t Hoen
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jeroen FJ Laros
- />Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
- />Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
178
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
179
|
Rødland EA. Compact representation of k-mer de Bruijn graphs for genome read assembly. BMC Bioinformatics 2013; 14:313. [PMID: 24152242 PMCID: PMC4015147 DOI: 10.1186/1471-2105-14-313] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2013] [Accepted: 10/14/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure. RESULTS The FM-index, which is based on the Burrows-Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k-1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently. CONCLUSIONS The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).
Collapse
Affiliation(s)
- Einar Andreas Rødland
- Center for Cancer Biomedicine & Departement of Informatics, University of Oslo, 0316 Oslo, Norway.
| |
Collapse
|
180
|
Guo Y, Ye F, Sheng Q, Clark T, Samuels DC. Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform 2013; 15:879-89. [PMID: 24067931 DOI: 10.1093/bib/bbt069] [Citation(s) in RCA: 118] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Advances in next-generation sequencing (NGS) technologies have greatly improved our ability to detect genomic variants for biomedical research. In particular, NGS technologies have been recently applied with great success to the discovery of mutations associated with the growth of various tumours and in rare Mendelian diseases. The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is quality control of the sequencing data. In this review, we discuss the proper quality control procedures and parameters for Illumina technology-based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling. Monitoring quality control metrics at each of the three stages of NGS data provides unique and independent evaluations of data quality from differing perspectives. Properly conducting quality control protocols at all three stages and correctly interpreting the quality control results are crucial to ensure a successful and meaningful study.
Collapse
|