51
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
52
|
Abstract
BACKGROUND Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources. METHODS We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need. RESULTS Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%. CONCLUSIONS Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.
Collapse
|
53
|
Abstract
Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
Collapse
|
54
|
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015; 4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. FINDINGS We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. CONCLUSIONS Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Collapse
|
55
|
Outbred genome sequencing and CRISPR/Cas9 gene editing in butterflies. Nat Commun 2015; 6:8212. [PMID: 26354079 PMCID: PMC4568561 DOI: 10.1038/ncomms9212] [Citation(s) in RCA: 125] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2014] [Accepted: 07/29/2015] [Indexed: 12/22/2022] Open
Abstract
Butterflies are exceptionally diverse but their potential as an experimental system has been limited by the difficulty of deciphering heterozygous genomes and a lack of genetic manipulation technology. Here we use a hybrid assembly approach to construct high-quality reference genomes for Papilio xuthus (contig and scaffold N50: 492 kb, 3.4 Mb) and Papilio machaon (contig and scaffold N50: 81 kb, 1.15 Mb), highly heterozygous species that differ in host plant affiliations, and adult and larval colour patterns. Integrating comparative genomics and analyses of gene expression yields multiple insights into butterfly evolution, including potential roles of specific genes in recent diversification. To functionally test gene function, we develop an efficient (up to 92.5%) CRISPR/Cas9 gene editing method that yields obvious phenotypes with three genes, Abdominal-B, ebony and frizzled. Our results provide valuable genomic and technological resources for butterflies and unlock their potential as a genetic model system. Butterflies are a promising system to study the genetics and evolution of morphological diversification, yet genomic and technological resources are limited. Here, the authors sequence genomes of two Papilio butterflies and develop a CRISPR/Cas9 gene editing method for these species.
Collapse
|
56
|
Iyer S, Casey E, Bouzek H, Kim M, Deng W, Larsen BB, Zhao H, Bumgarner RE, Rolland M, Mullins JI. Comparison of Major and Minor Viral SNPs Identified through Single Template Sequencing and Pyrosequencing in Acute HIV-1 Infection. PLoS One 2015; 10:e0135903. [PMID: 26317928 PMCID: PMC4552882 DOI: 10.1371/journal.pone.0135903] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 07/27/2015] [Indexed: 01/03/2023] Open
Abstract
Massively parallel sequencing (MPS) technologies, such as 454-pyrosequencing, allow for the identification of variants in sequence populations at lower levels than consensus sequencing and most single-template Sanger sequencing experiments. We sought to determine if the greater depth of population sampling attainable using MPS technology would allow detection of minor variants in HIV founder virus populations very early in infection in instances where Sanger sequencing detects only a single variant. We compared single nucleotide polymorphisms (SNPs) during acute HIV-1 infection from 32 subjects using both single template Sanger and 454-pyrosequencing. Pyrosequences from a median of 2400 viral templates per subject and encompassing 40% of the HIV-1 genome, were compared to a median of five individually amplified near full-length viral genomes sequenced using Sanger technology. There was no difference in the consensus nucleotide sequences over the 3.6kb compared in 84% of the subjects infected with single founders and 33% of subjects infected with multiple founder variants: among the subjects with disagreements, mismatches were found in less than 1% of the sites evaluated (of a total of nearly 117,000 sites across all subjects). The majority of the SNPs observed only in pyrosequences were present at less than 2% of the subject’s viral sequence population. These results demonstrate the utility of the Sanger approach for study of early HIV infection and provide guidance regarding the design, utility and limitations of population sequencing from variable template sources, and emphasize parameters for improving the interpretation of massively parallel sequencing data to address important questions regarding target sequence evolution.
Collapse
Affiliation(s)
- Shyamala Iyer
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Eleanor Casey
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Heather Bouzek
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Moon Kim
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Brendan B. Larsen
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Hong Zhao
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Roger E. Bumgarner
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
| | - Morgane Rolland
- US Military HIV Research Program, WRAIR, Silver Spring, MD, 20910, United States of America
- Henry Jackson Foundation for the Advancement of Military Medicine, Inc., Bethesda, MD, 20817, United States of America
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle, WA, 98195, United States of America
- Department of Medicine, University of Washington, Seattle, WA, 98195, United States of America
- Department of Laboratory Medicine, Seattle, WA, 98195, United States of America
- * E-mail:
| |
Collapse
|
57
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
58
|
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015; 15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open
Abstract
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Collapse
|
59
|
Marçais G, Yorke JA, Zimin A. QuorUM: An Error Corrector for Illumina Reads. PLoS One 2015; 10:e0130821. [PMID: 26083032 PMCID: PMC4471408 DOI: 10.1371/journal.pone.0130821] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 05/26/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available. Results We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated. Availability QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at http://www.genome.umd.edu. Contact gmarcais@umd.edu.
Collapse
Affiliation(s)
- Guillaume Marçais
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| | | | - Aleksey Zimin
- IPST, University of Maryland, College Park, MD, USA
- * E-mail: (AZ), (GM)
| |
Collapse
|
60
|
Sheikhizadeh S, de Ridder D. ACE: accurate correction of errors usingK-mer tries. Bioinformatics 2015; 31:3216-8. [DOI: 10.1093/bioinformatics/btv332] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2014] [Accepted: 05/22/2015] [Indexed: 11/13/2022] Open
|
61
|
Craveiro SR, Inglis PW, Togawa RC, Grynberg P, Melo FL, Ribeiro ZMA, Ribeiro BM, Báo SN, Castro MEB. The genome sequence of Pseudoplusia includens single nucleopolyhedrovirus and an analysis of p26 gene evolution in the baculoviruses. BMC Genomics 2015; 16:127. [PMID: 25765042 PMCID: PMC4346127 DOI: 10.1186/s12864-015-1323-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Accepted: 02/04/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Pseudoplusia includens single nucleopolyhedrovirus (PsinSNPV-IE) is a baculovirus recently identified in our laboratory, with high pathogenicity to the soybean looper, Chrysodeixis includens (Lepidoptera: Noctuidae) (Walker, 1858). In Brazil, the C. includens caterpillar is an emerging pest and has caused significant losses in soybean and cotton crops. The PsinSNPV genome was determined and the phylogeny of the p26 gene within the family Baculoviridae was investigated. RESULTS The complete genome of PsinSNPV was sequenced (Roche 454 GS FLX - Titanium platform), annotated and compared with other Alphabaculoviruses, displaying a genome apparently different from other baculoviruses so far sequenced. The circular double-stranded DNA genome is 139,132 bp in length, with a GC content of 39.3 % and contains 141 open reading frames (ORFs). PsinSNPV possesses the 37 conserved baculovirus core genes, 102 genes found in other baculoviruses and 2 unique ORFs. Two baculovirus repeat ORFs (bro) homologs, bro-a (Psin33) and bro-b (Psin69), were identified and compared with Chrysodeixis chalcites nucleopolyhedrovirus (ChchNPV) and Trichoplusia ni single nucleopolyhedrovirus (TnSNPV) bro genes and showed high similarity, suggesting that these genes may be derived from an ancestor common to these viruses. The homologous repeats (hrs) are absent from the PsinSNPV genome, which is also the case in ChchNPV and TnSNPV. Two p26 gene homologs (p26a and p26b) were found in the PsinSNPV genome. P26 is thought to be required for optimal virion occlusion in the occlusion bodies (OBs), but its function is not well characterized. The P26 phylogenetic tree suggests that this gene was obtained from three independent acquisition events within the Baculoviridae family. The presence of a signal peptide only in the PsinSNPV p26a/ORF-20 homolog indicates distinct function between the two P26 proteins. CONCLUSIONS PsinSNPV has a genomic sequence apparently different from other baculoviruses sequenced so far. The complete genome sequence of PsinSNPV will provide a valuable resource, contributing to studies on its molecular biology and functional genomics, and will promote the development of this virus as an effective bioinsecticide.
Collapse
Affiliation(s)
- Saluana R Craveiro
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
- Departamento de Biologia Celular, Universidade de Brasília-UnB, Brasília, DF, Brazil.
| | - Peter W Inglis
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
| | - Roberto C Togawa
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
| | - Priscila Grynberg
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
| | - Fernando L Melo
- Departamento de Biologia Celular, Universidade de Brasília-UnB, Brasília, DF, Brazil.
| | - Zilda Maria A Ribeiro
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
| | - Bergmann M Ribeiro
- Departamento de Biologia Celular, Universidade de Brasília-UnB, Brasília, DF, Brazil.
| | - Sônia N Báo
- Departamento de Biologia Celular, Universidade de Brasília-UnB, Brasília, DF, Brazil.
| | - Maria Elita B Castro
- Embrapa Recursos Genéticos e Biotecnologia, Parque Estação Biológica, W5 Norte Final, 70770-917, Brasília, DF, Brazil.
| |
Collapse
|
62
|
Salehi F, Baronio R, Idrogo-Lam R, Vu H, Hall LV, Kaiser P, Lathrop RH. CHOPER filters enable rare mutation detection in complex mutagenesis populations by next-generation sequencing. PLoS One 2015; 10:e0116877. [PMID: 25692681 PMCID: PMC4333345 DOI: 10.1371/journal.pone.0116877] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 12/08/2014] [Indexed: 01/12/2023] Open
Abstract
Next-generation sequencing (NGS) has revolutionized genetics and enabled the accurate identification of many genetic variants across many genomes. However, detection of biologically important low-frequency variants within genetically heterogeneous populations remains challenging, because they are difficult to distinguish from intrinsic NGS sequencing error rates. Approaches to overcome these limitations are essential to detect rare mutations in large cohorts, virus or microbial populations, mitochondria heteroplasmy, and other heterogeneous mixtures such as tumors. Modifications in library preparation can overcome some of these limitations, but are experimentally challenging and restricted to skilled biologists. This paper describes a novel quality filtering and base pruning pipeline, called Complex Heterogeneous Overlapped Paired-End Reads (CHOPER), designed to detect sequence variants in a complex population with high sequence similarity derived from All-Codon-Scanning (ACS) mutagenesis. A novel fast alignment algorithm, designed for the specified application, has O(n) time complexity. CHOPER was applied to a p53 cancer mutant reactivation study derived from ACS mutagenesis. Relative to error filtering based on Phred quality scores, CHOPER improved accuracy by about 13% while discarding only half as many bases. These results are a step toward extending the power of NGS to the analysis of genetically heterogeneous populations.
Collapse
Affiliation(s)
- Faezeh Salehi
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- * E-mail: (FS); (PK)
| | - Roberta Baronio
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Ryan Idrogo-Lam
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Huy Vu
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Linda V. Hall
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
| | - Peter Kaiser
- Department of Biological Chemistry, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- Chao Family Comprehensive Cancer Center, University of California Irvine, Irvine, CA, 92697, United States of America
- * E-mail: (FS); (PK)
| | - Richard H. Lathrop
- Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States of America
- Institute for Genomics and Bioinformatics, University of California Irvine, Irvine, CA, 92697, United States of America
- Chao Family Comprehensive Cancer Center, University of California Irvine, Irvine, CA, 92697, United States of America
| |
Collapse
|
63
|
Schulz MH, Weese D, Holtgrewe M, Dimitrova V, Niu S, Reinert K, Richard H. Fiona: a parallel and automatic strategy for read error correction. ACTA ACUST UNITED AC 2015; 30:i356-63. [PMID: 25161220 PMCID: PMC4147893 DOI: 10.1093/bioinformatics/btu440] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Motivation: Automatic error correction of high-throughput sequencing data can have a dramatic impact on the amount of usable base pairs and their quality. It has been shown that the performance of tasks such as de novo genome assembly and SNP calling can be dramatically improved after read error correction. While a large number of methods specialized for correcting substitution errors as found in Illumina data exist, few methods for the correction of indel errors, common to technologies like 454 or Ion Torrent, have been proposed. Results: We present Fiona, a new stand-alone read error–correction method. Fiona provides a new statistical approach for sequencing error detection and optimal error correction and estimates its parameters automatically. Fiona is able to correct substitution, insertion and deletion errors and can be applied to any sequencing technology. It uses an efficient implementation of the partial suffix array to detect read overlaps with different seed lengths in parallel. We tested Fiona on several real datasets from a variety of organisms with different read lengths and compared its performance with state-of-the-art methods. Fiona shows a constantly higher correction accuracy over a broad range of datasets from 454 and Ion Torrent sequencers, without compromise in speed. Conclusion: Fiona is an accurate parameter-free read error–correction method that can be run on inexpensive hardware and can make use of multicore parallelization whenever available. Fiona was implemented using the SeqAn library for sequence analysis and is publicly available for download at http://www.seqan.de/projects/fiona. Contact: mschulz@mmci.uni-saarland.de or hugues.richard@upmc.fr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marcel H Schulz
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - David Weese
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Manuel Holtgrewe
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Viktoria Dimitrova
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Sijia Niu
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Knut Reinert
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| | - Hugues Richard
- 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France 'Multimodal Computing and Interaction', Saarland University & Department for Computational Biology and Applied Computing, Max Planck Institute for Informatics, Saarbrücken, 66123 Saarland, Germany, Ray and Stephanie Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, 15206 PA, USA, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany, Université Pierre et Marie Curie, UMR7238, CNRS-UPMC, Paris, France and CNRS, UMR7238, Laboratory of Computational and Quantitative Biology, Paris, France
| |
Collapse
|
64
|
Kopylova E, Noé L, Da Silva C, Berthelot JF, Alberti A, Aury JM, Touzet H. Deciphering metatranscriptomic data. Methods Mol Biol 2015; 1269:279-91. [PMID: 25577385 DOI: 10.1007/978-1-4939-2291-8_17] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Metatranscriptomic data contributes another piece of the puzzle to understanding the phylogenetic structure and function of a community of organisms. High-quality total RNA is a bountiful mixture of ribosomal, transfer, messenger and other noncoding RNAs, where each family of RNA is vital to answering questions concerning the hidden microbial world. Software tools designed for deciphering metatranscriptomic data fall under two main categories: the first is to reassemble millions of short nucleotide fragments produced by high-throughput sequencing technologies into the original full-length transcriptomes for all organisms within a sample, and the second is to taxonomically classify the organisms and determine their individual functional roles within a community. Species identification is mainly established using the ribosomal RNA genes, whereas the behavior and functionality of a community is revealed by the messenger RNA of the expressed genes. Numerous chemical and computational methods exist to separate families of RNA prior to conducting further downstream analyses, primarily suitable for isolating mRNA or rRNA from a total RNA sample. In this chapter, we demonstrate a computational technique for filtering rRNA from total RNA using the software SortMeRNA. Additionally, we propose a post-processing pipeline using the latest software tools to conduct further studies on the filtered data, including the reconstruction of mRNA transcripts for functional analyses and phylogenetic classification of a community using the ribosomal RNA.
Collapse
|
65
|
Rare biosphere exploration using high-throughput sequencing: research progress and perspectives. CONSERV GENET 2014. [DOI: 10.1007/s10592-014-0678-9] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
66
|
Jünemann S, Prior K, Albersmeier A, Albaum S, Kalinowski J, Goesmann A, Stoye J, Harmsen D. GABenchToB: a genome assembly benchmark tuned on bacteria and benchtop sequencers. PLoS One 2014; 9:e107014. [PMID: 25198770 PMCID: PMC4157817 DOI: 10.1371/journal.pone.0107014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/07/2014] [Indexed: 12/28/2022] Open
Abstract
De novo genome assembly is the process of reconstructing a complete genomic sequence from countless small sequencing reads. Due to the complexity of this task, numerous genome assemblers have been developed to cope with different requirements and the different kinds of data provided by sequencers within the fast evolving field of next-generation sequencing technologies. In particular, the recently introduced generation of benchtop sequencers, like Illumina's MiSeq and Ion Torrent's Personal Genome Machine (PGM), popularized the easy, fast, and cheap sequencing of bacterial organisms to a broad range of academic and clinical institutions. With a strong pragmatic focus, here, we give a novel insight into the line of assembly evaluation surveys as we benchmark popular de novo genome assemblers based on bacterial data generated by benchtop sequencers. Therefore, single-library assemblies were generated, assembled, and compared to each other by metrics describing assembly contiguity and accuracy, and also by practice-oriented criteria as for instance computing time. In addition, we extensively analyzed the effect of the depth of coverage on the genome assemblies within reasonable ranges and the k-mer optimization problem of de Bruijn Graph assemblers. Our results show that, although both MiSeq and PGM allow for good genome assemblies, they require different approaches. They not only pair with different assembler types, but also affect assemblies differently regarding the depth of coverage where oversampling can become problematic. Assemblies vary greatly with respect to contiguity and accuracy but also by the requirement on the computing power. Consequently, no assembler can be rated best for all preconditions. Instead, the given kind of data, the demands on assembly quality, and the available computing infrastructure determines which assembler suits best. The data sets, scripts and all additional information needed to replicate our results are freely available at ftp://ftp.cebitec.uni-bielefeld.de/pub/GABenchToB.
Collapse
Affiliation(s)
- Sebastian Jünemann
- Department for Periodontology, University of Münster, Münster, Germany
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Karola Prior
- Department for Periodontology, University of Münster, Münster, Germany
| | - Andreas Albersmeier
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Stefan Albaum
- Bioinformatics Resource Facility, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jörn Kalinowski
- Technology Platform Genomics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Alexander Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-Univeristy Gießen, Gießen, Germany
| | - Jens Stoye
- Institute for Bioinformatics, Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Genome Informatics Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Dag Harmsen
- Department for Periodontology, University of Münster, Münster, Germany
| |
Collapse
|
67
|
Ahola V, Lehtonen R, Somervuo P, Salmela L, Koskinen P, Rastas P, Välimäki N, Paulin L, Kvist J, Wahlberg N, Tanskanen J, Hornett EA, Ferguson LC, Luo S, Cao Z, de Jong MA, Duplouy A, Smolander OP, Vogel H, McCoy RC, Qian K, Chong WS, Zhang Q, Ahmad F, Haukka JK, Joshi A, Salojärvi J, Wheat CW, Grosse-Wilde E, Hughes D, Katainen R, Pitkänen E, Ylinen J, Waterhouse RM, Turunen M, Vähärautio A, Ojanen SP, Schulman AH, Taipale M, Lawson D, Ukkonen E, Mäkinen V, Goldsmith MR, Holm L, Auvinen P, Frilander MJ, Hanski I. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nat Commun 2014; 5:4737. [PMID: 25189940 PMCID: PMC4164777 DOI: 10.1038/ncomms5737] [Citation(s) in RCA: 158] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2014] [Accepted: 07/17/2014] [Indexed: 12/30/2022] Open
Abstract
Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) of the Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a widely recognized model species in metapopulation biology and eco-evolutionary research, which has the putative ancestral karyotype of n=31. Using a phylogenetic analyses of Nymphalidae and of other Lepidoptera, combined with orthologue-level comparisons of chromosomes, we conclude that the ancestral lepidopteran karyotype has been n=31 for at least 140 My. We show that fusion chromosomes have retained the ancestral chromosome segments and very few rearrangements have occurred across the fusion sites. The same, shortest ancestral chromosomes have independently participated in fusion events in species with smaller karyotypes. The short chromosomes have higher rearrangement rate than long ones. These characteristics highlight distinctive features of the evolutionary dynamics of butterflies and moths. Butterflies and moths (Lepidoptera) vary in chromosome number. Here, the authors sequence the genome of the Glanville fritillary butterfly, Melitaea cinxia, show it has the ancestral lepidopteran karyotype and provide insight into how chromosomal fusions have shaped karyotype evolution in butterflies and moths.
Collapse
Affiliation(s)
- Virpi Ahola
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Rainer Lehtonen
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [3] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland [4] Center of Excellence in Cancer Genetics, University of Helsinki, FI-00014 Helsinki, Finland [5] [6]
| | - Panu Somervuo
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
| | - Leena Salmela
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Patrik Koskinen
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Pasi Rastas
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Niko Välimäki
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Lars Paulin
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Jouni Kvist
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Niklas Wahlberg
- Department of Biology, University of Turku, FI-20014 Turku, Finland
| | - Jaakko Tanskanen
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
| | - Emily A Hornett
- 1] Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, UK [2] Department of Biology, Pennsylvania State University, Pennsylvania 16802, USA
| | | | - Shiqi Luo
- College of Life Sciences, Peking University, Beijing 100871, P.R. China
| | - Zijuan Cao
- College of Life Sciences, Peking University, Beijing 100871, P.R. China
| | - Maaike A de Jong
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] School of Biological Sciences, University of Bristol, Bristol BS8 1UG, UK
| | - Anne Duplouy
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | | | - Heiko Vogel
- Department of Entomology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
| | - Rajiv C McCoy
- Department of Biology, Stanford University, Stanford, California 94305, USA
| | - Kui Qian
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Wong Swee Chong
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Qin Zhang
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Freed Ahmad
- Department of Information Technology, University of Turku, FI-20014 Turku, Finland
| | - Jani K Haukka
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Aruj Joshi
- BioMediTech, University of Tampere, FI-33520 Tampere, Finland
| | - Jarkko Salojärvi
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | | | - Ewald Grosse-Wilde
- Department of Evolutionary Neuroethology, Max Planck Institute for Chemical Ecology, D-07745 Jena, Germany
| | - Daniel Hughes
- 1] European Bioinformatics Institute, Hinxton CB10 1SD, UK [2] Baylor College of Medicine, Human Genome Sequencing Center, Houston, Texas 77030-3411, USA
| | - Riku Katainen
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Esa Pitkänen
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biomedicine, University of Helsinki, FI-00014 Helsinki, Finland
| | - Johannes Ylinen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Robert M Waterhouse
- 1] Department of Genetic Medicine and Development, University of Geneva Medical School &Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland [2] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA [3] The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Mikko Turunen
- Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland
| | - Anna Vähärautio
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Department of Pathology, University of Helsinki, FI-00014 Helsinki, Finland [3] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
| | - Sami P Ojanen
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| | - Alan H Schulman
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2] Biotechnology and Food Research, MTT Agrifood Research Finland, FI-31600 Jokioinen, Finland
| | - Minna Taipale
- 1] Genome-Scale Biology Research Program, University of Helsinki, FI-00014 Helsinki, Finland [2] Science for Life Laboratory, Department of Biosciences and Nutrition, Karolinska Institutet, SE-14183 Stockholm, Sweden
| | - Daniel Lawson
- European Bioinformatics Institute, Hinxton CB10 1SD, UK
| | - Esko Ukkonen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science &Helsinki Institute for Information Technology HIIT, University of Helsinki, FI-00014 Helsinki, Finland
| | - Marian R Goldsmith
- Department of Biological Sciences, University of Rhode Island, Kingston, Rhode Island 02881-0816, USA
| | - Liisa Holm
- 1] Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland [2] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [3]
| | - Petri Auvinen
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Mikko J Frilander
- 1] Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland [2]
| | - Ilkka Hanski
- Department of Biosciences, University of Helsinki, FI-00014 Helsinki, Finland
| |
Collapse
|
68
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
69
|
Abstract
Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space. Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec. Contact:lordec@lirmm.fr. Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leena Salmela
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| | - Eric Rivals
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France
| |
Collapse
|
70
|
Lim EC, Müller J, Hagmann J, Henz SR, Kim ST, Weigel D. Trowel: a fast and accurate error correction module for Illumina sequencing reads. ACTA ACUST UNITED AC 2014; 30:3264-5. [PMID: 25075116 DOI: 10.1093/bioinformatics/btu513] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The ability to accurately read the order of nucleotides in DNA and RNA is fundamental for modern biology. Errors in next-generation sequencing can lead to many artifacts, from erroneous genome assemblies to mistaken inferences about RNA editing. Uneven coverage in datasets also contributes to false corrections. RESULT We introduce Trowel, a massively parallelized and highly efficient error correction module for Illumina read data. Trowel both corrects erroneous base calls and boosts base qualities based on the k-mer spectrum. With high-quality k-mers and relevant base information, Trowel achieves high accuracy for different short read sequencing applications.The latency in the data path has been significantly reduced because of efficient data access and data structures. In performance evaluations, Trowel was highly competitive with other tools regardless of coverage, genome size read length and fragment size. AVAILABILITY AND IMPLEMENTATION Trowel is written in C++ and is provided under the General Public License v3.0 (GPLv3). It is available at http://trowel-ec.sourceforge.net. CONTACT euncheon.lim@tue.mpg.de or weigel@tue.mpg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eun-Cheon Lim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jonas Müller
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Jörg Hagmann
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Stefan R Henz
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Sang-Tae Kim
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| |
Collapse
|
71
|
Greenfield P, Duesing K, Papanicolaou A, Bauer DC. Blue: correcting sequencing errors using consensus and context. ACTA ACUST UNITED AC 2014; 30:2723-32. [PMID: 24919879 DOI: 10.1093/bioinformatics/btu368] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Bioinformatics tools, such as assemblers and aligners, are expected to produce more accurate results when given better quality sequence data as their starting point. This expectation has led to the development of stand-alone tools whose sole purpose is to detect and remove sequencing errors. A good error-correcting tool would be a transparent component in a bioinformatics pipeline, simply taking sequence data in any of the standard formats and producing a higher quality version of the same data containing far fewer errors. It should not only be able to correct all of the types of errors found in real sequence data (substitutions, insertions, deletions and uncalled bases), but it has to be both fast enough and scalable enough to be usable on the large datasets being produced by current sequencing technologies, and work on data derived from both haploid and diploid organisms. RESULTS This article presents Blue, an error-correction algorithm based on k-mer consensus and context. Blue can correct substitution, deletion and insertion errors, as well as uncalled bases. It accepts both FASTQ and FASTA formats, and corrects quality scores for corrected bases. Blue also maintains the pairing of reads, both within a file and between pairs of files, making it compatible with downstream tools that depend on read pairing. Blue is memory efficient, scalable and faster than other published tools, and usable on large sequencing datasets. On the tests undertaken, Blue also proved to be generally more accurate than other published algorithms, resulting in more accurately aligned reads and the assembly of longer contigs containing fewer errors. One significant feature of Blue is that its k-mer consensus table does not have to be derived from the set of reads being corrected. This decoupling makes it possible to correct one dataset, such as small set of 454 mate-pair reads, with the consensus derived from another dataset, such as Illumina reads derived from the same DNA sample. Such cross-correction can greatly improve the quality of small (and expensive) sets of long reads, leading to even better assemblies and higher quality finished genomes. AVAILABILITY AND IMPLEMENTATION The code for Blue and its related tools are available from http://www.bioinformatics.csiro.au/Blue. These programs are written in C# and run natively under Windows and under Mono on Linux.
Collapse
Affiliation(s)
- Paul Greenfield
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Konsta Duesing
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Alexie Papanicolaou
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| | - Denis C Bauer
- CSIRO Computational Informatics, School of IT, University of Sydney, CSIRO Animal, Food and Health Sciences, Sydney, NSW 2113, and CSIRO Ecosystem Sciences, Canberra, ACT 2601, Australia
| |
Collapse
|
72
|
Knief C. Analysis of plant microbe interactions in the era of next generation sequencing technologies. FRONTIERS IN PLANT SCIENCE 2014; 5:216. [PMID: 24904612 PMCID: PMC4033234 DOI: 10.3389/fpls.2014.00216] [Citation(s) in RCA: 120] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2014] [Accepted: 04/30/2014] [Indexed: 05/18/2023]
Abstract
Next generation sequencing (NGS) technologies have impressively accelerated research in biological science during the last years by enabling the production of large volumes of sequence data to a drastically lower price per base, compared to traditional sequencing methods. The recent and ongoing developments in the field allow addressing research questions in plant-microbe biology that were not conceivable just a few years ago. The present review provides an overview of NGS technologies and their usefulness for the analysis of microorganisms that live in association with plants. Possible limitations of the different sequencing systems, in particular sources of errors and bias, are critically discussed and methods are disclosed that help to overcome these shortcomings. A focus will be on the application of NGS methods in metagenomic studies, including the analysis of microbial communities by amplicon sequencing, which can be considered as a targeted metagenomic approach. Different applications of NGS technologies are exemplified by selected research articles that address the biology of the plant associated microbiota to demonstrate the worth of the new methods.
Collapse
Affiliation(s)
- Claudia Knief
- Institute of Crop Science and Resource Conservation—Molecular Biology of the Rhizosphere, Faculty of Agriculture, University of BonnBonn, Germany
| |
Collapse
|
73
|
Wirawan A, Harris RS, Liu Y, Schmidt B, Schröder J. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics 2014; 15:131. [PMID: 24885381 PMCID: PMC4023493 DOI: 10.1186/1471-2105-15-131] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2013] [Accepted: 04/24/2014] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Current-generation sequencing technologies are able to produce low-cost, high-throughput reads. However, the produced reads are imperfect and may contain various sequencing errors. Although many error correction methods have been developed in recent years, none explicitly targets homopolymer-length errors in the 454 sequencing reads. RESULTS We present HECTOR, a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. In this algorithm, for the first time we have investigated a novel homopolymer spectrum based approach to handle homopolymer insertions or deletions, which are the dominant sequencing errors in 454 pyrosequencing reads. We have evaluated the performance of HECTOR, in terms of correction quality, runtime and parallel scalability, using both simulated and real pyrosequencing datasets. This performance has been further compared to that of Coral, a state-of-the-art error corrector which is based on multiple sequence alignment and Acacia, a recently published error corrector for amplicon pyrosequences. Our evaluations reveal that HECTOR demonstrates comparable correction quality to Coral, but runs 3.7× faster on average. In addition, HECTOR performs well even when the coverage of the dataset is low. CONCLUSION Our homopolymer spectrum based approach is theoretically capable of processing arbitrary-length homopolymer-length errors, with a linear time complexity. HECTOR employs a multi-threaded design based on a master-slave computing model. Our experimental results show that HECTOR is a practical 454 pyrosequencing read error corrector which is competitive in terms of both correction quality and speed. The source code and all simulated data are available at: http://hector454.sourceforge.net.
Collapse
Affiliation(s)
- Adrianto Wirawan
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz, Germany.
| | | | | | | | | |
Collapse
|
74
|
Alkio M, Jonas U, Declercq M, Van Nocker S, Knoche M. Transcriptional dynamics of the developing sweet cherry (Prunus avium L.) fruit: sequencing, annotation and expression profiling of exocarp-associated genes. HORTICULTURE RESEARCH 2014; 1:11. [PMID: 26504533 PMCID: PMC4591669 DOI: 10.1038/hortres.2014.11] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2013] [Accepted: 01/17/2014] [Indexed: 05/24/2023]
Abstract
The exocarp, or skin, of fleshy fruit is a specialized tissue that protects the fruit, attracts seed dispersing fruit eaters, and has large economical relevance for fruit quality. Development of the exocarp involves regulated activities of many genes. This research analyzed global gene expression in the exocarp of developing sweet cherry (Prunus avium L., 'Regina'), a fruit crop species with little public genomic resources. A catalog of transcript models (contigs) representing expressed genes was constructed from de novo assembled short complementary DNA (cDNA) sequences generated from developing fruit between flowering and maturity at 14 time points. Expression levels in each sample were estimated for 34 695 contigs from numbers of reads mapping to each contig. Contigs were annotated functionally based on BLAST, gene ontology and InterProScan analyses. Coregulated genes were detected using partitional clustering of expression patterns. The results are discussed with emphasis on genes putatively involved in cuticle deposition, cell wall metabolism and sugar transport. The high temporal resolution of the expression patterns presented here reveals finely tuned developmental specialization of individual members of gene families. Moreover, the de novo assembled sweet cherry fruit transcriptome with 7760 full-length protein coding sequences and over 20 000 other, annotated cDNA sequences together with their developmental expression patterns is expected to accelerate molecular research on this important tree fruit crop.
Collapse
Affiliation(s)
- Merianne Alkio
- Institute of Horticultural Production Systems, Leibniz Universität Hannover, D-30419 Hannover, Germany
| | - Uwe Jonas
- Institute of Horticultural Production Systems, Leibniz Universität Hannover, D-30419 Hannover, Germany
| | - Myriam Declercq
- Institute of Horticultural Production Systems, Leibniz Universität Hannover, D-30419 Hannover, Germany
| | - Steven Van Nocker
- Department of Horticulture, Michigan State University, East Lansing, MI 48824-1325, USA
| | - Moritz Knoche
- Institute of Horticultural Production Systems, Leibniz Universität Hannover, D-30419 Hannover, Germany
| |
Collapse
|
75
|
Mbandi SK, Hesse U, Rees DJG, Christoffels A. A glance at quality score: implication for de novo transcriptome reconstruction of Illumina reads. Front Genet 2014; 5:17. [PMID: 24575122 PMCID: PMC3921913 DOI: 10.3389/fgene.2014.00017] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Accepted: 01/19/2014] [Indexed: 11/13/2022] Open
Abstract
Downstream analyses of short-reads from next-generation sequencing platforms are often preceded by a pre-processing step that removes uncalled and wrongly called bases. Standard approaches rely on their associated base quality scores to retain the read or a portion of it when the score is above a predefined threshold. It is difficult to differentiate sequencing error from biological variation without a reference using quality scores. The effects of quality score based trimming have not been systematically studied in de novo transcriptome assembly. Using RNA-Seq data produced from Illumina, we teased out the effects of quality score based filtering or trimming on de novo transcriptome reconstruction. We showed that assemblies produced from reads subjected to different quality score thresholds contain truncated and missing transfrags when compared to those from untrimmed reads. Our data supports the fact that de novo assembling of untrimmed data is challenging for de Bruijn graph assemblers. However, our results indicates that comparing the assemblies from untrimmed and trimmed read subsets can suggest appropriate filtering parameters and enable selection of the optimum de novo transcriptome assembly in non-model organisms.
Collapse
Affiliation(s)
- Stanley Kimbung Mbandi
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape Bellville, South Africa
| | - Uljana Hesse
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape Bellville, South Africa
| | - D Jasper G Rees
- Biotechnology Platform, Agricultural Research Council Onderstepoort, South Africa
| | - Alan Christoffels
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape Bellville, South Africa
| |
Collapse
|
76
|
Istvánek J, Jaros M, Krenek A, Řepková J. Genome assembly and annotation for red clover (Trifolium pratense; Fabaceae). AMERICAN JOURNAL OF BOTANY 2014; 101:327-37. [PMID: 24500806 DOI: 10.3732/ajb.1300340] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
PREMISE OF THE STUDY Red clover (Trifolium pratense) is an important forage plant from the legume family with great importance in agronomy and livestock nourishment. Nevertheless, assembling its medium-sized genome presents a challenge, given current hardware and software possibilities. Next-generation sequencing technologies enable us to generate large amounts of sequence data at low cost. In this study, the genome assembly and red clover genome features are presented. METHODS First, assembly software was assessed using data sets from a closely related species to find the best possible combination of assembler plus error correction program to assemble the red clover genome. The newly sequenced genome was characterized by repetitive content, number of protein-coding and nonprotein-coding genes, and gene families and functions. Genome features were also compared with those of other sequenced plant species. KEY RESULTS Abyss with Echo correction was used for de novo assembly of the red clover genome. The presented assembly comprises ∼314.6 Mbp. In contrast to leguminous species with comparable genome sizes, the genome of T. pratense contains a larger repetitive portion and more abundant retrotransposons and DNA transposons. Overall, 47 398 protein-coding genes were annotated from 64 761 predicted genes. Comparative analysis revealed several gene families that are characteristic for T. pratense. Resistance genes, leghemoglobins, and nodule-specific cystein-rich peptides were identified and compared with other sequenced species. CONCLUSIONS The presented red clover genomic data constitute a resource for improvement through molecular breeding and for comparison to other sequenced plant species.
Collapse
Affiliation(s)
- Jan Istvánek
- Department of Experimental Biology, Faculty of Science, Masaryk University, Brno, Czech Republic
| | | | | | | |
Collapse
|
77
|
Heo Y, Wu XL, Chen D, Ma J, Hwu WM. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. ACTA ACUST UNITED AC 2014; 30:1354-62. [PMID: 24451628 DOI: 10.1093/bioinformatics/btu030] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Rapid advances in next-generation sequencing (NGS) technology have led to exponential increase in the amount of genomic information. However, NGS reads contain far more errors than data from traditional sequencing methods, and downstream genomic analysis results can be improved by correcting the errors. Unfortunately, all the previous error correction methods required a large amount of memory, making it unsuitable to process reads from large genomes with commodity computers. RESULTS We present a novel algorithm that produces accurate correction results with much less memory compared with previous solutions. The algorithm, named BLoom-filter-based Error correction Solution for high-throughput Sequencing reads (BLESS), uses a single minimum-sized Bloom filter, and is also able to tolerate a higher false-positive rate, thus allowing us to correct errors with a 40× memory usage reduction on average compared with previous methods. Meanwhile, BLESS can extend reads like DNA assemblers to correct errors at the end of reads. Evaluations using real and simulated reads showed that BLESS could generate more accurate results than existing solutions. After errors were corrected using BLESS, 69% of initially unaligned reads could be aligned correctly. Additionally, de novo assembly results became 50% longer with 66% fewer assembly errors. AVAILABILITY AND IMPLEMENTATION Freely available at http://sourceforge.net/p/bless-ec CONTACT dchen@illinois.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Heo
- Department of Electrical and Computer Engineering, Department of Bioengineering and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | |
Collapse
|
78
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
79
|
El-Metwally S, Hamza T, Zakaria M, Helmy M. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput Biol 2013; 9:e1003345. [PMID: 24348224 PMCID: PMC3861042 DOI: 10.1371/journal.pcbi.1003345] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Decoding DNA symbols using next-generation sequencers was a major breakthrough in genomic research. Despite the many advantages of next-generation sequencers, e.g., the high-throughput sequencing rate and relatively low cost of sequencing, the assembly of the reads produced by these sequencers still remains a major challenge. In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages: preprocessing filtering, a graph construction process, a graph simplification process, and postprocessing filtering. Here we discuss them as a framework of four stages for data analysis and processing and survey variety of techniques, algorithms, and software tools used during each stage. We also discuss the challenges that face current assemblers in the next-generation environment to determine the current state-of-the-art. We recommend a layered architecture approach for constructing a general assembler that can handle the sequences generated by different sequencing platforms.
Collapse
Affiliation(s)
- Sara El-Metwally
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Taher Hamza
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Magdi Zakaria
- Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohamed Helmy
- Botany Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
- Biotechnology Department, Faculty of Agriculture, Al-Azhar University, Cairo, Egypt
| |
Collapse
|
80
|
Ferrarini M, Moretto M, Ward JA, Šurbanovski N, Stevanović V, Giongo L, Viola R, Cavalieri D, Velasco R, Cestaro A, Sargent DJ. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics 2013; 14:670. [PMID: 24083400 PMCID: PMC3853357 DOI: 10.1186/1471-2164-14-670] [Citation(s) in RCA: 112] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2013] [Accepted: 09/26/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Second generation sequencing has permitted detailed sequence characterisation at the whole genome level of a growing number of non-model organisms, but the data produced have short read-lengths and biased genome coverage leading to fragmented genome assemblies. The PacBio RS long-read sequencing platform offers the promise of increased read length and unbiased genome coverage and thus the potential to produce genome sequence data of a finished quality containing fewer gaps and longer contigs. However, these advantages come at a much greater cost per nucleotide and with a perceived increase in error-rate. In this investigation, we evaluated the performance of the PacBio RS sequencing platform through the sequencing and de novo assembly of the Potentilla micrantha chloroplast genome. RESULTS Following error-correction, a total of 28,638 PacBio RS reads were recovered with a mean read length of 1,902 bp totalling 54,492,250 nucleotides and representing an average depth of coverage of 320× the chloroplast genome. The dataset covered the entire 154,959 bp of the chloroplast genome in a single contig (100% coverage) compared to seven contigs (90.59% coverage) recovered from an Illumina data, and revealed no bias in coverage of GC rich regions. Post-assembly the data were largely concordant with the Illumina data generated and allowed 187 ambiguities in the Illumina data to be resolved. The additional read length also permitted small differences in the two inverted repeat regions to be assigned unambiguously. CONCLUSIONS This is the first report to our knowledge of a chloroplast genome assembled de novo using PacBio sequence data. The PacBio RS data generated here were assembled into a single large contig spanning the P. micrantha chloroplast genome, with a higher degree of accuracy than an Illumina dataset generated at a much greater depth of coverage, due to longer read lengths and lower GC bias in the data. The results we present suggest PacBio data will be of immense utility for the development of genome sequence assemblies containing fewer unresolved gaps and ambiguities and a significantly smaller number of contigs than could be produced using short-read sequence data alone.
Collapse
Affiliation(s)
- Marco Ferrarini
- Research and Innovation Centre, Fondazione Edmund Mach, Via E, Mach 1, 38010 San Michele all'Adige, Italy.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
81
|
Iyer S, Bouzek H, Deng W, Larsen B, Casey E, Mullins JI. Quality score based identification and correction of pyrosequencing errors. PLoS One 2013; 8:e73015. [PMID: 24039850 PMCID: PMC3764156 DOI: 10.1371/journal.pone.0073015] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2012] [Accepted: 07/22/2013] [Indexed: 12/26/2022] Open
Abstract
Massively-parallel DNA sequencing using the 454/pyrosequencing platform allows in-depth probing of diverse sequence populations, such as within an HIV-1 infected individual. Analysis of this sequence data, however, remains challenging due to the shorter read lengths relative to that obtained by Sanger sequencing as well as errors introduced during DNA template amplification and during pyrosequencing. The ability to distinguish real variation from pyrosequencing errors with high sensitivity and specificity is crucial to interpreting sequence data. We introduce a new algorithm, CorQ (Correction through Quality), which utilizes the inherent base quality in a sequence-specific context to correct for homopolymer and non-homopolymer insertion and deletion (indel) errors. CorQ also takes uneven read mapping into account for correcting pyrosequencing miscall errors and it identifies and corrects carry forward errors. We tested the ability of CorQ to correctly call SNPs on a set of pyrosequences derived from ten viral genomes from an HIV-1 infected individual, as well as on six simulated pyrosequencing datasets generated using non-zero error rates to emulate errors introduced by PCR. When combined with the AmpliconNoise error correction method developed to remove ambiguities in signal intensities, we attained a 97% reduction in indel errors, a 98% reduction in carry forward errors, and >97% specificity of SNP detection. When compared to four other error correction methods, AmpliconNoise+CorQ performed at equal or higher SNP identification specificity, but the sensitivity of SNP detection was consistently higher (>98%) than other methods tested. This combined procedure will therefore permit examination of complex genetic populations with improved accuracy.
Collapse
Affiliation(s)
- Shyamala Iyer
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Heather Bouzek
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Wenjie Deng
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Brendan Larsen
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Eleanor Casey
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - James I. Mullins
- Department of Microbiology, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
82
|
Matullo G, Di Gaetano C, Guarrera S. Next generation sequencing and rare genetic variants: from human population studies to medical genetics. ENVIRONMENTAL AND MOLECULAR MUTAGENESIS 2013; 54:518-532. [PMID: 23922201 DOI: 10.1002/em.21799] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Revised: 05/31/2013] [Accepted: 06/09/2013] [Indexed: 06/02/2023]
Abstract
The allelic frequency spectrum emerging from several Next Generation Sequencing (NGS) projects is revealing important details about evolutionary and demographic forces that shaped the human genome. Herein, we discuss some of the achievements of the use of low-frequency and rare variants from NGS studies. The majority of variants that affect protein-coding regions are recent and rare. Often, the novel rare variants are enriched for deleterious alleles and are population-specific, making them suitable for the study of disease susceptibility. To investigate this kind of variation and its effects in association studies, very large sample sizes will be necessary to achieve sufficient statistical power. Moreover, as these variants are typically population-specific, the replication of disease associations across populations could be very difficult due to population stratification. Therefore, the design of experiments focusing on the identification of rare variants and their effects should be carefully planned. Although several successes have already been achieved through NGS for genetic epidemiology, pharmacogenetic and clinical purposes, with improvements of the sequencing technology and decreased costs, further advances are expected in the near future.
Collapse
Affiliation(s)
- Giuseppe Matullo
- Dipartimento di Scienze Mediche, Università di Torino, Torino, Italy.
| | | | | |
Collapse
|
83
|
Deng W, Maust BS, Westfall DH, Chen L, Zhao H, Larsen BB, Iyer S, Liu Y, Mullins JI. Indel and Carryforward Correction (ICC): a new analysis approach for processing 454 pyrosequencing data. ACTA ACUST UNITED AC 2013; 29:2402-9. [PMID: 23900188 DOI: 10.1093/bioinformatics/btt434] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
MOTIVATION Pyrosequencing technology provides an important new approach to more extensively characterize diverse sequence populations and detect low frequency variants. However, the promise of this technology has been difficult to realize, as careful correction of sequencing errors is crucial to distinguish rare variants (∼1%) in an infected host with high sensitivity and specificity. RESULTS We developed a new approach, referred to as Indel and Carryforward Correction (ICC), to cluster sequences without substitutions and locally correct only indel and carryforward sequencing errors within clusters to ensure that no rare variants are lost. ICC performs sequence clustering in the order of (i) homopolymer indel patterns only, (ii) indel patterns only and (iii) carryforward errors only, without the requirement of a distance cutoff value. Overall, ICC removed 93-95% of sequencing errors found in control datasets. On pyrosequencing data from a PCR fragment derived from 15 HIV-1 plasmid clones mixed at various frequencies as low as 0.1%, ICC achieved the highest sensitivity and similar specificity compared with other commonly used error correction and variant calling algorithms. AVAILABILITY AND IMPLEMENTATION Source code is freely available for download at http://indra.mullins.microbiol.washington.edu/ICC. It is implemented in Perl and supported on Linux, Mac OS X and MS Windows.
Collapse
Affiliation(s)
- Wenjie Deng
- Department of Microbiology, University of Washington School of Medicine, Seattle, WA 98195, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
84
|
Abstract
MOTIVATION High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS We propose RACER (Rapid and Accurate Correction of Errors in Reads), a new software program for correcting errors in sequencing data. RACER has better error-correcting performance than existing programs, is faster and requires less memory. To support our claims, we performed extensive comparison with the existing leading programs on a variety of real datasets. AVAILABILITY RACER is freely available for non-commercial use at www.csd.uwo.ca/∼ilie/RACER/.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7 London, ON, Canada
| | | |
Collapse
|
85
|
Le HS, Schulz MH, McCauley BM, Hinman VF, Bar-Joseph Z. Probabilistic error correction for RNA sequencing. Nucleic Acids Res 2013; 41:e109. [PMID: 23558750 PMCID: PMC3664804 DOI: 10.1093/nar/gkt215] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Sequencing of RNAs (RNA-Seq) has revolutionized the field of transcriptomics, but the reads obtained often contain errors. Read error correction can have a large impact on our ability to accurately assemble transcripts. This is especially true for de novo transcriptome analysis, where a reference genome is not available. Current read error correction methods, developed for DNA sequence data, cannot handle the overlapping effects of non-uniform abundance, polymorphisms and alternative splicing. Here we present SEquencing Error CorrEction in Rna-seq data (SEECER), a hidden Markov Model (HMM)–based method, which is the first to successfully address these problems. SEECER efficiently learns hundreds of thousands of HMMs and uses these to correct sequencing errors. Using human RNA-Seq data, we show that SEECER greatly improves on previous methods in terms of quality of read alignment to the genome and assembly accuracy. To illustrate the usefulness of SEECER for de novo transcriptome studies, we generated new RNA-Seq data to study the development of the sea cucumber Parastichopus parvimensis. Our corrected assembled transcripts shed new light on two important stages in sea cucumber development. Comparison of the assembled transcripts to known transcripts in other species has also revealed novel transcripts that are unique to sea cucumber, some of which we have experimentally validated. Supporting website: http://sb.cs.cmu.edu/seecer/.
Collapse
Affiliation(s)
- Hai-Son Le
- Machine Learning Department, Carnegie Mellon University, 5000 Forbes Avenue Pittsburgh, PA 15217, USA
| | | | | | | | | |
Collapse
|
86
|
Abstract
Advances in sequencing technologies and increased access to sequencing services have led to renewed interest in sequence and genome assembly. Concurrently, new applications for sequencing have emerged, including gene expression analysis, discovery of genomic variants and metagenomics, and each of these has different needs and challenges in terms of assembly. We survey the theoretical foundations that underlie modern assembly and highlight the options and practical trade-offs that need to be considered, focusing on how individual features address the needs of specific applications. We also review key software and the interplay between experimental design and efficacy of assembly.
Collapse
Affiliation(s)
- Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, 138672 Singapore
| | | |
Collapse
|
87
|
Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. ACTA ACUST UNITED AC 2012. [PMID: 23202746 DOI: 10.1093/bioinformatics/bts690] [Citation(s) in RCA: 180] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
MOTIVATION The imperfect sequence data produced by next-generation sequencing technologies have motivated the development of a number of short-read error correctors in recent years. The majority of methods focus on the correction of substitution errors, which are the dominant error source in data produced by Illumina sequencing technology. Existing tools either score high in terms of recall or precision but not consistently high in terms of both measures. RESULTS In this article, we present Musket, an efficient multistage k-mer-based corrector for Illumina short-read data. We use the k-mer spectrum approach and introduce three correction techniques in a multistage workflow: two-sided conservative correction, one-sided aggressive correction and voting-based refinement. Our performance evaluation results, in terms of correction quality and de novo genome assembly measures, reveal that Musket is consistently one of the top performing correctors. In addition, Musket is multi-threaded using a master-slave model and demonstrates superior parallel scalability compared with all other evaluated correctors as well as a highly competitive overall execution time. AVAILABILITY Musket is available at http://musket.sourceforge.net.
Collapse
Affiliation(s)
- Yongchao Liu
- Institut für Informatik, Johannes Gutenberg Universität Mainz, Mainz 55099, Germany.
| | | | | |
Collapse
|
88
|
Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH. Megraft: a software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Res Microbiol 2012; 163:407-12. [PMID: 22824070 DOI: 10.1016/j.resmic.2012.07.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2012] [Accepted: 06/26/2012] [Indexed: 12/31/2022]
Abstract
Metagenomic libraries represent subsamples of the total DNA found at a study site and offer unprecedented opportunities to study ecological and functional aspects of microbial communities. To examine the depth of a community sequencing effort, rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in the metagenome is usually performed. The fragmentary, non-overlapping nature of SSU sequences in metagenomic libraries poses a problem for this analysis, however. We introduce a software package - Megraft - that grafts SSU fragments onto full-length SSU sequences, accounting for observed and unobserved variability, for accurate assessment of species richness and sequencing depth in metagenomics endeavors.
Collapse
Affiliation(s)
- Johan Bengtsson
- Institute of Neuroscience and Physiology, The Sahlgrenska Academy, University of Gothenburg, Medicinaregatan 11, 405 30 Göteborg, Sweden.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
89
|
Skums P, Dimitrova Z, Campo DS, Vaughan G, Rossi L, Forbi JC, Yokosawa J, Zelikovsky A, Khudyakov Y. Efficient error correction for next-generation sequencing of viral amplicons. BMC Bioinformatics 2012; 13 Suppl 10:S6. [PMID: 22759430 PMCID: PMC3382444 DOI: 10.1186/1471-2105-13-s10-s6] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing. RESULTS In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones. CONCLUSIONS Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.
Collapse
Affiliation(s)
- Pavel Skums
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - Zoya Dimitrova
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - David S Campo
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - Gilberto Vaughan
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - Livia Rossi
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - Joseph C Forbi
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| | - Jonny Yokosawa
- Laboratório de Virologia, Instituto de Ciências Biomédicas, Universidade Federal de Uberlândia, Uberlândia 38400-902, Brazil
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 34 Peachtree str., 30303, Atlanta, GA, USA
| | - Yury Khudyakov
- Laboratory of Molecular Epidemiology and Bioinformatics, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Clifton Road NE, 30333 Atlanta, GA, USA
| |
Collapse
|
90
|
Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 2012; 7:e37135. [PMID: 22675423 PMCID: PMC3365034 DOI: 10.1371/journal.pone.0037135] [Citation(s) in RCA: 2086] [Impact Index Per Article: 160.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 04/13/2012] [Indexed: 12/14/2022] Open
Abstract
The ability to efficiently and accurately determine genotypes is a keystone technology in modern genetics, crucial to studies ranging from clinical diagnostics, to genotype-phenotype association, to reconstruction of ancestry and the detection of selection. To date, high capacity, low cost genotyping has been largely achieved via “SNP chip” microarray-based platforms which require substantial prior knowledge of both genome sequence and variability, and once designed are suitable only for those targeted variable nucleotide sites. This method introduces substantial ascertainment bias and inherently precludes detection of rare or population-specific variants, a major source of information for both population history and genotype-phenotype association. Recent developments in reduced-representation genome sequencing experiments on massively parallel sequencers (commonly referred to as RAD-tag or RADseq) have brought direct sequencing to the problem of population genotyping, but increased cost and procedural and analytical complexity have limited their widespread adoption. Here, we describe a complete laboratory protocol, including a custom combinatorial indexing method, and accompanying software tools to facilitate genotyping across large numbers (hundreds or more) of individuals for a range of markers (hundreds to hundreds of thousands). Our method requires no prior genomic knowledge and achieves per-site and per-individual costs below that of current SNP chip technology, while requiring similar hands-on time investment, comparable amounts of input DNA, and downstream analysis times on the order of hours. Finally, we provide empirical results from the application of this method to both genotyping in a laboratory cross and in wild populations. Because of its flexibility, this modified RADseq approach promises to be applicable to a diversity of biological questions in a wide range of organisms.
Collapse
Affiliation(s)
- Brant K Peterson
- Department of Organismic & Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts, United States of America.
| | | | | | | | | |
Collapse
|
91
|
Yang X, Chockalingam SP, Aluru S. A survey of error-correction methods for next-generation sequencing. Brief Bioinform 2012; 14:56-66. [DOI: 10.1093/bib/bbs015] [Citation(s) in RCA: 177] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
92
|
Archer J, Baillie G, Watson SJ, Kellam P, Rambaut A, Robertson DL. Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II. BMC Bioinformatics 2012; 13:47. [PMID: 22443413 PMCID: PMC3359224 DOI: 10.1186/1471-2105-13-47] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2011] [Accepted: 03/23/2012] [Indexed: 01/23/2023] Open
Abstract
Background Next generation sequencing provides detailed insight into the variation present within viral populations, introducing the possibility of treatment strategies that are both reactive and predictive. Current software tools, however, need to be scaled up to accommodate for high-depth viral data sets, which are often temporally or spatially linked. In addition, due to the development of novel sequencing platforms and chemistries, each with implicit strengths and weaknesses, it will be helpful for researchers to be able to routinely compare and combine data sets from different platforms/chemistries. In particular, error associated with a specific sequencing process must be quantified so that true biological variation may be identified. Results Segminator II was developed to allow for the efficient comparison of data sets derived from different sources. We demonstrate its usage by comparing large data sets from 12 influenza H1N1 samples sequenced on both the 454 Life Sciences and Illumina platforms, permitting quantification of platform error. For mismatches median error rates at 0.10 and 0.12%, respectively, suggested that both platforms performed similarly. For insertions and deletions median error rates within the 454 data (at 0.3 and 0.2%, respectively) were significantly higher than those within the Illumina data (0.004 and 0.006%, respectively). In agreement with previous observations these higher rates were strongly associated with homopolymeric stretches on the 454 platform. Outside of such regions both platforms had similar indel error profiles. Additionally, we apply our software to the identification of low frequency variants. Conclusion We have demonstrated, using Segminator II, that it is possible to distinguish platform specific error from biological variation using data derived from two different platforms. We have used this approach to quantify the amount of error present within the 454 and Illumina platforms in relation to genomic location as well as location on the read. Given that next generation data is increasingly important in the analysis of drug-resistance and vaccine trials, this software will be useful to the pathogen research community. A zip file containing the source code and jar file is freely available for download from http://www.bioinf.manchester.ac.uk/segminator/.
Collapse
Affiliation(s)
- John Archer
- Computational and Evolutionary Biology, Faculty of Life Sciences, University of Manchester, Manchester, UK.
| | | | | | | | | | | |
Collapse
|
93
|
Studholme DJ. Deep sequencing of small RNAs in plants: applied bioinformatics. Brief Funct Genomics 2011; 11:71-85. [PMID: 22184332 DOI: 10.1093/bfgp/elr039] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Small RNAs, including microRNA and short-interfering RNAs, play important roles in plants. In recent years, developments in sequencing technology have enabled the large-scale discovery of sRNAs in various cells, tissues and developmental stages and in response to various stresses. This review describes the bioinformatics challenges to analysing these large datasets of short-RNA sequences and some of the solutions to those challenges.
Collapse
|
94
|
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One 2011; 6:e22594. [PMID: 21949676 PMCID: PMC3174933 DOI: 10.1371/journal.pone.0022594] [Citation(s) in RCA: 441] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2011] [Accepted: 06/29/2011] [Indexed: 01/15/2023] Open
Abstract
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment. We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence. MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.
Collapse
|
95
|
Philippe N, Salson M, Lecroq T, Léonard M, Commes T, Rivals E. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 2011; 12:242. [PMID: 21682852 PMCID: PMC3163563 DOI: 10.1186/1471-2105-12-242] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2010] [Accepted: 06/17/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the k-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some k-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently. RESULTS Here, we present a solution, named Gk arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a k-mer, get the reads containing this k-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq). CONCLUSIONS Gk arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The Gk arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The Gk arrays library is available under Cecill (GPL compliant) license from http://www.atgc-montpellier.fr/ngs/.
Collapse
Affiliation(s)
- Nicolas Philippe
- LIRMM, UMR 5506, CNRS and Université de Montpellier 2, CC 477, 161 rue Ada, 34095 Montpellier, France
| | | | | | | | | | | |
Collapse
|
96
|
Ranwez V, Harispe S, Delsuc F, Douzery EJP. MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. PLoS One 2011. [PMID: 21949676 DOI: 10.1371/journal.pone.00022594] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/24/2023] Open
Abstract
Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment.We present an algorithm that has the same space and time complexity as the classical Needleman-Wunsch algorithm while accommodating sequencing errors and other biological deviations from the coding frame. The resulting pairwise coding sequence alignment method was extended to a multiple sequence alignment (MSA) algorithm implemented in a program called MACSE (Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons). MACSE is the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.MACSE is distributed as an open-source java file executable with freely available source code and can be used via a web interface at: http://mbb.univ-montp2.fr/macse.
Collapse
Affiliation(s)
- Vincent Ranwez
- Institut des Sciences de l'Evolution, UMR5554-CNRS, Université Montpellier II, Montpellier, France.
| | | | | | | |
Collapse
|