Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Medvedev P, Scott E, Kakaradov B, Pevzner P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 2011;27:i137-41. [PMID: 21685062 PMCID: PMC3117386 DOI: 10.1093/bioinformatics/btr208] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

For:	Medvedev P, Scott E, Kakaradov B, Pevzner P. Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 2011;27:i137-41. [PMID: 21685062 PMCID: PMC3117386 DOI: 10.1093/bioinformatics/btr208] [Citation(s) in RCA: 85] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Pearson A, Lladser ME. On latent idealized models in symbolic datasets: unveiling signals in noisy sequencing data. J Math Biol 2023;87:26. [PMID: 37428265 DOI: 10.1007/s00285-023-01961-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 06/19/2023] [Accepted: 06/25/2023] [Indexed: 07/11/2023]

Cheng C, Fei Z, Xiao P. Methods to improve the accuracy of next-generation sequencing. Front Bioeng Biotechnol 2023;11:982111. [PMID: 36741756 PMCID: PMC9895957 DOI: 10.3389/fbioe.2023.982111] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 01/11/2023] [Indexed: 01/21/2023] Open

Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022;49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]

Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022;21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open

Abstract

Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.

Collapse

Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;PP:1-1. [PMID: 34529572 DOI: 10.1109/tcbb.2021.3113131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Liao X, Li M, Luo J, Zou Y, Wu FX, Luo F, Wang J. EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:1492-1503. [PMID: 31603794 DOI: 10.1109/tcbb.2019.2945761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Ciccolella S, Patterson M, Bonizzoni P, Della Vedova G. Effective Clustering for Single Cell Sequencing Cancer Data. IEEE J Biomed Health Inform 2021;25:4068-4078. [PMID: 34003758 DOI: 10.1109/jbhi.2021.3081380] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Li HD, Zhang W, Luo Y, Wang J. IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200316101205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences. BMC Bioinformatics 2020;21:482. [PMID: 33375937 PMCID: PMC7772912 DOI: 10.1186/s12859-020-03811-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 10/13/2020] [Indexed: 12/09/2022] Open

Abstract

Background

In molecular epidemiology, comparison of intra-host viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of Ultra-Deep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods.

Results

The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, k-hulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the k-means algorithm and can be used with any type of categorical data. It is 6.8-times more accurate than k-mode, a previously developed clustering algorithm for categorical data.

Conclusions

CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using threshold-based methods. In addition, the convex hull efficiently obtains important summary metrics for intra-host viral populations.

Collapse

Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020;20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open

Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019;20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open

Abstract

BACKGROUND

Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.

METHODS

In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base.

RESULTS

ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy.

CONCLUSION

ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Collapse

Ge J, Meng J, Guo N, Wei Y, Balaji P, Feng S. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci 2019;12:99-108. [PMID: 31734873 DOI: 10.1007/s12539-019-00348-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 08/19/2019] [Accepted: 10/25/2019] [Indexed: 11/25/2022]

Current challenges and solutions of de novo assembly. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0166-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]

Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019;20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open

Ershov V, Tarasov A, Lapidus A, Korobeynikov A. IonHammer: Homopolymer-Space Hamming Clustering for IonTorrent Read Error Correction. J Comput Biol 2019;26:124-127. [DOI: 10.1089/cmb.2018.0152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018;7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open

Kaisers W, Schwender H, Schaal H. Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities. Int J Mol Sci 2018;19:E3687. [PMID: 30469355 PMCID: PMC6274891 DOI: 10.3390/ijms19113687] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open

Tsyvina V, Campo DS, Sims S, Zelikovsky A, Khudyakov Y, Skums P. Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants. BMC Bioinformatics 2018;19:360. [PMID: 30343669 PMCID: PMC6196405 DOI: 10.1186/s12859-018-2333-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open

Shlemov A, Bankevich S, Bzikadze A, Turchaninova MA, Safonova Y, Pevzner PA. Reconstructing Antibody Repertoires from Error-Prone Immunosequencing Reads. THE JOURNAL OF IMMUNOLOGY 2017;199:3369-3380. [PMID: 28978691 DOI: 10.4049/jimmunol.1700485] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 08/24/2017] [Indexed: 12/16/2022]

Lee B, Moon T, Yoon S, Weissman T. DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 2017;12:e0181463. [PMID: 28749987 PMCID: PMC5531809 DOI: 10.1371/journal.pone.0181463] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/30/2017] [Indexed: 11/29/2022] Open

Malhotra R, Jha M, Poss M, Acharya R. A random forest classifier for detecting rare variants in NGS data from viral populations. Comput Struct Biotechnol J 2017;15:388-395. [PMID: 28819548 PMCID: PMC5548337 DOI: 10.1016/j.csbj.2017.07.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Revised: 07/01/2017] [Accepted: 07/03/2017] [Indexed: 11/28/2022] Open

Mysara M, Njima M, Leys N, Raes J, Monsieurs P. From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data. Gigascience 2017;6:1-10. [PMID: 28369460 PMCID: PMC5466709 DOI: 10.1093/gigascience/giw017] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 12/27/2016] [Indexed: 01/09/2023] Open

Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 2017;33:1324-1330. [PMID: 28453674 PMCID: PMC5408799 DOI: 10.1093/bioinformatics/btw832] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/21/2016] [Accepted: 12/27/2016] [Indexed: 12/21/2022] Open

Abstract

Motivation

Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task.

Results

Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications.

Availability and Implementation

ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard.

Contact

hmohamadi@bcgsc.ca or ibirol@bcgsc.ca.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017;33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open

Draft Genome Sequence of Staphylococcus hominis BHG17 Isolated from Wild Bar-Headed Goose (Anser indicus) Feces. GENOME ANNOUNCEMENTS 2017;5:5/5/e01552-16. [PMID: 28153901 PMCID: PMC5289687 DOI: 10.1128/genomea.01552-16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016;118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open

Akogwu I, Wang N, Zhang C, Gong P. A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis. Hum Genomics 2016;10 Suppl 2:20. [PMID: 27461106 PMCID: PMC4965716 DOI: 10.1186/s40246-016-0068-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open

Abstract

BACKGROUND

Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.

METHODS

Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method.

RESULTS

Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets.

CONCLUSIONS

This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.

Collapse

Use of Multiple Sequencing Technologies To Produce a High-Quality Genome of the Fungus Pseudogymnoascus destructans, the Causative Agent of Bat White-Nose Syndrome. GENOME ANNOUNCEMENTS 2016;4:4/3/e00445-16. [PMID: 27365344 PMCID: PMC4929507 DOI: 10.1128/genomea.00445-16] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Mamun AA, Pal S, Rajasekaran S. KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 2016;32:2783-90. [PMID: 27283950 DOI: 10.1093/bioinformatics/btw345] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 05/25/2016] [Indexed: 01/30/2023] Open

Draft Genome Sequence of Bacillus megaterium BHG1.1, a Strain Isolated from Bar-Headed Goose (Anser indicus) Feces on the Qinghai-Tibet Plateau. GENOME ANNOUNCEMENTS 2016;4:4/3/e00317-16. [PMID: 27174262 PMCID: PMC4866837 DOI: 10.1128/genomea.00317-16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016;2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]

Bremges A, Singer E, Woyke T, Sczyrba A. MeCorS: Metagenome-enabled error correction of single cell sequencing reads. Bioinformatics 2016;32:2199-201. [PMID: 27153586 PMCID: PMC4937190 DOI: 10.1093/bioinformatics/btw144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2015] [Accepted: 03/09/2016] [Indexed: 11/12/2022] Open

Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016;18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open

Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review ofde novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016;17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open

Pal S, Aluru S. In search of perfect reads. BMC Bioinformatics 2015;16 Suppl 17:S7. [PMID: 26679555 PMCID: PMC4674851 DOI: 10.1186/1471-2105-16-s17-s7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open

Abstract

BACKGROUND

Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources.

METHODS

We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need.

RESULTS

Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%.

CONCLUSIONS

Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.

Collapse

Saha S, Rajasekaran S. EC: an efficient error correction algorithm for short reads. BMC Bioinformatics 2015;16 Suppl 17:S2. [PMID: 26678663 PMCID: PMC4674864 DOI: 10.1186/1471-2105-16-s17-s2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2023] Open

Safonova Y, Bonissone S, Kurpilyansky E, Starostina E, Lapidus A, Stinson J, DePalatis L, Sandoval W, Lill J, Pevzner PA. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics 2015;31:i53-61. [PMID: 26072509 PMCID: PMC4542777 DOI: 10.1093/bioinformatics/btv238] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open

Affiliation(s)

Yana Safonova Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Stefano Bonissone Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Eugene Kurpilyansky Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Ekaterina Starostina Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Alla Lapidus Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Jeremy Stinson Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Laura DePalatis Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Wendy Sandoval Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Jennie Lill Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Pavel A Pevzner Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA

Collapse

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015;4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open

Sahl JW, Del Franco M, Pournaras S, Colman RE, Karah N, Dijkshoorn L, Zarrilli R. Phylogenetic and genomic diversity in isolates from the globally distributed Acinetobacter baumannii ST25 lineage. Sci Rep 2015;5:15188. [PMID: 26462752 PMCID: PMC4604477 DOI: 10.1038/srep15188] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/22/2015] [Indexed: 02/06/2023] Open

Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015;31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open

Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015;15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open

Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015. [PMID: 26217378 PMCID: PMC4493402 DOI: 10.3389/fgene.2015.00235] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

Sahl JW, Schupp JM, Rasko DA, Colman RE, Foster JT, Keim P. Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data. Genome Med 2015;7:52. [PMID: 26136847 PMCID: PMC4487561 DOI: 10.1186/s13073-015-0176-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 05/15/2015] [Indexed: 12/30/2022] Open

Computational and Statistical Analyses of Insertional Polymorphic Endogenous Retroviruses in a Non-Model Organism. COMPUTATION 2014. [DOI: 10.3390/computation2040221] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]

Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014;16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open

Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 2014;9:e101271. [PMID: 25062443 PMCID: PMC4111482 DOI: 10.1371/journal.pone.0101271] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 06/04/2014] [Indexed: 11/19/2022] Open

Roy RS, Bhattacharya D, Schliep A. Turtle: Identifying frequent k -mers with cache-efficient algorithms. Bioinformatics 2014;30:1950-7. [DOI: 10.1093/bioinformatics/btu132] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open