1
|
Pearson A, Lladser ME. On latent idealized models in symbolic datasets: unveiling signals in noisy sequencing data. J Math Biol 2023; 87:26. [PMID: 37428265 DOI: 10.1007/s00285-023-01961-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 06/19/2023] [Accepted: 06/25/2023] [Indexed: 07/11/2023]
Abstract
Data taking values on discrete sample spaces are the embodiment of modern biological research. "Omics" experiments based on high-throughput sequencing produce millions of symbolic outcomes in the form of reads (i.e., DNA sequences of a few dozens to a few hundred nucleotides). Unfortunately, these intrinsically non-numerical datasets often deviate dramatically from natural assumptions a practitioner might make, and the possible sources of this deviation are usually poorly characterized. This contrasts with numerical datasets where Gaussian-type errors are often well-justified. To overcome this hurdle, we introduce the notion of latent weight, which measures the largest expected fraction of samples from a probabilistic source that conform to a model in a class of idealized models. We examine various properties of latent weights, which we specialize to the class of exchangeable probability distributions. As proof of concept, we analyze DNA methylation data from the 22 human autosome pairs. Contrary to what is usually assumed in the literature, we provide strong evidence that highly specific methylation patterns are overrepresented at some genomic locations when latent weights are taken into account.
Collapse
Affiliation(s)
- Antony Pearson
- Department of Applied Mathematics, University of Colorado Boulder, Boulder, CO, USA
| | - Manuel E Lladser
- Department of Applied Mathematics, University of Colorado Boulder, Boulder, CO, USA.
| |
Collapse
|
2
|
Cheng C, Fei Z, Xiao P. Methods to improve the accuracy of next-generation sequencing. Front Bioeng Biotechnol 2023; 11:982111. [PMID: 36741756 PMCID: PMC9895957 DOI: 10.3389/fbioe.2023.982111] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 01/11/2023] [Indexed: 01/21/2023] Open
Abstract
Next-generation sequencing (NGS) is present in all fields of life science, which has greatly promoted the development of basic research while being gradually applied in clinical diagnosis. However, the cost and throughput advantages of next-generation sequencing are offset by large tradeoffs with respect to read length and accuracy. Specifically, its high error rate makes it extremely difficult to detect SNPs or low-abundance mutations, limiting its clinical applications, such as pharmacogenomics studies primarily based on SNP and early clinical diagnosis primarily based on low abundance mutations. Currently, Sanger sequencing is still considered to be the gold standard due to its high accuracy, so the results of next-generation sequencing require verification by Sanger sequencing in clinical practice. In order to maintain high quality next-generation sequencing data, a variety of improvements at the levels of template preparation, sequencing strategy and data processing have been developed. This study summarized the general procedures of next-generation sequencing platforms, highlighting the improvements involved in eliminating errors at each step. Furthermore, the challenges and future development of next-generation sequencing in clinical application was discussed.
Collapse
|
3
|
Genome sequence assembly algorithms and misassembly identification methods. Mol Biol Rep 2022; 49:11133-11148. [PMID: 36151399 DOI: 10.1007/s11033-022-07919-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 09/05/2022] [Indexed: 10/14/2022]
Abstract
The sequence assembly algorithms have rapidly evolved with the vigorous growth of genome sequencing technology over the past two decades. Assembly mainly uses the iterative expansion of overlap relationships between sequences to construct the target genome. The assembly algorithms can be typically classified into several categories, such as the Greedy strategy, Overlap-Layout-Consensus (OLC) strategy, and de Bruijn graph (DBG) strategy. In particular, due to the rapid development of third-generation sequencing (TGS) technology, some prevalent assembly algorithms have been proposed to generate high-quality chromosome-level assemblies. However, due to the genome complexity, the length of short reads, and the high error rate of long reads, contigs produced by assembly may contain misassemblies adversely affecting downstream data analysis. Therefore, several read-based and reference-based methods for misassembly identification have been developed to improve assembly quality. This work primarily reviewed the development of DNA sequencing technologies and summarized sequencing data simulation methods, sequencing error correction methods, various mainstream sequence assembly algorithms, and misassembly identification methods. A large amount of computation makes the sequence assembly problem more challenging, and therefore, it is necessary to develop more efficient and accurate assembly algorithms and alternative algorithms.
Collapse
|
4
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
5
|
Leinonen M, Salmela L. Extraction of long k-mers using spaced seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:1-1. [PMID: 34529572 DOI: 10.1109/tcbb.2021.3113131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The extraction of k-mers from reads is an important task in many bioinformatics applications, such as all DNA sequence analysis methods based on de Bruijn graphs. These methods tend to be more accurate when the used k-mers are unique in the analyzed DNA, and thus the use of longer k-mers is preferred. When the read lengths of short read sequencing technologies increase, the error rate will become the determining factor for the largest possible value of k. Here we propose LoMeX which uses spaced seeds to extract long k-mers accurately even in the presence of sequencing errors. Our experiments show that LoMeX can extract long k-mers from current Illumina reads with a similar or higher recall than a standard k-mer counting tool. Furthermore, our experiments on simulated data show that when the read length further increases enabling even longer k-mers, the performance of standard k-mer counters declines, whereas LoMeX still extracts long k-mers successfully.
Collapse
|
6
|
Liao X, Li M, Luo J, Zou Y, Wu FX, Luo F, Wang J. EPGA-SC : A Framework for de novo Assembly of Single-Cell Sequencing Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1492-1503. [PMID: 31603794 DOI: 10.1109/tcbb.2019.2945761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Assembling genomes from single-cell sequencing data is essential for single-cell studies. However, single-cell assemblies are challenging due to (i) the highly non-uniform read coverage and (ii) the elevated levels of sequencing errors and chimeric reads. Although several assemblers for single-cell data have been proposed in recent years, most of them fail to construct correct long contigs. In this study, we present a new framework called EPGA-SC for de novo assembly of single-cell sequencing reads. The EPGA assembler has designed strategies to solve the problems caused by sequencing errors, sequencing biases, and repetitive regions. However, the extremely unbalanced and richer error types prevent EPGA to achieve high performance in single-cell sequencing data. In this study, we designed EPGA-SC based on EPGA. The main innovations of EPGA-SC are as follows: (i) classifying reads to reduce the proportion of false reads; (ii) using multiple sets of high precision paired-end reads generated from the high precision assemblies produced by other assembler such as SPAdes to overcome the impact of sequencing biases and repetitive regions; and (iii) developing novel algorithms for removing chimeric errors and extending contigs. We test EPGA-SC with seven datasets. The experimental results show that EPGA-SC can generate better assemblies than most current tools in most time in term of MAX contig, N50, NG50, NA50, and NGA50.
Collapse
|
7
|
Ciccolella S, Patterson M, Bonizzoni P, Della Vedova G. Effective Clustering for Single Cell Sequencing Cancer Data. IEEE J Biomed Health Inform 2021; 25:4068-4078. [PMID: 34003758 DOI: 10.1109/jbhi.2021.3081380] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance --- usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells --- and to infer the tree from this reduced-size instance. In this work, we present a new clustering procedure aimed at clustering such categorical vector, or matrix data --- here representing SCS instances, called celluloid. We show that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method. We demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice. Our approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license, as well as on the Python Package Index (PyPI) at https://pypi.org/project/celluloid-clust/.
Collapse
|
8
|
Heo Y, Manikandan G, Ramachandran A, Chen D. Comprehensive Evaluation of Error-Correction Methodologies for Genome Sequencing Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
9
|
Li HD, Zhang W, Luo Y, Wang J. IsoDetect: Detection of Splice Isoforms from Third Generation Long Reads Based on Short Feature Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200316101205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Transcriptome annotation is the basis for understanding gene structures
and analysing gene expression. The transcriptome annotation of many organisms such as humans
is far from incomplete, due partly to the challenge in the identification of isoforms that are
produced from the same gene through alternative splicing. Third generation sequencing (TGS)
reads provide unprecedented opportunity for detecting isoforms due to their long length that
exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection
methods is that they are exclusively based on sequence reads, without incorporating the sequence
information of annotated isoforms.
Objective:
We aim to develop a method to detect isoforms by incorporating annotated isoforms.
Methods:
Based on annotated isoforms, we propose a splice isoform detection method called
IsoDetect. First, the sequence at exon-exon junctions is extracted from annotated isoforms as
“short feature sequences”, which is used to distinguish splice isoforms. Second, we align these
feature sequences to long reads and partition long reads into groups that contain the same set of
feature sequences, thereby avoiding the pair-wise comparison among the large number of long
reads. Third, clustering and consensus generation are carried out based on sequence similarity. For
the long reads that do not contain any short feature sequence, clustering analysis based on
sequence similarity is performed to identify isoforms. Therefore, our method can detect not only
known but also novel isoforms.
Result:
Tested on two datasets from Calypte anna and Zebra Finch, IsoDetect shows higher speed
and good accuracies compared with four existing methods.
Conclusion:
IsoDetect may become a promising method for isoform detection.
Collapse
Affiliation(s)
- Hong-Dong Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Wenjing Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yuwen Luo
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
10
|
Convex hulls in hamming space enable efficient search for similarity and clustering of genomic sequences. BMC Bioinformatics 2020; 21:482. [PMID: 33375937 PMCID: PMC7772912 DOI: 10.1186/s12859-020-03811-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 10/13/2020] [Indexed: 12/09/2022] Open
Abstract
Background In molecular epidemiology, comparison of intra-host viral variants among infected persons is frequently used for tracing transmissions in human population and detecting viral infection outbreaks. Application of Ultra-Deep Sequencing (UDS) immensely increases the sensitivity of transmission detection but brings considerable computational challenges when comparing all pairs of sequences. We developed a new population comparison method based on convex hulls in hamming space. We applied this method to a large set of UDS samples obtained from unrelated cases infected with hepatitis C virus (HCV) and compared its performance with three previously published methods. Results The convex hull in hamming space is a data structure that provides information on: (1) average hamming distance within the set, (2) average hamming distance between two sets; (3) closeness centrality of each sequence; and (4) lower and upper bound of all the pairwise distances among the members of two sets. This filtering strategy rapidly and correctly removes 96.2% of all pairwise HCV sample comparisons, outperforming all previous methods. The convex hull distance (CHD) algorithm showed variable performance depending on sequence heterogeneity of the studied populations in real and simulated datasets, suggesting the possibility of using clustering methods to improve the performance. To address this issue, we developed a new clustering algorithm, k-hulls, that reduces heterogeneity of the convex hull. This efficient algorithm is an extension of the k-means algorithm and can be used with any type of categorical data. It is 6.8-times more accurate than k-mode, a previously developed clustering algorithm for categorical data. Conclusions CHD is a fast and efficient filtering strategy for massively reducing the computational burden of pairwise comparison among large samples of sequences, and thus, aiding the calculation of transmission links among infected individuals using threshold-based methods. In addition, the convex hull efficiently obtains important summary metrics for intra-host viral populations.
Collapse
|
11
|
Olson ND, Treangen TJ, Hill CM, Cepeda-Espinoza V, Ghurye J, Koren S, Pop M. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Brief Bioinform 2020; 20:1140-1150. [PMID: 28968737 DOI: 10.1093/bib/bbx098] [Citation(s) in RCA: 86] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 07/13/2017] [Indexed: 01/09/2023] Open
Abstract
Metagenomic samples are snapshots of complex ecosystems at work. They comprise hundreds of known and unknown species, contain multiple strain variants and vary greatly within and across environments. Many microbes found in microbial communities are not easily grown in culture making their DNA sequence our only clue into their evolutionary history and biological function. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. Current methods have made significant strides in reconstructing DNA segments comprising operons, tandem gene arrays and syntenic blocks. Shorter, higher-throughput sequencing technologies have become the de facto standard in the field. Sequencers are now able to generate billions of short reads in only a few days. Multiple metagenomic assembly strategies, pipelines and assemblers have appeared in recent years. Owing to the inherent complexity of metagenome assembly, regardless of the assembly algorithm and sequencing method, metagenome assemblies contain errors. Recent developments in assembly validation tools have played a pivotal role in improving metagenomics assemblers. Here, we survey recent progress in the field of metagenomic assembly, provide an overview of key approaches for genomic and metagenomic assembly validation and demonstrate the insights that can be derived from assemblies through the use of assembly validation strategies. We also discuss the potential for impact of long-read technologies in metagenomics. We conclude with a discussion of future challenges and opportunities in the field of metagenomic assembly and validation.
Collapse
|
12
|
Das AK, Goswami S, Lee K, Park SJ. A hybrid and scalable error correction algorithm for indel and substitution errors of long reads. BMC Genomics 2019; 20:948. [PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
BACKGROUND Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.
Collapse
Affiliation(s)
- Arghya Kusum Das
- Department of Computer Science and Software Engineering, University of Wisconsin at Platteville, Platteville, WI USA
| | - Sayan Goswami
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Kisung Lee
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| | - Seung-Jong Park
- School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, Baton Rouge, Baton Rouge, LA USA
| |
Collapse
|
13
|
Ge J, Meng J, Guo N, Wei Y, Balaji P, Feng S. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci 2019; 12:99-108. [PMID: 31734873 DOI: 10.1007/s12539-019-00348-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 08/19/2019] [Accepted: 10/25/2019] [Indexed: 11/25/2022]
Abstract
Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.
Collapse
Affiliation(s)
- Jianqiu Ge
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Jintao Meng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Ning Guo
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China.
| | - Pavan Balaji
- Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439-4844, USA
| | - Shengzhong Feng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| |
Collapse
|
14
|
|
15
|
Manekar SC, Sathe SR. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art. Curr Genomics 2019; 20:2-15. [PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Revised: 10/05/2018] [Accepted: 10/24/2018] [Indexed: 12/24/2022] Open
Abstract
Background In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. Objective In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. Methods Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. Results The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. Conclusion The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur, India
| |
Collapse
|
16
|
Ershov V, Tarasov A, Lapidus A, Korobeynikov A. IonHammer: Homopolymer-Space Hamming Clustering for IonTorrent Read Error Correction. J Comput Biol 2019; 26:124-127. [DOI: 10.1089/cmb.2018.0152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Vasily Ershov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
| | - Artem Tarasov
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Anton Korobeynikov
- Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
17
|
Manekar SC, Sathe SR. A benchmark study of k-mer counting methods for high-throughput sequencing. Gigascience 2018; 7:5140149. [PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Accepted: 10/16/2018] [Indexed: 11/25/2022] Open
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.
Collapse
Affiliation(s)
- Swati C Manekar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| | - Shailesh R Sathe
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440 010, India
| |
Collapse
|
18
|
Kaisers W, Schwender H, Schaal H. Hierarchical Clustering of DNA k-mer Counts in RNAseq Fastq Files Identifies Sample Heterogeneities. Int J Mol Sci 2018; 19:E3687. [PMID: 30469355 PMCID: PMC6274891 DOI: 10.3390/ijms19113687] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Accepted: 11/15/2018] [Indexed: 01/14/2023] Open
Abstract
We apply hierarchical clustering (HC) of DNA k-mer counts on multiple Fastq files. The tree structures produced by HC may reflect experimental groups and thereby indicate experimental effects, but clustering of preparation groups indicates the presence of batch effects. Hence, HC of DNA k-mer counts may serve as a diagnostic device. In order to provide a simple applicable tool we implemented sequential analysis of Fastq reads with low memory usage in an R package (seqTools) available on Bioconductor. The approach is validated by analysis of Fastq file batches containing RNAseq data. Analysis of three Fastq batches downloaded from ArrayExpress indicated experimental effects. Analysis of RNAseq data from two cell types (dermal fibroblasts and Jurkat cells) sequenced in our facility indicate presence of batch effects. The observed batch effects were also present in reads mapped to the human genome and also in reads filtered for high quality (Phred > 30). We propose, that hierarchical clustering of DNA k-mer counts provides an unspecific diagnostic tool for RNAseq experiments. Further exploration is required once samples are identified as outliers in HC derived trees.
Collapse
Affiliation(s)
- Wolfgang Kaisers
- Department of Anaesthesiology, HELIOS University Hospital Wuppertal, University of Witten/Herdecke, Heusnerstr. 40, 42283 Wuppertal, Germany.
- Institut fur Virologie, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
| | - Holger Schwender
- Mathematisches Institut, Heinrich-Heine-Universität Düsseldorf, 40225 Düsseldorf, Germany.
| | - Heiner Schaal
- Institut fur Virologie, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany.
| |
Collapse
|
19
|
Tsyvina V, Campo DS, Sims S, Zelikovsky A, Khudyakov Y, Skums P. Fast estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants. BMC Bioinformatics 2018; 19:360. [PMID: 30343669 PMCID: PMC6196405 DOI: 10.1186/s12859-018-2333-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
Background Many biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naïeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets. Results In this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. It shows better filtering quality and time performance when comparing to other tools. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sj Conclusion The proposed tool allows for efficient detection of genetic relatedness between genomic samples produced by deep sequencing of heterogeneous populations. It should be especially useful for analysis of relatedness of genomes of viruses with unevenly distributed variable genomic regions, such as HIV and HCV. For the future we envision, that besides applications in molecular epidemiology the tool can also be adapted to immunosequencing and metagenomics data.
Collapse
Affiliation(s)
- Viachaslau Tsyvina
- Computer Science Department, Georgia State University, 25 Park Place NE, Atlanta, 30303, GA, USA.
| | - David S Campo
- Molecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Cliffton Road, Atlanta, 30333, GA, USA
| | - Seth Sims
- Computer Science Department, Georgia State University, 25 Park Place NE, Atlanta, 30303, GA, USA.,Molecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Cliffton Road, Atlanta, 30333, GA, USA
| | - Alex Zelikovsky
- Computer Science Department, Georgia State University, 25 Park Place NE, Atlanta, 30303, GA, USA
| | - Yury Khudyakov
- Molecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Cliffton Road, Atlanta, 30333, GA, USA
| | - Pavel Skums
- Computer Science Department, Georgia State University, 25 Park Place NE, Atlanta, 30303, GA, USA.,Molecular Epidemiology and Bioinformatics Laboratory, Division of Viral Hepatitis, Centers for Disease Control and Prevention, 1600 Cliffton Road, Atlanta, 30333, GA, USA
| |
Collapse
|
20
|
Shlemov A, Bankevich S, Bzikadze A, Turchaninova MA, Safonova Y, Pevzner PA. Reconstructing Antibody Repertoires from Error-Prone Immunosequencing Reads. THE JOURNAL OF IMMUNOLOGY 2017; 199:3369-3380. [PMID: 28978691 DOI: 10.4049/jimmunol.1700485] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 08/24/2017] [Indexed: 12/16/2022]
Abstract
Transforming error-prone immunosequencing datasets into Ab repertoires is a fundamental problem in immunogenomics, and a prerequisite for studies of immune responses. Although various repertoire reconstruction algorithms were released in the last 3 y, it remains unclear how to benchmark them and how to assess the accuracy of the reconstructed repertoires. We describe an accurate IgReC algorithm for constructing Ab repertoires from high-throughput immunosequencing datasets and a new framework for assessing the quality of reconstructed repertoires. Surprisingly, Ab repertoires constructed by IgReC from barcoded immunosequencing datasets in the blind mode (without using information about unique molecular identifiers) improved upon the repertoires constructed by the state-of-the-art tools that use barcoding. This finding suggests that IgReC may alleviate the need to generate repertoires using the barcoding technology (the workhorse of current immunogenomics efforts) because our computational approach to error correction of immunosequencing data is nearly as powerful as the experimental approach based on barcoding.
Collapse
Affiliation(s)
- Alexander Shlemov
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg University, St. Petersburg, Russia 199034
| | - Sergey Bankevich
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg University, St. Petersburg, Russia 199034
| | - Andrey Bzikadze
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg University, St. Petersburg, Russia 199034
| | - Maria A Turchaninova
- Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, Russia 117997
| | - Yana Safonova
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg University, St. Petersburg, Russia 199034; .,Information Theory and Applications Center, University of California, San Diego, La Jolla, CA 92093; and
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St. Petersburg University, St. Petersburg, Russia 199034.,Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093
| |
Collapse
|
21
|
Lee B, Moon T, Yoon S, Weissman T. DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing. PLoS One 2017; 12:e0181463. [PMID: 28749987 PMCID: PMC5531809 DOI: 10.1371/journal.pone.0181463] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/30/2017] [Indexed: 11/29/2022] Open
Abstract
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
Collapse
Affiliation(s)
- Byunghan Lee
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Taesup Moon
- College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
- * E-mail: (TM); (SY)
| | - Sungroh Yoon
- Electrical and Computer Engineering, Seoul National University, Seoul, Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Neurology and Neurological Sciences, Stanford University, Stanford, California, United States of America
- * E-mail: (TM); (SY)
| | - Tsachy Weissman
- Electrical Engineering, Stanford University, Stanford, California, United States of America
| |
Collapse
|
22
|
Malhotra R, Jha M, Poss M, Acharya R. A random forest classifier for detecting rare variants in NGS data from viral populations. Comput Struct Biotechnol J 2017; 15:388-395. [PMID: 28819548 PMCID: PMC5548337 DOI: 10.1016/j.csbj.2017.07.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Revised: 07/01/2017] [Accepted: 07/03/2017] [Indexed: 11/28/2022] Open
Abstract
We propose a random forest classifier for detecting rare variants from sequencing errors in Next Generation Sequencing (NGS) data from viral populations. The method utilizes counts of varying length of k-mers from the reads of a viral population to train a Random forest classifier, called MultiRes, that classifies k-mers as erroneous or rare variants. Our algorithm is rooted in concepts from signal processing and uses a frame-based representation of k-mers. Frames are sets of non-orthogonal basis functions that were traditionally used in signal processing for noise removal. We define discrete spatial signals for genomes and sequenced reads, and show that k-mers of a given size constitute a frame. We evaluate MultiRes on simulated and real viral population datasets, which consist of many low frequency variants, and compare it to the error detection methods used in correction tools known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs) and fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is available freely from the GitHub link https://github.com/raunaq-m/MultiRes.
Collapse
Affiliation(s)
- Raunaq Malhotra
- The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Manjari Jha
- The School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Mary Poss
- Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
| | - Raj Acharya
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| |
Collapse
|
23
|
Mysara M, Njima M, Leys N, Raes J, Monsieurs P. From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data. Gigascience 2017; 6:1-10. [PMID: 28369460 PMCID: PMC5466709 DOI: 10.1093/gigascience/giw017] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 12/27/2016] [Indexed: 01/09/2023] Open
Abstract
The development of high-throughput sequencing technologies has provided microbial ecologists with an efficient approach to assess bacterial diversity at an unseen depth, particularly with the recent advances in the Illumina MiSeq sequencing platform. However, analyzing such high-throughput data is posing important computational challenges, requiring specialized bioinformatics solutions at different stages during the processing pipeline, such as assembly of paired-end reads, chimera removal, correction of sequencing errors, and clustering of those sequences into Operational Taxonomic Units (OTUs). Individual algorithms grappling with each of those challenges have been combined into various bioinformatics pipelines, such as mothur, QIIME, LotuS, and USEARCH. Using a set of well-described bacterial mock communities, state-of-the-art pipelines for Illumina MiSeq amplicon sequencing data are benchmarked at the level of the amount of sequences retained, computational cost, error rate, and quality of the OTUs. In addition, a new pipeline called OCToPUS is introduced, which is making an optimal combination of different algorithms. Huge variability is observed between the different pipelines in respect to the monitored performance parameters, where in general the amount of retained reads is found to be inversely proportional to the quality of the reads. By contrast, OCToPUS achieves the lowest error rate, minimum number of spurious OTUs, and the closest correspondence to the existing community, while retaining the uppermost amount of reads when compared to other pipelines. The newly introduced pipeline translates Illumina MiSeq amplicon sequencing data into high-quality and reliable OTUs, with improved performance and accuracy compared to the currently existing pipelines.
Collapse
Affiliation(s)
- Mohamed Mysara
- Unit of Microbiology, Belgian Nuclear Research Centre (SCK-CEN), Boeretang 200, 2400 Mol, Belgium.,Department of Bio-Engineering Sciences, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussel, Belgium.,VIB Center for the Biology of Disease, VIB, Herestraat 49 - box 1028, 3000 Leuven, Belgium.,Department of Microbiology and Immunology, REGA institute, Herestraat 49 - box 1028, 3000 Leuven, Belgium
| | - Mercy Njima
- Unit of Microbiology, Belgian Nuclear Research Centre (SCK-CEN), Boeretang 200, 2400 Mol, Belgium
| | - Natalie Leys
- Unit of Microbiology, Belgian Nuclear Research Centre (SCK-CEN), Boeretang 200, 2400 Mol, Belgium
| | - Jeroen Raes
- Department of Bio-Engineering Sciences, Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussel, Belgium.,VIB Center for the Biology of Disease, VIB, Herestraat 49 - box 1028, 3000 Leuven, Belgium.,Department of Microbiology and Immunology, REGA institute, Herestraat 49 - box 1028, 3000 Leuven, Belgium
| | - Pieter Monsieurs
- Unit of Microbiology, Belgian Nuclear Research Centre (SCK-CEN), Boeretang 200, 2400 Mol, Belgium
| |
Collapse
|
24
|
Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics 2017; 33:1324-1330. [PMID: 28453674 PMCID: PMC5408799 DOI: 10.1093/bioinformatics/btw832] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/21/2016] [Accepted: 12/27/2016] [Indexed: 12/21/2022] Open
Abstract
Motivation Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task. Results Here, we present ntCard, a streaming algorithm for estimating the frequencies of k -mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k -mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. Availability and Implementation ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. Contact hmohamadi@bcgsc.ca or ibirol@bcgsc.ca. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hamid Mohamadi
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| | - Hamza Khan
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Faculty of Science, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
25
|
Zhao L, Chen Q, Li W, Jiang P, Wong L, Li J. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 2017; 33:3844-3851. [PMID: 28205674 DOI: 10.1093/bioinformatics/btx089] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 02/14/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Liang Zhao
- School of Computing and Electronic Information, Guangxi University, Nanning, China
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Qingfeng Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Wencui Li
- Taihe Hospital, Hubei University of Medicine, Hubei, China
| | - Peng Jiang
- School of Computing and Electronic Information, Guangxi University, Nanning, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Jinyan Li
- Advanced Analytics Institute and Centre for Health Technologies, University of Technology Sydney, Broadway, NSW, Australia
| |
Collapse
|
26
|
Draft Genome Sequence of Staphylococcus hominis BHG17 Isolated from Wild Bar-Headed Goose (Anser indicus) Feces. GENOME ANNOUNCEMENTS 2017; 5:5/5/e01552-16. [PMID: 28153901 PMCID: PMC5289687 DOI: 10.1128/genomea.01552-16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Staphylococcus hominis belongs to a group of coagulase-negative staphylococci and is an opportunistic pathogen, usually found on the skin and mucous membranes. Studies involving S. hominis isolated from wild birds are scarce. Here, we report a 2.365-Mb draft genome sequence of S. hominis BHG17, isolated from the feces of a bar-headed goose.
Collapse
|
27
|
From next-generation resequencing reads to a high-quality variant data set. Heredity (Edinb) 2016; 118:111-124. [PMID: 27759079 DOI: 10.1038/hdy.2016.102] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2016] [Revised: 09/03/2016] [Accepted: 09/06/2016] [Indexed: 12/11/2022] Open
Abstract
Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. High-throughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
Collapse
|
28
|
Akogwu I, Wang N, Zhang C, Gong P. A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis. Hum Genomics 2016; 10 Suppl 2:20. [PMID: 27461106 PMCID: PMC4965716 DOI: 10.1186/s40246-016-0068-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Innumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets. METHODS Six k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method. RESULTS Results from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets. CONCLUSIONS This study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.
Collapse
Affiliation(s)
- Isaac Akogwu
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Nan Wang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Chaoyang Zhang
- School of Computing, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
29
|
Use of Multiple Sequencing Technologies To Produce a High-Quality Genome of the Fungus Pseudogymnoascus destructans, the Causative Agent of Bat White-Nose Syndrome. GENOME ANNOUNCEMENTS 2016; 4:4/3/e00445-16. [PMID: 27365344 PMCID: PMC4929507 DOI: 10.1128/genomea.00445-16] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
White-nose syndrome has recently emerged as one of the most devastating wildlife diseases recorded, causing widespread mortality in numerous bat species throughout eastern North America. Here, we present an improved reference genome of the fungal pathogen Pseudogymnoascus destructans for use in comparative genomic studies.
Collapse
|
30
|
Mamun AA, Pal S, Rajasekaran S. KCMBT: a k-mer Counter based on Multiple Burst Trees. Bioinformatics 2016; 32:2783-90. [PMID: 27283950 DOI: 10.1093/bioinformatics/btw345] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Accepted: 05/25/2016] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION A massive number of bioinformatics applications require counting of k-length substrings in genetically important long strings. A k-mer counter generates the frequencies of each k-length substring in genome sequences. Genome assembly, repeat detection, multiple sequence alignment, error detection and many other related applications use a k-mer counter as a building block. Very fast and efficient algorithms are necessary to count k-mers in large data sets to be useful in such applications. RESULTS We propose a novel trie-based algorithm for this k-mer counting problem. We compare our devised algorithm k-mer Counter based on Multiple Burst Trees (KCMBT) with available all well-known algorithms. Our experimental results show that KCMBT is around 30% faster than the previous best-performing algorithm KMC2 for human genome dataset. As another example, our algorithm is around six times faster than Jellyfish2. Overall, KCMBT is 20-30% faster than KMC2 on five benchmark data sets when both the algorithms were run using multiple threads. AVAILABILITY AND IMPLEMENTATION KCMBT is freely available on GitHub: (https://github.com/abdullah009/kcmbt_mt). CONTACT rajasek@engr.uconn.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdullah-Al Mamun
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Soumitra Pal
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Sanguthevar Rajasekaran
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
31
|
Draft Genome Sequence of Bacillus megaterium BHG1.1, a Strain Isolated from Bar-Headed Goose (Anser indicus) Feces on the Qinghai-Tibet Plateau. GENOME ANNOUNCEMENTS 2016; 4:4/3/e00317-16. [PMID: 27174262 PMCID: PMC4866837 DOI: 10.1128/genomea.00317-16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Bacillus megaterium is a soil-inhabiting Gram-positive bacterium that is routinely used in industrial applications for recombinant protein production and bioremediation. Studies involving Bacillus megaterium isolated from waterfowl are scarce. Here, we report a 6.26-Mbp draft genome sequence of Bacillus megaterium BHG1.1, which was isolated from feces of a bar-headed goose.
Collapse
|
32
|
The A, C, G, and T of Genome Assembly. BIOMED RESEARCH INTERNATIONAL 2016; 2016:6329217. [PMID: 27247941 PMCID: PMC4877455 DOI: 10.1155/2016/6329217] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Accepted: 12/22/2015] [Indexed: 11/18/2022]
Abstract
Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research.
Collapse
|
33
|
Bremges A, Singer E, Woyke T, Sczyrba A. MeCorS: Metagenome-enabled error correction of single cell sequencing reads. Bioinformatics 2016; 32:2199-201. [PMID: 27153586 PMCID: PMC4937190 DOI: 10.1093/bioinformatics/btw144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2015] [Accepted: 03/09/2016] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED We present a new tool, MeCorS, to correct chimeric reads and sequencing errors in Illumina data generated from single amplified genomes (SAGs). It uses sequence information derived from accompanying metagenome sequencing to accurately correct errors in SAG reads, even from ultra-low coverage regions. In evaluations on real data, we show that MeCorS outperforms BayesHammer, the most widely used state-of-the-art approach. MeCorS performs particularly well in correcting chimeric reads, which greatly improves both accuracy and contiguity of de novo SAG assemblies. AVAILABILITY AND IMPLEMENTATION https://github.com/metagenomics/MeCorS CONTACT: abremges@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Andreas Bremges
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Esther Singer
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Tanja Woyke
- U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| | - Alexander Sczyrba
- Center for Biotechnology and Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany U.S. Department of Energy Joint Genome Institute, Walnut Creek, CA 94598, USA
| |
Collapse
|
34
|
Sameith K, Roscito JG, Hiller M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief Bioinform 2016; 18:1-8. [PMID: 26868358 PMCID: PMC5221426 DOI: 10.1093/bib/bbw003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Revised: 01/02/2016] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.
Collapse
Affiliation(s)
- Katrin Sameith
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Corresponding author. Michael Hiller. Max Planck Institute of Molecular Cell Biology and Genetics & Max Planck Institute for the Physics of Complex Systems, 01307 Dresden, Germany. E-mail:
| |
Collapse
|
35
|
Alic AS, Tomas A, Medina I, Blanquer I. MuffinEc: Error correction for de Novo assembly via greedy partitioning and sequence alignment. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2015.09.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
36
|
Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novostand-alone error correction methods for NGS data. WILEY INTERDISCIPLINARY REVIEWS: COMPUTATIONAL MOLECULAR SCIENCE 2016. [DOI: 10.1002/wcms.1239] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Andy S. Alic
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
| | - David Ruzafa
- Departamento de Quìmica Fìsica e Instituto de Biotecnologìa, Facultad de Ciencias; Universidad de Granada; Granada Spain
| | - Joaquin Dopazo
- Department of Computational Genomics; Príncipe Felipe Research Centre (CIPF); Valencia Spain
- CIBER de Enfermedades Raras (CIBERER); Valencia Spain
- Functional Genomics Node (INB) at CIPF; Valencia Spain
| | - Ignacio Blanquer
- Institute of Instrumentation for Molecular Imaging (I3M); Universitat Politècnica de València; València Spain
- Biomedical Imaging Research Group GIBI 2; Polytechnic University Hospital La Fe; Valencia Spain
| |
Collapse
|
37
|
Laehnemann D, Borkhardt A, McHardy AC. Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction. Brief Bioinform 2016; 17:154-79. [PMID: 26026159 PMCID: PMC4719071 DOI: 10.1093/bib/bbv029] [Citation(s) in RCA: 190] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Revised: 04/09/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
Collapse
|
38
|
Abstract
BACKGROUND Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources. METHODS We propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need. RESULTS Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%. CONCLUSIONS Thanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.
Collapse
|
39
|
Abstract
Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC), for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip.
Collapse
|
40
|
Safonova Y, Bonissone S, Kurpilyansky E, Starostina E, Lapidus A, Stinson J, DePalatis L, Sandoval W, Lill J, Pevzner PA. IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics 2015; 31:i53-61. [PMID: 26072509 PMCID: PMC4542777 DOI: 10.1093/bioinformatics/btv238] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
UNLABELLED The analysis of concentrations of circulating antibodies in serum (antibody repertoire) is a fundamental, yet poorly studied, problem in immunoinformatics. The two current approaches to the analysis of antibody repertoires [next generation sequencing (NGS) and mass spectrometry (MS)] present difficult computational challenges since antibodies are not directly encoded in the germline but are extensively diversified by somatic recombination and hypermutations. Therefore, the protein database required for the interpretation of spectra from circulating antibodies is custom for each individual. Although such a database can be constructed via NGS, the reads generated by NGS are error-prone and even a single nucleotide error precludes identification of a peptide by the standard proteomics tools. Here, we present the IgRepertoireConstructor algorithm that performs error-correction of immunosequencing reads and uses mass spectra to validate the constructed antibody repertoires. AVAILABILITY AND IMPLEMENTATION IgRepertoireConstructor is open source and freely available as a C++ and Python program running on all Unix-compatible platforms. The source code is available from http://bioinf.spbau.ru/igtools. CONTACT ppevzner@ucsd.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yana Safonova
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Stefano Bonissone
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Eugene Kurpilyansky
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Ekaterina Starostina
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Jeremy Stinson
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Laura DePalatis
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Wendy Sandoval
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Jennie Lill
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| | - Pavel A Pevzner
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Algorithmic Biology Laboratory, St. Petersburg Academic University, St. Petersburg, Russia, Bioinformatics Program, University of California, San Diego, CA, USA, Genentech, South San Francisco, CA, USA and Department of Computer Science and Engineering, University of California, San Diego, CA, USA
| |
Collapse
|
41
|
Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. Gigascience 2015; 4:48. [PMID: 26500767 PMCID: PMC4615873 DOI: 10.1186/s13742-015-0089-y] [Citation(s) in RCA: 329] [Impact Index Per Article: 32.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Accepted: 10/09/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation sequencing of cellular RNA (RNA-seq) is rapidly becoming the cornerstone of transcriptomic analysis. However, sequencing errors in the already short RNA-seq reads complicate bioinformatics analyses, in particular alignment and assembly. Error correction methods have been highly effective for whole-genome sequencing (WGS) reads, but are unsuitable for RNA-seq reads, owing to the variation in gene expression levels and alternative splicing. FINDINGS We developed a k-mer based method, Rcorrector, to correct random sequencing errors in Illumina RNA-seq reads. Rcorrector uses a De Bruijn graph to compactly represent all trusted k-mers in the input reads. Unlike WGS read correctors, which use a global threshold to determine trusted k-mers, Rcorrector computes a local threshold at every position in a read. CONCLUSIONS Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.
Collapse
|
42
|
Sahl JW, Del Franco M, Pournaras S, Colman RE, Karah N, Dijkshoorn L, Zarrilli R. Phylogenetic and genomic diversity in isolates from the globally distributed Acinetobacter baumannii ST25 lineage. Sci Rep 2015; 5:15188. [PMID: 26462752 PMCID: PMC4604477 DOI: 10.1038/srep15188] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2015] [Accepted: 09/22/2015] [Indexed: 02/06/2023] Open
Abstract
Acinetobacter baumannii is a globally distributed nosocomial pathogen that has gained interest due to its resistance to most currently used antimicrobials. Whole genome sequencing (WGS) and phylogenetics has begun to reveal the global genetic diversity of this pathogen. The evolution of A. baumannii has largely been defined by recombination, punctuated by the emergence and proliferation of defined clonal lineages. In this study we sequenced seven genomes from the sequence type (ST)25 lineage and compared them to 12 ST25 genomes deposited in public databases. A recombination analysis identified multiple genomic regions that are homoplasious in the ST25 phylogeny, indicating active or historical recombination. Genes associated with antimicrobial resistance were differentially distributed between ST25 genomes, which matched our laboratory-based antimicrobial susceptibility typing. Differences were also observed in biofilm formation between ST25 isolates, which were demonstrated to produce significantly more extensive biofilm than an isolate from the ST1 clonal lineage. These results demonstrate that within A. baumannii, even a fairly recently derived monophyletic lineage can still exhibit significant genotypic and phenotypic diversity. These results have implications for associating outbreaks with sequence typing as well as understanding mechanisms behind the global propagation of successful A. baumannii lineages.
Collapse
Affiliation(s)
- Jason W. Sahl
- Translational Genomics Research Institute, Flagstaff, AZ, USA
- Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ, USA
| | | | - Spyros Pournaras
- Department of Microbiology, Medical School, University of Athens, Athens, Greece
| | | | - Nabil Karah
- Department of Molecular Biology, Umeå University, Umeå, Sweden
| | - Lenie Dijkshoorn
- Department of Infectious Diseases, Leiden University Medical Centre, Leiden, The Netherlands
| | - Raffaele Zarrilli
- Department of Public Health, University of Naples “Federico II”, Naples, Italy
| |
Collapse
|
43
|
Allam A, Kalnis P, Solovyev V. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 2015; 31:3421-8. [DOI: 10.1093/bioinformatics/btv415] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 07/08/2015] [Indexed: 11/12/2022] Open
|
44
|
Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol 2015; 15:509. [PMID: 25398208 PMCID: PMC4248469 DOI: 10.1186/s13059-014-0509-9] [Citation(s) in RCA: 150] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2014] [Indexed: 02/02/2023] Open
Abstract
Lighter is a fast, memory-efficient tool for correcting sequencing errors. Lighter avoids counting k-mers. Instead, it uses a pair of Bloom filters, one holding a sample of the input k-mers and the other holding k-mers likely to be correct. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy.
Collapse
|
45
|
Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW, Schupp JM, Keim P, Morrow JB, Salit ML, Zook JM. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015. [PMID: 26217378 PMCID: PMC4493402 DOI: 10.3389/fgene.2015.00235] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Innovations in sequencing technologies have allowed biologists to make incredible advances in understanding biological systems. As experience grows, researchers increasingly recognize that analyzing the wealth of data provided by these new sequencing platforms requires careful attention to detail for robust results. Thus far, much of the scientific Communit’s focus for use in bacterial genomics has been on evaluating genome assembly algorithms and rigorously validating assembly program performance. Missing, however, is a focus on critical evaluation of variant callers for these genomes. Variant calling is essential for comparative genomics as it yields insights into nucleotide-level organismal differences. Variant calling is a multistep process with a host of potential error sources that may lead to incorrect variant calls. Identifying and resolving these incorrect calls is critical for bacterial genomics to advance. The goal of this review is to provide guidance on validating algorithms and pipelines used in variant calling for bacterial genomics. First, we will provide an overview of the variant calling procedures and the potential sources of error associated with the methods. We will then identify appropriate datasets for use in evaluating algorithms and describe statistical methods for evaluating algorithm performance. As variant calling moves from basic research to the applied setting, standardized methods for performance evaluation and reporting are required; it is our hope that this review provides the groundwork for the development of these standards.
Collapse
Affiliation(s)
- Nathan D Olson
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Steven P Lund
- Statistical Engineering Division, Information Technology Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Rebecca E Colman
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA
| | - Jeffrey T Foster
- Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - Jason W Sahl
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA ; Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - James M Schupp
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA
| | - Paul Keim
- Division of Pathogen Genomics, Translational Genomics Research Institute , Flagstaff, AZ, USA ; Center for Microbial Genetics and Genomics, Northern Arizona University , Flagstaff, AZ, USA
| | - Jayne B Morrow
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| | - Marc L Salit
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA ; Department of Bioengineering, Stanford University , Stanford, CA, USA
| | - Justin M Zook
- Biosystems and Biomaterials Division, Material Measurement Laboratory, National Institute of Standards and Technology , Gaithersburg, MD, USA
| |
Collapse
|
46
|
Sahl JW, Schupp JM, Rasko DA, Colman RE, Foster JT, Keim P. Phylogenetically typing bacterial strains from partial SNP genotypes observed from direct sequencing of clinical specimen metagenomic data. Genome Med 2015; 7:52. [PMID: 26136847 PMCID: PMC4487561 DOI: 10.1186/s13073-015-0176-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Accepted: 05/15/2015] [Indexed: 12/30/2022] Open
Abstract
We describe an approach for genotyping bacterial strains from low coverage genome datasets, including metagenomic data from complex samples. Sequence reads from unknown samples are aligned to a reference genome where the allele states of known SNPs are determined. The Whole Genome Focused Array SNP Typing (WG-FAST) pipeline can identify unknown strains with much less read data than is needed for genome assembly. To test WG-FAST, we resampled SNPs from real samples to understand the relationship between low coverage metagenomic data and accurate phylogenetic placement. WG-FAST can be downloaded from https://github.com/jasonsahl/wgfast.
Collapse
Affiliation(s)
- Jason W. Sahl
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
| | - James M. Schupp
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
| | - David A. Rasko
- />Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD USA
| | - Rebecca E. Colman
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
| | - Jeffrey T. Foster
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
- />Current address: Department of Molecular, Cellular & Biomedical Sciences, University of New Hampshire, Durham, NH USA
| | - Paul Keim
- />Department of Pathogen Genomics, Translational Genomics Research Institute, Flagstaff, AZ USA
- />Center for Microbial Genetics and Genomics, Northern Arizona University, Flagstaff, AZ 86011 USA
| |
Collapse
|
47
|
Computational and Statistical Analyses of Insertional Polymorphic Endogenous Retroviruses in a Non-Model Organism. COMPUTATION 2014. [DOI: 10.3390/computation2040221] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
48
|
Molnar M, Ilie L. Correcting Illumina data. Brief Bioinform 2014; 16:588-99. [PMID: 25183248 DOI: 10.1093/bib/bbu029] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 08/02/2014] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.
Collapse
|
49
|
Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 2014; 9:e101271. [PMID: 25062443 PMCID: PMC4111482 DOI: 10.1371/journal.pone.0101271] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 06/04/2014] [Indexed: 11/19/2022] Open
Abstract
K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.
Collapse
Affiliation(s)
- Qingpeng Zhang
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Jason Pell
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Rosangela Canino-Koning
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Adina Chuang Howe
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Plant, Soil, and Microbial Sciences, Michigan State University, East Lansing, Michigan, United States of America
| | - C. Titus Brown
- Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
- * E-mail:
| |
Collapse
|
50
|
Roy RS, Bhattacharya D, Schliep A. Turtle: Identifying frequent k -mers with cache-efficient algorithms. Bioinformatics 2014; 30:1950-7. [DOI: 10.1093/bioinformatics/btu132] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|