1
|
No A, Hernaez M, Ochoa I. CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores. J Bioinform Comput Biol 2020; 18:2050031. [PMID: 32938284 DOI: 10.1142/s0219720020500316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.
Collapse
Affiliation(s)
- Albert No
- Electronic and Electrical Engineering, Hongik University, 94 Wausan-ro, Mapo-gu, Seoul 04066, Korea
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W Gregory Dr, Urbana, IL 61801, USA
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, USA
| |
Collapse
|
2
|
Guerrini V, Louza FA, Rosone G. Metagenomic analysis through the extended Burrows-Wheeler transform. BMC Bioinformatics 2020; 21:299. [PMID: 32938362 PMCID: PMC7493373 DOI: 10.1186/s12859-020-03628-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 06/22/2020] [Indexed: 11/10/2022] Open
Abstract
Background The development of Next Generation Sequencing (NGS) has had a major impact on the study of genetic sequences. Among problems that researchers in the field have to face, one of the most challenging is the taxonomic classification of metagenomic reads, i.e., identifying the microorganisms that are present in a sample collected directly from the environment. The analysis of environmental samples (metagenomes) are particularly important to figure out the microbial composition of different ecosystems and it is used in a wide variety of fields: for instance, metagenomic studies in agriculture can help understanding the interactions between plants and microbes, or in ecology, they can provide valuable insights into the functions of environmental communities. Results In this paper, we describe a new lightweight alignment-free and assembly-free framework for metagenomic classification that compares each unknown sequence in the sample to a collection of known genomes. We take advantage of the combinatorial properties of an extension of the Burrows-Wheeler transform, and we sequentially scan the required data structures, so that we can analyze unknown sequences of large collections using little internal memory. The tool LiME (Lightweight Metagenomics via eBWT) is available at https://github.com/veronicaguerrini/LiME. Conclusions In order to assess the reliability of our approach, we run several experiments on NGS data from two simulated metagenomes among those provided in benchmarking analysis and on a real metagenome from the Human Microbiome Project. The experiment results on the simulated data show that LiME is competitive with the widely used taxonomic classifiers. It achieves high levels of precision and specificity – e.g. 99.9% of the positive control reads are correctly assigned and the percentage of classified reads of the negative control is less than 0.01% – while keeping a high sensitivity. On the real metagenome, we show that LiME is able to deliver classification results comparable to that of MagicBlast. Overall, the experiments confirm the effectiveness of our method and its high accuracy even in negative control samples.
Collapse
Affiliation(s)
- Veronica Guerrini
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Felipe A Louza
- Faculty of Electrical Engineering, Federal University of Uberlândia, Uberlândia, Brazil
| | - Giovanna Rosone
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy.
| |
Collapse
|
3
|
Prezza N, Pisanti N, Sciortino M, Rosone G. Variable-order reference-free variant discovery with the Burrows-Wheeler Transform. BMC Bioinformatics 2020; 21:260. [PMID: 32938358 PMCID: PMC7493873 DOI: 10.1186/s12859-020-03586-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Accepted: 06/08/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. RESULTS In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel . CONCLUSIONS Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore report results on larger (real) Human whole-genome sequencing experiments. Also in these cases, our tool exhibits a much higher sensitivity than the state-of-the art tool.
Collapse
Affiliation(s)
- Nicola Prezza
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Nadia Pisanti
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy
| | - Marinella Sciortino
- Dipartimento di Matematica e Informatica, Università di Palermo, Via Archirafi, 34, Palermo, Italy
| | - Giovanna Rosone
- Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo, 3, Pisa, Italy.
| |
Collapse
|
4
|
Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019; 17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff .
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
| |
Collapse
|
5
|
Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019; 20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY https://github.com/yhhshb/yalff.
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
6
|
Kimura K, Koike A. Parallel Computation of the Burrows-Wheeler Transform of Short Reads Using Prefix Parallelism. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:3-13. [PMID: 29994538 DOI: 10.1109/tcbb.2018.2837749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed. The BWT, corresponding to a sorted list of suffixes of reads, is constructed incrementally by successively including longer and longer suffixes. The working data is divided into more than 10,000 "blocks" corresponding to sublists of suffixes with the same prefixes. Thousands of groups of blocks can be processed in parallel while making exclusive writes and concurrent reads into a shared memory. Reads and writes are basically sequential, and the read concurrency is limited to two. Thus, a fine-grained parallelism, referred to as prefix parallelism, is expected to work efficiently. The time complexity for processing n reads of length l is O(nl2). On actual biological DNA sequence data of about 100 Gbp with a read length of 100 bp (base pairs), a tentative implementation of the proposed method took less than an hour on a single-node computer; i.e., it was about three times faster than one of the fastest programs developed so far.
Collapse
|
7
|
Voges J, Fotouhi A, Ostermann J, Külekci MO. A Two-Level Scheme for Quality Score Compression. J Comput Biol 2018; 25:1141-1151. [PMID: 30059248 DOI: 10.1089/cmb.2018.0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part-spending only 0.49 bits per quality score on average-shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
Collapse
Affiliation(s)
- Jan Voges
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | - Ali Fotouhi
- 2 Electronics and Communication Engineering Department, Istanbul Technical University , Istanbul, Turkey
| | - Jörn Ostermann
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | | |
Collapse
|
8
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017; 18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Collapse
Affiliation(s)
- Idoia Ochoa
- Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Rachel Goldfeder
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
9
|
Kimura K, Koike A. Analysis of genomic rearrangements by using the Burrows-Wheeler transform of short-read data. BMC Bioinformatics 2015; 16 Suppl 18:S5. [PMID: 26678411 PMCID: PMC4708002 DOI: 10.1186/1471-2105-16-s18-s5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The potential utility of the Burrows-Wheeler transform (BWT) of a large amount of short-read data ("reads") has not been fully studied. The BWT basically serves as a lossless dictionary of reads, unlike the heuristic and lossy reads-to-genome mapping results conventionally obtained in the first step of sequence analysis. Thus, it is naturally expected to lead to development of sensitive methods for analysis of short-read data. Recently, one of the most active areas of research in sequence analysis is sensitive detection of rare genomic rearrangements from whole-genome sequencing (WGS) data of heterogeneous cancer samples. The application the BWT of reads to the analysis of genomic rearrangements is addressed in this study. Results A new method for sensitive detection of genomic rearrangements by using the BWT of reads in the following three steps is proposed: first, breakpoint regions, which contain breakpoints and are joined together by rearrangement, are predicted from the distribution of so-called discordant pairs by using a kind of the conjugate gradient method; second, reads partially matching the breakpoint regions are collected from the BWT of reads; and third, breakpoints are detected as branching points among the collected reads, and their precise positions are determined. The method was experimentally implemented, and its performance (i.e., sensitivity and specificity) was evaluated by using simulated data with known artificial rearrangements. It was applied to publicly available real biological WGS data of cancer patients, and the detection results were compared with published results. Conclusions Serving as a lossless dictionary of reads, the BWT of short reads enables sensitive analysis of genomic rearrangements in heterogeneous cancer-genome samples when used in conjunction with breakpoint-region predictions based on a conjugate gradient method.
Collapse
|
10
|
Wandelt S, Leser U. Sequence Factorization with Multiple References. PLoS One 2015; 10:e0139000. [PMID: 26422374 PMCID: PMC4589410 DOI: 10.1371/journal.pone.0139000] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2014] [Accepted: 09/07/2015] [Indexed: 11/29/2022] Open
Abstract
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.
Collapse
Affiliation(s)
- Sebastian Wandelt
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
- * E-mail:
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
| |
Collapse
|
11
|
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 2015; 16:288. [PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Collapse
Affiliation(s)
- Gaëtan Benoit
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Claire Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | | | - Erwan Drezen
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Thibault Dayris
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
| | - Raluca Uricaru
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
- University of Bordeaux, CBiB, Bordeaux, F-33000, France.
| | - Guillaume Rizk
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| |
Collapse
|
12
|
Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z. Light-weight reference-based compression of FASTQ data. BMC Bioinformatics 2015; 16:188. [PMID: 26051252 PMCID: PMC4459677 DOI: 10.1186/s12859-015-0628-7] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 05/27/2015] [Indexed: 01/23/2023] Open
Abstract
Background The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference. Results This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms. Conclusions LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://csse.szu.edu.cn/staff/zhuzx/LWFQZip.
Collapse
Affiliation(s)
- Yongpeng Zhang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Linsen Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Yanli Yang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| | - Xiao Yang
- The Broad Institute, Cambridge, MA, 02142, USA.
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham, B15 2TT, UK.
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
| |
Collapse
|
13
|
Malysa G, Hernaez M, Ochoa I, Rao M, Ganesan K, Weissman T. QVZ: lossy compression of quality values. Bioinformatics 2015; 31:3122-9. [PMID: 26026138 DOI: 10.1093/bioinformatics/btv330] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 04/09/2015] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. RESULTS The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained. AVAILABILITY AND IMPLEMENTATION QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Greg Malysa
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Milind Rao
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Karthik Ganesan
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
14
|
Yu YW, Yorukoglu D, Peng J, Berger B. Quality score compression improves genotyping accuracy. Nat Biotechnol 2015; 33:240-3. [PMID: 25748910 PMCID: PMC4439189 DOI: 10.1038/nbt.3170] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Y. William Yu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Deniz Yorukoglu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| |
Collapse
|
15
|
Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. ACTA ACUST UNITED AC 2014; 31:1389-95. [PMID: 25536966 DOI: 10.1093/bioinformatics/btu844] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 12/17/2014] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION http://sun.aei.polsl.pl/orcom under a free license. CONTACT sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Sebastian Deorowicz
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Łukasz Roguski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| |
Collapse
|
16
|
Zhou J, Ji Z, Zhu Z, He S. Compression of next-generation sequencing quality scores using memetic algorithm. BMC Bioinformatics 2014; 15 Suppl 15:S10. [PMID: 25474747 PMCID: PMC4271560 DOI: 10.1186/1471-2105-15-s15-s10] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND The exponential growth of next-generation sequencing (NGS) derived DNA data poses great challenges to data storage and transmission. Although many compression algorithms have been proposed for DNA reads in NGS data, few methods are designed specifically to handle the quality scores. RESULTS In this paper we present a memetic algorithm (MA) based NGS quality score data compressor, namely MMQSC. The algorithm extracts raw quality score sequences from FASTQ formatted files, and designs compression codebook using MA based multimodal optimization. The input data is then compressed in a substitutional manner. Experimental results on five representative NGS data sets show that MMQSC obtains higher compression ratio than the other state-of-the-art methods. Particularly, MMQSC is a lossless reference-free compression algorithm, yet obtains an average compression ratio of 22.82% on the experimental data sets. CONCLUSIONS The proposed MMQSC compresses NGS quality score data effectively. It can be utilized to improve the overall compression ratio on FASTQ formatted files.
Collapse
|
17
|
Abstract
MOTIVATION The throughput of genomic sequencing has increased to the point that is overrunning the rate of downstream analysis. This, along with the desire to revisit old data, has led to a situation where large quantities of raw, and nearly impenetrable, sequence data are rapidly filling the hard drives of modern biology labs. These datasets can be compressed via a multi-string variant of the Burrows-Wheeler Transform (BWT), which provides the side benefit of searches for arbitrary k-mers within the raw data as well as the ability to reconstitute arbitrary reads as needed. We propose a method for merging such datasets for both increased compression and downstream analysis. RESULTS We present a novel algorithm that merges multi-string BWTs in [Formula: see text] time where LCS is the length of their longest common substring between any of the inputs, and N is the total length of all inputs combined (number of symbols) using [Formula: see text] bits where F is the number of multi-string BWTs merged. This merged multi-string BWT is also shown to have a higher compressibility compared with the input multi-string BWTs separately. Additionally, we explore some uses of a merged multi-string BWT for bioinformatics applications.
Collapse
Affiliation(s)
- James Holt
- Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA
| | - Leonard McMillan
- Department of Computer Science, 201 S. Columbia St. UNC-CH, Chapel Hill, NC 27599, USA
| |
Collapse
|
18
|
Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics 2014; 30:2796-801. [PMID: 24950811 DOI: 10.1093/bioinformatics/btu387] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as gzip are often used to reduce the storage burden but have the disadvantage that the data must be decompressed before they can be used. Here, we present BEETL-fastq, a tool that not only compresses FASTQ-formatted DNA reads more compactly than gzip but also permits rapid search for k-mer queries within the archived sequences. Importantly, the full FASTQ record of each matching read or read pair is returned, allowing the search results to be piped directly to any of the many standard tools that accept FASTQ data as input. RESULTS We show that 6.6 terabytes of human reads in FASTQ format can be transformed into 1.7 terabytes of indexed files, from where we can search for 1, 10, 100, 1000 and a million of 30-mers in 3, 8, 14, 45 and 567 s, respectively, plus 20 ms per output read. Useful applications of the search capability are highlighted, including the genotyping of structural variant breakpoints and 'in silico pull-down' experiments in which only the reads that cover a region of interest are selectively extracted for the purposes of variant calling or visualization. AVAILABILITY AND IMPLEMENTATION BEETL-fastq is part of the BEETL library, available as a github repository at github.com/BEETL/BEETL.
Collapse
Affiliation(s)
- Lilian Janin
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| | - Ole Schulz-Trieglaff
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| | - Anthony J Cox
- Computational Biology Group, Illumina Cambridge Ltd., Little Chesterford, Essex CB10 1XL, UK
| |
Collapse
|
19
|
Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics 2014; 30:2130-6. [DOI: 10.1093/bioinformatics/btu183] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
20
|
Yu YW, Yorukoglu D, Berger B. Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2014; 8394:385-399. [PMID: 28825060 DOI: 10.1007/978-3-319-05269-4_31] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
UNLABELLED It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.
Collapse
Affiliation(s)
- Y William Yu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Deniz Yorukoglu
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| | - Bonnie Berger
- Massachusetts Institute of Technology, Cambridge MA 02139, USA http://people.csail.mit.edu/bab/
| |
Collapse
|