1
|
Sun H, Zheng Y, Xie H, Ma H, Zhong C, Yan M, Liu X, Wang G. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping. Bioinformatics 2024; 40:btae323. [PMID: 38759114 PMCID: PMC11139522 DOI: 10.1093/bioinformatics/btae323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 04/22/2024] [Accepted: 05/16/2024] [Indexed: 05/19/2024] Open
Abstract
MOTIVATION The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters. RESULTS We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely. AVAILABILITY AND IMPLEMENTATION The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Cheng Zhong
- Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
| | - Meng Yan
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| |
Collapse
|
2
|
Chen S, Chen Y, Wang Z, Qin W, Zhang J, Nand H, Zhang J, Li J, Zhang X, Liang X, Xu M. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front Genet 2023; 14:1260531. [PMID: 37811144 PMCID: PMC10552150 DOI: 10.3389/fgene.2023.1260531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/07/2023] [Indexed: 10/10/2023] Open
Abstract
With the increasing throughput of modern sequencing instruments, the cost of storing and transmitting sequencing data has also increased dramatically. Although many tools have been developed to compress sequencing data, there is still a need to develop a compressor with a higher compression ratio. We present a two-step framework for compressing sequencing data in this paper. The first step is to repack original data into a binary stream, while the second step is to compress the stream with a LZMA encoder. We develop a new strategy to encode the original file into a LZMA highly compressed stream. In addition an FPGA-accelerated of LZMA was implemented to speedup the second step. As a demonstration, we present repaq as a lossless non-reference compressor of FASTQ format files. We introduced a multifile redundancy elimination method, which is very useful for compressing paired-end sequencing data. According to our test results, the compression ratio of repaq is much higher than other FASTQ compressors. For some deep sequencing data, the compression ratio of repaq can be higher than 25, almost four times of Gzip. The framework presented in this paper can also be applied to develop new tools for compressing other sequencing data. The open-source code of repaq is available at: https://github.com/OpenGene/repaq.
Collapse
Affiliation(s)
- Shifu Chen
- HaploX Biotechnology, Shenzhen, Guangdong, China
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Yaru Chen
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | | | - Wenjian Qin
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Jing Zhang
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | - Heera Nand
- Xilinx Inc., San Jose, CA, United States
| | | | - Jun Li
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | - Xiaoni Zhang
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | | | - Mingyan Xu
- HaploX Biotechnology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
4
|
Chen H, Chen J, Lu Z, Wang R. CMIC: an efficient quality score compressor with random access functionality. BMC Bioinformatics 2022; 23:294. [PMID: 35870880 PMCID: PMC9308261 DOI: 10.1186/s12859-022-04837-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/13/2022] [Indexed: 12/02/2022] Open
Abstract
Background Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access. Results In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores. Conclusions CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04837-1.
Collapse
|
5
|
Rivara-Espasandín M, Balestrazzi L, Dufort y Álvarez G, Ochoa I, Seroussi G, Smircich P, Sotelo-Silveira J, Martín Á. Nanopore quality score resolution can be reduced with little effect on downstream analysis. BIOINFORMATICS ADVANCES 2022; 2:vbac054. [PMID: 36699360 PMCID: PMC9710687 DOI: 10.1093/bioadv/vbac054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 07/13/2022] [Accepted: 08/08/2022] [Indexed: 01/28/2023]
Abstract
Motivation The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files. Results We polished de novo assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores. Availability and supplementary information Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.
Collapse
Affiliation(s)
- Martín Rivara-Espasandín
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Departamento de Genética, Facultad de Medicina, Universidad de la República, 11800 Montevideo, Uruguay
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
| | - Lucía Balestrazzi
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Sección Bioinformática, Unidad de Genómica Evolutiva, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - Guillermo Dufort y Álvarez
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| | - Idoia Ochoa
- Electrical and Electronics Department, Tecnun, University of Navarra, 20018 San Sebastián, Spain
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA
| | - Gadiel Seroussi
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
- Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| | - Pablo Smircich
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
- Laboratorio de Interacciones Moleculares, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - José Sotelo-Silveira
- Departamento de Genómica, Instituto de Investigaciones Biológicas Clemente Estable, 11600 Montevideo, Uruguay
- Departamento de Biología Celular y Molecular, Sección Biología Celular, Facultad de Ciencias, Universidad de la República, 11400 Montevideo, Uruguay
| | - Álvaro Martín
- Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay
| |
Collapse
|
6
|
Bonfield JK. CRAM 3.1: advances in the CRAM file format. Bioinformatics 2022; 38:1497-1503. [PMID: 34999766 PMCID: PMC8896640 DOI: 10.1093/bioinformatics/btac010] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 12/14/2021] [Accepted: 01/04/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. RESULTS With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. AVAILABILITY AND IMPLEMENTATION The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James K Bonfield
- Informatics and Digital Solutions, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| |
Collapse
|
7
|
Huo H, Liu P, Wang C, Jiang H, Vitter JS. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics 2022; 38:335-343. [PMID: 34524416 DOI: 10.1093/bioinformatics/btab655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/12/2021] [Accepted: 09/10/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongwei Huo
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Pengfei Liu
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Chenhui Wang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Hongbo Jiang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | | |
Collapse
|
8
|
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022; 38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dohyeon Lee
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| |
Collapse
|
9
|
Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The dropping cost of sequencing human DNA has allowed for fast development of several projects around the world generating huge amounts of DNA sequencing data. This deluge of data has run up against limited storage space, a problem that researchers are trying to solve through compression techniques. In this study we address the compression of SAM files, the standard output files for DNA alignment. We specifically study lossy compression techniques used for quality values reported in the SAM file and analyze the impact of such lossy techniques on the CRAM format. We present a series of experiments using a data set corresponding to individual NA12878 with three different fold coverages. We introduce a new lossy model, dynamic binning, and compare its performance to other lossy techniques, namely Illumina binning, LEON and QVZ. We analyze the compression ratio when using CRAM and also study the impact of the lossy techniques on SNP calling. Our results show that lossy techniques allow a better CRAM compression ratio. Furthermore, we show that SNP calling performance is not negatively affected and may even be boosted.
Collapse
|
10
|
Yu R, Yang W, Wang S. Performance evaluation of lossy quality compression algorithms for RNA-seq data. BMC Bioinformatics 2020; 21:321. [PMID: 32689929 PMCID: PMC7372835 DOI: 10.1186/s12859-020-03658-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/13/2020] [Indexed: 11/29/2022] Open
Abstract
Background Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. Results In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. Conclusions Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.
Collapse
|
11
|
Yu R, Yang W. ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics 2020; 36:4551-4559. [DOI: 10.1093/bioinformatics/btaa543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 04/25/2020] [Accepted: 05/20/2020] [Indexed: 12/30/2022] Open
Abstract
Abstract
Motivation
Per-base quality values in Next Generation Sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications, lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications.
Results
We developed a scalable lossy to lossless compression solution for quality values named ScaleQC (Scalable Quality value Compression). ScaleQC is able to provide the so-called bit-stream level scalability that the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without incurring an expensive transcoding operation. Despite its scalability, ScaleQC still achieves comparable compression performance at both lossless and lossy data rates compared to the existing lossless or lossy compressors.
Availability and implementation
ScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib).
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rongshan Yu
- Digital Fujian Institute of Healthcare and Biomedical Big Data, School of Informatics, Xiamen University, Xiamen 316005, China
- Aginome Scientific, Xiamen 316005, China
| | | |
Collapse
|
12
|
No A, Hernaez M, Ochoa I. CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores. J Bioinform Comput Biol 2020; 18:2050031. [PMID: 32938284 DOI: 10.1142/s0219720020500316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.
Collapse
Affiliation(s)
- Albert No
- Electronic and Electrical Engineering, Hongik University, 94 Wausan-ro, Mapo-gu, Seoul 04066, Korea
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W Gregory Dr, Urbana, IL 61801, USA
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Shokrof M, Abouelhoda M. IonCRAM: a reference-based compression tool for ion torrent sequence files. BMC Bioinformatics 2020; 21:397. [PMID: 32907531 PMCID: PMC7487613 DOI: 10.1186/s12859-020-03726-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Accepted: 08/31/2020] [Indexed: 12/29/2022] Open
Abstract
Background Ion Torrent is one of the major next generation sequencing (NGS) technologies and it is frequently used in medical research and diagnosis. The built-in software for the Ion Torrent sequencing machines delivers the sequencing results in the BAM format. In addition to the usual SAM/BAM fields, the Ion Torrent BAM file includes technology-specific flow signal data. The flow signals occupy a big portion of the BAM file (about 75% for the human genome). Compressing SAM/BAM into CRAM format significantly reduces the space needed to store the NGS results. However, the tools for generating the CRAM formats are not designed to handle the flow signals. This missing feature has motivated us to develop a new program to improve the compression of the Ion Torrent files for long term archiving. Results In this paper, we present IonCRAM, the first reference-based compression tool to compress Ion Torrent BAM files for long term archiving. For the BAM files, IonCRAM could achieve a space saving of about 43%. This space saving is superior to what achieved with the CRAM format by about 8–9%. Conclusions Reducing the space consumption of NGS data reduces the cost of storage and data transfer. Therefore, developing efficient compression software for clinical NGS data goes beyond the computational interest; as it ultimately contributes to the overall cost reduction of the clinical test. The space saving achieved by our tool is a practical step in this direction. The tool is open source and available at Code Ocean, github, and http://ioncram.saudigenomeproject.com.
Collapse
Affiliation(s)
- Moustafa Shokrof
- Faculty of Computer Science, University of California at Davis, Davis, CA, USA
| | - Mohamed Abouelhoda
- King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia. .,Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia. .,Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, University Square, Giza, Egypt.
| |
Collapse
|
14
|
Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020; 35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. RESULTS We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| |
Collapse
|
15
|
Hernandez-Lopez AA, Alberti C, Mattavelli M. Toward a Dynamic Threshold for Quality Score Distortion in Reference-Based Alignment. J Comput Biol 2020; 27:288-300. [PMID: 31891532 DOI: 10.1089/cmb.2019.0333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The intrinsic high-entropy sequence metadata, known as quality scores, is largely the cause of the substantial size of sequence data files. Yet, there is no consensus on a viable reduction of the resolution of the quality score scale, arguably because of collateral side effects. In this article, we leverage on the penalty functions of HISAT2 aligner to rebin the quality score scale in such a way as to avoid any impact on sequence alignment, identifying alongside a distortion threshold for "safe" quality score representation. We tested our findings on whole-genome and RNA-seq data, and contrasted the results with three methods for lossy compression of the quality scores.
Collapse
Affiliation(s)
| | | | - Marco Mattavelli
- École Polytechnique Fédérale de Lausanne, EPFL, Lausanne, Switzerland
| |
Collapse
|
16
|
Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019; 17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff .
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
| |
Collapse
|
17
|
Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019; 20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY https://github.com/yhhshb/yalff.
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
18
|
Bonfield JK, McCarthy SA, Durbin R. Crumble: reference free lossy compression of sequence quality values. Bioinformatics 2019; 35:337-339. [PMID: 29992288 PMCID: PMC6330002 DOI: 10.1093/bioinformatics/bty608] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 07/09/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James K Bonfield
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Shane A McCarthy
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| | - Richard Durbin
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| |
Collapse
|
19
|
Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M. Denoising of Aligned Genomic Data. Sci Rep 2019; 9:15067. [PMID: 31636330 PMCID: PMC6803637 DOI: 10.1038/s41598-019-51418-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open
Abstract
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .
Collapse
Affiliation(s)
- Irena Fischer-Hwang
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA.
| | - Idoia Ochoa
- University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, 61801, USA
| | - Tsachy Weissman
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA
| | - Mikel Hernaez
- University of Illinois Urbana-Champaign, Carl R. Woese Institute for Genomic Biology, Urbana, 61801, USA.
| |
Collapse
|
20
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
21
|
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018; 34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact schandak@stanford.edu. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
22
|
Voges J, Fotouhi A, Ostermann J, Külekci MO. A Two-Level Scheme for Quality Score Compression. J Comput Biol 2018; 25:1141-1151. [PMID: 30059248 DOI: 10.1089/cmb.2018.0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part-spending only 0.49 bits per quality score on average-shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
Collapse
Affiliation(s)
- Jan Voges
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | - Ali Fotouhi
- 2 Electronics and Communication Engineering Department, Istanbul Technical University , Istanbul, Turkey
| | - Jörn Ostermann
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | | |
Collapse
|
23
|
Voges J, Ostermann J, Hernaez M. CALQ: compression of quality values of aligned sequencing data. Bioinformatics 2018; 34:1650-1658. [PMID: 29186284 PMCID: PMC5946873 DOI: 10.1093/bioinformatics/btx737] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Revised: 10/23/2017] [Accepted: 11/22/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq. Contact voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jan Voges
- Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, Hannover, Germany
| | - Jörn Ostermann
- Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, Hannover, Germany
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, IL, USA
| |
Collapse
|
24
|
Paridaens T, Van Wallendael G, De Neve W, Lambert P. AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality. Bioinformatics 2018; 34:425-433. [PMID: 29028894 DOI: 10.1093/bioinformatics/btx607] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 09/22/2017] [Indexed: 01/09/2023] Open
Abstract
Motivation The past decade has seen the introduction of new technologies that significantly lowered the cost of genome sequencing. As a result, the amount of genomic data that must be stored and transmitted is increasing exponentially. To mitigate storage and transmission issues, we introduce a framework for lossless compression of quality scores. Results This article proposes AQUa, an adaptive framework for lossless compression of quality scores. To compress these quality scores, AQUa makes use of a configurable set of coding tools, extended with a Context-Adaptive Binary Arithmetic Coding scheme. When benchmarking AQUa against generic single-pass compressors, file sizes are reduced by up to 38.49% when comparing with GNU Gzip and by up to 6.48% when comparing with 7-Zip at the Ultra Setting, while still providing support for random access. When comparing AQUa with the purpose-built, single-pass, and state-of-the-art compressor SCALCE, which does not support random access, file sizes are reduced by up to 21.14%. When comparing AQUa with the purpose-built, dual-pass, and state-of-the-art compressor QVZ, which does not support random access, file sizes are larger by 6.42-33.47%. However, for one test file, the file size is 0.38% smaller, illustrating the strength of our single-pass compression framework. This work has been spurred by the current activity on genomic information representation (MPEG-G) within the ISO/IEC SC29/WG11 technical committee. Availability and implementation The software is available on Github: https://github.com/tparidae/AQUa. Contact tom.paridaens@ugent.be.
Collapse
Affiliation(s)
- Tom Paridaens
- Department of Electronics and Information Systems, IDLab, Ghent University - IMEC, Ghent, Belgium
| | - Glenn Van Wallendael
- Department of Electronics and Information Systems, IDLab, Ghent University - IMEC, Ghent, Belgium
| | - Wesley De Neve
- Department of Electronics and Information Systems, IDLab, Ghent University - IMEC, Ghent, Belgium.,Center for Biotech Data Science, Ghent University Global Campus, Songdo, Incheon 305-701, Republic of Korea.,Image and Video Systems Lab, KAIST, Daejeon 305-732, Republic of Korea
| | - Peter Lambert
- Department of Electronics and Information Systems, IDLab, Ghent University - IMEC, Ghent, Belgium
| |
Collapse
|
25
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017; 18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Collapse
Affiliation(s)
- Idoia Ochoa
- Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Rachel Goldfeder
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
26
|
Sarkar H, Patro R. Quark enables semi-reference-based compression of RNA-seq data. Bioinformatics 2017; 33:3380-3386. [DOI: 10.1093/bioinformatics/btx428] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Accepted: 06/29/2017] [Indexed: 12/19/2022] Open
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| |
Collapse
|
27
|
Cánovas R, Moffat A, Turpin A. CSAM: Compressed SAM format. Bioinformatics 2016; 32:3709-3716. [PMID: 27540265 DOI: 10.1093/bioinformatics/btw543] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 08/13/2016] [Accepted: 08/15/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next generation sequencing machines produce vast amounts of genomic data. For the data to be useful, it is essential that it can be stored and manipulated efficiently. This work responds to the combined challenge of compressing genomic data, while providing fast access to regions of interest, without necessitating decompression of whole files. RESULTS We describe CSAM (Compressed SAM format), a compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which is currently the preferred lossless compressed SAM-equivalent format; and are self-contained, that is, they do not depend on any external resources to compress or decompress SAM files. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/rcanovas/libCSAM CONTACT: canovas-ba@lirmm.frSupplementary Information: Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Rodrigo Cánovas
- L.I.R.M.M. and Institut Biologie Computationnelle, Université de Montpellier, Montpellier Cedex 5, CNRS F-34392, France.,Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| | - Alistair Moffat
- Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| | - Andrew Turpin
- Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| |
Collapse
|
28
|
Ochoa I, No A, Hernaez M, Weissman T. CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores. PROCEEDINGS. INFORMATION THEORY WORKSHOP 2016; 2016:121-125. [PMID: 29806047 DOI: 10.1109/itw.2016.7606808] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. As a result, lossy compression of the quality scores has recently been proposed to alleviate the storage costs. Further, it has been shown that lossy compression, at some specific rates, can achieve a performance on variant calling similar to that achieved with the lossless compressed data. We propose CROMqs, a new lossy compressor for the quality scores with the property of "infinitesimal successive refinability". This property allows the decoder to decompress the data iteratively without the need of agreeing with the encoder on a specific rate prior to compression. This characteristic is particularly amenable in practice, as in most cases the appropriate rate at which the lossy compressor should operate can not be established prior to compression. Further, this property can be of interest in scenarios involving streaming of genomic data. CROMqs is the first infinitesimal successive refinement lossy compressor for the quality scores in the literature, and we show that it obtains a comparable rate-distortion performance to previously proposed algorithms. Moreover, we also show that CROMqs achieves a comparable performance on variant calling to that of the lossless compressed data.
Collapse
Affiliation(s)
- I Ochoa
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - A No
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - M Hernaez
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - T Weissman
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| |
Collapse
|
29
|
Greenfield DL, Stegle O, Rrustemi A. GeneCodeq: quality score compression and improved genotyping using a Bayesian framework. Bioinformatics 2016; 32:3124-3132. [DOI: 10.1093/bioinformatics/btw385] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 06/15/2016] [Indexed: 12/30/2022] Open
|
30
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Denoising of Quality Scores for Boosted Inference and Reduced Storage. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:251-260. [PMID: 29098178 PMCID: PMC5663231 DOI: 10.1109/dcc.2016.92] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.
Collapse
Affiliation(s)
- Idoia Ochoa
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Rachel Goldfeder
- Department of Medicine, Stanford University, Stanford, CA, 94305
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, 94305
| |
Collapse
|
31
|
Hernaez M, Ochoa I, Weissman T. A cluster-based approach to compression of Quality Scores. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:261-270. [PMID: 29057318 DOI: 10.1109/dcc.2016.49] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.
Collapse
Affiliation(s)
- Mikel Hernaez
- Department of Electrical Engineering, Stanford University
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University
| | | |
Collapse
|
32
|
Alberti C, Daniels N, Hernaez M, Voges J, Goldfeder RL, Hernandez-Lopez AA, Mattavelli M, Berger B. An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:221-230. [PMID: 28845445 PMCID: PMC5568552 DOI: 10.1109/dcc.2016.39] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
This paper provides the specification and an initial validation of an evaluation framework for the comparison of lossy compressors of genome sequencing quality values. The goal is to define reference data, test sets, tools and metrics that shall be used to evaluate the impact of lossy compression of quality values on human genome variant calling. The functionality of the framework is validated referring to two state-of-the-art genomic compressors. This work has been spurred by the current activity within the ISO/IEC SC29/WG11 technical committee (a.k.a. MPEG), which is investigating the possibility of starting a standardization activity for genomic information representation.
Collapse
Affiliation(s)
- Claudio Alberti
- École Polytechnique Fédérale de Lausanne EPFL - SCI-STI-MM Lausanne, VD Switzerland
| | - Noah Daniels
- Massachusetts Institute of Technology CSAIL & Mathematics. Cambridge, MA
| | | | - Jan Voges
- Institut fuer Informationsverarbeitung (TNT) Leibniz Universität Hannover, Germany
| | | | | | - Marco Mattavelli
- École Polytechnique Fédérale de Lausanne EPFL - SCI-STI-MM Lausanne, VD Switzerland
| | - Bonnie Berger
- Massachusetts Institute of Technology CSAIL & Mathematics. Cambridge, MA
| |
Collapse
|
33
|
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput DNA sequence data compression. J Bioinform Comput Biol 2015; 14:1630002. [PMID: 26846812 DOI: 10.1142/s0219720016300021] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Muhammad Tahir
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Ataul Aziz Ikram
- 2 Department of Electrical Engineering, National University, Islamabad 44000, Pakistan
| |
Collapse
|
34
|
Xie X, Zhou S, Guan J. CoGI: Towards Compressing Genomes as an Image. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1275-1285. [PMID: 26671800 DOI: 10.1109/tcbb.2015.2430331] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM--one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip--a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://admis.fudan.edu.cn/projects/cogi.htm.
Collapse
|
35
|
Ochoa I, Hernaez M, Weissman T. Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 2015; 12:1442002. [PMID: 25395305 DOI: 10.1142/s0219720014420025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.
Collapse
Affiliation(s)
- Idoia Ochoa
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA
| | | | | |
Collapse
|
36
|
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 2015; 16:288. [PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Collapse
Affiliation(s)
- Gaëtan Benoit
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Claire Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | | | - Erwan Drezen
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Thibault Dayris
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
| | - Raluca Uricaru
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
- University of Bordeaux, CBiB, Bordeaux, F-33000, France.
| | - Guillaume Rizk
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| |
Collapse
|
37
|
Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 2015; 31:2770-7. [PMID: 25910696 PMCID: PMC4547610 DOI: 10.1093/bioinformatics/btv248] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 04/11/2015] [Accepted: 04/20/2015] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. RESULTS We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. AVAILABILITY AND IMPLEMENTATION Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. CONTACT carlk@cs.cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and
| | - Carl Kingsford
- Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA
| |
Collapse
|
38
|
Malysa G, Hernaez M, Ochoa I, Rao M, Ganesan K, Weissman T. QVZ: lossy compression of quality values. Bioinformatics 2015; 31:3122-9. [PMID: 26026138 DOI: 10.1093/bioinformatics/btv330] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 04/09/2015] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit. RESULTS The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained. AVAILABILITY AND IMPLEMENTATION QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz. CONTACT mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Greg Malysa
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Milind Rao
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Karthik Ganesan
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
39
|
Yu YW, Yorukoglu D, Peng J, Berger B. Quality score compression improves genotyping accuracy. Nat Biotechnol 2015; 33:240-3. [PMID: 25748910 PMCID: PMC4439189 DOI: 10.1038/nbt.3170] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Y. William Yu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Deniz Yorukoglu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Jian Peng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA; Department of Mathematics at MIT
| |
Collapse
|
40
|
Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015; 31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. AVAILABILITY AND IMPLEMENTATION Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| | - Rob Patro
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| |
Collapse
|
41
|
Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. ACTA ACUST UNITED AC 2014; 31:1389-95. [PMID: 25536966 DOI: 10.1093/bioinformatics/btu844] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 12/17/2014] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION http://sun.aei.polsl.pl/orcom under a free license. CONTACT sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Sebastian Deorowicz
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Łukasz Roguski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| |
Collapse
|