1
|
Sun H, Zheng Y, Xie H, Ma H, Zhong C, Yan M, Liu X, Wang G. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping. Bioinformatics 2024; 40:btae323. [PMID: 38759114 PMCID: PMC11139522 DOI: 10.1093/bioinformatics/btae323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 04/22/2024] [Accepted: 05/16/2024] [Indexed: 05/19/2024] Open
Abstract
MOTIVATION The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters. RESULTS We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely. AVAILABILITY AND IMPLEMENTATION The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Cheng Zhong
- Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
| | - Meng Yan
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| |
Collapse
|
2
|
Chen S, Chen Y, Wang Z, Qin W, Zhang J, Nand H, Zhang J, Li J, Zhang X, Liang X, Xu M. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front Genet 2023; 14:1260531. [PMID: 37811144 PMCID: PMC10552150 DOI: 10.3389/fgene.2023.1260531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/07/2023] [Indexed: 10/10/2023] Open
Abstract
With the increasing throughput of modern sequencing instruments, the cost of storing and transmitting sequencing data has also increased dramatically. Although many tools have been developed to compress sequencing data, there is still a need to develop a compressor with a higher compression ratio. We present a two-step framework for compressing sequencing data in this paper. The first step is to repack original data into a binary stream, while the second step is to compress the stream with a LZMA encoder. We develop a new strategy to encode the original file into a LZMA highly compressed stream. In addition an FPGA-accelerated of LZMA was implemented to speedup the second step. As a demonstration, we present repaq as a lossless non-reference compressor of FASTQ format files. We introduced a multifile redundancy elimination method, which is very useful for compressing paired-end sequencing data. According to our test results, the compression ratio of repaq is much higher than other FASTQ compressors. For some deep sequencing data, the compression ratio of repaq can be higher than 25, almost four times of Gzip. The framework presented in this paper can also be applied to develop new tools for compressing other sequencing data. The open-source code of repaq is available at: https://github.com/OpenGene/repaq.
Collapse
Affiliation(s)
- Shifu Chen
- HaploX Biotechnology, Shenzhen, Guangdong, China
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Yaru Chen
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | | | - Wenjian Qin
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Jing Zhang
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | - Heera Nand
- Xilinx Inc., San Jose, CA, United States
| | | | - Jun Li
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | - Xiaoni Zhang
- HaploX Biotechnology, Shenzhen, Guangdong, China
| | | | - Mingyan Xu
- HaploX Biotechnology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
4
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
5
|
Chen H, Chen J, Lu Z, Wang R. CMIC: an efficient quality score compressor with random access functionality. BMC Bioinformatics 2022; 23:294. [PMID: 35870880 PMCID: PMC9308261 DOI: 10.1186/s12859-022-04837-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/13/2022] [Indexed: 12/02/2022] Open
Abstract
Background Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access. Results In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores. Conclusions CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04837-1.
Collapse
|
6
|
Niu Y, Ma M, Li F, Liu X, Shi G. ACO:lossless quality score compression based on adaptive coding order. BMC Bioinformatics 2022; 23:219. [PMID: 35672665 PMCID: PMC9175485 DOI: 10.1186/s12859-022-04712-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 04/25/2022] [Indexed: 08/30/2023] Open
Abstract
Background With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging. Results In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and an improved in-context strategy, ACO achieves the state-of-the-art quality score compression performances with moderate complexity for the next-generation sequencing (NGS) data. Conclusions The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/ACO.
Collapse
Affiliation(s)
- Yi Niu
- School of artificial intelligence, Xidian University, Xian, 710071, China. .,The Pengcheng Lab, Shenzhen, 518055, China.
| | - Mingming Ma
- School of artificial intelligence, Xidian University, Xian, 710071, China
| | - Fu Li
- School of artificial intelligence, Xidian University, Xian, 710071, China
| | | | - Guangming Shi
- School of artificial intelligence, Xidian University, Xian, 710071, China
| |
Collapse
|
7
|
Huo H, Liu P, Wang C, Jiang H, Vitter JS. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics 2022; 38:335-343. [PMID: 34524416 DOI: 10.1093/bioinformatics/btab655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/12/2021] [Accepted: 09/10/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongwei Huo
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Pengfei Liu
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Chenhui Wang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Hongbo Jiang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | | |
Collapse
|
8
|
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022; 38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dohyeon Lee
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| |
Collapse
|
9
|
Cho M, No A. FCLQC: fast and concurrent lossless quality scores compressor. BMC Bioinformatics 2021; 22:606. [PMID: 34930110 PMCID: PMC8686598 DOI: 10.1186/s12859-021-04516-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 12/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. RESULTS This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. CONCLUSION The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.
Collapse
Affiliation(s)
- Minhyeok Cho
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea.
| |
Collapse
|
10
|
Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The dropping cost of sequencing human DNA has allowed for fast development of several projects around the world generating huge amounts of DNA sequencing data. This deluge of data has run up against limited storage space, a problem that researchers are trying to solve through compression techniques. In this study we address the compression of SAM files, the standard output files for DNA alignment. We specifically study lossy compression techniques used for quality values reported in the SAM file and analyze the impact of such lossy techniques on the CRAM format. We present a series of experiments using a data set corresponding to individual NA12878 with three different fold coverages. We introduce a new lossy model, dynamic binning, and compare its performance to other lossy techniques, namely Illumina binning, LEON and QVZ. We analyze the compression ratio when using CRAM and also study the impact of the lossy techniques on SNP calling. Our results show that lossy techniques allow a better CRAM compression ratio. Furthermore, we show that SNP calling performance is not negatively affected and may even be boosted.
Collapse
|
11
|
Chandak S, Tatwawadi K, Ochoa I, Hernaez M, Weissman T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 2020; 35:2674-2676. [PMID: 30535063 DOI: 10.1093/bioinformatics/bty1015] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 12/06/2018] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
12
|
Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020; 35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. RESULTS We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| |
Collapse
|
13
|
Fu J, Ke B, Dong S. LCQS: an efficient lossless compression tool of quality scores with random access functionality. BMC Bioinformatics 2020; 21:109. [PMID: 32183707 PMCID: PMC7079445 DOI: 10.1186/s12859-020-3428-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Accepted: 02/24/2020] [Indexed: 12/02/2022] Open
Abstract
Background Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. Results In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. Conclusion The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.
Collapse
Affiliation(s)
- Jiabing Fu
- School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510006, China.,Communication & Computer Network Lab of Guangdong, South China University of Technology, Wushan Road, Guangzhou, 510006, China
| | - Bixin Ke
- School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510006, China.,Communication & Computer Network Lab of Guangdong, South China University of Technology, Wushan Road, Guangzhou, 510006, China
| | - Shoubin Dong
- School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 510006, China. .,Communication & Computer Network Lab of Guangdong, South China University of Technology, Wushan Road, Guangzhou, 510006, China.
| |
Collapse
|
14
|
Hernandez-Lopez AA, Alberti C, Mattavelli M. Toward a Dynamic Threshold for Quality Score Distortion in Reference-Based Alignment. J Comput Biol 2020; 27:288-300. [PMID: 31891532 DOI: 10.1089/cmb.2019.0333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
The intrinsic high-entropy sequence metadata, known as quality scores, is largely the cause of the substantial size of sequence data files. Yet, there is no consensus on a viable reduction of the resolution of the quality score scale, arguably because of collateral side effects. In this article, we leverage on the penalty functions of HISAT2 aligner to rebin the quality score scale in such a way as to avoid any impact on sequence alignment, identifying alongside a distortion threshold for "safe" quality score representation. We tested our findings on whole-genome and RNA-seq data, and contrasted the results with three methods for lossy compression of the quality scores.
Collapse
Affiliation(s)
| | | | - Marco Mattavelli
- École Polytechnique Fédérale de Lausanne, EPFL, Lausanne, Switzerland
| |
Collapse
|
15
|
Abstract
The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.
Collapse
|
16
|
Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019; 17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff .
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
| |
Collapse
|
17
|
Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019; 20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY https://github.com/yhhshb/yalff.
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
18
|
Bonfield JK, McCarthy SA, Durbin R. Crumble: reference free lossy compression of sequence quality values. Bioinformatics 2019; 35:337-339. [PMID: 29992288 PMCID: PMC6330002 DOI: 10.1093/bioinformatics/bty608] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 07/09/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James K Bonfield
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Shane A McCarthy
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| | - Richard Durbin
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| |
Collapse
|
19
|
Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M. Denoising of Aligned Genomic Data. Sci Rep 2019; 9:15067. [PMID: 31636330 PMCID: PMC6803637 DOI: 10.1038/s41598-019-51418-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open
Abstract
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .
Collapse
Affiliation(s)
- Irena Fischer-Hwang
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA.
| | - Idoia Ochoa
- University of Illinois Urbana-Champaign, Department of Electrical and Computer Engineering, Urbana, 61801, USA
| | - Tsachy Weissman
- Stanford University, Department of Electrical Engineering, Stanford, 94305, USA
| | - Mikel Hernaez
- University of Illinois Urbana-Champaign, Carl R. Woese Institute for Genomic Biology, Urbana, 61801, USA.
| |
Collapse
|
20
|
Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019; 34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lukasz Roguski
- Centro Nacional de Análisis Genómico-Centre for Genomic Regulation, Barcelona Institute of Science and Technology (CNAG-CRG), Barcelona, Spain.,Experimental and Health Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL, USA
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
21
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
22
|
Yang R, Chen X, Ochoa I. MassComp, a lossless compressor for mass spectrometry data. BMC Bioinformatics 2019; 20:368. [PMID: 31262247 PMCID: PMC6604446 DOI: 10.1186/s12859-019-2962-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 06/20/2019] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination. RESULTS We present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp . CONCLUSIONS The compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to lessen the storage burden and facilitate the exchange and dissemination of omics data.
Collapse
Affiliation(s)
- Ruochen Yang
- Electrical Engineering Department, University of Southern California, Los Angeles, CA USA
| | - Xi Chen
- Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign, Urbana, IL USA
| | - Idoia Ochoa
- Electrical and Computer Engineering Department, University of Illinois at Urbana-Champaign, Urbana, IL USA
| |
Collapse
|
23
|
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018; 34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact schandak@stanford.edu. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
24
|
Voges J, Fotouhi A, Ostermann J, Külekci MO. A Two-Level Scheme for Quality Score Compression. J Comput Biol 2018; 25:1141-1151. [PMID: 30059248 DOI: 10.1089/cmb.2018.0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Previous studies on quality score compression can be classified into two main lines: lossy schemes and lossless schemes. Lossy schemes enable a better management of computational resources. Thus, in practice, and for preliminary analyses, bioinformaticians may prefer to work with a lossy quality score representation. However, the original quality scores might be required for a deeper analysis of the data. Hence, it might be necessary to keep them; in addition to lossy compression this requires lossless compression as well. We developed a space-efficient hierarchical representation of quality scores, QScomp, which allows the users to work with lossy quality scores in routine analysis, without sacrificing the capability of reaching the original quality scores when further investigations are required. Each quality score is represented by a tuple through a novel decomposition. The first and second dimensions of these tuples are separately compressed such that the first-level compression is a lossy scheme. The compressed information of the second dimension allows the users to extract the original quality scores. Experiments on real data reveal that the downstream analysis with the lossy part-spending only 0.49 bits per quality score on average-shows a competitive performance, and that the total space usage with the inclusion of the compressed second dimension is comparable to the performance of competing lossless schemes.
Collapse
Affiliation(s)
- Jan Voges
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | - Ali Fotouhi
- 2 Electronics and Communication Engineering Department, Istanbul Technical University , Istanbul, Turkey
| | - Jörn Ostermann
- 1 Institut für Informationsverarbeitung, Leibniz Universität Hannover , Hannover, Germany
| | | |
Collapse
|
25
|
Voges J, Ostermann J, Hernaez M. CALQ: compression of quality values of aligned sequencing data. Bioinformatics 2018; 34:1650-1658. [PMID: 29186284 PMCID: PMC5946873 DOI: 10.1093/bioinformatics/btx737] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Revised: 10/23/2017] [Accepted: 11/22/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq. Contact voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jan Voges
- Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, Hannover, Germany
| | - Jörn Ostermann
- Fakultät für Elektrotechnik und Informatik, Institut für Informationsverarbeitung (TNT), Leibniz Universität Hannover, Hannover, Germany
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana-Champaign, IL, USA
| |
Collapse
|
26
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017; 18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Collapse
Affiliation(s)
- Idoia Ochoa
- Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Rachel Goldfeder
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
27
|
Abstract
There is great potential for genome sequencing to enhance patient care through improved diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential, genomics strategies that have been developed for genetic discovery - including DNA-sequencing technologies and analysis algorithms - need to be adapted to fit clinical needs. This will require the optimization of alignment algorithms, attention to quality-coverage metrics, tailored solutions for paralogous or low-complexity areas of the genome, and the adoption of consensus standards for variant calling and interpretation. Global sharing of this more accurate genotypic and phenotypic data will accelerate the determination of causality for novel genes or variants. Thus, a deeper understanding of disease will be realized that will allow its targeting with much greater therapeutic precision.
Collapse
Affiliation(s)
- Euan A Ashley
- Center for Inherited Cardiovascular Disease, Falk Cardiovascular Research Building, Stanford Medicine, 870 Quarry Road, Stanford, California 94305, USA
| |
Collapse
|
28
|
Long R, Hernaez M, Ochoa I, Weissman T. GeneComp, a new reference-based compressor for SAM files. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2017; 2017:330-339. [PMID: 29046896 PMCID: PMC5641594 DOI: 10.1109/dcc.2017.76] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.
Collapse
Affiliation(s)
- Reggy Long
- Department of Electrical Engineering, Stanford University, CA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, CA
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University, CA
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, CA
| |
Collapse
|
29
|
Ochoa I, No A, Hernaez M, Weissman T. CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores. PROCEEDINGS. INFORMATION THEORY WORKSHOP 2016; 2016:121-125. [PMID: 29806047 DOI: 10.1109/itw.2016.7606808] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. As a result, lossy compression of the quality scores has recently been proposed to alleviate the storage costs. Further, it has been shown that lossy compression, at some specific rates, can achieve a performance on variant calling similar to that achieved with the lossless compressed data. We propose CROMqs, a new lossy compressor for the quality scores with the property of "infinitesimal successive refinability". This property allows the decoder to decompress the data iteratively without the need of agreeing with the encoder on a specific rate prior to compression. This characteristic is particularly amenable in practice, as in most cases the appropriate rate at which the lossy compressor should operate can not be established prior to compression. Further, this property can be of interest in scenarios involving streaming of genomic data. CROMqs is the first infinitesimal successive refinement lossy compressor for the quality scores in the literature, and we show that it obtains a comparable rate-distortion performance to previously proposed algorithms. Moreover, we also show that CROMqs achieves a comparable performance on variant calling to that of the lossless compressed data.
Collapse
Affiliation(s)
- I Ochoa
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - A No
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - M Hernaez
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - T Weissman
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| |
Collapse
|
30
|
Greenfield DL, Stegle O, Rrustemi A. GeneCodeq: quality score compression and improved genotyping using a Bayesian framework. Bioinformatics 2016; 32:3124-3132. [DOI: 10.1093/bioinformatics/btw385] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 06/15/2016] [Indexed: 12/30/2022] Open
|
31
|
Roguski Ł, Ribeca P. CARGO: effective format-free compressed storage of genomic information. Nucleic Acids Res 2016; 44:e114. [PMID: 27131376 PMCID: PMC4937321 DOI: 10.1093/nar/gkw318] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Accepted: 04/13/2016] [Indexed: 01/21/2023] Open
Abstract
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here, we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets. All CARGO software components can be freely downloaded for academic and non-commercial use from http://bio-cargo.sourceforge.net.
Collapse
Affiliation(s)
- Łukasz Roguski
- Algorithm Development, Centro Nacional de Análisis Genómico, Carrer Baldiri i Reixac 4, Barcelona 08028, Spain Experimental and Health Sciences, Universitat Pompeu Fabra, Carrer Doctor Aiguader 88, Barcelona 08003, Spain
| | - Paolo Ribeca
- Algorithm Development, Centro Nacional de Análisis Genómico, Carrer Baldiri i Reixac 4, Barcelona 08028, Spain Integrative Biology, The Pirbright Institute, Ash Road, Pirbright, Woking, GU24 0NF, United Kingdom
| |
Collapse
|
32
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Denoising of Quality Scores for Boosted Inference and Reduced Storage. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:251-260. [PMID: 29098178 PMCID: PMC5663231 DOI: 10.1109/dcc.2016.92] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.
Collapse
Affiliation(s)
- Idoia Ochoa
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Rachel Goldfeder
- Department of Medicine, Stanford University, Stanford, CA, 94305
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, 94305
| |
Collapse
|
33
|
Hernaez M, Ochoa I, Weissman T. A cluster-based approach to compression of Quality Scores. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:261-270. [PMID: 29057318 DOI: 10.1109/dcc.2016.49] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.
Collapse
Affiliation(s)
- Mikel Hernaez
- Department of Electrical Engineering, Stanford University
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University
| | | |
Collapse
|
34
|
Alberti C, Daniels N, Hernaez M, Voges J, Goldfeder RL, Hernandez-Lopez AA, Mattavelli M, Berger B. An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016; 2016:221-230. [PMID: 28845445 PMCID: PMC5568552 DOI: 10.1109/dcc.2016.39] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
This paper provides the specification and an initial validation of an evaluation framework for the comparison of lossy compressors of genome sequencing quality values. The goal is to define reference data, test sets, tools and metrics that shall be used to evaluate the impact of lossy compression of quality values on human genome variant calling. The functionality of the framework is validated referring to two state-of-the-art genomic compressors. This work has been spurred by the current activity within the ISO/IEC SC29/WG11 technical committee (a.k.a. MPEG), which is investigating the possibility of starting a standardization activity for genomic information representation.
Collapse
Affiliation(s)
- Claudio Alberti
- École Polytechnique Fédérale de Lausanne EPFL - SCI-STI-MM Lausanne, VD Switzerland
| | - Noah Daniels
- Massachusetts Institute of Technology CSAIL & Mathematics. Cambridge, MA
| | | | - Jan Voges
- Institut fuer Informationsverarbeitung (TNT) Leibniz Universität Hannover, Germany
| | | | | | - Marco Mattavelli
- École Polytechnique Fédérale de Lausanne EPFL - SCI-STI-MM Lausanne, VD Switzerland
| | - Bonnie Berger
- Massachusetts Institute of Technology CSAIL & Mathematics. Cambridge, MA
| |
Collapse
|