Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Malysa G, Hernaez M, Ochoa I, Rao M, Ganesan K, Weissman T. QVZ: lossy compression of quality values. Bioinformatics 2015;31:3122-9. [PMID: 26026138 DOI: 10.1093/bioinformatics/btv330] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 04/09/2015] [Indexed: 12/30/2022] Open

For:	Malysa G, Hernaez M, Ochoa I, Rao M, Ganesan K, Weissman T. QVZ: lossy compression of quality values. Bioinformatics 2015;31:3122-9. [PMID: 26026138 DOI: 10.1093/bioinformatics/btv330] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 04/09/2015] [Indexed: 12/30/2022] Open

Number

Cited by Other Article(s)

Sun H, Zheng Y, Xie H, Ma H, Zhong C, Yan M, Liu X, Wang G. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping. Bioinformatics 2024;40:btae323. [PMID: 38759114 PMCID: PMC11139522 DOI: 10.1093/bioinformatics/btae323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 04/22/2024] [Accepted: 05/16/2024] [Indexed: 05/19/2024] Open

Abstract

MOTIVATION

The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.

RESULTS

We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely.

AVAILABILITY AND IMPLEMENTATION

The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.

Collapse

Chen S, Chen Y, Wang Z, Qin W, Zhang J, Nand H, Zhang J, Li J, Zhang X, Liang X, Xu M. Efficient sequencing data compression and FPGA acceleration based on a two-step framework. Front Genet 2023;14:1260531. [PMID: 37811144 PMCID: PMC10552150 DOI: 10.3389/fgene.2023.1260531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 09/07/2023] [Indexed: 10/10/2023] Open

Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023;24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open

Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023;24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]

Chen H, Chen J, Lu Z, Wang R. CMIC: an efficient quality score compressor with random access functionality. BMC Bioinformatics 2022;23:294. [PMID: 35870880 PMCID: PMC9308261 DOI: 10.1186/s12859-022-04837-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 07/13/2022] [Indexed: 12/02/2022] Open

Abstract

Background

Over the past few decades, the emergence and maturation of new technologies have substantially reduced the cost of genome sequencing. As a result, the amount of genomic data that needs to be stored and transmitted has grown exponentially. For the standard sequencing data format, FASTQ, compression of the quality score is a key and difficult aspect of FASTQ file compression. Throughout the literature, we found that the majority of the current quality score compression methods do not support random access. Based on the above consideration, it is reasonable to investigate a lossless quality score compressor with a high compression rate, a fast compression and decompression speed, and support for random access.

Results

In this paper, we propose CMIC, an adaptive and random access supported compressor for lossless compression of quality score sequences. CMIC is an acronym of the four steps (classification, mapping, indexing and compression) in the paper. Its framework consists of the following four parts: classification, mapping, indexing, and compression. The experimental results show that our compressor has good performance in terms of compression rates on all the tested datasets. The file sizes are reduced by up to 21.91% when compared with LCQS. In terms of compression speed, CMIC is better than all other compressors on most of the tested cases. In terms of random access speed, the CMIC is faster than the LCQS, which provides a random access function for compressed quality scores.

Conclusions

CMIC is a compressor that is especially designed for quality score sequences, which has good performance in terms of compression rate, compression speed, decompression speed, and random access speed. The CMIC can be obtained in the following way: https://github.com/Humonex/Cmic.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-022-04837-1.

Collapse

Niu Y, Ma M, Li F, Liu X, Shi G. ACO:lossless quality score compression based on adaptive coding order. BMC Bioinformatics 2022;23:219. [PMID: 35672665 PMCID: PMC9175485 DOI: 10.1186/s12859-022-04712-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 04/25/2022] [Indexed: 08/30/2023] Open

Huo H, Liu P, Wang C, Jiang H, Vitter JS. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics 2022;38:335-343. [PMID: 34524416 DOI: 10.1093/bioinformatics/btab655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/12/2021] [Accepted: 09/10/2021] [Indexed: 02/03/2023] Open

Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022;38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open

Cho M, No A. FCLQC: fast and concurrent lossless quality scores compressor. BMC Bioinformatics 2021;22:606. [PMID: 34930110 PMCID: PMC8686598 DOI: 10.1186/s12859-021-04516-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 12/06/2021] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality.

RESULTS

This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads.

CONCLUSION

The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.

Collapse

Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Chandak S, Tatwawadi K, Ochoa I, Hernaez M, Weissman T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 2020;35:2674-2676. [PMID: 30535063 DOI: 10.1093/bioinformatics/bty1015] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 12/06/2018] [Indexed: 01/31/2023] Open

Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020;35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open

Abstract

MOTIVATION

Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs.

RESULTS

We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yuansliu/minicom.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Fu J, Ke B, Dong S. LCQS: an efficient lossless compression tool of quality scores with random access functionality. BMC Bioinformatics 2020;21:109. [PMID: 32183707 PMCID: PMC7079445 DOI: 10.1186/s12859-020-3428-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2018] [Accepted: 02/24/2020] [Indexed: 12/02/2022] Open

Abstract

Background

Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance.

Results

In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases.

Conclusion

The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.

Collapse

Hernandez-Lopez AA, Alberti C, Mattavelli M. Toward a Dynamic Threshold for Quality Score Distortion in Reference-Based Alignment. J Comput Biol 2020;27:288-300. [PMID: 31891532 DOI: 10.1089/cmb.2019.0333] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

FQSqueezer: k-mer-based compression of sequencing data. Sci Rep 2020;10:578. [PMID: 31953467 PMCID: PMC6969201 DOI: 10.1038/s41598-020-57452-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Accepted: 12/19/2019] [Indexed: 12/26/2022] Open

Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019;17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019;20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open

Bonfield JK, McCarthy SA, Durbin R. Crumble: reference free lossy compression of sequence quality values. Bioinformatics 2019;35:337-339. [PMID: 29992288 PMCID: PMC6330002 DOI: 10.1093/bioinformatics/bty608] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 07/09/2018] [Indexed: 02/01/2023] Open

Fischer-Hwang I, Ochoa I, Weissman T, Hernaez M. Denoising of Aligned Genomic Data. Sci Rep 2019;9:15067. [PMID: 31636330 PMCID: PMC6803637 DOI: 10.1038/s41598-019-51418-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open

Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019;34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open

Hernaez M, Pavlichin D, Weissman T, Ochoa I. Genomic Data Compression. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021229] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Yang R, Chen X, Ochoa I. MassComp, a lossless compressor for mass spectrometry data. BMC Bioinformatics 2019;20:368. [PMID: 31262247 PMCID: PMC6604446 DOI: 10.1186/s12859-019-2962-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Accepted: 06/20/2019] [Indexed: 12/17/2022] Open

Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018;34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open

Voges J, Fotouhi A, Ostermann J, Külekci MO. A Two-Level Scheme for Quality Score Compression. J Comput Biol 2018;25:1141-1151. [PMID: 30059248 DOI: 10.1089/cmb.2018.0065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open

Voges J, Ostermann J, Hernaez M. CALQ: compression of quality values of aligned sequencing data. Bioinformatics 2018;34:1650-1658. [PMID: 29186284 PMCID: PMC5946873 DOI: 10.1093/bioinformatics/btx737] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Revised: 10/23/2017] [Accepted: 11/22/2017] [Indexed: 12/30/2022] Open

Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017;18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open

Ashley EA. Towards precision medicine. Nat Rev Genet 2017;17:507-22. [PMID: 27528417 DOI: 10.1038/nrg.2016.86] [Citation(s) in RCA: 526] [Impact Index Per Article: 65.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Long R, Hernaez M, Ochoa I, Weissman T. GeneComp, a new reference-based compressor for SAM files. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2017;2017:330-339. [PMID: 29046896 PMCID: PMC5641594 DOI: 10.1109/dcc.2017.76] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Ochoa I, No A, Hernaez M, Weissman T. CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores. PROCEEDINGS. INFORMATION THEORY WORKSHOP 2016;2016:121-125. [PMID: 29806047 DOI: 10.1109/itw.2016.7606808] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]

Greenfield DL, Stegle O, Rrustemi A. GeneCodeq: quality score compression and improved genotyping using a Bayesian framework. Bioinformatics 2016;32:3124-3132. [DOI: 10.1093/bioinformatics/btw385] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 06/15/2016] [Indexed: 12/30/2022] Open

Roguski Ł, Ribeca P. CARGO: effective format-free compressed storage of genomic information. Nucleic Acids Res 2016;44:e114. [PMID: 27131376 PMCID: PMC4937321 DOI: 10.1093/nar/gkw318] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Accepted: 04/13/2016] [Indexed: 01/21/2023] Open

Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Denoising of Quality Scores for Boosted Inference and Reduced Storage. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016;2016:251-260. [PMID: 29098178 PMCID: PMC5663231 DOI: 10.1109/dcc.2016.92] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]

Hernaez M, Ochoa I, Weissman T. A cluster-based approach to compression of Quality Scores. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016;2016:261-270. [PMID: 29057318 DOI: 10.1109/dcc.2016.49] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]

Abstract

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

Collapse

Alberti C, Daniels N, Hernaez M, Voges J, Goldfeder RL, Hernandez-Lopez AA, Mattavelli M, Berger B. An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2016;2016:221-230. [PMID: 28845445 PMCID: PMC5568552 DOI: 10.1109/dcc.2016.39] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]