1
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 2025; 22:692-697. [PMID: 40205174 DOI: 10.1038/s41592-025-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 02/12/2025] [Indexed: 04/11/2025]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
Affiliation(s)
- Karel Břinda
- Inria, Irisa, Univ. Rennes, Rennes, France.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | - Simone Pignotti
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | | | - Kamil Salikhov
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | - Rayan Chikhi
- Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | | - Zamin Iqbal
- EMBL-EBI, Hinxton, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
2
|
Kowalski TM, Grabowski S. PgRC2: engineering the compression of sequencing reads. Bioinformatics 2025; 41:btaf101. [PMID: 40037801 PMCID: PMC11908645 DOI: 10.1093/bioinformatics/btaf101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Revised: 01/23/2025] [Accepted: 02/28/2025] [Indexed: 03/06/2025] Open
Abstract
SUMMARY The FASTQ format remains at the heart of high-throughput sequencing. Despite advances in specialized FASTQ compressors, they are still imperfect in terms of practical performance tradeoffs. We present a multi-threaded version of Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of approximating the shortest common superstring over high-quality reads. Redundancy in the obtained string is efficiently removed by using a compact temporary representation. The current version, v2.0, preserves the compression ratio of the previous one, reducing the compression (resp. decompression) time by a factor of 8-9 (resp. 2-2.5) on a 14-core/28-thread machine. AVAILABILITY AND IMPLEMENTATION PgRC 2.0 can be downloaded from https://github.com/kowallus/PgRC and https://zenodo.org/records/14882486 (10.5281/zenodo.14882486).
Collapse
Affiliation(s)
- Tomasz M Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| |
Collapse
|
3
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
4
|
Sun H, Zheng Y, Xie H, Ma H, Liu X, Wang G. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering. BMC Bioinformatics 2023; 24:454. [PMID: 38036969 PMCID: PMC10691058 DOI: 10.1186/s12859-023-05566-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 11/13/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC .
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| |
Collapse
|
5
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
6
|
Abstract
The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.
Collapse
|
7
|
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022; 38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dohyeon Lee
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| |
Collapse
|
8
|
Liu Y, Zhang X, Zou Q, Zeng X. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics 2021; 37:1604-1606. [PMID: 33112385 DOI: 10.1093/bioinformatics/btaa915] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/30/2020] [Accepted: 10/14/2020] [Indexed: 12/21/2022] Open
Abstract
SUMMARY Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| | - Xiaocai Zhang
- Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| |
Collapse
|
9
|
Liu Y, Li J. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. PLoS Comput Biol 2021; 17:e1009229. [PMID: 34280186 PMCID: PMC8321399 DOI: 10.1371/journal.pcbi.1009229] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 07/29/2021] [Accepted: 06/30/2021] [Indexed: 11/21/2022] Open
Abstract
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.
Collapse
Affiliation(s)
- Yuansheng Liu
- Data Science Institute, University of Technology Sydney, Sydney, Australia
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Sydney, Australia
| |
Collapse
|
10
|
Tang T, Li J. Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases. J Bioinform Comput Biol 2021; 19:2050048. [PMID: 33472569 DOI: 10.1142/s0219720020500481] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97[Formula: see text]GB).
Collapse
Affiliation(s)
- Tao Tang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia
| |
Collapse
|
11
|
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020; 21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Collapse
Affiliation(s)
- Guillaume Holley
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| |
Collapse
|
12
|
Kowalski TM, Grabowski S. PgRC: pseudogenome-based read compressor. Bioinformatics 2020; 36:2082-2089. [PMID: 31893286 DOI: 10.1093/bioinformatics/btz919] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/28/2019] [Accepted: 12/05/2019] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. RESULTS We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. AVAILABILITY AND IMPLEMENTATION PgRC can be downloaded from https://github.com/kowallus/PgRC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tomasz M Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| |
Collapse
|
13
|
Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020; 35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. RESULTS We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| |
Collapse
|
14
|
Kockan C, Zhu K, Dokmai N, Karpov N, Kulekci MO, Woodruff DP, Sahinalp SC. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat Methods 2020; 17:295-301. [PMID: 32132732 DOI: 10.1038/s41592-020-0761-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Accepted: 01/22/2020] [Indexed: 11/09/2022]
Abstract
Genome-wide association studies (GWAS), especially on rare diseases, may necessitate exchange of sensitive genomic data between multiple institutions. Since genomic data sharing is often infeasible due to privacy concerns, cryptographic methods, such as secure multiparty computation (SMC) protocols, have been developed with the aim of offering privacy-preserving collaborative GWAS. Unfortunately, the computational overhead of these methods remain prohibitive for human-genome-scale data. Here we introduce SkSES (https://github.com/ndokmai/sgx-genome-variants-search), a hardware-software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude. The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors-in particular, Intel's SGX. To overcome the severe memory limitation of the TEEs, SkSES employs novel 'sketching' algorithms that maintain essential statistical information on genomic variants in input VCF files. By additionally incorporating efficient data compression and population stratification reduction methods, SkSES identifies the top k genomic variants in a cohort quickly, accurately and in a privacy-preserving manner.
Collapse
Affiliation(s)
- Can Kockan
- Department of Computer Science, Indiana University, Bloomington, IN, USA.,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Kaiyuan Zhu
- Department of Computer Science, Indiana University, Bloomington, IN, USA.,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Natnatee Dokmai
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Nikolai Karpov
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - M Oguzhan Kulekci
- Informatics Institute, Istanbul Technical University, Istanbul, Turkey
| | - David P Woodruff
- Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
15
|
Abstract
The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.
Collapse
|
16
|
Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019; 20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY https://github.com/yhhshb/yalff.
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
17
|
Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019; 34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lukasz Roguski
- Centro Nacional de Análisis Genómico-Centre for Genomic Regulation, Barcelona Institute of Science and Technology (CNAG-CRG), Barcelona, Spain.,Experimental and Health Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL, USA
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
18
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
19
|
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019; 13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open
Abstract
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
Collapse
Affiliation(s)
- Aníbal Guerra
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
- Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
| | - Jaime Lotero
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - José Édinson Aedo
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - Sebastián Isaza
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| |
Collapse
|
20
|
Abstract
Background:
Biological sequence data have increased at a rapid rate due to the
advancements in sequencing technologies and reduction in the cost of sequencing data. The huge
increase in these data presents significant research challenges to researchers. In addition to meaningful
analysis, data storage is also a challenge, an increase in data production is outpacing the storage
capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as
well as transmission cost over the internet.
Objective:
This article presents a novel compression algorithm (FCompress) for Next Generation
Sequencing (NGS) data in FASTQ format.
Method:
The proposed algorithm uses bits manipulation and dictionary-based compression for bases
compression. Headers are compressed with reference-based compression, whereas quality scores are
compressed with Huffman coding.
Results:
The proposed algorithm is validated with experimental results on real datasets. The results are
compared with both general purpose and specialized compression programs.
Conclusion:
The proposed algorithm produces better compression ratio in a comparable time to other
algorithms.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Muhammad Tahir
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| |
Collapse
|
21
|
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018; 34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact schandak@stanford.edu. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
22
|
Wang R, Li J, Bai Y, Zang T, Wang Y. BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs. PeerJ 2018; 6:e5611. [PMID: 30364599 PMCID: PMC6197042 DOI: 10.7717/peerj.5611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 09/13/2018] [Indexed: 02/01/2023] Open
Abstract
Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG.
Collapse
Affiliation(s)
- Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China
| | - Yang Bai
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Tianyi Zang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| |
Collapse
|
23
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. J Comput Biol 2018; 25:825-836. [PMID: 30011247 DOI: 10.1089/cmb.2018.0068] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
Collapse
Affiliation(s)
- Guillaume Holley
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Roland Wittler
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Jens Stoye
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany
| | - Faraz Hach
- 3 School of Computing Science, Simon Fraser University , Burnaby, Canada .,4 Department of Urologic Sciences, University of British Columbia , Vancouver, Canada .,5 Vancouver Prostate Centre , Vancouver, Canada
| |
Collapse
|
24
|
Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 2018; 9:566. [PMID: 29422526 PMCID: PMC5805770 DOI: 10.1038/s41467-017-02480-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Accepted: 12/05/2017] [Indexed: 12/21/2022] Open
Abstract
The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed. Increase in high throughput sequencing (HTS) data warrants compression methods to facilitate better storage and communication. Here, Ginart et al. introduce Assembltrie, a reference-free compression tool which is guaranteed to achieve optimality for error-free reads.
Collapse
|
25
|
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z. LW-FQZip 2: a parallelized reference-based compression of FASTQ files. BMC Bioinformatics 2017; 18:179. [PMID: 28320326 PMCID: PMC5359991 DOI: 10.1186/s12859-017-1588-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2016] [Accepted: 03/09/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. RESULTS In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. CONCLUSIONS The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .
Collapse
Affiliation(s)
- Zhi-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Zhenkun Wen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Qingjin Deng
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Ying Chu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Yiwen Sun
- School of Medicine, Shenzhen University, Shenzhen, 518060 China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| |
Collapse
|
26
|
Paridaens T, Van Wallendael G, De Neve W, Lambert P. AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality. Bioinformatics 2017; 33:1464-1472. [DOI: 10.1093/bioinformatics/btx001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 01/03/2017] [Indexed: 11/14/2022] Open
|
27
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
28
|
Cánovas R, Moffat A, Turpin A. CSAM: Compressed SAM format. Bioinformatics 2016; 32:3709-3716. [PMID: 27540265 DOI: 10.1093/bioinformatics/btw543] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 08/13/2016] [Accepted: 08/15/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Next generation sequencing machines produce vast amounts of genomic data. For the data to be useful, it is essential that it can be stored and manipulated efficiently. This work responds to the combined challenge of compressing genomic data, while providing fast access to regions of interest, without necessitating decompression of whole files. RESULTS We describe CSAM (Compressed SAM format), a compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which is currently the preferred lossless compressed SAM-equivalent format; and are self-contained, that is, they do not depend on any external resources to compress or decompress SAM files. AVAILABILITY AND IMPLEMENTATION An implementation is available at https://github.com/rcanovas/libCSAM CONTACT: canovas-ba@lirmm.frSupplementary Information: Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Rodrigo Cánovas
- L.I.R.M.M. and Institut Biologie Computationnelle, Université de Montpellier, Montpellier Cedex 5, CNRS F-34392, France.,Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| | - Alistair Moffat
- Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| | - Andrew Turpin
- Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia
| |
Collapse
|
29
|
Comparison of high-throughput sequencing data compression tools. Nat Methods 2016; 13:1005-1008. [PMID: 27776113 DOI: 10.1038/nmeth.4037] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]
Abstract
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.
Collapse
|
30
|
|
31
|
Greenfield DL, Stegle O, Rrustemi A. GeneCodeq: quality score compression and improved genotyping using a Bayesian framework. Bioinformatics 2016; 32:3124-3132. [DOI: 10.1093/bioinformatics/btw385] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 06/15/2016] [Indexed: 12/30/2022] Open
|
32
|
Roguski Ł, Ribeca P. CARGO: effective format-free compressed storage of genomic information. Nucleic Acids Res 2016; 44:e114. [PMID: 27131376 PMCID: PMC4937321 DOI: 10.1093/nar/gkw318] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2015] [Accepted: 04/13/2016] [Indexed: 01/21/2023] Open
Abstract
The recent super-exponential growth in the amount of sequencing data generated worldwide has put techniques for compressed storage into the focus. Most available solutions, however, are strictly tied to specific bioinformatics formats, sometimes inheriting from them suboptimal design choices; this hinders flexible and effective data sharing. Here, we present CARGO (Compressed ARchiving for GenOmics), a high-level framework to automatically generate software systems optimized for the compressed storage of arbitrary types of large genomic data collections. Straightforward applications of our approach to FASTQ and SAM archives require a few lines of code, produce solutions that match and sometimes outperform specialized format-tailored compressors and scale well to multi-TB datasets. All CARGO software components can be freely downloaded for academic and non-commercial use from http://bio-cargo.sourceforge.net.
Collapse
Affiliation(s)
- Łukasz Roguski
- Algorithm Development, Centro Nacional de Análisis Genómico, Carrer Baldiri i Reixac 4, Barcelona 08028, Spain Experimental and Health Sciences, Universitat Pompeu Fabra, Carrer Doctor Aiguader 88, Barcelona 08003, Spain
| | - Paolo Ribeca
- Algorithm Development, Centro Nacional de Análisis Genómico, Carrer Baldiri i Reixac 4, Barcelona 08028, Spain Integrative Biology, The Pirbright Institute, Ash Road, Pirbright, Woking, GU24 0NF, United Kingdom
| |
Collapse
|
33
|
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput DNA sequence data compression. J Bioinform Comput Biol 2015; 14:1630002. [PMID: 26846812 DOI: 10.1142/s0219720016300021] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Muhammad Tahir
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Ataul Aziz Ikram
- 2 Department of Electrical Engineering, National University, Islamabad 44000, Pakistan
| |
Collapse
|
34
|
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 2015; 16:288. [PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Collapse
Affiliation(s)
- Gaëtan Benoit
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Claire Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | | | - Erwan Drezen
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Thibault Dayris
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
| | - Raluca Uricaru
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
- University of Bordeaux, CBiB, Bordeaux, F-33000, France.
| | - Guillaume Rizk
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| |
Collapse
|
35
|
Kowalski T, Grabowski S, Deorowicz S. Indexing Arbitrary-Length k-Mers in Sequencing Reads. PLoS One 2015; 10:e0133198. [PMID: 26182400 PMCID: PMC4504488 DOI: 10.1371/journal.pone.0133198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Accepted: 06/24/2015] [Indexed: 11/25/2022] Open
Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Collapse
Affiliation(s)
- Tomasz Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
36
|
Deorowicz S, Danek A, Niemiec M. GDC 2: Compression of large collections of genomes. Sci Rep 2015; 5:11565. [PMID: 26108279 PMCID: PMC4479802 DOI: 10.1038/srep11565] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 05/28/2015] [Indexed: 01/18/2023] Open
Abstract
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Agnieszka Danek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | | |
Collapse
|
37
|
|