1
|
Roy S, Mukhopadhyay A. A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression. Gene 2024; 907:148235. [PMID: 38342250 DOI: 10.1016/j.gene.2024.148235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 01/13/2024] [Accepted: 01/30/2024] [Indexed: 02/13/2024]
Abstract
Next Generation Sequencing (NGS) technology generates massive amounts of genome sequence that increases rapidly over time. As a result, there is a growing need for efficient compression algorithms to facilitate the processing, storage, transmission, and analysis of large-scale genome sequences. Over the past 31 years, numerous state-of-the-art compression algorithms have been developed. The performance of any compression algorithm is measured by three main compression metrics: compression ratio, time, and memory usage. Existing k-mer hash indexing systems take more time, due to the decision-making process based on compression results. In this paper, we propose a two-phase reference genome compression algorithm using optimal k-mer length (RGCOK). Reference-based compression takes advantage of the inter-similarity between chromosomes of the same species. RGCOK achieves this by finding the optimal k-mer length for matching, using a randomization method and hashing. The performance of RGCOK was evaluated on three different benchmark data sets: novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Homo sapiens, and other species sequences using an Amazon AWS virtual cloud machine. Experiments showed that the optimal k-mer finding time by RGCOK is around 45.28 min, whereas the time for existing state-of-the-art algorithms HiRGC, SCCG, and HRCM ranges from 58 min to 8.97 h.
Collapse
Affiliation(s)
- Subhankar Roy
- Department of Computer Science and Engineering, Academy of Technology, Adisaptagram, Hooghly 712121, West Bengal, India.
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, Nadia 741235, West Bengal, India.
| |
Collapse
|
2
|
Lu Z, Guo L, Chen J, Wang R. Reference-based genome compression using the longest matched substrings with parallelization consideration. BMC Bioinformatics 2023; 24:369. [PMID: 37777730 PMCID: PMC10544193 DOI: 10.1186/s12859-023-05500-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Accepted: 09/26/2023] [Indexed: 10/02/2023] Open
Abstract
BACKGROUND A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms. RESULTS We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs. CONCLUSIONS Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.
Collapse
Affiliation(s)
- Zhiwen Lu
- School of Information, Yunnan University, KunMing, China
| | - Lu Guo
- Yunnan Physical Science and Sports Professional College, KunMing, China
| | - Jianhua Chen
- School of Information, Yunnan University, KunMing, China.
| | - Rongshu Wang
- School of Information, Yunnan University, KunMing, China
| |
Collapse
|
3
|
Deorowicz S, Danek A, Li H. AGC: compact representation of assembled genomes with fast queries and updates. Bioinformatics 2023; 39:7067744. [PMID: 36864624 PMCID: PMC9994791 DOI: 10.1093/bioinformatics/btad097] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 01/13/2023] [Indexed: 03/04/2023] Open
Abstract
MOTIVATION High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland
| | - Agnieszka Danek
- Department of Algorithmics and Software, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
4
|
Yao H, Hu G, Liu S, Fang H, Ji Y. SparkGC: Spark based genome compression for large collections of genomes. BMC Bioinformatics 2022; 23:297. [PMID: 35879669 PMCID: PMC9310413 DOI: 10.1186/s12859-022-04825-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 07/06/2022] [Indexed: 11/23/2022] Open
Abstract
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC.
Collapse
Affiliation(s)
- Haichang Yao
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Guangyong Hu
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Shangdong Liu
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Houzhi Fang
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Yimu Ji
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China. .,Jiangsu HPC and Intelligent Processing Engineer Research Center, Nanjing, 210003, China. .,Institute of High Performance Computing and Bigdata, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
| |
Collapse
|
5
|
A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry. ENTROPY 2022; 24:e24050574. [PMID: 35626459 PMCID: PMC9140898 DOI: 10.3390/e24050574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 04/08/2022] [Accepted: 04/14/2022] [Indexed: 11/29/2022]
Abstract
We propose an innovative delta-differencing algorithm that combines software-updating methods with LZ77 data compression. This software-updating method relates to server-side software that creates binary delta files and to client-side software that performs software-update installations. The proposed algorithm creates binary-differencing streams already compressed from an initial phase. We present a software-updating method suitable for OTA software updates and the method’s basic strategies to achieve a better performance in terms of speed, compression ratio or a combination of both. A comparison with publicly available solutions is provided. Our test results show our method, Keops, can outperform an LZMA (Lempel–Ziv–Markov chain-algorithm) based binary differencing solution in terms of compression ratio in two cases by more than 3% while being two to five times faster in decompression. We also prove experimentally that the difference between Keops and other competing delta-creator software increases when larger history buffers are used. In one case, we achieve a three times better performance for a delta rate compared to other competing delta rates.
Collapse
|
6
|
Tang T, Li J. Comparative studies on the high-performance compression of SARS-CoV-2 genome collections. Brief Funct Genomics 2022; 21:103-112. [PMID: 34889452 DOI: 10.1093/bfgp/elab041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 10/12/2021] [Accepted: 10/15/2021] [Indexed: 01/24/2023] Open
Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is fast mutating worldwide. The mutated strains have been timely sequenced by worldwide labs, accumulating a huge amount of viral genome sequences open to public for biomedicine research such as mRNA vaccine design and drug recommendation. It is inefficient to transmit the millions of genome sequences without compression. In this study, we benchmark the performance of reference-free and reference-based compression algorithms on SARS-CoV-2 genome collections extracted from NCBI. Experimental results show that reference-based two-level compression is the most suitable approach to the compression, achieving the best compression ratio 1019.33-fold for compressing 132 372 genomes and 949.73-fold for compressing 416 238 genomes. This enormous file size reduction and efficient decompression have enabled a 5-min download and decompression of $10^5$ SARS-CoV-2 genomes. As compression on datasets containing such big numbers of genomes has been explored seldom before, our comparative analysis of the state-of-the-art compression algorithms provides practical guidance for the selection of compression tools and their parameters such as reference genomes to compress viral genome databases with similar characteristics. We also suggested a genome clustering approach using multiple references for a better compression. It is anticipated that the increased availability of SARS-CoV-2 genome datasets will make biomedicine research more productive.
Collapse
Affiliation(s)
- Tao Tang
- School of Mordern Posts, Nanjing University of Posts and Telecommunications
- Data Science Institute, Faculty of Engineering and IT, University of Technology Sydney, 15 Broadway, 2007, NSW, Australia
| | - Jinyan Li
- Data Science Institute, Faculty of Engineering and IT, University of Technology Sydney, 15 Broadway, 2007, NSW, Australia
| |
Collapse
|
7
|
Abstract
BACKGROUND Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. RESULTS In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.
Collapse
Affiliation(s)
- Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, ul. Stefanowskiego 18, 90-537 Lodz, Poland
| | - Tomasz M Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, ul. Stefanowskiego 18, 90-537 Lodz, Poland
| |
Collapse
|
8
|
Abstract
Motivation The size of a genome graph—the space required to store the nodes, node labels and edges—affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs. Results We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel–Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit. Availability The RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yutong Qiu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
9
|
ER-index: A referential index for encrypted genomic databases. INFORM SYST 2021. [DOI: 10.1016/j.is.2020.101668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Silva M, Pratas D, Pinho AJ. Efficient DNA sequence compression with neural networks. Gigascience 2020; 9:giaa119. [PMID: 33179040 PMCID: PMC7657843 DOI: 10.1093/gigascience/giaa119] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/19/2020] [Accepted: 10/02/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. FINDINGS We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. CONCLUSIONS GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.
Collapse
Affiliation(s)
- Milton Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Armando J Pinho
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
11
|
Liu Y, Wong L, Li J. Allowing mutations in maximal matches boosts genome compression performance. Bioinformatics 2020; 36:4675-4681. [PMID: 33118018 DOI: 10.1093/bioinformatics/btaa572] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 05/05/2020] [Accepted: 06/10/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. RESULTS We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
12
|
Shi W, Chen J, Luo M, Chen M. High efficiency referential genome compression algorithm. Bioinformatics 2020; 35:2058-2065. [PMID: 30407493 DOI: 10.1093/bioinformatics/bty934] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2018] [Revised: 10/09/2018] [Accepted: 11/07/2018] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION With the development and the gradually popularized application of next-generation sequencing technologies (NGS), genome sequencing has been becoming faster and cheaper, creating a massive amount of genome sequence data which still grows at an explosive rate. The time and cost of transmission, storage, processing and analysis of these genetic data have become bottlenecks that hinder the development of genetics and biomedicine. Although there are many common data compression algorithms, they are not effective for genome sequences due to their inability to consider and exploit the inherent characteristics of genome sequence data. Therefore, the development of a fast and efficient compression algorithm specific to genome data is an important and pressing issue. RESULTS We have developed a referential lossless genome data compression algorithm with better performance than previous algorithms. According to a carefully designed matching strategy selection mechanism, the advantages of local matching and global matching are reasonably combined together to improve the description efficiency of the matched sub-strings. The effects of the length and the position of matched sub-strings to the compression efficiency are jointly taken into consideration. The proposed algorithm can compress the FASTA data of complete human genomes, each of which is about 3 GB, in about 18 min. The compressed file sizes are ranging from a few megabytes to about forty megabytes. The averaged compression ratio is higher than that of the state-of-the-art genome compression algorithms, the time complexity is at the same order of the best-known algorithms. AVAILABILITY AND IMPLEMENTATION https://github.com/jhchen5/SCCG. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Shi
- School of Information, Yunnan University, Kunming, China
| | - Jianhua Chen
- School of Information, Yunnan University, Kunming, China
| | - Mao Luo
- School of Information, Yunnan University, Kunming, China
| | - Min Chen
- Information Security College, Yunnan Police College, Kunming, China
| |
Collapse
|
13
|
Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020; 15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open
Abstract
The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.
Collapse
Affiliation(s)
- Kelvin V. Kredens
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Juliano V. Martins
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Osmar B. Dordal
- Polytechnic School, Centro Universitário UniDomBosco, Curitiba, Paraná, Brazil
| | - Mauri Ferrandin
- Department of Control, Automation and Computing Engineering, Universidade Federal de Santa Catarina (UFSC), Blumenau, Brazil
| | - Roberto H. Herai
- Graduate Program in Health Sciences, School of Medicine, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, Paraná, Brazil
| | - Edson E. Scalabrin
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Bráulio C. Ávila
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| |
Collapse
|
14
|
HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data. BIOMED RESEARCH INTERNATIONAL 2020; 2019:3108950. [PMID: 31915686 PMCID: PMC6930768 DOI: 10.1155/2019/3108950] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2019] [Revised: 09/14/2019] [Accepted: 10/22/2019] [Indexed: 12/22/2022]
Abstract
With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.
Collapse
|
15
|
Tang T, Liu Y, Zhang B, Su B, Li J. Sketch distance-based clustering of chromosomes for large genome database compression. BMC Genomics 2019; 20:978. [PMID: 31888458 PMCID: PMC6939838 DOI: 10.1186/s12864-019-6310-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Accepted: 11/19/2019] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND The rapid development of Next-Generation Sequencing technologies enables sequencing genomes with low cost. The dramatically increasing amount of sequencing data raised crucial needs for efficient compression algorithms. Reference-based compression algorithms have exhibited outstanding performance on compressing single genomes. However, for the more challenging and more useful problem of compressing a large collection of n genomes, straightforward application of these reference-based algorithms suffers a series of issues such as difficult reference selection and remarkable performance variation. RESULTS We propose an efficient clustering-based reference selection algorithm for reference-based compression within separate clusters of the n genomes. This method clusters the genomes into subsets of highly similar genomes using MinHash sketch distance, and uses the centroid sequence of each cluster as the reference genome for an outstanding reference-based compression of the remaining genomes in each cluster. A final reference is then selected from these reference genomes for the compression of the remaining reference genomes. Our method significantly improved the performance of the-state-of-art compression algorithms on large-scale human and rice genome databases containing thousands of genome sequences. The compression ratio gain can reach up to 20-30% in most cases for the datasets from NCBI, the 1000 Human Genomes Project and the 3000 Rice Genomes Project. The best improvement boosts the performance from 351.74 compression folds to 443.51 folds. CONCLUSIONS The compression ratio of reference-based compression on large scale genome datasets can be improved via reference selection by applying appropriate data preprocessing and clustering methods. Our algorithm provides an efficient way to compress large genome database.
Collapse
Affiliation(s)
- Tao Tang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, Sydney, NSW 2007, Australia
| | - Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, Sydney, NSW 2007, Australia
| | - Buzhong Zhang
- School of Computer and Information, Anqing Normal University, Anqing, 246401, China
| | - Benyue Su
- School of Computer and Information, Anqing Normal University, Anqing, 246401, China.
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, Sydney, NSW 2007, Australia.
| |
Collapse
|
16
|
Navarro G, Sepúlveda V, Marín M, González S. Compressed filesystem for managing large genome collections. Bioinformatics 2019; 35:4120-4128. [DOI: 10.1093/bioinformatics/btz192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 10/18/2018] [Accepted: 03/15/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Genome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten.
Results
We present a highly compressed filesystem that specializes in storing large collections of genomes and reads. The system obtains orders-of-magnitude compression by using Relative Lempel-Ziv, which exploits the high similarities between genomes of the same species. The filesystem transparently stores the files in compressed form, intervening the system calls of the applications without the need to modify them. A client/server variant of the system stores the compressed files in a server, while the client’s filesystem transparently retrieves and updates the data from the server. The data between client and server are also transferred in compressed form, which saves an order of magnitude network time.
Availability and implementation
The C++ source code of our implementation is available for download in https://github.com/vsepulve/relz_fs.
Collapse
Affiliation(s)
- Gonzalo Navarro
- CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
- Department of Computer Science, University of Chile, Santiago, Chile
| | - Víctor Sepúlveda
- CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
| | - Mauricio Marín
- CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
- DIINF, University of Santiago, Santiago, Chile
| | - Senén González
- CeBiB—Center for Biotechnology and Bioengineering, Santiago, Chile
| |
Collapse
|
17
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
18
|
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019; 13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open
Abstract
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
Collapse
Affiliation(s)
- Aníbal Guerra
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
- Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
| | - Jaime Lotero
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - José Édinson Aedo
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - Sebastián Isaza
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| |
Collapse
|
19
|
Bianchi L, Liò P. Opportunities for community awareness platforms in personal genomics and bioinformatics education. Brief Bioinform 2018; 18:1082-1090. [PMID: 27580620 DOI: 10.1093/bib/bbw078] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2016] [Indexed: 01/16/2023] Open
Abstract
Precision and personalized medicine will be increasingly based on the integration of various type of information, particularly electronic health records and genome sequences. The availability of cheap genome sequencing services and the information interoperability will increase the role of online bioinformatics analysis. Being on the Internet poses constant threats to security and privacy. While we are connected and we share information, websites and internet services collect various types of personal data with or without the user consent. It is likely that genomics will merge with the internet culture of connectivity. This process will increase incidental findings, exposure and vulnerability. Here we discuss the social vulnerability owing to the genome and Internet combined security and privacy weaknesses. This urges more efforts in education and social awareness on how biomedical data are analysed and transferred through the internet and how inferential methods could integrate information from different sources. We propose that digital social platforms, used for raising collective awareness in different fields, could be developed for collaborative and bottom-up efforts in education. In this context, bioinformaticians could play a meaningful role in mitigating the future risk of digital-genomic divide.
Collapse
|
20
|
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. ENTROPY 2018; 20:e20060393. [PMID: 33265483 PMCID: PMC7512912 DOI: 10.3390/e20060393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/16/2018] [Accepted: 05/21/2018] [Indexed: 11/26/2022]
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Collapse
|
21
|
|
22
|
|
23
|
Deorowicz S, Grabowski S, Ochoa I, Hernaez M, Weissman T. Comment on: 'ERGC: an efficient referential genome compression algorithm'. Bioinformatics 2016; 32:1115-7. [PMID: 26615213 DOI: 10.1093/bioinformatics/btv704] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2015] [Accepted: 11/25/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors. RESULTS We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp. CONTACT sebastian.deorowicz@polsl.pl, iochoa@stanford.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, Gliwice, 44-100 Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Poland and
| | - Idoia Ochoa
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA
| |
Collapse
|
24
|
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput DNA sequence data compression. J Bioinform Comput Biol 2015; 14:1630002. [PMID: 26846812 DOI: 10.1142/s0219720016300021] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Muhammad Tahir
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Ataul Aziz Ikram
- 2 Department of Electrical Engineering, National University, Islamabad 44000, Pakistan
| |
Collapse
|
25
|
Wandelt S, Leser U. Sequence Factorization with Multiple References. PLoS One 2015; 10:e0139000. [PMID: 26422374 PMCID: PMC4589410 DOI: 10.1371/journal.pone.0139000] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2014] [Accepted: 09/07/2015] [Indexed: 11/29/2022] Open
Abstract
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.
Collapse
Affiliation(s)
- Sebastian Wandelt
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
- * E-mail:
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
| |
Collapse
|