1
|
Wang L, Ding R, He S, Wang Q, Zhou Y. A Pipeline for Constructing Reference Genomes for Large Cohort-Specific Metagenome Compression. Microorganisms 2023; 11:2560. [PMID: 37894218 PMCID: PMC10609127 DOI: 10.3390/microorganisms11102560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 09/16/2023] [Accepted: 09/18/2023] [Indexed: 10/29/2023] Open
Abstract
Metagenomic data compression is very important as metagenomic projects are facing the challenges of larger data volumes per sample and more samples nowadays. Reference-based compression is a promising method to obtain a high compression ratio. However, existing microbial reference genome databases are not suitable to be directly used as references for compression due to their large size and redundancy, and different metagenomic cohorts often have various microbial compositions. We present a novel pipeline that generated simplified and tailored reference genomes for large metagenomic cohorts, enabling the reference-based compression of metagenomic data. We constructed customized reference genomes, ranging from 2.4 to 3.9 GB, for 29 real metagenomic datasets and evaluated their compression performance. Reference-based compression achieved an impressive compression ratio of over 20 for human whole-genome data and up to 33.8 for all samples, demonstrating a remarkable 4.5 times improvement than the standard Gzip compression. Our method provides new insights into reference-based metagenomic data compression and has a broad application potential for faster and cheaper data transfer, storage, and analysis.
Collapse
Affiliation(s)
- Linqi Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
| | - Renpeng Ding
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| | - Shixu He
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| | - Qinyu Wang
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
| | - Yan Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai 200438, China; (L.W.); (Q.W.)
- MGI Tech, Shenzhen 518083, China; (R.D.); (S.H.)
| |
Collapse
|
2
|
Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023; 39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour. AVAILABILITY AND IMPLEMENTATION PAC's open-source software is available at https://github.com/Malfoy/PAC.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| |
Collapse
|
3
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 PMCID: PMC10204111 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
SFQ: Constructing and Querying a Succinct Representation of FASTQ Files. ELECTRONICS 2022. [DOI: 10.3390/electronics11111783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
A large and ever increasing quantity of high throughput sequencing (HTS) data is stored in FASTQ files. Various methods for data compression are used to mitigate the storage and transmission costs, from the still prevalent general purpose Gzip to state-of-the-art specialized methods. However, all of the existing methods for FASTQ file compression require the decompression stage before the HTS data can be used. This is particularly costly with the random access to specific records in FASTQ files. We propose the sFASTQ format, a succinct representation of FASTQ files that can be used without decompression (i.e., the records can be retrieved and listed online), and that supports random access to individual records. The sFASTQ format can be searched on the disk, which eliminates the need for any additional memory resources. The searchable sFASTQ archive is of comparable size to the corresponding Gzip file. sFASTQ format outputs (interleaved) FASTQ records to the STDOUT stream. We provide SFQ, a software for the construction and usage of the sFASTQ format that supports variable length reads, pairing of records, and both lossless and lossy compression of quality scores.
Collapse
|
5
|
Huo H, Liu P, Wang C, Jiang H, Vitter JS. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics 2022; 38:335-343. [PMID: 34524416 DOI: 10.1093/bioinformatics/btab655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/12/2021] [Accepted: 09/10/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongwei Huo
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Pengfei Liu
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Chenhui Wang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Hongbo Jiang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | | |
Collapse
|
6
|
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022; 38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dohyeon Lee
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| |
Collapse
|
7
|
Huo H, Chen X, Guo X, Vitter JS. Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2394-2408. [PMID: 31985436 DOI: 10.1109/tcbb.2020.2968323] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection G of t sequences Si of length n each, we can represent G succinctly in 2nHk(T) + O(n' loglogn) + o(q n') + o(tn) bits using O(t n2 + q n') time, where Hk(T) is the kth-order empirical entropy of the sequence T ∈ G that is used as the reference sequence, n' is the total number of variations between T and the sequences in G, and q is a small fixed constant. We can restore any length len substring S[ sp, ..., sp + len-1] of S ∈ G in O(ns' + len(logn)2 / loglogn) time and report all positions where P occurs in G in O(m ·t + occ ·t ·(logn)2/loglogn ) time. In addition, we propose a dynamic programming method to find the variations between T and the sequences in G in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.
Collapse
|
8
|
Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The dropping cost of sequencing human DNA has allowed for fast development of several projects around the world generating huge amounts of DNA sequencing data. This deluge of data has run up against limited storage space, a problem that researchers are trying to solve through compression techniques. In this study we address the compression of SAM files, the standard output files for DNA alignment. We specifically study lossy compression techniques used for quality values reported in the SAM file and analyze the impact of such lossy techniques on the CRAM format. We present a series of experiments using a data set corresponding to individual NA12878 with three different fold coverages. We introduce a new lossy model, dynamic binning, and compare its performance to other lossy techniques, namely Illumina binning, LEON and QVZ. We analyze the compression ratio when using CRAM and also study the impact of the lossy techniques on SNP calling. Our results show that lossy techniques allow a better CRAM compression ratio. Furthermore, we show that SNP calling performance is not negatively affected and may even be boosted.
Collapse
|
9
|
Tahir M, Sardaraz M, Mehmood Z, Khan MS. ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200614171832] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Aims:
To assess the error profile in NGS data, generated from high throughput
sequencing machines.
Background:
Short-read sequencing data from Next Generation Sequencing (NGS) are currently
being generated by a number of research projects. Depicting the errors produced by NGS
platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It
has high significance in various analyses, such as genome sequence assembly, SNPs calling,
evolutionary studies, and haplotype inference. The systematic and random errors show incidence
profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454
pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and
Oxford Nanopore sequencing. Advances in NGS deliver galactic data with the addition of errors.
Some ratio of these errors may emulate genuine true biological signals i.e., mutation, and may
subsequently negate the results. Various independent applications have been proposed to correct
the sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models
are missing.
Objective:
In this paper, an effcient error estimation computational model called ESREEM is
proposed to assess the error rates in NGS data.
Methods:
The proposed model prospects the analysis that there exists a true linear regression
association between the number of reads containing errors and the number of reads sequenced. The
model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM).
Result:
The proposed model is evaluated on several benchmark datasets and the results obtained are
compared with state-of-the-art algorithms.
Conclusions:
Experimental results analyses show that the proposed model efficiently estimates errors
and runs in less time as compared to others.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| | - Muhammad Sardaraz
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| | - Zahid Mehmood
- Department of Software Engineering, University of Engineering and Technology, Taxila,Pakistan
| | - Muhammad Saud Khan
- Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock,Pakistan
| |
Collapse
|
10
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|
11
|
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 2020; 36:i177-i185. [PMID: 32657392 PMCID: PMC7355249 DOI: 10.1093/bioinformatics/btaa487] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
MOTIVATION In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Camille Marchet
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Zamin Iqbal
- European Bioinformatics Institute, Cambridge CB10 1SD, UK
| | - Daniel Gautheret
- CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Mikaël Salson
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur, CNRS, C3BI – USR 3756, 75015 Paris, France
| |
Collapse
|
12
|
Zhou H, Guo S, Sun Y, Wang H, Zhang M, Li Y. Screening the Action Targets of Enterovirus 71 in Human SH-SY5Y Cells Using RNA Sequencing Data. Viral Immunol 2019; 32:170-178. [PMID: 31063043 DOI: 10.1089/vim.2018.0137] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Hand, foot, and mouth disease (HFMD) is a common infection for children younger than the age of five. HFMD is mainly induced by coxsackievirus A16 and enterovirus 71 (EV71). EV71-associated HFMD often has serious neurological disease complications. The purpose of this study was to reveal the mechanisms of action of EV71 on neurons. SH-SY5Y cells transfected or untransfected with EV71 were sequenced. After data preprocessing, differentially expressed genes (DEGs) were screened using the limma package in R, and clustering analysis was then performed using the ComplexHeatmap package in R. The DAVID tool was used for EDG enrichment analysis. Protein-protein interactions (PPIs) were predicted using the STRING database and PPI networks were then constructed using Cytoscape software. After pathways involved in the key PPI network nodes were enriched, pathway deviation scores were calculated. Clustering analysis was also conducted for these pathways. There were 978 DEGs in the transfected samples. Upregulated TNF was enriched in NF-kappa B signaling pathway. Among the top 20 nodes in the PPI network, CDK1, STAT3, CCND1, TNF, and MYC had the highest degrees. A total of 28 pathways were enriched for the top 20 nodes, including Epstein-Barr virus infection (p = 3.78E-06), proteoglycans in cancer (p = 4.96E-06), and melanoma (p = 1.99E-05). In addition, clustering analysis showed that these pathways could clearly differentiate the two groups of samples. EV71 may affect neurons by mediating CDK1, STAT3, CCND1, TNF, and MYC, indicating that these genes are promising targets for preventing the neuronal complications of HFMD.
Collapse
Affiliation(s)
- Hong Zhou
- 1 The Respiratory Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Shuzhen Guo
- 2 The Respiratory Medicine, The Center for Disease Control and Prevention of Harbin, Harbin, China
| | - Yu Sun
- 1 The Respiratory Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Hui Wang
- 1 The Respiratory Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Meiling Zhang
- 1 The Respiratory Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Yanjun Li
- 1 The Respiratory Medicine, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
13
|
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019; 13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open
Abstract
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
Collapse
Affiliation(s)
- Aníbal Guerra
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
- Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
| | - Jaime Lotero
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - José Édinson Aedo
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - Sebastián Isaza
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| |
Collapse
|
14
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
15
|
ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data. G3-GENES GENOMES GENETICS 2017; 7:3839-3848. [PMID: 29079682 PMCID: PMC5714481 DOI: 10.1534/g3.117.300271] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI’s SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the B. subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.
Collapse
|
16
|
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z. LW-FQZip 2: a parallelized reference-based compression of FASTQ files. BMC Bioinformatics 2017; 18:179. [PMID: 28320326 PMCID: PMC5359991 DOI: 10.1186/s12859-017-1588-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2016] [Accepted: 03/09/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. RESULTS In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. CONCLUSIONS The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .
Collapse
Affiliation(s)
- Zhi-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Zhenkun Wen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Qingjin Deng
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Ying Chu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Yiwen Sun
- School of Medicine, Shenzhen University, Shenzhen, 518060 China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| |
Collapse
|
17
|
Dolle DD, Liu Z, Cotten M, Simpson JT, Iqbal Z, Durbin R, McCarthy SA, Keane TM. Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes. Genome Res 2016; 27:300-309. [PMID: 27986821 PMCID: PMC5287235 DOI: 10.1101/gr.211748.116] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 12/14/2016] [Indexed: 01/04/2023]
Abstract
We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows–Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.
Collapse
Affiliation(s)
- Dirk D Dolle
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Zhicheng Liu
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Matthew Cotten
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Jared T Simpson
- Ontario Institute for Cancer Research, Toronto, Ontario M5G 0A3, Canada.,Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3G4, Canada
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, Oxford OX3 7BN, United Kingdom
| | - Richard Durbin
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Shane A McCarthy
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Thomas M Keane
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.,European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
18
|
Xie X, Zhou S, Guan J. CoGI: Towards Compressing Genomes as an Image. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1275-1285. [PMID: 26671800 DOI: 10.1109/tcbb.2015.2430331] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM--one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip--a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://admis.fudan.edu.cn/projects/cogi.htm.
Collapse
|
19
|
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, Rizk G. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinformatics 2015; 16:288. [PMID: 26370285 PMCID: PMC4570262 DOI: 10.1186/s12859-015-0709-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Accepted: 08/17/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. RESULTS We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. CONCLUSIONS LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.
Collapse
Affiliation(s)
- Gaëtan Benoit
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Claire Lemaitre
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | | | - Erwan Drezen
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| | - Thibault Dayris
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
| | - Raluca Uricaru
- University of Bordeaux, CNRS/LaBRI, Talence, F-33405, France.
- University of Bordeaux, CBiB, Bordeaux, F-33000, France.
| | - Guillaume Rizk
- INRIA/IRISA/GenScale, Campus de Beaulieu, Rennes, 35042, France.
| |
Collapse
|
20
|
Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015; 31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. AVAILABILITY AND IMPLEMENTATION Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| | - Rob Patro
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| |
Collapse
|