1
|
Sun H, Zheng Y, Xie H, Ma H, Zhong C, Yan M, Liu X, Wang G. PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping. Bioinformatics 2024; 40:btae323. [PMID: 38759114 PMCID: PMC11139522 DOI: 10.1093/bioinformatics/btae323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 04/22/2024] [Accepted: 05/16/2024] [Indexed: 05/19/2024] Open
Abstract
MOTIVATION The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters. RESULTS We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely. AVAILABILITY AND IMPLEMENTATION The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Cheng Zhong
- Key Laboratory of Parallel, Distributed and Intelligent of Guangxi Universities and Colleges, School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
| | - Meng Yan
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China
| |
Collapse
|
2
|
Lescroart J, Bonilla-Sánchez A, Napolitano C, Buitrago-Torres DL, Ramírez-Chaves HE, Pulido-Santacruz P, Murphy WJ, Svardal H, Eizirik E. Extensive Phylogenomic Discordance and the Complex Evolutionary History of the Neotropical Cat Genus Leopardus. Mol Biol Evol 2023; 40:msad255. [PMID: 37987559 PMCID: PMC10701098 DOI: 10.1093/molbev/msad255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/07/2023] [Accepted: 11/13/2023] [Indexed: 11/22/2023] Open
Abstract
Even in the genomics era, the phylogeny of Neotropical small felids comprised in the genus Leopardus remains contentious. We used whole-genome resequencing data to construct a time-calibrated consensus phylogeny of this group, quantify phylogenomic discordance, test for interspecies introgression, and assess patterns of genetic diversity and demographic history. We infer that the Leopardus radiation started in the Early Pliocene as an initial speciation burst, followed by another in its subgenus Oncifelis during the Early Pleistocene. Our findings challenge the long-held notion that ocelot (Leopardus pardalis) and margay (L. wiedii) are sister species and instead indicate that margay is most closely related to the enigmatic Andean cat (L. jacobita), whose whole-genome data are reported here for the first time. In addition, we found that the newly sampled Andean tiger cat (L. tigrinus pardinoides) population from Colombia associates closely with Central American tiger cats (L. tigrinus oncilla). Genealogical discordance was largely attributable to incomplete lineage sorting, yet was augmented by strong gene flow between ocelot and the ancestral branch of Oncifelis, as well as between Geoffroy's cat (L. geoffroyi) and southern tiger cat (L. guttulus). Contrasting demographic trajectories have led to disparate levels of current genomic diversity, with a nearly tenfold difference in heterozygosity between Andean cat and ocelot, spanning the entire range of variability found in extant felids. Our analyses improved our understanding of the speciation history and diversity patterns in this felid radiation, and highlight the benefits to phylogenomic inference of embracing the many heterogeneous signals scattered across the genome.
Collapse
Affiliation(s)
- Jonas Lescroart
- Department of Biology, University of Antwerp, Antwerp, Belgium
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
| | - Alejandra Bonilla-Sánchez
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
- Faculty of Exact and Natural Sciences, University of Antioquia, Medellín, Colombia
| | - Constanza Napolitano
- Department of Biological Sciences and Biodiversity, University of Los Lagos, Osorno, Chile
- Institute of Ecology and Biodiversity, Concepción, Chile
- Cape Horn International Center, Puerto Williams, Chile
- Andean Cat Alliance, Villa Carlos Paz, Argentina
| | - Diana L Buitrago-Torres
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
| | - Héctor E Ramírez-Chaves
- Department of Biological Sciences, University of Caldas, Manizales, Colombia
- Centro de Museos, Museo de Historia Natural, University of Caldas, Manizales, Colombia
| | | | - William J Murphy
- Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, USA
- Interdisciplinary Program in Genetics & Genomics, Texas A&M University, College Station, TX, USA
| | - Hannes Svardal
- Department of Biology, University of Antwerp, Antwerp, Belgium
- Naturalis Biodiversity Center, Leiden, Netherlands
| | - Eduardo Eizirik
- School of Health and Life Sciences, Pontifical Catholic University of Rio Grande do Sul, Porto Alegre, Brazil
- Instituto Pró-Carnívoros, Atibaia, Brazil
| |
Collapse
|
3
|
Halfmeyer I, Bartolomaeus T, Popp B, Radtke M, Helms T, Hentschel J, Popp D, Jamra RA. Approach to Cohort-Wide Re-Analysis of Exome Data in 1000 Individuals with Neurodevelopmental Disorders. Genes (Basel) 2022; 14:genes14010030. [PMID: 36672771 PMCID: PMC9858523 DOI: 10.3390/genes14010030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/02/2022] [Accepted: 12/19/2022] [Indexed: 12/25/2022] Open
Abstract
The re-analysis of nondiagnostic exome sequencing (ES) has the potential to increase diagnostic yields in individuals with rare diseases, but its implementation in the daily routines of laboratories is limited due to restricted capacities. Here, we describe a systematic approach to re-analyse the ES data of a cohort consisting of 1040 diagnostic and nondiagnostic samples. We applied a strict filter cascade to reveal the most promising single-nucleotide variants (SNVs) of the whole cohort, which led to an average of 0.77 variants per individual that had to be manually evaluated. This variant set revealed seven novel diagnoses (0.8% of all nondiagnostic cases) and two secondary findings. Thirteen additional variants were identified by a scientific approach prior to this re-analysis and were also present in this variant set. This resulted in a total increase in the diagnostic yield of 2.3%. The filter cascade was optimised during the course of the study and finally resulted in sensitivity of 85%. After applying the filter cascade, our re-analysis took 20 h and enabled a workflow that can be used repeatedly. This work is intended to provide a practical recommendation for other laboratories wishing to introduce a resource-efficient re-analysis strategy into their clinical routine.
Collapse
Affiliation(s)
- Insa Halfmeyer
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
| | - Tobias Bartolomaeus
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
| | - Bernt Popp
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
- Center of Functional Genomics, Berlin Institute of Health at Charité, Universitätsmedizin Berlin, Hessische Straße 4A, 10115 Berlin, Germany
| | - Maximilian Radtke
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
| | - Tobias Helms
- Limbus Medical Technologies GmbH, Neuer Markt 9/10, 18055 Rostock, Germany
| | - Julia Hentschel
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
| | - Denny Popp
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
| | - Rami Abou Jamra
- Institute of Human Genetics, University of Leipzig Medical Center, 04103 Leipzig, Germany
- Correspondence:
| |
Collapse
|
4
|
Wiewiórka M, Szmurło A, Stankiewicz P, Gambin T. Cloud-native distributed genomic pileup operations. Bioinformatics 2022; 39:6900922. [PMID: 36515465 PMCID: PMC9848050 DOI: 10.1093/bioinformatics/btac804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/16/2022] [Accepted: 12/13/2022] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. RESULTS Here, we present a scalable, distributed and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5× faster) and memory usage (up to 2× less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range join and coverage calculations, our package provides end-users with a unified SQL interface for convenient analyses of population-scale genomic data in an interactive way. AVAILABILITY AND IMPLEMENTATION https://biodatageeks.github.io/sequila/.
Collapse
Affiliation(s)
| | | | - Paweł Stankiewicz
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | | |
Collapse
|
5
|
Niu Y, Ma M, Li F, Liu X, Shi G. ACO:lossless quality score compression based on adaptive coding order. BMC Bioinformatics 2022; 23:219. [PMID: 35672665 PMCID: PMC9175485 DOI: 10.1186/s12859-022-04712-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 04/25/2022] [Indexed: 08/30/2023] Open
Abstract
Background With the rapid development of high-throughput sequencing technology, the cost of whole genome sequencing drops rapidly, which leads to an exponential growth of genome data. How to efficiently compress the DNA data generated by large-scale genome projects has become an important factor restricting the further development of the DNA sequencing industry. Although the compression of DNA bases has achieved significant improvement in recent years, the compression of quality score is still challenging. Results In this paper, by reinvestigating the inherent correlations between the quality score and the sequencing process, we propose a novel lossless quality score compressor based on adaptive coding order (ACO). The main objective of ACO is to traverse the quality score adaptively in the most correlative trajectory according to the sequencing process. By cooperating with the adaptive arithmetic coding and an improved in-context strategy, ACO achieves the state-of-the-art quality score compression performances with moderate complexity for the next-generation sequencing (NGS) data. Conclusions The competence enables ACO to serve as a candidate tool for quality score compression, ACO has been employed by AVS(Audio Video coding Standard Workgroup of China) and is freely available at https://github.com/Yoniming/ACO.
Collapse
Affiliation(s)
- Yi Niu
- School of artificial intelligence, Xidian University, Xian, 710071, China. .,The Pengcheng Lab, Shenzhen, 518055, China.
| | - Mingming Ma
- School of artificial intelligence, Xidian University, Xian, 710071, China
| | - Fu Li
- School of artificial intelligence, Xidian University, Xian, 710071, China
| | | | - Guangming Shi
- School of artificial intelligence, Xidian University, Xian, 710071, China
| |
Collapse
|
6
|
Bonfield JK. CRAM 3.1: advances in the CRAM file format. Bioinformatics 2022; 38:1497-1503. [PMID: 34999766 PMCID: PMC8896640 DOI: 10.1093/bioinformatics/btac010] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 12/14/2021] [Accepted: 01/04/2022] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION CRAM has established itself as a high compression alternative to the BAM file format for DNA sequencing data. We describe updates to further improve this on modern sequencing instruments. RESULTS With Illumina data CRAM 3.1 is 7-15% smaller than the equivalent CRAM 3.0 file, and 50-70% smaller than the corresponding BAM file. Long-read technology shows more modest compression due to the presence of high-entropy signals. AVAILABILITY AND IMPLEMENTATION The CRAM 3.0 specification is freely available from https://samtools.github.io/hts-specs/CRAMv3.pdf. The CRAM 3.1 improvements are available in a separate OpenSource HTScodecs library from https://github.com/samtools/htscodecs, and have been incorporated into HTSlib. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James K Bonfield
- Informatics and Digital Solutions, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| |
Collapse
|
7
|
Huo H, Liu P, Wang C, Jiang H, Vitter JS. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics 2022; 38:335-343. [PMID: 34524416 DOI: 10.1093/bioinformatics/btab655] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 08/12/2021] [Accepted: 09/10/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Ultrahigh-throughput next-generation sequencing instruments continue to generate vast amounts of genomic data. These data are generally stored in FASTQ format. Two important simultaneous goals are space-efficient compressed storage of the genomic data and fast query performance. Toward that end, we introduce compressed indexing to store and retrieve FASTQ files. RESULTS We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures and tables REF and Rγ, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate and extract queries on reads, our method uses 2.7-41.66% points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88% points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice. AVAILABILITY AND IMPLEMENTATION The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongwei Huo
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Pengfei Liu
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Chenhui Wang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | - Hongbo Jiang
- Department of Computer Science, Xidian University, Xi'an 710071, China
| | | |
Collapse
|
8
|
Cho M, No A. FCLQC: fast and concurrent lossless quality scores compressor. BMC Bioinformatics 2021; 22:606. [PMID: 34930110 PMCID: PMC8686598 DOI: 10.1186/s12859-021-04516-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 12/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. RESULTS This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. CONCLUSION The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.
Collapse
Affiliation(s)
- Minhyeok Cho
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea.
| |
Collapse
|
9
|
Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The dropping cost of sequencing human DNA has allowed for fast development of several projects around the world generating huge amounts of DNA sequencing data. This deluge of data has run up against limited storage space, a problem that researchers are trying to solve through compression techniques. In this study we address the compression of SAM files, the standard output files for DNA alignment. We specifically study lossy compression techniques used for quality values reported in the SAM file and analyze the impact of such lossy techniques on the CRAM format. We present a series of experiments using a data set corresponding to individual NA12878 with three different fold coverages. We introduce a new lossy model, dynamic binning, and compare its performance to other lossy techniques, namely Illumina binning, LEON and QVZ. We analyze the compression ratio when using CRAM and also study the impact of the lossy techniques on SNP calling. Our results show that lossy techniques allow a better CRAM compression ratio. Furthermore, we show that SNP calling performance is not negatively affected and may even be boosted.
Collapse
|
10
|
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience 2021; 10:giab008. [PMID: 33590861 PMCID: PMC7931819 DOI: 10.1093/gigascience/giab008] [Citation(s) in RCA: 6314] [Impact Index Per Article: 1578.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 01/18/2021] [Accepted: 01/28/2021] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND SAMtools and BCFtools are widely used programs for processing and analysing high-throughput sequencing data. They include tools for file format conversion and manipulation, sorting, querying, statistics, variant calling, and effect analysis amongst other methods. FINDINGS The first version appeared online 12 years ago and has been maintained and further developed ever since, with many new features and improvements added over the years. The SAMtools and BCFtools packages represent a unique collection of tools that have been used in numerous other software projects and countless genomic pipelines. CONCLUSION Both SAMtools and BCFtools are freely available on GitHub under the permissive MIT licence, free for both non-commercial and commercial use. Both packages have been installed >1 million times via Bioconda. The source code and documentation are available from https://www.htslib.org.
Collapse
Affiliation(s)
- Petr Danecek
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - James K Bonfield
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Jennifer Liddle
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - John Marshall
- Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Switchback Road, Glasgow, G61 1QH, UK
| | - Valeriu Ohan
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Martin O Pollard
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Andrew Whitwham
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Thomas Keane
- EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Shane A McCarthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Robert M Davies
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA 02215, USA
- Department of Biomedical Informatics, Harvard Medical School, 10 Shattuck Street, Boston, MA 02215, USA
| |
Collapse
|
11
|
Yu R, Yang W, Wang S. Performance evaluation of lossy quality compression algorithms for RNA-seq data. BMC Bioinformatics 2020; 21:321. [PMID: 32689929 PMCID: PMC7372835 DOI: 10.1186/s12859-020-03658-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/13/2020] [Indexed: 11/29/2022] Open
Abstract
Background Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. Results In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. Conclusions Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.
Collapse
|
12
|
Yu R, Yang W. ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics 2020; 36:4551-4559. [DOI: 10.1093/bioinformatics/btaa543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 04/25/2020] [Accepted: 05/20/2020] [Indexed: 12/30/2022] Open
Abstract
Abstract
Motivation
Per-base quality values in Next Generation Sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications, lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications.
Results
We developed a scalable lossy to lossless compression solution for quality values named ScaleQC (Scalable Quality value Compression). ScaleQC is able to provide the so-called bit-stream level scalability that the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without incurring an expensive transcoding operation. Despite its scalability, ScaleQC still achieves comparable compression performance at both lossless and lossy data rates compared to the existing lossless or lossy compressors.
Availability and implementation
ScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib).
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rongshan Yu
- Digital Fujian Institute of Healthcare and Biomedical Big Data, School of Informatics, Xiamen University, Xiamen 316005, China
- Aginome Scientific, Xiamen 316005, China
| | | |
Collapse
|
13
|
Abstract
Artificial intelligence (AI) is having an increasing impact on the field of pathology, as computation techniques allow computers to perform tasks previously performed by people. Here, we offer a simple and practical guide to AI methods used in pathology, such as digital image analysis, next-generation sequencing, and natural language processing. We not only provide a comprehensive review, but also discuss relevant history and future directions of AI in pathology. We additionally provide a short tabular dictionary of AI terminology which will help practicing pathologists and researchers to understand this field.
Collapse
|
14
|
No A, Hernaez M, Ochoa I. CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores. J Bioinform Comput Biol 2020; 18:2050031. [PMID: 32938284 DOI: 10.1142/s0219720020500316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.
Collapse
Affiliation(s)
- Albert No
- Electronic and Electrical Engineering, Hongik University, 94 Wausan-ro, Mapo-gu, Seoul 04066, Korea
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W Gregory Dr, Urbana, IL 61801, USA
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, USA
| |
Collapse
|
15
|
Buels R, Dider S, Diesh C, Robinson J, Holmes I. Cram-JS: reference-based decompression in node and the browser. Bioinformatics 2020; 35:4451-4452. [PMID: 31099383 PMCID: PMC6821367 DOI: 10.1093/bioinformatics/btz384] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 04/24/2019] [Accepted: 05/09/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The CRAM format addresses rising DNA storage costs for short-read sequencing by aligning reads to a reference genome and encoding the resulting alignment with Huffman, subexponential, Elias gamma, rANS, gzip and other codes. The CRAM codec is complex, and until now, there has been no JavaScript implementation. RESULTS We have developed a JavaScript library, Cram-JS, that natively reads and decompresses the CRAM format on-the-fly. The library is used in the JBrowse and IGV-JS genome browsers and can readily be used by other JavaScript applications, in the web browser or in Node. AVAILABILITY AND IMPLEMENTATION Cram-JS is written to the ES-6 standard and is available from the GitHub repository at https://github.com/GMOD/cram-js.
Collapse
Affiliation(s)
- Robert Buels
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Shihab Dider
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Colin Diesh
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Ian Holmes
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| |
Collapse
|
16
|
Abstract
The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.
Collapse
|
17
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|