1
|
Czech E, Millar TR, White T, Jeffery B, Miles A, Tallman S, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.
Collapse
Affiliation(s)
- Eric Czech
- Related Sciences and Lincoln, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | - Tom White
- Tom White Consulting Ltd., Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Alistair Miles
- Wellcome Sanger Institute, McGill University, Montreal, QC, Canada
| | - Sam Tallman
- Genomics England, McGill University, Montreal, QC, Canada
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | | | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
2
|
Luo X, Chen Y, Liu L, Ding L, Li Y, Li S, Zhang Y, Zhu Z. GSC: efficient lossless compression of VCF files with fast query. Gigascience 2024; 13:giae046. [PMID: 39028587 PMCID: PMC11258903 DOI: 10.1093/gigascience/giae046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 05/16/2024] [Accepted: 06/22/2024] [Indexed: 07/21/2024] Open
Abstract
BACKGROUND With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives. FINDINGS To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies. CONCLUSION GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.
Collapse
Affiliation(s)
- Xiaolong Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Yuxin Chen
- BGI Research, Wuhan 430074, China
- BGI Research, Shenzhen 518083, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| | - Ling Liu
- Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China
| | - Lulu Ding
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| | - Yuxiang Li
- BGI Research, Wuhan 430074, China
- BGI Research, Shenzhen 518083, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| | - Shengkang Li
- BGI Research, Wuhan 430074, China
- BGI Research, Shenzhen 518083, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| | - Yong Zhang
- BGI Research, Wuhan 430074, China
- BGI Research, Shenzhen 518083, China
- Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China
| | - Zexuan Zhu
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
| |
Collapse
|
3
|
Genovese G, Rockweiler NB, Gorman BR, Bigdeli TB, Pato MT, Pato CN, Ichihara K, McCarroll SA. BCFtools/liftover: an accurate and comprehensive tool to convert genetic variants across genome assemblies. Bioinformatics 2024; 40:btae038. [PMID: 38261650 PMCID: PMC10832354 DOI: 10.1093/bioinformatics/btae038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/07/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open
Abstract
MOTIVATION Many genetics studies report results tied to genomic coordinates of a legacy genome assembly. However, as assemblies are updated and improved, researchers are faced with either realigning raw sequence data using the updated coordinate system or converting legacy datasets to the updated coordinate system to be able to combine results with newer datasets. Currently available tools to perform the conversion of genetic variants have numerous shortcomings, including poor support for indels and multi-allelic variants, that lead to a higher rate of variants being dropped or incorrectly converted. As a result, many researchers continue to work with and publish using legacy genomic coordinates. RESULTS Here we present BCFtools/liftover, a tool to convert genomic coordinates across genome assemblies for variants encoded in the variant call format with improved support for indels represented by different reference alleles across genome assemblies and full support for multi-allelic variants. It further supports variant annotation fields updates whenever the reference allele changes across genome assemblies. The tool has the lowest rate of variants being dropped with an order of magnitude less indels dropped or incorrectly converted and is an order of magnitude faster than other tools typically used for the same task. It is particularly suited for converting variant callsets from large cohorts to novel telomere-to-telomere assemblies as well as summary statistics from genome-wide association studies tied to legacy genome assemblies. AVAILABILITY AND IMPLEMENTATION The tool is written in C and freely available under the MIT open source license as a BCFtools plugin available at http://github.com/freeseek/score.
Collapse
Affiliation(s)
- Giulio Genovese
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Stanley Center, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Department of Genetics, Harvard Medical School, Boston, MA 02115, United States
| | - Nicole B Rockweiler
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Stanley Center, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Department of Genetics, Harvard Medical School, Boston, MA 02115, United States
| | - Bryan R Gorman
- Center for Data and Computational Sciences, VA Boston HealthCare System, Boston, MA 02130, United States
- Booz Allen Hamilton Inc, McLean, VA 22102, United States
| | - Tim B Bigdeli
- Department of Psychiatry and Behavioral Sciences, SUNY Downstate Health Sciences University, Brooklyn, NY 11203, United States
- Institute for Genomics in Health, SUNY Downstate Health Sciences University, Brooklyn, NY 11203, United States
- Cooperative Studies Program, VA New York Harbor Healthcare System, Brooklyn, NY 11209, United States
| | - Michelle T Pato
- Department of Psychiatry, Robert Wood Johnson Medical School, New Brunswick, NJ 08901, United States
| | - Carlos N Pato
- Department of Psychiatry, Robert Wood Johnson Medical School, New Brunswick, NJ 08901, United States
| | - Kiku Ichihara
- Stanley Center, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Department of Genetics, Harvard Medical School, Boston, MA 02115, United States
| | - Steven A McCarroll
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Stanley Center, Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
- Department of Genetics, Harvard Medical School, Boston, MA 02115, United States
| |
Collapse
|
4
|
Zhang L, Yuan Y, Peng W, Tang B, Li MJ, Gui H, Wang Q, Li M. GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species. Genome Biol 2023; 24:76. [PMID: 37069653 PMCID: PMC10108510 DOI: 10.1186/s13059-023-02906-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Accepted: 03/22/2023] [Indexed: 04/19/2023] Open
Abstract
Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC's data structure and algorithms are valuable for accelerating large-scale genomic research.
Collapse
Affiliation(s)
- Liubin Zhang
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Yangyang Yuan
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
- School of Medical Technology and Information Engineering, Zhejiang Chinese Medical University, Hangzhou, China
| | - Wenjie Peng
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Bin Tang
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China
| | - Mulin Jun Li
- The Province and Ministry Co-Sponsored Collaborative Innovation Center for Medical Epigenetics, Tianjin Medical University, Tianjin, China
| | - Hongsheng Gui
- Behavioral Health Services, Henry Ford Health, Detroit, MI, USA
- Center for Health Policy & Health Services Research, Henry Ford Health, Detroit, MI, USA
| | - Qiang Wang
- Mental Health Center, West China Hospital, Sichuan University, Chengdu, China
| | - Miaoxin Li
- Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510080, China.
- Center for Precision Medicine, Sun Yat-Sen University, Guangzhou, China.
- Center for Disease Genome Research, Sun Yat-Sen University, Guangzhou, China.
- Key Laboratory of Tropical Disease Control (SYSU), Ministry of Education, Guangzhou, 510080, China.
- Guangdong Provincial Key Laboratory of Biomedical Imaging and Guangdong Provincial Engineering Research Center of Molecular Imaging, The Fifth Affiliated Hospital, Sun Yat-sen University, Zhuhai, China.
| |
Collapse
|
5
|
Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 2022; 18:e1009123. [PMID: 35639788 PMCID: PMC9286226 DOI: 10.1371/journal.pcbi.1009123] [Citation(s) in RCA: 45] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2021] [Revised: 07/15/2022] [Accepted: 04/11/2022] [Indexed: 11/30/2022] Open
Abstract
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies—as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format. Most bioinformatics workflows deal with DNA/RNA variations that are typically represented in the variant call format (VCF)—a file format that describes mutations (SNP and MNP), insertions and deletions (INDEL) against a reference genome. Here we present a wide range of free and open source software tools that are used in biomedical sequencing workflows around the world today.
Collapse
Affiliation(s)
- Erik Garrison
- Department Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| | - Zev N. Kronenberg
- Pacific Biosciences, San Diego, California, United States of America
| | - Eric T. Dawson
- NVIDIA Corporation, Santa Clara, California, United States of America
| | - Brent S. Pedersen
- Center for Molecular Medicine, University Medical Center, Utrecht, The Netherlands
| | - Pjotr Prins
- Department Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
- * E-mail:
| |
Collapse
|
6
|
Garrison E, Kronenberg ZN, Dawson ET, Pedersen BS, Prins P. A spectrum of free software tools for processing the VCF variant call format: vcflib, bio-vcf, cyvcf2, hts-nim and slivar. PLoS Comput Biol 2022. [PMID: 35639788 DOI: 10.1101/2021.05.21.445151] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023] Open
Abstract
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.
Collapse
Affiliation(s)
- Erik Garrison
- Department Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| | - Zev N Kronenberg
- Pacific Biosciences, San Diego, California, United States of America
| | - Eric T Dawson
- NVIDIA Corporation, Santa Clara, California, United States of America
| | - Brent S Pedersen
- Center for Molecular Medicine, University Medical Center, Utrecht, The Netherlands
| | - Pjotr Prins
- Department Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, Tennessee, United States of America
| |
Collapse
|
7
|
Lin MF, Bai X, Salerno WJ, Reid JG. Sparse Project VCF: efficient encoding of population genotype matrices. Bioinformatics 2021; 36:5537-5538. [PMID: 33300997 PMCID: PMC8016461 DOI: 10.1093/bioinformatics/btaa1004] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Revised: 11/13/2020] [Accepted: 11/20/2020] [Indexed: 11/13/2022] Open
Abstract
SUMMARY Variant Call Format (VCF), the prevailing representation for germline genotypes in population sequencing, suffers rapid size growth as larger cohorts are sequenced and more rare variants are discovered. We present Sparse Project VCF (spVCF), an evolution of VCF with judicious entropy reduction and run-length encoding, delivering >10× size reduction for modern studies with practically minimal information loss. spVCF interoperates with VCF efficiently, including tabix-based random access. We demonstrate its effectiveness with the DiscovEHR and UK Biobank whole-exome sequencing cohorts. AVAILABILITY AND IMPLEMENTATION Apache-licensed reference implementation: github.com/mlin/spVCF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Xiaodong Bai
- Department of Regeneron Pharmaceuticals, Inc., Regeneron Genetics Center, Tarrytown, NY 10591, USA
| | - William J Salerno
- Department of Regeneron Pharmaceuticals, Inc., Regeneron Genetics Center, Tarrytown, NY 10591, USA
| | - Jeffrey G Reid
- Department of Regeneron Pharmaceuticals, Inc., Regeneron Genetics Center, Tarrytown, NY 10591, USA
| |
Collapse
|
8
|
Deorowicz S, Danek A, Kokot M. VCFShark: how to squeeze a VCF file. Bioinformatics 2021; 37:3358-3360. [PMID: 33787870 DOI: 10.1093/bioinformatics/btab211] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 02/20/2021] [Accepted: 03/30/2021] [Indexed: 11/15/2022] Open
Abstract
SUMMARY VCF files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. AVAILABILITY AND IMPLEMENTATION https://github.com/refresh-bio/vcfshark. SUPPLEMENTARY INFORMATION Supplementary data are available at publisher's Web site.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Department of Algorithmics and Software, Silesian University of Technology, Gliwice, Poland
| | - Agnieszka Danek
- Faculty of Automatic Control, Electronics and Computer Science, Department of Algorithmics and Software, Silesian University of Technology, Gliwice, Poland
| | - Marek Kokot
- Faculty of Automatic Control, Electronics and Computer Science, Department of Algorithmics and Software, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
9
|
Lan D, Tobler R, Souilmi Y, Llamas B. Genozip - A Universal Extensible Genomic Data Compressor. Bioinformatics 2021; 37:2225-2230. [PMID: 33585897 PMCID: PMC8388020 DOI: 10.1093/bioinformatics/btab102] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 01/25/2021] [Accepted: 02/12/2021] [Indexed: 11/14/2022] Open
Abstract
We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities - universality (support for all common genomic file formats), high compression ratios, speed, feature-richness, and extensibility. Genozip delivers high-performance compression for widely-used genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP, and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space. Availability: Genozip is written in C. The code is open-source and available on GitHub (https://github.com/divonlan/genozip). The package is free for non-commercial use. It is distributed as a Docker container on DockerHub and through the conda package manager. Genozip is tested on Linux, Mac, and Windows. Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Divon Lan
- Australian Centre for Ancient DNA, School of Biological Sciences, Faculty of Sciences, The University of Adelaide, Adelaide SA 5005, Australia
| | - Ray Tobler
- Australian Centre for Ancient DNA, School of Biological Sciences, Faculty of Sciences, The University of Adelaide, Adelaide SA 5005, Australia.,Centre of Excellence for Australian Biodiversity and Heritage (CABAH), School of Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia
| | - Yassine Souilmi
- Australian Centre for Ancient DNA, School of Biological Sciences, Faculty of Sciences, The University of Adelaide, Adelaide SA 5005, Australia.,National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 0200, Australia
| | - Bastien Llamas
- Australian Centre for Ancient DNA, School of Biological Sciences, Faculty of Sciences, The University of Adelaide, Adelaide SA 5005, Australia.,Centre of Excellence for Australian Biodiversity and Heritage (CABAH), School of Biological Sciences, University of Adelaide, Adelaide, SA 5005, Australia.,National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 0200, Australia
| |
Collapse
|
10
|
Shokrof M, Abouelhoda M. IonCRAM: a reference-based compression tool for ion torrent sequence files. BMC Bioinformatics 2020; 21:397. [PMID: 32907531 PMCID: PMC7487613 DOI: 10.1186/s12859-020-03726-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Accepted: 08/31/2020] [Indexed: 12/29/2022] Open
Abstract
Background Ion Torrent is one of the major next generation sequencing (NGS) technologies and it is frequently used in medical research and diagnosis. The built-in software for the Ion Torrent sequencing machines delivers the sequencing results in the BAM format. In addition to the usual SAM/BAM fields, the Ion Torrent BAM file includes technology-specific flow signal data. The flow signals occupy a big portion of the BAM file (about 75% for the human genome). Compressing SAM/BAM into CRAM format significantly reduces the space needed to store the NGS results. However, the tools for generating the CRAM formats are not designed to handle the flow signals. This missing feature has motivated us to develop a new program to improve the compression of the Ion Torrent files for long term archiving. Results In this paper, we present IonCRAM, the first reference-based compression tool to compress Ion Torrent BAM files for long term archiving. For the BAM files, IonCRAM could achieve a space saving of about 43%. This space saving is superior to what achieved with the CRAM format by about 8–9%. Conclusions Reducing the space consumption of NGS data reduces the cost of storage and data transfer. Therefore, developing efficient compression software for clinical NGS data goes beyond the computational interest; as it ultimately contributes to the overall cost reduction of the clinical test. The space saving achieved by our tool is a practical step in this direction. The tool is open source and available at Code Ocean, github, and http://ioncram.saudigenomeproject.com.
Collapse
Affiliation(s)
- Moustafa Shokrof
- Faculty of Computer Science, University of California at Davis, Davis, CA, USA
| | - Mohamed Abouelhoda
- King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia. .,Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST), Riyadh, Saudi Arabia. .,Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, University Square, Giza, Egypt.
| |
Collapse
|