1
|
Müntefering F, Adhisantoso YG, Chandak S, Ostermann J, Hernaez M, Voges J. Genie: the first open-source ISO/IEC encoder for genomic data. Commun Biol 2024; 7:553. [PMID: 38724695 PMCID: PMC11082222 DOI: 10.1038/s42003-024-06249-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 04/26/2024] [Indexed: 05/12/2024] Open
Abstract
For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.
Collapse
Affiliation(s)
- Fabian Müntefering
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| | - Yeremia Gunawan Adhisantoso
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, 350 Jane Stanford Way, Stanford, CA, 94305, USA
| | - Jörn Ostermann
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Mikel Hernaez
- Center for Applied Medical Research (CIMA), University of Navarra, Av. de Pío XII, 55, Pamplona, 31008, Navarra, Spain.
| | - Jan Voges
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| |
Collapse
|
2
|
Lau B, Chandak S, Roy S, Tatwawadi K, Wootters M, Weissman T, Ji HP. Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing. Sci Rep 2023; 13:8514. [PMID: 37231057 DOI: 10.1038/s41598-023-29575-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 02/07/2023] [Indexed: 05/27/2023] Open
Abstract
The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.
Collapse
Affiliation(s)
- Billy Lau
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA, 94304, USA
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Sharmili Roy
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Mary Wootters
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
| | - Hanlee P Ji
- Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA.
- Stanford Genome Technology Center, Stanford University, Palo Alto, CA, 94304, USA.
| |
Collapse
|
3
|
Meng Q, Chandak S, Zhu Y, Weissman T. Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach. Sci Rep 2023; 13:2082. [PMID: 36747011 PMCID: PMC9902536 DOI: 10.1038/s41598-023-29267-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 02/01/2023] [Indexed: 02/08/2023] Open
Abstract
The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .
Collapse
Affiliation(s)
- Qingxi Meng
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
| | - Yifan Zhu
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
4
|
Tabatabaei SK, Pham B, Pan C, Liu J, Chandak S, Shorkey SA, Hernandez AG, Aksimentiev A, Chen M, Schroeder CM, Milenkovic O. Expanding the Molecular Alphabet of DNA-Based Data Storage Systems with Neural Network Nanopore Readout Processing. Nano Lett 2022; 22:1905-1914. [PMID: 35212544 PMCID: PMC8915253 DOI: 10.1021/acs.nanolett.1c04203] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 02/22/2022] [Indexed: 05/23/2023]
Abstract
DNA is a promising next-generation data storage medium, but challenges remain with synthesis costs and recording latency. Here, we describe a prototype of a DNA data storage system that uses an extended molecular alphabet combining natural and chemically modified nucleotides. Our results show that MspA nanopores can discriminate different combinations and ordered sequences of natural and chemically modified nucleotides in custom-designed oligomers. We further demonstrate single-molecule sequencing of the extended alphabet using a neural network architecture that classifies raw current signals generated by Oxford Nanopore sequencers with an average accuracy exceeding 60% (39× larger than random guessing). Molecular dynamics simulations show that the majority of modified nucleotides lead to only minor perturbations of the DNA double helix. Overall, the extended molecular alphabet may potentially offer a nearly 2-fold increase in storage density and potentially the same order of reduction in the recording latency, thereby enabling new implementations of molecular recorders.
Collapse
Affiliation(s)
- S Kasra Tabatabaei
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Bach Pham
- Department of Chemistry, University of Massachusetts at Amherst, Amherst, Massachusetts 01003, United States
| | - Chao Pan
- Department of Electrical and Computer Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Jingqian Liu
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, United States
| | - Spencer A Shorkey
- Department of Chemistry, University of Massachusetts at Amherst, Amherst, Massachusetts 01003, United States
| | - Alvaro G Hernandez
- Roy J. Carver Biotechnology Center, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Aleksei Aksimentiev
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
- Department of Physics, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Min Chen
- Department of Chemistry, University of Massachusetts at Amherst, Amherst, Massachusetts 01003, United States
| | - Charles M Schroeder
- Center for Biophysics and Quantitative Biology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
- Department of Materials Science and Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| | - Olgica Milenkovic
- Department of Electrical and Computer Engineering, University of Illinois at Urbana─Champaign, Urbana, Illinois 61801, United States
| |
Collapse
|
5
|
Agarwal A, Agarwal S, Chandak S. Response to the letter to the editor. Ultrasound 2022; 30:96. [PMID: 35173785 PMCID: PMC8841946 DOI: 10.1177/1742271x211055801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Affiliation(s)
- A Agarwal
- Arjit Agarwal, Department of Radiodiagnosis, Teerthanker Mahaveer Medical College and Research Center, Teerthanker Mahaveer University, Moradabad, UP, India.
| | | | | |
Collapse
|
6
|
Chandak S, Tatwawadi T, Sridhar S, Weissman T. Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy. Bioinformatics 2020; 36:5313-5321. [PMID: 33325499 DOI: 10.1093/bioinformatics/btaa1017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 10/14/2020] [Accepted: 11/24/2020] [Indexed: 11/14/2022] Open
Abstract
Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tatwawadi Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Srivatsan Sridhar
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
7
|
Chandak S, Tatwawadi K, Ochoa I, Hernaez M, Weissman T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics 2020; 35:2674-2676. [PMID: 30535063 DOI: 10.1093/bioinformatics/bty1015] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 12/06/2018] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression. RESULTS In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources. AVAILABILITY AND IMPLEMENTATION SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
8
|
Puntambekar S, Chandak S, Goel A, Puntambekar A. 2048 Colo -Anal Anastomosis: A Novel Idea for Treatment of Re-Re-Recurrent Rectovaginal Fistula. J Minim Invasive Gynecol 2019. [DOI: 10.1016/j.jmig.2019.09.108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
9
|
Goel A, Manchekar M, Chitale M, Pattanaik S, Chandak S, Puntambekar A. 1749 Laparoscopic Rectovaginal Fistula Repair Following Benign Gynaecological Procedure. J Minim Invasive Gynecol 2019. [DOI: 10.1016/j.jmig.2019.09.151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
10
|
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018; 34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact schandak@stanford.edu. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|