1
|
Schwarz PM, Freisleben B. Optimizing fountain codes for DNA data storage. Comput Struct Biotechnol J 2024; 23:3878-3896. [PMID: 39559773 PMCID: PMC11570749 DOI: 10.1016/j.csbj.2024.10.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 10/22/2024] [Accepted: 10/22/2024] [Indexed: 11/20/2024] Open
Abstract
Fountain codes, originally developed for reliable multicasting in communication networks, are effectively applied in various data transmission and storage systems. Their recent use in DNA data storage systems has unique challenges, since the DNA storage channel deviates from the traditional Gaussian white noise erasure model considered in communication networks and has several restrictions as well as special properties. Thus, optimizing fountain codes to address these challenges promises to improve their overall usability in DNA data storage systems. In this article, we present several methods for optimizing fountain codes for DNA data storage. Apart from generally applicable optimizations for fountain codes, we propose optimization algorithms to create tailored distribution functions of fountain codes, which is novel in the context of DNA data storage. We evaluate the proposed methods in terms of various metrics related to the DNA storage channel. Our evaluation shows that optimizing fountain codes for DNA data storage can significantly enhance the reliability and capacity of DNA data storage systems. The developed methods represent a step forward in harnessing the full potential of fountain codes for DNA-based data storage applications. The new coding schemes and all developed methods are available under a free and open-source software license.
Collapse
Affiliation(s)
- Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35043, Marburg, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35043, Marburg, Germany
| |
Collapse
|
2
|
Schwarz PM, Freisleben B. Data recovery methods for DNA storage based on fountain codes. Comput Struct Biotechnol J 2024; 23:1808-1823. [PMID: 38707543 PMCID: PMC11066528 DOI: 10.1016/j.csbj.2024.04.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 04/18/2024] [Accepted: 04/18/2024] [Indexed: 05/07/2024] Open
Abstract
Today's digital data storage systems typically offer advanced data recovery solutions to address the problem of catastrophic data loss, such as software-based disk sector analysis or physical-level data retrieval methods for conventional hard disk drives. However, DNA-based data storage currently relies solely on the inherent error correction properties of the methods used to encode digital data into strands of DNA. Any error that cannot be corrected utilizing the redundancy added by DNA encoding methods results in permanent data loss. To provide data recovery for DNA storage systems, we present a method to automatically reconstruct corrupted or missing data stored in DNA using fountain codes. Our method exploits the relationships between packets encoded with fountain codes to identify and rectify corrupted or lost data. Furthermore, we present file type-specific and content-based data recovery methods for three file types, illustrating how a fusion of fountain encoding-specific redundancy and knowledge about the data can effectively recover information in a corrupted DNA storage system, both in an automatic and in a guided manual manner. To demonstrate our approach, we introduce DR4DNA, a software toolkit that contains all methods presented. We evaluate DR4DNA using both in-silico and in-vitro experiments.
Collapse
Affiliation(s)
- Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße 6, Marburg, D-35043, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße 6, Marburg, D-35043, Germany
| |
Collapse
|
3
|
Rasool A, Hong J, Hong Z, Li Y, Zou C, Chen H, Qu Q, Wang Y, Jiang Q, Huang X, Dai J. An Effective DNA-Based File Storage System for Practical Archiving and Retrieval of Medical MRI Data. SMALL METHODS 2024; 8:e2301585. [PMID: 38807543 DOI: 10.1002/smtd.202301585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 03/29/2024] [Indexed: 05/30/2024]
Abstract
DNA-based data storage is a new technology in computational and synthetic biology, that offers a solution for long-term, high-density data archiving. Given the critical importance of medical data in advancing human health, there is a growing interest in developing an effective medical data storage system based on DNA. Data integrity, accuracy, reliability, and efficient retrieval are all significant concerns. Therefore, this study proposes an Effective DNA Storage (EDS) approach for archiving medical MRI data. The EDS approach incorporates three key components (i) a novel fraction strategy to address the critical issue of rotating encoding, which often leads to data loss due to single base error propagation; (ii) a novel rule-based quaternary transcoding method that satisfies bio-constraints and ensure reliable mapping; and (iii) an indexing technique designed to simplify random search and access. The effectiveness of this approach is validated through computer simulations and biological experiments, confirming its practicality. The EDS approach outperforms existing methods, providing superior control over bio-constraints and reducing computational time. The results and code provided in this study open new avenues for practical DNA storage of medical MRI data, offering promising prospects for the future of medical data archiving and retrieval.
Collapse
Affiliation(s)
- Abdur Rasool
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Jingwei Hong
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- College of Mathematics and Information Science, Hebei University, Baoding, 071002, China
| | - Zhiling Hong
- Quanzhou Development Group Co., Ltd, Quanzhou, 362000, China
| | - Yuanzhen Li
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen, 518055, China
| | - Chao Zou
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Hui Chen
- Shenzhen Polytechnic University, Shenzhen, 518055, China
| | - Qiang Qu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yang Wang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Qingshan Jiang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Xiaoluo Huang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen, 518055, China
| | - Junbiao Dai
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518055, China
| |
Collapse
|
4
|
Gao Y, No A. Efficient and low-complexity variable-to-variable length coding for DNA storage. BMC Bioinformatics 2024; 25:320. [PMID: 39354338 PMCID: PMC11446080 DOI: 10.1186/s12859-024-05943-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 09/23/2024] [Indexed: 10/03/2024] Open
Abstract
BACKGROUND Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated. RESULTS In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 andc GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub. CONCLUSION We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.
Collapse
Affiliation(s)
- Yunfei Gao
- SJTU-Ruijing-UIH Institute for Medical Imaging Technology, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No. 197 Ruijin Second Road, Shanghai, 200025, China
| | - Albert No
- Department of Artificial Intelligence, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, 03722, South Korea.
| |
Collapse
|
5
|
Kim JW, Jeong J, Kwak HY, No JS. Design of DNA Storage Coding Scheme With LDPC Codes and Interleaving. IEEE Trans Nanobioscience 2024; 23:447-457. [PMID: 38512749 DOI: 10.1109/tnb.2024.3379976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/23/2024]
Abstract
In this paper, we propose a new coding scheme for DNA storage using low-density parity-check (LDPC) codes and interleaving techniques. While conventional coding schemes generally employ error correcting codes in both inter and intra-oligo directions, we show that inter-oligo LDPC codes, optimized by differential evolution, are sufficient in ensuring the reliability of DNA storage due to the powerful soft decoding of LDPC codes. In addition, we apply interleaving techniques for handling non-uniform error characteristics of DNA storage to enhance the decoding performance. Consequently, the proposed coding scheme reduces the required number of oligo reads for perfect recovery by 26.25% ~ 38.5% compared to existing state-of-the-art coding schemes. Moreover, we develop an analytical DNA channel model in terms of non-uniform binary symmetric channels. This mathematical model allows us to demonstrate the superiority of the proposed coding scheme while isolating the experimental variation, as well as confirm the independent effects of LDPC codes and interleaving techniques.
Collapse
|
6
|
Cao B, Zheng Y, Shao Q, Liu Z, Xie L, Zhao Y, Wang B, Zhang Q, Wei X. Efficient data reconstruction: The bottleneck of large-scale application of DNA storage. Cell Rep 2024; 43:113699. [PMID: 38517891 DOI: 10.1016/j.celrep.2024.113699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/15/2023] [Accepted: 01/05/2024] [Indexed: 03/24/2024] Open
Abstract
Over the past decade, the rapid development of DNA synthesis and sequencing technologies has enabled preliminary use of DNA molecules for digital data storage, overcoming the capacity and persistence bottlenecks of silicon-based storage media. DNA storage has now been fully accomplished in the laboratory through existing biotechnology, which again demonstrates the viability of carbon-based storage media. However, the high cost and latency of data reconstruction pose challenges that hinder the practical implementation of DNA storage beyond the laboratory. In this article, we review existing advanced DNA storage methods, analyze the characteristics and performance of biotechnological approaches at various stages of data writing and reading, and discuss potential factors influencing DNA storage from the perspective of data reconstruction.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China; Centre for Frontier AI Research, Agency for Science, Technology, and Research (A(∗)STAR), 1 Fusionopolis Way, Singapore 138632, Singapore
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| | - Qi Shao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Zhenlu Liu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Lei Xie
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Yunzhu Zhao
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Xuefu Street, Dalian, Liaoning 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China.
| | - Xiaopeng Wei
- School of Computer Science and Technology, Dalian University of Technology, Lingshui Street, Dalian, Liaoning 116024, China
| |
Collapse
|
7
|
Wang K, Cao B, Ma T, Zhao Y, Zheng Y, Wang B, Zhou S, Zhang Q. Storing Images in DNA via base128 Encoding. J Chem Inf Model 2024; 64:1719-1729. [PMID: 38385334 DOI: 10.1021/acs.jcim.3c01592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Current DNA storage schemes lack flexibility and consistency in processing highly redundant and correlated image data, resulting in low sequence stability and image reconstruction rates. Therefore, according to the characteristics of image storage, this paper proposes storing images in DNA via base128 encoding (DNA-base128). In the data writing stage, data segmentation and probability statistics are carried out, and then, the data block frequency and constraint encoding set are associated with achieving encoding. When the image needs to be recovered, DNA-base128 completes internal error correction by threshold setting and drift comparison. Compared with representative work, the DNA-base128 encoding results show that the undesired motifs were reduced by 71.2-90.7% and that the local guanine-cytosine content variance was reduced by 3 times, indicating that DNA-base128 can store images more stably. In addition, the structural similarity index (SSIM) and multiscale structural similarity (MS-SSIM) of image reconstruction using DNA-base128 were improved by 19-102 and 6.6-20.3%, respectively. In summary, DNA-base128 provides image encoding with internal error correction and provides a potential solution for DNA image storage. The data and code are available at the GitHub repository: https://github.com/123456wk/DNA_base128.
Collapse
Affiliation(s)
- Kun Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Tao Ma
- Brain Function Research Section, China Medical University, Shenyang 110001, China
| | - Yunzhu Zhao
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Yanfen Zheng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Bin Wang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Shihua Zhou
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Qiang Zhang
- The Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| |
Collapse
|
8
|
Jeong J, Park H, Kwak HY, No JS, Jeon H, Lee JW, Kim JW. Iterative Soft Decoding Algorithm for DNA Storage Using Quality Score and Redecoding. IEEE Trans Nanobioscience 2024; 23:81-90. [PMID: 37294652 DOI: 10.1109/tnb.2023.3284406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Ever since deoxyribonucleic acid (DNA) was considered as a next-generation data-storage medium, lots of research efforts have been made to correct errors occurred during the synthesis, storage, and sequencing processes using error correcting codes (ECCs). Previous works on recovering the data from the sequenced DNA pool with errors have utilized hard decoding algorithms based on a majority decision rule. To improve the correction capability of ECCs and robustness of the DNA storage system, we propose a new iterative soft decoding algorithm, where soft information is obtained from FASTQ files and channel statistics. In particular, we propose a new formula for log-likelihood ratio (LLR) calculation using quality scores (Q-scores) and a redecoding method which may be suitable for the error correction and detection in the DNA sequencing area. Based on the widely adopted encoding scheme of the fountain code structure proposed by Erlich et al., we use three different sets of sequenced data to show consistency for the performance evaluation. The proposed soft decoding algorithm gives 2.3% ∼ 7.0% improvement of the reading number reduction compared to the state-of-the-art decoding method and it is shown that it can deal with erroneous sequenced oligo reads with insertion and deletion errors.
Collapse
|
9
|
Rasool A, Hong J, Jiang Q, Chen H, Qu Q. BO-DNA: Biologically optimized encoding model for a highly-reliable DNA data storage. Comput Biol Med 2023; 165:107404. [PMID: 37666064 DOI: 10.1016/j.compbiomed.2023.107404] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 08/13/2023] [Accepted: 08/26/2023] [Indexed: 09/06/2023]
Abstract
DNA data storage is a promising technology that utilizes computer simulation, and synthetic biology, offering high-density and reliable digital information storage. It is challenging to store massive data in a small amount of DNA without losing the original data since nonspecific hybridization errors occur frequently and severely affect the reliability of stored data. This study proposes a novel biologically optimized encoding model for DNA data storage (BO-DNA) to overcome the reliability problem. BO-DNA model is developed by a new rule-based mapping method to avoid data drop during the transcoding of binary data to premier nucleotides. A customized optimization algorithm based on a tent chaotic map is applied to maximize the lower bounds that help to minimize the nonspecific hybridization errors. The robustness of BO-DNA is computed by four bio-constraints to confirm the reliability of newly generated DNA sequences. Experimentally, different medical images are encoded and decoded successfully with 12%-59% improved lower bounds and optimally constrained-based DNA sequences reported with 1.77bit/nt average density. BO-DNA's results demonstrate substantial advantages in constructing reliable DNA data storage.
Collapse
Affiliation(s)
- Abdur Rasool
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Jingwei Hong
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China; College of Mathematics and Information Science, Hebei University, Baoding, 071002, China
| | - Qingshan Jiang
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
| | - Hui Chen
- Shenzhen Polytechnic University, Shenzhen, 518055, Guangdong, China
| | - Qiang Qu
- Shenzhen Key Laboratory for High Performance Data Mining, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China.
| |
Collapse
|
10
|
Mu Z, Cao B, Wang P, Wang B, Zhang Q. RBS: A Rotational Coding Based on Blocking Strategy for DNA Storage. IEEE Trans Nanobioscience 2023; 22:912-922. [PMID: 37028365 DOI: 10.1109/tnb.2023.3254514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
The data volume of global information has grown exponentially in recent years, but the development of silicon-based memory has entered a bottleneck period. Deoxyribonucleic acid (DNA) storage is drawing attention owing to its advantages of high storage density, long storage time, and easy maintenance. However, the base utilization and information density of existing DNA storage methods are insufficient. Therefore, this study proposes a rotational coding based on blocking strategy (RBS) for encoding digital information such as text and images in DNA data storage. This strategy satisfies multiple constraints and produces low error rates in synthesis and sequencing. To illustrate the superiority of the proposed strategy, it was compared and analyzed with existing strategies in terms of entropy value change, free energy size, and Hamming distance. The experimental results show that the proposed strategy has higher information storage density and better coding quality in DNA storage, so it will improve the efficiency, practicality, and stability of DNA storage.
Collapse
|
11
|
Park SJ, Kim S, Jeong J, No A, No JS, Park H. Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads. Bioinformatics 2023; 39:btad548. [PMID: 37669160 PMCID: PMC10500082 DOI: 10.1093/bioinformatics/btad548] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 08/30/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed. RESULTS We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works. AVAILABILITY AND IMPLEMENTATION Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.
Collapse
Affiliation(s)
- Seong-Joon Park
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Sunghwan Kim
- Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, South Korea
| | - Jaeho Jeong
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul 04066, South Korea
| | - Jong-Seon No
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Hosung Park
- Department of Computer Engineering, Chonnam National University, Gwangju 61186, South Korea
- Department of ICT Convergence System Engineering, Chonnam National University, Gwangju 61186, South Korea
| |
Collapse
|
12
|
Wang P, Cao B, Ma T, Wang B, Zhang Q, Zheng P. DUHI: Dynamically updated hash index clustering method for DNA storage. Comput Biol Med 2023; 164:107244. [PMID: 37453377 DOI: 10.1016/j.compbiomed.2023.107244] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 06/08/2023] [Accepted: 07/07/2023] [Indexed: 07/18/2023]
Abstract
The exponential growth of global data leads to the problem of insufficient data storage capacity. DNA storage can be an ideal storage method due to its high storage density and long storage time. However, the DNA storage process is subject to unavoidable errors that can lead to increased cluster redundancy during data reading, which in turn affects the accuracy of the data reads. This paper proposes a dynamically updated hash index (DUHI) clustering method for DNA storage, which clusters sequences by constructing a dynamic core index set and using hash lookup. The proposed clustering method is analyzed in terms of overall reliability evaluation and visualization evaluation. The results show that the DUHI clustering method can reduce the redundancy of more than 10% of the sequences within the cluster and increase the reconstruction rate of the sequences to more than 99%. Therefore, our method solves the high redundancy problem after DNA sequence clustering, improves the accuracy of data reading, and promotes the development of DNA storage.
Collapse
Affiliation(s)
- Penghao Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China
| | - Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Tao Ma
- Brain Function Research Section, The First Hospital of China Medical University, 110001, Shenyang, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China.
| | - Qiang Zhang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, 116622, Dalian, China
| | - Pan Zheng
- Department of Accounting and Information Systems, University of Canterbury, 8140, Christchurch, New Zealand
| |
Collapse
|
13
|
Yang X, Shi X, Lai L, Chen C, Xu H, Deng M. Towards long double-stranded chains and robust DNA-based data storage using the random code system. Front Genet 2023; 14:1179867. [PMID: 37384333 PMCID: PMC10294226 DOI: 10.3389/fgene.2023.1179867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Accepted: 05/31/2023] [Indexed: 06/30/2023] Open
Abstract
DNA has become a popular choice for next-generation storage media due to its high storage density and stability. As the storage medium of life's information, DNA has significant storage capacity and low-cost, low-power replication and transcription capabilities. However, utilizing long double-stranded DNA for storage can introduce unstable factors that make it difficult to meet the constraints of biological systems. To address this challenge, we have designed a highly robust coding scheme called the "random code system," inspired by the idea of fountain codes. The random code system includes the establishment of a random matrix, Gaussian preprocessing, and random equilibrium. Compared to Luby transform codes (LT codes), random code (RC) has better robustness and recovery ability of lost information. In biological experiments, we successfully stored 29,390 bits of data in 25,700 bp chains, achieving a storage density of 1.78 bits per nucleotide. These results demonstrate the potential for using long double-stranded DNA and the random code system for robust DNA-based data storage.
Collapse
Affiliation(s)
- Xu Yang
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Xiaolong Shi
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Langwen Lai
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Congzhou Chen
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Huaisheng Xu
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Ming Deng
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| |
Collapse
|
14
|
Zan X, Chu L, Xie R, Su Y, Yao X, Xu P, Liu W. An image cryptography method by highly error-prone DNA storage channel. Front Bioeng Biotechnol 2023; 11:1173763. [PMID: 37152655 PMCID: PMC10154519 DOI: 10.3389/fbioe.2023.1173763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Accepted: 03/30/2023] [Indexed: 05/09/2023] Open
Abstract
Introduction: Rapid development in synthetic technologies has boosted DNA as a potential medium for large-scale data storage. Meanwhile, how to implement data security in the DNA storage system is still an unsolved problem. Methods: In this article, we propose an image encryption method based on the modulation-based storage architecture. The key idea is to take advantage of the unpredictable modulation signals to encrypt images in highly error-prone DNA storage channels. Results and Discussion: Numerical results have demonstrated that our image encryption method is feasible and effective with excellent security against various attacks (statistical, differential, noise, and data loss). When compared with other methods such as the hybridization reactions of DNA molecules, the proposed method is more reliable and feasible for large-scale applications.
Collapse
Affiliation(s)
- Xiangzhen Zan
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ling Chu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Ranze Xie
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Yanqing Su
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Xiangyu Yao
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
| | - Peng Xu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
- School of Computer Science of Information Technology, Qiannan Normal University for Nationalities, Duyun, Guizhou, China
- Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China
| | - Wenbin Liu
- Institute of Computational Science and Technology, Guangzhou University, Guangzhou, Guangdong, China
- Guangdong Provincial Key Laboratory of Artificial Intelligence in Medical Image Analysis and Application, Guangzhou, Guangdong, China
| |
Collapse
|
15
|
Cao B, Wang B, Zhang Q. GCNSA: DNA storage encoding with a graph convolutional network and self-attention. iScience 2023; 26:106231. [PMID: 36876131 PMCID: PMC9982308 DOI: 10.1016/j.isci.2023.106231] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Revised: 01/31/2023] [Accepted: 02/14/2023] [Indexed: 02/22/2023] Open
Abstract
DNA Encoding, as a key step in DNA storage, plays an important role in reading and writing accuracy and the storage error rate. However, currently, the encoding efficiency is not high enough and the encoding speed is not fast enough, which limits the performance of DNA storage systems. In this work, a DNA storage encoding system with a graph convolutional network and self-attention (GCNSA) is proposed. The experimental results show that DNA storage code constructed by GCNSA increases by 14.4% on average under the basic constraints, and by 5%-40% under other constraints. The increase of DNA storage codes effectively improves the storage density of 0.7-2.2% in the DNA storage system. The GCNSA predicted more DNA storage codes in less time while ensuring the quality of codes, which lays a foundation for higher read and write efficiency in DNA storage.
Collapse
Affiliation(s)
- Ben Cao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Bin Wang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, School of Software Engineering, Dalian University, Dalian 116622, China
| | - Qiang Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
16
|
Li X, Chen M, Wu H. Multiple errors correction for position-limited DNA sequences with GC balance and no homopolymer for DNA-based data storage. Brief Bioinform 2023; 24:6835379. [PMID: 36410731 DOI: 10.1093/bib/bbac484] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 10/12/2022] [Accepted: 10/13/2022] [Indexed: 11/23/2022] Open
Abstract
Deoxyribonucleic acid (DNA) is an attractive medium for long-term digital data storage due to its extremely high storage density, low maintenance cost and longevity. However, during the process of synthesis, amplification and sequencing of DNA sequences with homopolymers of large run-length, three different types of errors, namely, insertion, deletion and substitution errors frequently occur. Meanwhile, DNA sequences with large imbalances between GC and AT content exhibit high dropout rates and are prone to errors. These limitations severely hinder the widespread use of DNA-based data storage. In order to reduce and correct these errors in DNA storage, this paper proposes a novel coding schema called DNA-LC, which converts binary sequences into DNA base sequences that satisfy both the GC balance and run-length constraints. Furthermore, our coding mode is able to detect and correct multiple errors with a higher error correction capability than the other methods targeting single error correction within a single strand. The decoding algorithm has been implemented in practice. Simulation results indicate that our proposed coding scheme can offer outstanding error protection to DNA sequences. The source code is freely accessible at https://github.com/XiayangLi2301/DNA.
Collapse
Affiliation(s)
- Xiayang Li
- School of Mathematics, Tianjin University, Tianjin, 300372, China
| | - Moxuan Chen
- School of Mathematics, Tianjin University, Tianjin, 300372, China
| | - Huaming Wu
- Center for Applied Mathematics, Tianjin University, Tianjin, 300372, China
| |
Collapse
|
17
|
FMG: An observable DNA storage coding method based on frequency matrix game graphs. Comput Biol Med 2022; 151:106269. [PMID: 36356390 DOI: 10.1016/j.compbiomed.2022.106269] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/20/2022] [Accepted: 10/30/2022] [Indexed: 11/06/2022]
Abstract
Using complex biomolecules for storage is a new carbon-based storage method. For example, DNA has the potential to be a good method for archival long-term data storage. Reasonable and efficient coding is the first and most important step in DNA storage. However, current coding methods, such as altruism algorithm, have the problem of low coding efficiency and high complexity, and coding constraints and sets make it difficult to see the coding results visually. In this study, a new DNA storage coding method based on frequency matrix game graph (FMG) is proposed to generate DNA storage coding satisfying combinatorial constraints. Compared with the randomness of the heuristic algorithm that satisfies the constraints, the coding method based on the FMG is deterministic and can clearly explain the coding process. In addition, the constraints and coding results have observable characteristics and are better than the previously published results for the size of the coding set. For example, when length of the code n = 10, hamming distance d = 4, the results obtained by proposed approach combining chaos game and graph are 24% better than the previous results. The proposed coding scheme successfully constructs high-quality coding sets with less complexity, which effectively promotes the development of carbon-based storage coding.
Collapse
|
18
|
Qu G, Yan Z, Wu H. Clover: tree structure-based efficient DNA clustering for DNA-based data storage. Brief Bioinform 2022; 23:6668252. [PMID: 35975958 DOI: 10.1093/bib/bbac336] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/21/2022] [Accepted: 07/22/2022] [Indexed: 11/12/2022] Open
Abstract
Deoxyribonucleic acid (DNA)-based data storage is a promising new storage technology which has the advantage of high storage capacity and long storage time compared with traditional storage media. However, the synthesis and sequencing process of DNA can randomly generate many types of errors, which makes it more difficult to cluster DNA sequences to recover DNA information. Currently, the available DNA clustering algorithms are targeted at DNA sequences in the biological domain, which not only cannot adapt to the characteristics of sequences in DNA storage, but also tend to be unacceptably time-consuming for billions of DNA sequences in DNA storage. In this paper, we propose an efficient DNA clustering method termed Clover for DNA storage with linear computational complexity and low memory. Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. We argue through theoretical proofs that Clover has standard linear computational complexity, low space complexity, etc. Experiments show that our method can cluster 10 million DNA sequences into 50 000 classes in 10 s and meet an accuracy rate of over 99%. Furthermore, we have successfully completed an unprecedented clustering of 10 billion DNA data on a single home computer and the time consumption still satisfies the linear relationship. Clover is freely available at https://github.com/Guanjinqu/Clover.
Collapse
Affiliation(s)
- Guanjin Qu
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Zihui Yan
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| | - Huaming Wu
- Center for Applied Mathematics, Tianjin University, Tianjin, 300072, China
| |
Collapse
|
19
|
Adaptive coding for DNA storage with high storage density and low coverage. NPJ Syst Biol Appl 2022; 8:23. [PMID: 35788589 PMCID: PMC9253015 DOI: 10.1038/s41540-022-00233-w] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 06/10/2022] [Indexed: 11/09/2022] Open
Abstract
The rapid development of information technology has generated substantial data, which urgently requires new storage media and storage methods. DNA, as a storage medium with high density, high durability, and ultra-long storage time characteristics, is promising as a potential solution. However, DNA storage is still in its infancy and suffers from low space utilization of DNA strands, high read coverage, and poor coding coupling. Therefore, in this work, an adaptive coding DNA storage system is proposed to use different coding schemes for different coding region locations, and the method of adaptively generating coding constraint thresholds is used to optimize at the system level to ensure the efficient operation of each link. Images, videos, and PDF files of size 698 KB were stored in DNA using adaptive coding algorithms. The data were sequenced and losslessly decoded into raw data. Compared with previous work, the DNA storage system implemented by adaptive coding proposed in this paper has high storage density and low read coverage, which promotes the development of carbon-based storage systems.
Collapse
|
20
|
Zan X, Yao X, Xu P, Chen Z, Xie L, Li S, Liu W. A Hierarchical Error Correction Strategy for Text DNA Storage. Interdiscip Sci 2021; 14:141-150. [PMID: 34463928 DOI: 10.1007/s12539-021-00476-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 08/20/2021] [Accepted: 08/22/2021] [Indexed: 12/28/2022]
Abstract
DNA storage has been a thriving interdisciplinary research area because of its high density, low maintenance cost, and long durability for information storage. However, the complexity of errors in DNA sequences including substitutions, insertions and deletions hinders its application for massive data storage. Motivated by the divide-and-conquer algorithm, we propose a hierarchical error correction strategy for text DNA storage. The basic idea is to design robust codes for common characters which have one-base error correction ability including insertion and/or deletion. The errors are gradually corrected by the codes in DNA reads, multiple alignment of character lines, and finally word spelling. On one hand, the proposed encoding method provides a systematic way to design storage friendly codes, such as 50% GC content, no more than 2-base homopolymers, and robustness against secondary structures. On the other hand, the proposed error correction method not only corrects single insertion or deletion, but also deals with multiple insertions or deletions. Simulation results demonstrate that the proposed method can correct more than 98% errors when error rate is less than or equal to 0.05. Thus, it is more powerful and adaptable to the complicated DNA storage applications.
Collapse
Affiliation(s)
- Xiangzhen Zan
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Xiangyu Yao
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Peng Xu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Zhihua Chen
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China
| | - Lian Xie
- Institution of Huangpu Research, Guangzhou University, Guangzhou, 510006, China
| | - Shudong Li
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, 510006, China
| | - Wenbin Liu
- Institution of Computational Science and Technology, Guangzhou University, Guangzhou, 510006, China.
| |
Collapse
|