

To address these challenges, DNA is emerging as a novel storehouse 4, 5, 6, 7, 8 of information as it is not only potential to be orders of magnitude denser 9 than contemporary cutting-edge techniques but also extremely stable to retain the information for hundreds or even thousands of years 10, compared to hard drives which might last in only several decades 11, 12.Īlthough it is a distinguished medium to store enormous data over millennia by virtue of its inherent high density 12, 13, 14, 15and durable preservation 16, 17, DNA is still hardly practical as of today to store more than several hundred megabytes because of the high cost of DNA synthesis 18. On the other hand, given the uniqueness of gathered big data, it is often desired to archive them in external storage devices for value extending over a long period of time.


With the prevalence of big data-based applications, the massive quantities of data created across the globe each year are increasing in an exponential fashion 1, and it is expected that the global data storage demands will rapidly rise to 175ZB 2 by 2025, with 2.5 Exabytes per day, far exceeding the world’s capacities that traditional storage technologies can afford 3. We verified through experimental data that the proposed self-contained and self-explanatory method can not only get rid of the reliance on external tools for data restoration but also minimise the data redundancy brought about when the amount of data to be stored reaches a certain scale. To this end, we design a specific DNA file format whereby a separate storage scheme is developed to reduce the data redundancy while an effective indexing is designed for random read operations to the stored data file. To address this issue, we propose in this paper a self-contained DNA storage system that can bring self-explanatory to its stored data without relying on any external tool. This may result in high risks in data loss since the required tools might not be available due to the high uncertainty in far future. Consequently, the current DNA storage systems are often not self-contained, implying that they have to resort to external tools for the restoration of the stored DNA data. Current research on DNA storage usually focuses on the improvement of storage density by developing effective encoding and decoding schemes while lacking the consideration on the uncertainty in ultra-long-term data storage and retention.
