DNA storage is a new digital data storage technology based on specific encoding and decoding methods between 0 and 1 binary codes of digital data and A-T-C-G quaternary codes of DNAs, which and is expected to develop into a major data storage form in the future due to its advantages (such as high data density, long storage time, low energy consumption, convenience for carrying, concealed transportation and multiple encryptions). In this review, we mainly summarize the recent research advances of four main encoding and decoding methods of DNA storage technology: direct mapping method between 0 and 1 binary and A-T-C-G quaternary codes in early-stage, fountain code for higher logical storage density, inner and outer codes for random access DNA storage data, and CRISPR mediated in vivo DNA storage method. The first three encoding/decoding methods belong to in vitro DNA storage, representing the mainstream research and application in DNA storage. Their advantages and disadvantages are also reviewed: direct mapping method is easy and efficient, but has high error rate and low logical density; fountain code can achieve higher storage density without random access; inner and outer code has error-correction design to realize random access at the expense of logic density. This review provides important references and improved understanding of DNA storage methods. Development of efficient and accurate DNA storage encoding and decoding methods will play a very important and even decisive role in the transition of DNA storage from the laboratory to practical application, which may fundamentally change the information industry in the future.
DNA storage technology is a new data storage technology through DNA storage medium, which can achieve digital data storage (text, image, audio, video, etc.) by encoding and decoding for the synthesized DNAs with specific sequences based on certain encoding/decoding methods. Specially, according to certain encoding methods/rules of DNA storage, the 0–1 binary codes encoding for various digital data (text, image, audio, video, etc.) can be converted to corresponding DNA quaternary codes (i.e., combinations of A, T, C, and G), and the corresponding DNAs were then synthesized to store the digital data information into the DNAs with specific sequences. Conversely, based on corresponding decoding methods/rules, the stored DNAs can be sequenced to obtain DNA quaternary codes, further restoring to the digital data with 0–1 binary codes. Here the encoding and decoding methods follow the same “codebook”, and the encoding is the reverse process of the decoding (Fig. 1).
This paper reviews the research advances of encoding and decoding methods in the field of DNA storage, covering several primary encoding and decoding methods (Table 1): direct mapping between 0–1 binary digital data and A-T-C-G quaternary DNA storage data in the early stages (Church, Blawat, Goldman and Bornholt labs), fountain code (Erlich & Zielinski and Anavy labs), inner and outer codes for random access DNA storage data (Grass and Organick labs), and CRISPR mediated in vivo DNA storage technology (developed by Church et al. 2012).
Each encoding and decoding method has its advantages and disadvantages. Here, the direct mapping method is easy and efficient but has a high error rate and low logical density. Fountain code made a breakthrough for logical density, reaching up to 80% of the upper limit of theoretical estimation of DNA storage, but it could not achieve random access. Grass's inner and outer code has an error-correction design (at the chain and data block levels) to realize random access at the expense of logic density. In all, the above three encoding/decoding methods belong to in vitro DNA storage, with the advantage of high-throughput storage and lower cost. CRISPR mediated DNA storage technology belongs to in vivo DNA storage, with the advantage of long-term stable storage and cheaper/random amplification of DNA storage data copies, but it is nowadays in the early exploration stage only for a small amount of data storage due to some insurmountable technical bottlenecks (such as biochemical restrictions in vivo).