At the moment, the write process must be measured in days because of the need to send the oligos to a specialist fabrication facility.
« Previously: Will DNA data point the way to lithography's future?
The target of the memory “write” process is to turn the original data steam into a series of DNA oligonucleotides or “oligos” as the short form. These can be sent to a company specialising in the manufacture of DNA to order who return a small ampoule of the data encoded DNA.
Figure 1: The memory “Write” process (Source: Ron Neale)
Figure 1 illustrates, scaled down in segment size, the data memory write process sequence. It starts by breaking a block of the source data (in this case it was a Tarball, like a Zip-file,into small segments).
The next step is to add bitwise randomly selected segments of the data. The small inset in Figure 1 shows the bimodal histogram for the segment selection; the number of segments subjected to the bitwise addition is in the range 1 to 3 with a peak at 13. This step goes under the name of a Luby transform named for the inventor of the first practical Fountain code. The number of droplets will be slightly larger than the number of segments.
The next step is to add randomly selected bit described as a “seed” to the data to create what are described as "droplets." The seed is incremented for each droplet and acts to identify the droplet.
The innovation in is the addition of a new step which is not part of the original fountain code methodology and removes some of the problems and limitations of earlier attempts at using DNA as a data memory. It is a selection or screening process which maximise the effectiveness of the process and to fully realise the coding potential of each nucleotide. As illustrated in Figure 1, the droplets (00,01,10,11) are converted into nucleobases (A,C,G,T, respectively). This is followed by a screening step looking for any biochemical undesirable homopolymer runs of the same base, such as TTTT or a high GC content.
Biochemical constraints dictate this screening step because high GC content or long homopolymer runs (e.g., TTTTT…) are undesirable, as they are difficult to synthesise and prone to sequencing (read) errors. The decay of these undesirable DNA features during storage can induce uneven representation of the oligos.
Figure 2: The final oligo with error checking and illumina plate adaptors (Source: Ron Neale)
As illustrated in Figure 2, each oligo for the work of 38 bytes, of which 32 bytes (128 nt) were the data payload, 4 bytes (16 nt) for the random seed (the transform droplet) plus 2 bytes (8 nt) for error checking. Added each end were illumina plate adapters each of 6 bytes (24 nt).