DNA Storage Programming Model - dna-storage/framed GitHub Wiki

Programming DNA Storage Systems

Our infrastructure assumes a consistent view of how information is converted in a DNA storage system to provide simple interfaces between arbitrary encoding algorithms. The view that we use is similar to an assembly line, where the starting binary information is subsequently transformed until it becomes a final DNA strand, and vice versa. We call this a DNA storage pipeline. Furthermore, due to the unordered nature of DNA molecules, a model for organizational information that is managed in each DNA strand is also required. Lastly, we also provide a systematic way to organize and store critical encoder information that is needed across the time between instantiating the encoder and then the decoder. In the following sections we will cover encoding and decoding pipelines, indexing mechanisms supported by this infrastructure that allows for automatic organization and separation of related DNA strands from a random mixture, and headers which allow for storage of critical encoder information either in DNA or digital form.

Encoding

A general representation of a DNA storage pipeline is shown in the following figure. Here, some Packet of binary information is broken down into several Byte Strands. A Byte Strand is a unit of information in which all bytes contained in the unit will be encoded on the same physical DNA strand. All Byte Strands of a packet will pass through an outer encoding that may add on additional error correction Byte strands to tolerate strand drop outs as indicated by the green highlighted Byte Strand. From here, each Byte Strand is processed individually through a series of passes that we put in two separate categories. One category is the Inner ECC set of passes. These convert Byte Strands to new Byte Strands that may have additional error correction added in. The other category is the Physical set of passes, which serve two purposes. One, to convert binary information ultimately into a DNA form, and two to allow for additional functional DNA pieces to be added in, e.g. appending/prepending primers that will be used for polymerase chain reaction (PCR) or promoter sites for DNA to RNA transcription. The number of passes that are allowed in each of the 3 categories shown here are not limited, however there are some limitations on what type of algorithms will work with the outer encoding due to indexing issues as covered in TODO: put section link here

/images/encode.jpg

Decoding

The decoding process is for a DNA storage pipeline is essentially the same as encoding, just in the reverse direction. However, there is a unique process that must be handle during decoding due to the likely event that there are many copies of the same DNA strand. Copies of DNA strands can arise from synthesis, processing steps like PCR, or library preparation before sequencing. Though, the outer encoding only considers 1 copy of each Byte Strand when decoding. Thus, a mechanism to consolidate copies is required from the infrastructure, but there are two ways that this can be accomplished. One way is to consolidate when information is still in its DNA form, as is done by Rastchian et al.. Another approach is to allow every DNA strand to be decoded up until the Outer Decoding, and at this point consolidate the Byte Strands into singular Byte Strands. Either way has its pros and cons, but both are supported by this infrastructure. The following 2 figures illustrate DNA and Byte Strand consolidation respectively.

/images/decode_DNA_consolidate.jpg /images/decode_byte_strand_consolidate.jpg

Indexing

In order for the outer decoding to work, Byte Strands need to be rearranged into the correct order that they were at the time of encoding. A common and efficient approach to this is indexing. Indexing simply adds an integer to each DNA strand describing its position within the stored information. In our approach to indexing, we determine the integer for each strand by its position in each sub-unit of information that has an outer encoding applied to it. This is illustrated in the following figure. A packet may be a sub-unit of an entire file, thus each strand will have a packed identifier integer, then as the packet is broken down into sub-packets of different levels, an integer for each level is assigned to each strand. Thus, tracing a path through the sub-packet tree, a unique index is found. Because (sub-)packets are broken down further into sub-packets that may have some error correction applied, all the user building the encoder needs to do is specify the width of each tree. The infrastructure provided here understands this as a divider and automatically groups strands together to determine their index. Note: each divider need not be a divisor of the number of Byte Strands that it is dividing, nor does the product of dividers need to be a divisor of the number of Byte Strands in the packet. The infrastructure will ensure that padding is done and that every Byte Strand has a unique index.

/images/indexing.jpg

Barcoding

On top of indexing, our pipeline infrastructure allows for the assignment of barcodes to each pipeline. A barcode acts as a unique identifier for a group of data that should be decoded by a certain pipeline. The simplest use case of such a barcode is differentiating two different files that may have DNA strands in the same mixture that was sequenced. Not having such an identifier would be confusing to the decoder in this case, especially if both files used the same encoding parameters, since there eventually will be two strands with distinct data but with the same index. More advanced use cases may include having multiple pipelines for the same file, with a pipeline for different parts of the file, as we will see soon. Thus, it likely makes sense that barcodes can have multiple levels of information to them, and so our infrastructure supports arbitrary length barcodes for each pipeline.

Not only are barcodes necessary to separate unrelated data, but are also practical and general. For example, one may be tempted to use DNA strings that are used during random access approaches as unique identifiers of information. While that may be the case, such strings may be altered along the lifetime of a DNA strand in a storage system. By embedding the barcode as encoded information, this part of a DNA strand should remain relatively consistent and should not be reasonably targeted by any chemical processes. Furthermore, barcodes provide a consistent way to partition DNA strands in an experiment such that multiple pipelines can be easily used to determine the complete strand composition of an experiment as shown in the following figure.

/images/barcodes.jpg

Headers

When encoding information, it may be necessary to store information related to specific part of the pipeline, and this information may not be explicitly known to the user. For example, an encoding pass that may randomize the bytes in a Byte Strand through XORing the bytes with a random sequence. To reverse this process, a seed will be needed at decode such that the random sequence can be generated exactly. Such a seed may be automatically generated by the encoder, and thus should not really be handled/explicitly known by the user. To take care of this information, we place it in a data structure known as the pipeline's header. The header can either be stored in DNA, or it can be stored in digital form. At the time of decoding, the header is provided to the pipeline in which it is decoded and used to re-instantiate the decoder in the state it was in at encoding. A diagram of this process is shown in the following figure. In this illustration we show that the header for a pipeline is processed by a separate header pipeline. We do this so that header data can be treated differently than payload data, e.g. one may want to add more error correction to this critical information so as to ensure that the decoder can at least be configured correctly. Because the header data needs its own pipeline, that pipeline also needs a header, and this header must be stored in digital form. This digital data shouldn't add too much storage overhead, only 1 digitally stored header is required per file and these headers are generally on the order of 10's of bytes. We will cover how to support your own header data necessary for your own passes later in this document.

/images/headers.jpg

Performance Considerations

Parallelism

Performance is important to consider for a DNA storage encoding evaluation framework given that strands tend to be sequenced to a certain depth, lending to many copies of each strand that need to be decoded. Furthermore, a framework that can evaluate encodings through fault injection has to consider both the number of unique configurations that are to be evaluated and the number of samples of random injections that need to be performed on each configuration. Clearly, running every configuration's strands for a random fault injection sample sequentially would be time prohibitive, so this infrastructure allows for parallelism at 3 different levels shown by the following figure:

/images/parallelism.jpg

Most of the book keeping to enable parallelism is automatically handled by the infrastructure, and the user only really needs to specify the amount of parallelism to use for fault injection run samples and decoding strands. This infrastructure natively relies on a compute environment that supports batch and MPI parallelism. Batch parallelism helps parallelize separate configurations which have different output data, and MPI parallelism helps parallelize independent pieces of work which still have data related to each other, e.g. fault injection samples and strand decoding results need to be aggregated together which is most easily done if the parallelism has a direct mode of communication. Note, while the exact MPI implementation should not matter to the infrastructure, batch jobs are currently only supported for the IBM LSF job scheduler. Though, extending this to other job schedulers should be relatively straightforward.

Currently, parallelism is only supported parallelizing across configurations, samples, and strands, not parallelization is built into supporting any given pass. This is because parallelizing the decode process for a single strand will be algorithm specific, also allocating cores to larger units of work like strands/fault injection samples will also be more favorable for load balancing given that generally decoding each strand should be relatively little work. Though, if parallelism during strand decoding is warranted, the easiest approach would be to leverage a shared memory programming model like OpenMP or leverage GPU devices. The reason this would be the easiest approach is because the infrastructure does not provide an MPI communicator that can be used within a strand, so if a more locally managed parallelism is used like threads each individual process can easily control the parallelism. All that would be needed to do to enable parallelism like this is to ensure that the compute environment is set up properly across nodes. E.g. if 16 OpenMP threads are required for a single strand and a node has 64 cores, then MPI processes need to be tiled at 4 per node.

Language Considerations

When considering performance, the language used to implement encoding/decoding passes will influence run time as well. As you will see in the following section all of the infrastructure related to connecting encoding/decoding passes together is implemented in Python. However, for the programmer implementing new encoding/decoding passes which desires a higher performance language like C/C++, their use of Python can simply be limited to writing wrapper methods around Python-C++ interfaces which link to C/C++ code that implements the actual pass. Really, if you have a language that has an interface to python, then it can be used to implement the core behavior of encoding/decoding algorithms. This should also help reduce the amount of code that needs to be rewritten when choosing to use this infrastructure