Fault Injection - dna-storage/framed GitHub Wiki

Fault Injection Environments

Fault injection is carried out in what we call a fault injection environment. This environment carries out some basic operations to generate "faulty strands" and also contains other classes that aid in generating strand distributions and injecting errors which are covered in the next two sections. The environment is represented by a class named Fi_env, and can be found in the file fi_env.py. This class is likely not that interesting to the average user as there is little customization that needs to be done here. The only thing that the user should be aware of is the keyword arguments that control the environment, they are listed as:

fault_model: name of the fault model that will be used.
distribution: name of the distribution that will be used to generate DNA copies.
reverse_complement: indicates whether reverse complements should be mixed in when performing fault injection.

The names of valid fault injection models can be found in the open @classmethod of the BaseFI class in fault_injector.py. The same can be found for read distributions in the ReadDistribution class in the file readdist.py. With this knowledge you should be able to go directly to the part of the infrastructure that allows you to run fault injection experiments provided you use fault models/distributions that already exist. If that is the case please go to this section which explains launching jobs.

Fault Injection Models

Writing new fault injection models requires extending base classes, similar to what we did for implementing new components. An outline of an extending fault injection model is as follows:

#Need FaultDNA representation
from dnastorage.fi.fault_strand_representation import *
class MyFaultInjector(BaseFI):
    def __init__(self,**args):
        BaseFI.__init__(self)
        #initialize with keyword arguments from args     
    def Run(self):
        out_list = []
        #Fille out_list with FaultDNA objects
        return out_list

The job of the fault injector is simple, it uses the attribute _input_library to generate a list of DNA strands that must be of type FaultDNA. The fault injector can be initialized with anything the user wants to pass in through the keyword arguments in the __init__ method. How the fault injector modifies the _input_library during the run call is up to the programmer. A FaultDNA object is instantiated with a reference to a BaseDNA object, and a string that represents the new DNA strand. The FaultDNA constructor will ensure all new attributes inserted by probes are added to the object. The FaultDNA class can be found in the file fault_strand_representation.py . Note, after writing a new fault injection model, it needs to be added to the BaseFI open class method under a name.

Strand Distribution Models

Distribution models allow for the assignment of a number of strand copies. This is important in a DNA storage system given the availability of copies typically found in sequencing data which arise from a number of possible processes. Whether or not an evaluation wants to try to leverage sequencing depth is up to the programmer, but simple distribution models can be easily used to simulate 1 read per strand. An example of a useful distribution is given in the following code block. This example can be followed to write new distribution models that fit different needs. In this model, a strand is given a probability of being dropped out defined as 1-kwargs["mean"]. If a strand does not get dropped out, then the number of strands returned is kwargs["n_success"]. This models both drop out of strands and a guaranteed sequencing depth. The random variable defined as the number of reads per strand is thus a Bernoulli random variable scaled by the sequencing depth. A read distribution needs to support several methods. The pmf method defines a probability mass function, e.g. given an input X the probability that the random variable can take on X is returned. Most importantly is the method gen, which is called by the fault injection environment. What this method does is returns a number sampled from the given read distribution. An argument is provided to gen named strand which is strand gen is being called for, using this parameter is optional. This sampled value is then interpreted as the read depth for a certain strand. When initializing the model, the super class ReadDistribution can be initialized with the average and standard deviation of the distribution. This isn't necessarily important, but provides a sanity check on the properties of the distribution, and thus these fields can be left 0 or None.


class DNABernoulli(ReadDistribution):
    def __init__(self,**kwargs):
        assert "mean" in kwargs and "n_success" in kwargs
        self._n_success=kwargs["n_success"]
        self._success_prob=kwargs["mean"]
        ReadDistribution.__init__(self,kwargs["mean"]*kwargs["n_success"],(1-kwargs["mean"])*(kwargs["mean"])*kwargs["n_success"]**2)

    def pmf(self,X):
        if X != 0 and X!=self._n_success:
            return 0
        elif X==0:
            return 1-self._success_prob
        else:
            return self._success_prob

    def gen(self,strand):
        rand = generate.rand()
        if rand <=self._success_prob:    
            return self._n_success
        else:
            return 0

Similarly, as in the fault injection model, the read distribution model needs to be registered with the open class method of the base class ReadDistribution.

Configuring Fault Injection Runs

To this point we have covered all the essential parts of a fault injection run: writing pipelines and custom passes, writing fault injection models that will modify encoded DNA strands, and read distribution models which control the sequencing depth distribution the will be simulated. In the following sections we will talk about the tool that is used to put all of these pieces together with their parameters so that fault injection simulations can be run in a scalable manner over a number of batch jobs and MPI processes.

Fault Injection Tool

The main tool for fault injection runs is a python script, named generate_fi_jobs.py. This script takes in several parameters that can be described in detail by running python generate_fi_jobs.py --help. The main function of the script is to take in compute environment parameters like number of cores, core depth, and memory and also take in a parameter configuration file that describes all of the fault injection experiments that need to be executed. One of the main strengths of this infrastructure is the ability to generate a lot of simultaneous batch jobs for the exploration of a combinatoric space of encoder/decoder parameters along with fault injection environment parameters. The computer environment parameters are relatively straight forward, so we will focus the discussion on the parameter configuration file and its format.

Fault Injection JSON Config Files

We use json formatted configuration files to express fault injection parameters. There are several expected fields in this json file as described bellow. If these fields do not exist, an error will get thrown.

file: The file of data that will be encoded and decoded during fault injection.
arch: Name of the pipeline given in the FileSystemFormats dictionary.
header_version: version of the header being used. Could default this to "0.5" for simplicity right now.
simulation_runs: number of fault injection samples run for each fault injection configuration.
encoder_params: Parameters for the pipeline.
header_params: Parameters for the header pipeline.
fault_params: Parameters for the fault injection model.
distribution_params: Parameters for the strand distribution model.
fi_env_params: Parameters for the fault injection environment class.
dna_processing: Each entry in this field is a function name, and the accompanying array for each field is a set of parameters for the function. This dictionary can be left empty, but it is a way to apply some basic DNA functions on the encoded strands, like transcription simulation, so that the decoded strands have fields that mirror those that are expected from experiments.

Within each of the fields that are dictionaries, e.g. encoder_params, is where arguments are populated which will control the pipeline, fault injector, and read distribution objects. These arguments can be anything, and are dependent on what pipeline and fault injection models that you are using. This makes the infrastructure extremely flexible in terms of simulating many different models that may have very different parameters, but makes it the responsibility of the user to know the implementation of the objects they are using, at least at the surface level. The following code block is an example that can be found in the examples path.

{
    "file":"path/to/file",
    "arch": "BasicHedges",
    "header_version":"0.5",
    "simulation_runs":1000,
    "encoder_params":{
	"primer3": "", 
	"primer5": "",
	"hedges_rate":0.167,
	"crc_type":"index",
	"strandSizeInBytes": 86,
	"blockSizeInBytes": 86,
	"outerECCStrands":["value_list",0],
	"dna_length": 100000, 
	"title": "payload",
	"fi": true,
	"seq":true,
	"try_reverse":false,
	"reverse_payload":true,
	"hedges_guesses":["value_list",100000]
    },
    "header_params":{
	"primer3": "", 
	"primer5": "", 
	"hedges_rate": 0.25, 
	"strandSizeInBytes": 5, 
	"blockSizeInBytes": 100,
	"outerECCStrands":36,
	"dna_length": 400, 
	"title": "header",
	"try_reverse":false,
	"fi": true
    },
    "fault_params":{
	"fault_rate":["value_list",0.1,0.2]
    },
    "distribution_params":{
	"mean":["value_list",1],
	"n_success":["value_list",1]
    },
    "fi_env_params":{
	"fault_model":"fixed_rate",
	"distribution":"bernoulli",
	"reverse_complement":false
    },
    "dna_processing":{
	"T7_Transcription":[]
    }
}

For any given parameter, there is a syntax that is understood by the back-end of the fault injection script which allows for describing a set of values that should be simulated. This helps express multiple experiments that are related to each other in a single configuration file. As exemplified by the "fault_rate" parameter in the above code block, sets of values follow the format:

"<parameter_name>":["<parameter_type>", <parameter_options>]

<parameter_name> is the name of the parameter, <parameter_options> are options for the parameter, and <parameter_type> specifies how the options for the parameter should be interpreted. There are currently 3 types of parameters that are supported:

"value list": The parameter will take on values specified as as a sequence of integers in the <parameter_options> field.
"range": The parameters will take on values specified by a sequence of 3 integers (start,stop,step), where stop is non-inclusive. Integer ranges and float ranges are both supported.
"file_name": The parameter options for this type of parameter is a sequence of 2 strings, (base directory, regexp). What the backend does for this parameter type is that starting from the base directory, all sub-paths are searched for files that match the input regexp. The regexp follows Python's representation of regular expressions. This parameter type is useful for easily specifying a set of files that may need to be read by a certain object during fault injection.
"dir_name": Same exact thing as file_name, except directory names are searched for.

As is needed, the parameter types can be extended in the file param_util.py.

The fault injection script will combine every parameter value with every other parameter to create a combinatoric set of unique fault injection experiments. Just take for example 2 parameters that are both value lists specified as ["value_list",1,2] and ["value_list",3,4]. The script will combine each parameter in one value_list with each parameter in the other to get a unique set of 4 tuples: (1,3),(1,4),(2,3),(2,4) that each represent a unique fault injection run. If there was a third such list of two elements, then we would have 8 unique runs. So, while this allows for large spaces of simulations to be performed from 1 script, care should be taken to not explode the space. After specifying your parameters, the fault injection script can be launched.

The next section will discuss how all of the outputs of each fault injection run will be organized.

Structure of Stored Fault Injection Data

For each fault injection experiment, a unique directory is generated. There is typically a directory for each set of parameters, and is indicated by the parameter fields embedded in the directory name. This allows for easier navigation of separate experiments. In the cases where the number of parameters is excessively large, which typically occurs for encoder parameters, the script will instead use a hash of all the parameters to generate a unique directory. This also occurs if the file_name option is used, e.g the actual names of the files used as parameters are replaced by a hash of their absolute paths.

At the end of each unique directory path for each experiment there are several files that should be output for every run as explained below:

encoder_params.json: json file that stores the configuration for the encode/decode pipeline.
fault_params.json: json file that stores the fault injection model configuration.
distribution_params.json: json file that stores the read distribution model configuration.
fi_env_params.json: json file that stores the parameters for the fault injection environment.
dna_process.json: json file that stores the DNA processing functions used before decode.
sim_params.json: json file that stores general parameters of the fault injection runs.
header_params.json: json file that stores the configuration for the encode/decode pipeline for the header.
fi.stats: human readable file for statistics output from fault injection.
fi.pickle: binary pickled version of fi.stats. This allows for subsequent tools to easily load in data directly instead of requiring the need to parse human-readable forms.
header_pipeline<X>.header: binary header file for the header pipeline output for the last Xth fault injection simulation. Typically, there are a number of the same file that are generated by separate processes, but only one is stored for reference by the user.
payload_pipeline<X>.header: binary header file for the payload pipeline output for the last Xth fault injection simulation. Typically this wouldn't be stored in a real DNA storage system, but it is useful to have for debugging.
*.dna: This file is a *.dna file, storing the encoded data of the file that was fault injected. This is provided for convenience so that the fault injection tool can be used to generate encoded files which can be sent for synthesis.

The *.json files output from fault injection are used to book keep all of the parameters associated with the run so that data can be categorized by each parameter.

Compiling Fault Injection Data

After running fault injection experiments, you will likely have a large set of fi.pickle files. To compile this data into a format that can be easily navigated and plotted from, there is a data compilation tool named db_gen.py. This script takes in a base directory for the statistics that are to be compiled. This base directory is the top directory generated by the fault injection generation script. This script will generate a binary, pickle formatted, Pandas dataframe and store it at the given top directory for data. The keys for the dataframe are determined by the statistics names in fi.stats and the *.json files that describe all of the configuration parameters. For each unique fi.stats path, a row is generated in the data frame where each row contains fields related to all configuration parameters and stored statistics. This format allows the user to load the dataframe and query for certain configurations to navigate relevant data.