Analyzing Sequencing Results - dna-storage/framed GitHub Wiki

Analyzing sequencing data is done in 2 phases. The first phase is a secondary tool to the main fault injection tool, controlled similarly to the fault injection tool, that takes in the sequencing data and runs it through the decode portion of the pipeline. Because sequencing data is unknown, e.g. each strand's origin identity isn't known until performing some form of decoding correctly or aligning to a ground truth set of DNA strands, data cannot be collected initially for sequencing like we collected for fault injection, e.g we can't compare to a baseline strand to determine where errors are occuring. Since we would like to have this kind of in-depth analysis available to sequencing data, the first pass simply needs to generate a mapping between each sequencing read and its index that identifies which encoded strand it has originated from. The second phase utilizes the exact same fault injection setup and code to further analyze the sequencing data so that probes can be used to generate in-depth error analysis. The next sections demonstrate how to configure this first phase of analysis and how to utilize the results in a fault injection run.

Indexing Sequenced DNA

The script that will be run to generate index maps is generate_seq_jobs.py. You can run python generate_seq_jobs.py --help to get more information on the arguments for the script. A lot of the arguments are similar to those for the fault injection script, e.g. describing what queue to submit jobs to, time to run, number of cores, etc. The two most important arguments are:

--sequencing_path: A single file or directory of sequencing data to analyze. If it is a directory, all of the files in this directory will be run as an individual job.

--params: path to a json file which describes parameters for the decoder used during indexing.

The next section will cover how the --params json file is specified.

Sequencing Config Files

Each parameter in the json file can be specified with "value_lists", etc. just like what was done for fault injection runs. However, there is one main difference between these sequencing configuration files and fault injection configuration files which is the sequencing configuration file supports multiple pipeline specifications. The reason to allow for multiple pipelines is because sequencing data may come from an experiment that may pool together multiple files that used separate decoders. Thus, to get the entire set of mapping information out of the sequencing data, it is convenient to run all relevant decoders at the same time. An example of a sequencing configuration file can be found in our examples directory, which is also pasted below in the following code block:

{
    "file_list":[
	{
	    "arch": "BasicHedges",
	    "header_version":"0.5",
	    "encoder_params":{
		"primer3": "TTTCCTAAAGGTCGGGGGCA", 
		"primer5": "TCGAAGGCGACAGAAGAGTCAAGGTTC",
		"hedges_rate":0.167,
		"crc_type":"index",
		"strandSizeInBytes": 86,
		"blockSizeInBytes": 86,
		"hedges_guesses":["value_list",200000],
		"outerECCStrands":0,
		"dna_length": 100000, 
		"title": "payload",
		"fi": true,
		"try_reverse":false,
		"reverse_payload":true
	    },
	     "header_params":{
		 "primer3": "AAAAAAAAAAAAAAAAA", 
		 "primer5": "AAAAAAAAAAAAAAAAA", 
		 "hedges_rate": 0.25, 
		 "strandSizeInBytes": 5, 
		 "blockSizeInBytes": 100,
		 "outerECCStrands":36,
		 "dna_length": 400, 
		 "title": "header", 
		 "fi": true
	     },
	    "payload_header_path":"/path/to/payload/header",
	    "header_header_path":"/path/to/header/header"
	}
    ]
}

Each sequencing json file must have a "file_list" entry. This entry must be a list, where each entry in the list will be a dictionary that bundles together parameters for each decoder that is desired to run on the sequencing data. Within each of these entries are entries that are identical to entries in the fault injection generation script, and they do indeed refer to the same things. However, there are 2 entries that are new to the sequencing config file. They are "payload_header_path" and "header_header_path". These are paths to the payload header binary files and the header's binary file. The latter is not necessarily required, as it may be attainable from decoding the DNA itself, but may be included to ensure that the header is instantiated properly. These 2 files should have been generated during fault injection runs for the given encoder. For example, from the section on fault injection outputs, the file header_pipeline<X>.header would be used for header_header_path and payload_pipeline<X>.header would be used for payload_header_path.

Indexing Output Data

The organization of data output from sequencing analysis is similar to fault injection, where a unique directory path is generated for sequencing file and total decoder configuration. In each unique path, you will find .json files that are identical to those output from the fault injection analysis. There are a few new files that are important listed as follows:

  1. sequencing.stats: human readable form of data extracted from analyzing the sequencing data.
  2. sequencing.pickle: binary pickled form of sequencing.stats.
  3. sequencing.map: mapping data output from sequencing in human readable form.
  4. sequencing.map.pickle: mapping data in pickled binary form.

The file that is likely immediately interesting to the user is sequencing.stats which provides a broad overview of how many sequencing strands were imported, how many strands were able to be indexed, and where the strands were filtered out if any. The mapping files, while important, are more of a intermediate data output for doing fault injection analysis.

Generating a Sequencing Map

A sequencing sequencing.map is not seomthing FrameD inherently knows how to generate. Thus, a user must include a probe that has support for generating such a file. The most basic method is to check the decoded index of a strand and then write down the index and what fastq ID it pairs with. This is done in the following code, which is a snippet from a class provided as a probe in this code base. During initialization of IndexDistribution we register a "sequencing.map" file to be output from simulation that has an associated statistic. The stored statistic is tracked as a dictionary because a dictionary allows us to associated each fastq ID with an index. Then, during decoding, when an index successfully decodes we make the pairing between index and fastq ID. Note the if hasattr(s,"record_id") statement, if a strand is from sequencing data the attribute record_id should be set for the strand to its unique ID given by sequencing.

This is just one way a mapping can be generated. One can write other forms of mapping for different purposes. For example, one may want to map based on the best base-alignment of a sequenced strand against a set of ground truth strands. As long as the user knows the index integers that are assigned to the ground truth strand during encoding, this should be easily accomplished following the given example with a simple input file for the probe that contains this ground truth information.

class IndexDistribution(BaseCodec,Probe):
    probe_id=0
    def __init__(self,probe_name="",CodecObj=None,prefix_to_match=tuple()):
        BaseCodec.__init__(self,CodecObj)
        if probe_name=="":
            self.name = "index_dist_probe_{}".format(IndexDistribution.probe_id)
        else:
            self.name = probe_name
        self._ideal_index_dist_attr = stats.get_next_name("{}::ideal_decode_dist".format(self.name))
        self._seq_map = "{}::seq_map".format(self.name) #create name for statistic that will be dumped to "sequencing.map"
        stats[self._seq_map]={} #dictionary statistic to track mapping between index and fastq ID
        stats.register_file(self._seq_map,"sequencing.map") #register "sequencing.map" file
        IndexDistribution.probe_id+=1
    def _decode(self,s):
        #pipeline doesn't make index_ints until AFTER the whole inner CW to CW cascade runs, do our own translation for now
        try:
            index_ints = tuple(unpack_bytes_to_indexes(s.codewords[0:s.index_bytes],s.index_bit_set))
        except Exception as e:
            stats.inc(self._total_indexes_lost)
            return s
        if self._prefix_to_match != index_ints[:len(self._prefix_to_match)]:
            stats.inc(self._total_indexes_lost_barcode_mismatch)
        else:
            stats.inc(self._total_indexes_decode)
            #if the strand we are seeing is from sequencing, record its index
            if hasattr(s,"record_id"): 
                stats[self._seq_map][s.record_id]=index_ints
        return s

Fault Injection Analysis of Indexed DNA

The main bridge between sequencing analysis and fault injection is the sequencing.map.pickle file. This is used by a fault injection model that understands this mapping, and uses it to generate "faulty" data from a set of encoded data. Thus, the fault_model parameter needs to be modified in the fault injection configuration file, and the locations of each sequencing map need to be provided. This is shown in the following example fault injection script that is used to analyze sequencing data:

{
    "file":"path/to/file",
    "arch": "BasicHedges",
    "header_version":"0.5",
    "simulation_runs":1000,
    "encoder_params":{
	"primer3": "", 
	"primer5": "",
	"hedges_rate":0.167,
	"crc_type":"index",
	"strandSizeInBytes": 86,
	"blockSizeInBytes": 86,
	"outerECCStrands":["value_list",0],
	"dna_length": 100000, 
	"title": "payload",
	"fi": true,
	"seq":true,
	"try_reverse":false,
	"reverse_payload":true,
	"hedges_guesses":["value_list",100000]
    },
    "header_params":{
	"primer3": "", 
	"primer5": "", 
	"hedges_rate": 0.25, 
	"strandSizeInBytes": 5, 
	"blockSizeInBytes": 100,
	"outerECCStrands":36,
	"dna_length": 400, 
	"title": "header",
	"try_reverse":false,
	"fi": true
    },
    "fault_params":{
	"experiment_path": ["dir_name","/top/sequencing/path",".*\\.fastq/decoder.*/?$"]
    },
    "distribution_params":{
	"mean":["value_list",1],
	"n_success":["value_list",1]
    },
    "fi_env_params":{
	"fault_model":"sequencing_experiment",
	"distribution":"bernoulli",
	"reverse_complement":false
    },
    "dna_processing":{
    }
}

In this example, the only fields that needs to be changed is the first and second argument for the experiment_path entry of fault_params. The first argument should be a path that will act as the root in search for directory names that max the given regular expression. The second argument is a regular expression that that the path name should match. This regular expression is what should be used to target a subset of sequencing data that may match a certain encoder.

Note the fault model we are using here, which is named sequencing_experiment. This fault model replaces the synthetic faults that are typically used during in-silico trials with the actual raw sequenced strands itself. Thus allowing us to derive error characteristics of NGS data.

After running a fault injection experiment using sequencing index data, the same exact output files will be generated as discussed previously, and thus data can be compiled and analyzed using dataframes exactly the same as well.

Walkthrough Example Analyzing Sequencing Data

In this section we provide an end-to-end example of how to perform fault injection analysis on NGS data.

Assume we have a directory structure as such:

/
 NGS_DATA/
  set1_run1.fastq
  set1_run2.fastq
  set2_run1.fastq
 framed/
  tools/
   lsf/
    config/
     seq.json
     fi.json

In this example /NGS_DATA is a directory which contains a suite of fastq results from sequencing. For the sake of example we assume that this set of data has 2 subsets prefixed with set1/2. To make the example easier, we assume the user has changed their working directory to /framed/tools/lsf/, where there is a sub-directory named config that includes 2 jsons, one for sequencing the other for fault injection.

An example first command that can be issued for NGS analysis is the following:

python generate_seq_jobs.py --dump_dir `pwd` --params /framed/tools/lsf/configs/seq.json  --sequencing_path /NGS_DATA/   --sequencing_regex "set1.+"  ... 

This command has a params argument to point to the pipeline specification to use, a sequencing_path argument to point to the NGS data to analyze, and a sequencing_regex parameter. This final parameter is optional, but proves to be useful frequently. Typically, when sequencing experiments are performed, multiple experiments that may be loosely related are pooled together into a single run. Thus, strands related to each experiment are uniquely tagged so they can be separated into their respective fastq file. To make data organization easier, the user can allow all of these fastqs to occupy the same directory and filtering the application of the desired pipeline can be done with the regex option that only picks files that match the input string. So, in this scenario, only jobs prefixed with set1 are run.

After running the previous command, you should now have a directory structure that looks something like:

/
 NGS_DATA/
  set1_run1.fastq
  set1_run2.fastq
  set2_run1.fastq
 framed/
  tools/
   lsf/
    config/
     seq.json
     fi.json
    NGS_DATA/
     unique_set1_run1_path/
      sequencing.map
      sequencing_data_link
     unique_set2_run1_path/
      sequencing.map
      sequencing_data_link

The tool we ran created a directory NGS_DATA in our working directory, and created 2 sub-paths each unique to each fastq file. Under each path is a sequencing.map and sequencing_data_link file. The former provides the mapping discussed earlier, linking each fastq read to a strand's index. The latter provides a soft link to the fastq file that was analyzed. The reason to include this link is to allow the directory for each fastq to include every piece of information relevant to the analysis which makes writing fault injection support easier.

The last command is a fault injection command that should have a config file set up to ingest the mapping files and should look like:

python generate_fi_jobs.py --params /framed/tools/lsf/configs/fi.json  ... 

Where fi.json should utilize the fault module "fault_model":"sequencing_experiment" as previously discussed which already supports the directory structure of the results of running the sequencing analysis.