Validation - RCChan5/BioLockJ GitHub Wiki

Summary

Description: Validation checks whether the output files of a pipeline match the expectation.

validation.compareOn
validation.disableValidation
validation.expectationFile
validation.reportOn
validation.sizeWithinPercent
validation.stopPipeline

The validation utility creates a table for the output of each module where it reports the file name, size and md5. These tables are saved in the validation folder; the validation folder generated by a pipeline can be used as the expectations when re-running the same pipeline.

If there are no expectations, these values are reported in the validation folder.
If there are expectations, these values are reported and compared against the expected values; the result of the comparison is reported as either PASS or FAIL for each file.

If validation.stopPipeline=Y, the validation utility will halt the pipeline if any outputs FAIL to meet expectations, otherwise the result is reported and the pipeline moves forward.

Soft Validation

Many components of a pipeline have the potential for tiny variation: maybe a date is stored in the output, or a reported confidence level is based on a random sampling. With these tiny variations, the file is practically the same, but it will FAIL md5 validation. The file might also be a few bytes bigger or smaller, so it will also FAIL size validation. "Soft validation" is the practice of allowing some wiggle room. If the config file gives validation.sizeWithinPercent=1, then an output file will PASS size validation as long as it is within 1.0% of the expected file size. By default, this value is 0, and a file must be exactly the expected size to pass size validation.

Expectations

Give the file path to the expectation file using validation.expectationFile=/path/to/saved/validation.

This path can either point to a tab-delimited table giving the expectations for a single module, or it can point to a folder, in which case BioLockJ assumes that a file within this folder has a name that matches the module being validated. When validating an entire pipeline, the expectation file for all modules can be passed with a single file path. The validation folder created by a pipeline is designed to be used as this input.

The expectation file format is:

The expectation file is a tab-delimited table.
The first row is column names.
The first column (labeled "name") gives the file names.
Optional column "size" gives the file size in bytes.
Optional column "md5" gives the md5 string.
Optional column "MATCHED_EXPECTION" is always ignored.
The file should not have any other columns.

Use cases

The expectation is usually based on a previous run of the same pipeline.

Maybe some software has been updated and the results are not expected to change, but you have to re-do your analysis with the latest version to appease reviewers.
Maybe you added a filtering step.
Maybe you just want to change module 5, and you expect 1-4 to produce the same outputs they did last time.
Maybe this analysis has been published and the the original researcher made their pipeline available to you; you want to re-run it and know if the output you generated by running the pipeline is the same as what they had.

The expectation can be set by hand. This is recommended for validation using name only, or soft validation using size only. This is a way to prevent a pipeline from continuing after it is effectively doomed.

For example: Maybe module 5 is a resource-intensive classifier, and modules 1-4 are processing and filtering steps ending with the SeqFileValidator. If modules 1-4 filter out too much, you might not want to move forward with module 5 until you've made adjustments to the earlier modules. You could create an expectation file for module 4, that just lists the names of the files and their pre-filtering file size (in bytes), and set validation.sizeWithinPercent=80 and SeqFileValidator.stopPipeline=Y. With this, the pipeline will stop if any of those files are not in the module 4 output or if any of them have been reduced by more than 80%. The output file names are predictable if you've ever seen output from that module before.