4. Information for Developers - DKFZ-ODCF/AlignmentAndQCWorkflows GitHub Wiki

Running the Tests

The workflow consists of a number of scripts that are written in various languages, including Bash, Perl, Python, R, AWK. Currently, only the Bash and Perl code has some tests. You can run all tests with the following sequence of commands.

git clone https://github.com/kward/shunit2.git
export SHUNIT2=$PWD/shunit2/shunit2
cd resources/tests
runTests.sh

Workflow Structure

The structure of the different workflow variants included in the plugin has been described in the section Configuring & Running. Some of the jobs are rather simple and don't need much documentation, but the alignAndSortSlim and mergeAndMarkDuplicatesSlim jobs are pretty complex, because they are highly tuned to reduce IO.

Basically both jobs create -- in a "main-line" -- SAM data as TAM (text-based) and BAM (binary variant) streams and pipe that data into multiple tools for quality control. The quality-control part is identical for the two jobs. Here is the example for the alignAndSortSlim job:

alignAndPairSlim

And a little legend

jobStructureLegend

Note that in this alignAndPairSlim structure plot parameters and input files may be missing, but it provides a good overview. The implementation of the data-streams is based on pipes and named-pipes/fifos in Bash. Data streams are split by tee or mbuffer, which is not reflected in the figure.

Input/Output

Each cluster job has its characteristic input and output behaviour. The read data is usually the largest part. In the following reading the indices, genome sequences, annotation files, or writing statistics files is ignored.

  • The fastq jobs read the read files once.
  • The alignAndPairSlim cluster job reads the read data 2 times and writes it 2 times:
    • 1 or 2 FASTQ input files
    • samtools sort/bamsort write and read the BAM data once
    • sorted lane-BAM (FILENAME_SORTED_BAM)
  • The mergeAndMarkDuplicatesSlim cluster job writes the BAM data <= 2 times and writes it <= 2 times
    • lane-BAM input
    • sambamba/Picard/bammarkduplicates create temporary files of maximally about the size of the input files
    • merged-BAM output
  • The coveragePlot[Single] job reads the merged-BAM once
  • The annotateCoverageWindows reads the merged-BAM once

The total number or inputs and outputs of the read data depends on the specific workflow (WGS, WES, WGBS) and their configuration (fastq, coverage plots, etc.).

Final Remark

There is an outdated Docker version of the QC part at https://github.com/HiDiHlabs/PanCanQC.