4. Information for Developers - DKFZ-ODCF/AlignmentAndQCWorkflows GitHub Wiki
Running the Tests
The workflow consists of a number of scripts that are written in various languages, including Bash, Perl, Python, R, AWK. Currently, only the Bash and Perl code has some tests. You can run all tests with the following sequence of commands.
git clone https://github.com/kward/shunit2.git
export SHUNIT2=$PWD/shunit2/shunit2
cd resources/tests
runTests.sh
Workflow Structure
The structure of the different workflow variants included in the plugin has been described in the section Configuring & Running. Some of the jobs are rather simple and don't need much documentation, but the alignAndSortSlim
and mergeAndMarkDuplicatesSlim
jobs are pretty complex, because they are highly tuned to reduce IO.
Basically both jobs create -- in a "main-line" -- SAM data as TAM (text-based) and BAM (binary variant) streams and pipe that data into multiple tools for quality control. The quality-control part is identical for the two jobs. Here is the example for the alignAndSortSlim
job:
And a little legend
Note that in this alignAndPairSlim structure plot parameters and input files may be missing, but it provides a good overview. The implementation of the data-streams is based on pipes and named-pipes/fifos in Bash. Data streams are split by tee
or mbuffer
, which is not reflected in the figure.
Input/Output
Each cluster job has its characteristic input and output behaviour. The read data is usually the largest part. In the following reading the indices, genome sequences, annotation files, or writing statistics files is ignored.
- The fastq jobs read the read files once.
- The alignAndPairSlim cluster job reads the read data 2 times and writes it 2 times:
- 1 or 2 FASTQ input files
samtools sort
/bamsort
write and read the BAM data once- sorted lane-BAM (
FILENAME_SORTED_BAM
)
- The mergeAndMarkDuplicatesSlim cluster job writes the BAM data <= 2 times and writes it <= 2 times
- lane-BAM input
sambamba
/Picard
/bammarkduplicates
create temporary files of maximally about the size of the input files- merged-BAM output
- The coveragePlot[Single] job reads the merged-BAM once
- The annotateCoverageWindows reads the merged-BAM once
The total number or inputs and outputs of the read data depends on the specific workflow (WGS, WES, WGBS) and their configuration (fastq, coverage plots, etc.).
Final Remark
There is an outdated Docker version of the QC part at https://github.com/HiDiHlabs/PanCanQC.