Home - nsc-norway/pipeline GitHub Wiki

Scope

The repository contains scripts for automation of a few data processing tasks, written for the NSC sites, and probably not usable elsewhere. The scripts interact with the Clarity LIMS and the Slurm resource manager.

The following documentation is for the "v2" branch, which will soon be merged into the master branch.

The primary components are the "task" scripts, which are prefixed with a number to indicate the order in which they have to run. The scripts handle processing, analysis and management of the data, and automate some parts of the data delivery process.

Demultiplexing / initial data preparation phase

10: Copy metadata

This step copies the run folder from the primary storage to the secondary storage, but excludes the actual data. So only the directory structure, logs, config files, stats, etc. are copied. The run folder on the secondary storage then becomes the "working directory", into which all the subsequent processes write.

20. Pre-demultiplexing

The script in 20 prepares the sample sheet, taking a sample sheet generated by the LIMS and changing it into a format which bcl2fastq2 can read.

30. Demultiplexing

This step invokes bcl2fastq2 on a compute node using the configured remote execution command (srun).

40. Post-demultiplexing

Tasks to run after demultiplexing. This constitutes the end of the demultiplexing "phase". After this, the fastq files are ready, and the subsequent tasks handle the "QC phase".

The "move_results" script moves the result files into a standard file structure. The default structure exposes the LIMS-IDs of the samples, which are irrelevant and confusing to our users, so keeping the default file tree is not a desirable option. The "processed" script marks the run as processed in the LIMS, and moves the run folder on the primary storage into an archive directory.

QC Phase

50. Initial Qc

Lightweight reporting scripts which don't depend on FastQC or any other computations. The "emails" script generates reports in a format helpful for NSC to send out delivery emails. The "update_lims" script posts the demultiplexing stats to the LIMS process.

60. FastQC

Invokes FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) on a compute node, and organises the output files into a manageable structure.

70. Reports

Generates reports using the output of FastQC and the demultiplexing stats.

80. Final processing

When all the QC results are in place. The "md5sum" script generates a checksum for all files in each project, including the PDF QC reports.

90 Delivery prep

Various local automation tasks to prepare data for delivery,

Processes (details)