QC pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction
Arguments
Argument overview
Pipeline schematic
Output files

Introduction

This is a quality-control (QC) sub-workflow for paired-end short-read sequence data.

Arguments

The current version of the QC pipeline requires the following arguments.

--seqdir string              Directory path containing paired-end short-read FASTQ files.
--platform string            Specify the sequencing platform. Options: illumina, mgi.
--krakendb string            Directory path to pre-installed Kraken2 database.
--bq_phred integer           Quality value that a base is qualified. Default phread quality >= 15.
--n_base_limit integer       Number of N's in a read before read-pair is removed. Default is 5
--average_qual integer       Average quality required by a read to not be filtered out. Default 0 (no minimum avg. quality).
--length_required integer    Reads shorter than this length will be discarded. Default is 15.

Argument overview

Seqdir

This argument is merely the directory path to the shor-read sequencing data. The read files are expected to be paired-end and gzipped with the extension _R{1,2}. If the files do not match match these requirements, the pipeline will fail to find the files and error out. Valid file names are shown below.

sample1_R1.fastq.gz
sample1_R2.fq.gz
sample.information_R1.fastq.gz
sample_information_R2.fq.gz

Platform

This argument requires the specification of the sequencing platform the data was generated on. Currently illumina and mgi are valid options. This is important for the Fastp process, as it can correct MGI sequence headers that don't play well with downstream software (e.g. BAM files).

Krakendb

Provide the path to a pre-installed Kraken2 database. The Phoenix compute nodes do not have access to the internet, so it is easier to simply download this ahead of time and pass the path to the directory. Pre-compiled Kraken2 databases can be downloaded from this website.

Bq_phred

This is an argument specifically for Fastp (as are all the following arguments), corresponding to the qualified_quality_phred argument. This argument controls the quality a base needs to be to be qualified. If a base-quality is less than this value, it will be removed. The default value is 15.

N_base_limit

This argument controls the number of N sequences that are allowed to be in a read before it is filtered out. The default value is 5.

Average_qual

This argument controls the minimum average quality of a read. If the average quality of a read drops below this value, it will be filtered out. The default value is 0, which means no filtering will take place and all reads will pass.

Length_required

The final argument controls the minimum length required by a read. If a read falls below this value (default: 15), it will be filtered out. I recommend setting this to around half the expected length.

Pipeline schematic

Output files

The QC sub-workflow generates a number of output files, however the most useful is likely to be the multiqc_report.html. This is an aggregation of all the QC results.

qc-results/
├── reports
│   ├── DAG.svg
│   ├── report.html
│   ├── timeline.html
│   └── trace.txt
└── out_prefix
    ├── fastp
    │   ├── id.fastp.html
    │   ├── id.fastp.json
    │   ├── id_R1.fastq.gz
    │   ├── id_R1.unpaired.gz
    │   ├── id_R2.fastq.gz
    │   └── id_R2.unpaired.gz
    ├── fastqc
    │   ├── id_R1_fastqc.html
    │   ├── id_R1_fastqc.zip
    │   ├── id_R2_fastqc.html
    │   └── id_R2_fastqc.zip
    ├── kraken2
    │   ├── id_1.fastq.gz
    │   ├── id_2.fastq.gz
    │   ├── id-classified_1.fastq.gz
    │   ├── id-classified_2.fastq.gz
    │   └── id.report
    └── multiqc
        ├── multiqc_data
        │   ├── multiqc_citations.txt
        │   ├── multiqc_data.json
        │   ├── multiqc_fastp.txt
        │   ├── multiqc_fastqc.txt
        │   ├── multiqc_general_stats.txt
        │   ├── multiqc.log
        │   └── multiqc_sources.txt
        └── multiqc_report.html

There will be a sub-directory for each QC-process with the relevant output files from each piece of software if you so wish to examine individual files/the speific outputs.