Introduction

Introduction

This sub-workflow is designed to be a flexible variant calling pipeline that can accomodate both standard and joint genotyping using the software BCFtools. An additional aim when developing this pipeline was to speed up run-time, especially when joint genotyping, by calling variants for subsets of chromosomes in parallel and concatenating them all together at the end.

Arguments

The current version of the variant pipeline has the following arguments:

--sheet string               CSV file with paths to each BAM and its corresponding reference genome: '<path BAM>,<path reference>'
--method string              Method of variant calling. Options: standard, joint.
--vcftype string             Type of VCF to return. Options: variant, allsites.
--regions string             Directory path to region files (BED format) to call variants in. The basename of the file should match the reference it is to be used for.
--nodp boolean               Don't apply depth filter (only applies to 'standard' mode)
--mapq integer               Mapping quality applied at 'mpileup'. Default: 10.
--baseq integer              Base quality applied at 'mpileup'. Default: 10.
--genoq integer              Genotype-quality filter.
--ploidy integer             Ploidy of the organism. Default: 2
--mpileup_opt string         Optional arguments to pass to 'bcftools mpileup'. Provide within single quotes.
--call_opt string            Optional arguments to pass to 'bcftools call'. Provide within single quotes.
--filter_opt string          Optional arguments to pass to 'bcftools filter'. Provide within single quotes.
--view_opt string            Optional arguments to pass to 'bcftools view'. Provide within single quotes.
--norm_opt string            Optional arguments to pass to 'bcftools norm'. Provide within single quotes.
--sort_opt string            Optional arguments to pass to 'bcftools sort'. Provide within single quotes.

Below I'll go into each argument in a bit more detail.

Argument overview

Sheet

The pipeline requires a sample-sheet with two columns: BAM path and reference path. The first column contains the file paths to BAM files. The second column contains the file path to the reference the samples were aligned to - i.e. the BAMs corresponding reference file.

The CSV sample-sheet provides flexibility in processing the data. We can pass lots of different samples that have all had their sequence data aligned to different references without having to re-run the pipeline multiple times for different subsets.

An example of how the samplesheet should look is below:

/path/to/SAMPLE1.bam,/path/to/reference-A.fa
/path/to/SAMPLE2.bam,/path/to/reference-A.fa
/path/to/SAMPLE3.bam,/path/to/reference-B.fa

Method

This argument is used to specify the type of variant calling one wants to do. I've written the pipeline to accomodate joint-genotyping - that is, calling variants in multiple samples at the same time - and standard genotyping. Standard genotyping is simply calling variants on a sample-by-sample basis.

vcftype

The type of VCF output is dictated by the vcftype argument. The two options include allsites and variant. The allsites option returns all sites that could be genotyped, regardless of if they are variant or not ( these VCF files are also called invariant-VCF files). The alternative is the variant VCF which only returns a VCF file with variable sites (I.e. 0/1 or 1/1/).

Regions

The regions argument plays a role in a couple of pipeline processes. The argument expects a file path to a BED file corresponding to the regions you'd like to call variants in. Additionally, if you are joint genotyping, the first column of the BED file will be used to parallelise the variant calling process across chromosomes. For example, lets say we have the BED file below:

chr1    10  5000
chr8    390 2977
chrz    50000   1000000

First, variants will only be called in the regions between 10-5000 on chromosome 1, 390-2977 on chromosome 8 and 50000-1000000 in chromosome Z. Further, the pipeline will run variant calling on the three chromosomes ( chr1, chr8 and chrZ) in parallel (assuming joint genotyping).

If you do not provide a regions file, all chromosomes/contigs will be used and no splitting of the data will occur. Similarly, if you are variant calling in standard mode, the regions file will not be used to split samples by chromosome in an attempt to reduce cluster load (e.g. if you have hundreds of samples and many chromsomes, it could quickly become a big job...). Further, I couldn't be bothered writing the code to do that right now...

Nodp

This is simply an argument that can be used in standard mode. It'll prevent the pipeline from using 2 $\times$ Avg. DP and $^1/_3$ Avg. DP as the maximum and minimum depth filters.

MapQ/BaseQ/GenoQ/Ploidy

These are all simply arguments that are passed to the BCFtools suite during the variant calling process. They can be set to reasonable values to filter the pipeline results.

<argument>_opt

The final set of arguments are optional strings you can pass with software specific arguments. I've deliberately been light-on when it comes to hard-coded filters throughout the pipeline. Use these arguments to pass any additional custom arguments you'd like to the software. Simply pass the arguments in single quotes as a string. e.g.

--filter_opt '--IndelGap 10 --include'

Variant Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

Arguments

Argument overview

Sheet

Method

vcftype

Regions

Nodp

MapQ/BaseQ/GenoQ/Ploidy

<argument>_opt

Pipeline schematic

⚠️ GitHub.com Fallback ⚠️

Variant Pipeline - a-lud/nf-pipelines GitHub Wiki

Introduction

Arguments

Argument overview

Sheet

Method

vcftype

Regions

Nodp

MapQ/BaseQ/GenoQ/Ploidy

<argument>_opt

Pipeline schematic

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️