Variant Pipeline - a-lud/nf-pipelines GitHub Wiki
This sub-workflow is designed to be a flexible variant calling pipeline that can accomodate both standard and joint genotyping using the software BCFtools. An additional aim when developing this pipeline was to speed up run-time, especially when joint genotyping, by calling variants for subsets of chromosomes in parallel and concatenating them all together at the end.
The current version of the variant
pipeline has the following arguments:
--sheet string CSV file with paths to each BAM and its corresponding reference genome: '<path BAM>,<path reference>'
--method string Method of variant calling. Options: standard, joint.
--vcftype string Type of VCF to return. Options: variant, allsites.
--regions string Directory path to region files (BED format) to call variants in. The basename of the file should match the reference it is to be used for.
--nodp boolean Don't apply depth filter (only applies to 'standard' mode)
--mapq integer Mapping quality applied at 'mpileup'. Default: 10.
--baseq integer Base quality applied at 'mpileup'. Default: 10.
--genoq integer Genotype-quality filter.
--ploidy integer Ploidy of the organism. Default: 2
--mpileup_opt string Optional arguments to pass to 'bcftools mpileup'. Provide within single quotes.
--call_opt string Optional arguments to pass to 'bcftools call'. Provide within single quotes.
--filter_opt string Optional arguments to pass to 'bcftools filter'. Provide within single quotes.
--view_opt string Optional arguments to pass to 'bcftools view'. Provide within single quotes.
--norm_opt string Optional arguments to pass to 'bcftools norm'. Provide within single quotes.
--sort_opt string Optional arguments to pass to 'bcftools sort'. Provide within single quotes.
Below I'll go into each argument in a bit more detail.
The pipeline requires a sample-sheet with two columns: BAM path and reference path. The first column contains the file paths to BAM files. The second column contains the file path to the reference the samples were aligned to - i.e. the BAMs corresponding reference file.
The CSV sample-sheet provides flexibility in processing the data. We can pass lots of different samples that have all had their sequence data aligned to different references without having to re-run the pipeline multiple times for different subsets.
An example of how the samplesheet should look is below:
/path/to/SAMPLE1.bam,/path/to/reference-A.fa
/path/to/SAMPLE2.bam,/path/to/reference-A.fa
/path/to/SAMPLE3.bam,/path/to/reference-B.fa
This argument is used to specify the type of variant calling one wants to do. I've written the pipeline to accomodate joint-genotyping - that is, calling variants in multiple samples at the same time - and standard genotyping. Standard genotyping is simply calling variants on a sample-by-sample basis.
The type of VCF output is dictated by the vcftype
argument. The two options include allsites
and variant
.
The allsites
option returns all sites that could be genotyped, regardless of if they are variant or not (
these VCF files are also called invariant-VCF files). The alternative is the variant
VCF which only returns
a VCF file with variable sites (I.e. 0/1 or 1/1/).
The regions
argument plays a role in a couple of pipeline processes. The argument expects a file path to a
BED file corresponding to the regions you'd like to call variants in. Additionally, if you are joint
genotyping,
the first column of the BED file will be used to parallelise the variant calling process across chromosomes. For
example, lets say we have the BED file below:
chr1 10 5000
chr8 390 2977
chrz 50000 1000000
First, variants will only be called in the regions between 10-5000 on chromosome 1, 390-2977 on chromosome 8
and 50000-1000000 in chromosome Z. Further, the pipeline will run variant calling on the three chromosomes (
chr1, chr8 and chrZ) in parallel (assuming joint
genotyping).
If you do not provide a regions file, all chromosomes/contigs will be used and no splitting of the data will occur. Similarly, if you are variant calling in standard mode, the regions file will not be used to split samples by chromosome in an attempt to reduce cluster load (e.g. if you have hundreds of samples and many chromsomes, it could quickly become a big job...). Further, I couldn't be bothered writing the code to do that right now...
This is simply an argument that can be used in standard
mode. It'll prevent the pipeline from using 2
These are all simply arguments that are passed to the BCFtools
suite during the variant calling process.
They can be set to reasonable values to filter the pipeline results.
The final set of arguments are optional strings you can pass with software specific arguments. I've deliberately been light-on when it comes to hard-coded filters throughout the pipeline. Use these arguments to pass any additional custom arguments you'd like to the software. Simply pass the arguments in single quotes as a string. e.g.
--filter_opt '--IndelGap 10 --include'