Genotype_GVCFs - MorrellLAB/sequence_handling GitHub Wiki
Basic Usage
The Genotype_GVCFs handler uses the Genome Analysis Toolkit (GATK) to create variant call format (VCF) files for each chromosome of your organism by pooling together all your samples. Because this step pools together all of your samples into one file, it is essential that all samples are included for this step. Automatically breaking the process into chromosome parts allows the job to be run as a task array and speeds up computing time.
To run Genotype_GVCFs, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Genotype_GVCFs can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling
):
./sequence_handling Genotype_GVCFs Config
Where Config is the full file path to the configuration file.
Handler-Specific Variables
The following are a list of variables that need to be defined within Config
. In addition to the handler-specific variables, all common variables must be defined.
Variable | Function |
---|---|
GG_QSUB |
QSub settings for batch submission. Recommended settings are "mem=22gb,nodes=1:ppn=16,walltime=24:00:00" . |
GG_QUEUE |
Which queue are we submitting the job to? |
GVCF_LIST |
A list of full file paths to the GVCF files. This can be generated with a one liner find $(pwd -P) -name "*.g.vcf" | sort -V > gvcf_list.txt or using the sample_list_generator.sh script. |
REF_DICT |
The reference dictionary, which should end in .dict . |
NUM_CHR |
The number of chromosomes or chromosome parts the reference has. It is an integer value which varies per species. For barley: 15 (7*2 chromosome parts + chrUn) For soybean: 20 (this excludes scaffolds) |
CUSTOM_INTERVALS |
Leave blank if you do not wish to call SNPs on non-chromosomal sequence. The full file path to a list of the names of any and all scaffolds or parts of the reference not covered by the chromosomes above. It should be a file ending in .intervals containing one scaffold name per line. SAMtools style intervals are also acceptable, one per line (ex: chr1:100-200). |
PLOIDY |
The sample ploidy. Highly inbred samples (most barleys) will have a ploidy of 1. |
THETA |
Genotype_GVCFs uses the THETA parameter under Haplotype_Caller. The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001 |
Output
Genotype_GVCFs generates a VCF file for each chromosome or chromosome part. The VCF files can be found at
${OUT_DIR}/Genotype_GVCFs
A list of files is not generated from Genotype_GVCFs. However, you can generate one using sample_list_generator.sh
.
Dependencies
Genotype_GVCFs depends on GATK for generating the VCFs. In addition, PBS is required for basic operation. If the reference dictionary needs to be generated, Genotype_GVCFs also depends on Picard. Please check the dependencies page to ensure that you are using the required version of each dependency.