Haplotype_Caller - MorrellLAB/sequence_handling GitHub Wiki

Basic Usage

The Haplotype_Caller handler uses the Genome Analysis Toolkit (GATK) to create a genomic variant call format (GVCF) file for each sample. This script requires a list of BAM files and the nucleotide diversity per base pair (Watterson's theta) as input. Due to the large amount of memory required, it is recommended to submit the task array to the "ram256g" queue on MSI.

To run Haplotype_Caller, all common variables and handler-specific variables must be defined within the configuration file. Once the variables have been defined, Haplotype_Caller can be submitted to a job scheduler with the following command (assuming that you are in the directory containing sequence_handling):

./sequence_handling Haplotype_Caller Config

Where Config is the full file path to the configuration file.

Handler-Specific Variables

The following are a list of variables that need to be defined within Config. In addition to the handler-specific variables, all common variables must be defined.

Variable Function
HC_QSUB QSub settings for batch submission. Recommended settings are "mem=250gb,nodes=1:ppn=24,walltime=24:00:00".
HC_QUEUE The specific queue where the job will be submitted. Attempting to run sequence_handling while on a different server than the one specified will create an error message. Choose from: "lab", "mesabi", "ram256g", or other queues shown here. Recommended queue is "ram256g".
FINISHED_BAM_LIST A list of full file paths to the finished BAM files. This can be generated with sample_list_generator.sh.
THETA The nucleotide diversity per base pair (Watterson's theta). This varies per species. For barley: 0.008 For soybean: 0.001
DO_NOT_TRIM_ACTIVE_REGIONS If true, GATK will not trim down the active region from the full region (active + extension) to just the active interval for genotyping. Recommended value: false.
FORCE_ACTIVE If true, all bases will be considered active regions. Recommended value: false.

Output

Haplotype_Caller generates a GVCF file for each BAM file specified. The GVCF files can be found at

${OUT_DIR}/Haplotype_Caller

A list of files is not generated from Haplotype_Caller. However, you can generate one using sample_list_generator.sh.

Dependencies

Haplotype_Caller depends on GATK for generating the GVCFs. If the reference dictionary needs to be generated, Haplotype_Caller also depends on Picard. In addition, PBS is required for basic operation. Please check the dependencies page to ensure that you are using the required version of each dependency.

Next: Genotype_GVCFs