Running GATK version 3.7 variant calling - SIWLab/Lab_Info GitHub Wiki
The lab generally uses GATK pipelines to call SNPs and filter VCFs. Below is an example pipeline compiled from Wright Lab graduate students with descriptions for each step and an example of hard filtration (filtering entire sites out of a VCF based on some criterion). These roughly follow the best practices from GATK. In all of the examples, I use bash variables in my filenames, which are explained at the end. GATK and Picard are java programs, which you run with java -jar <path-to-gatk-or-picard>
, but in the steps I don't put the path to each---make sure you do. Note: This tutorial assumes you have paired-end Illumina sequence data. If you have something else, make sure to check every step for the appropriate options.
UPDATE: the examples below are for GATK 3.7. Things have changed in GATK4, but it should be similar.
There's multiple copies of each on the servers, but if you want or need to install your own, go to these links and/or ask someone.
install GATK: https://software.broadinstitute.org/gatk/guide/quickstart install Picard: https://broadinstitute.github.io/picard/ , https://github.com/broadinstitute/picard/releases/tag/2.8.1
A quick note: when you align a genome, you're going to end up with a sam file at some point. Convert this to a bam with samtools and delete the sam. Never keep a sam on the server---they're far too large and completely unnecessary.
For every bam you're running you're going to have to sort and index using samtools. This is written assuming samtools version 1.3.1 is in your $PATH. This proceeds as follows:
- Sort your bam with this command:
samtools sort -l 9 -O bam -o ${prefix}/${sample}.sorted.bam ${prefix}/${sample}.bam >> sort1.out 2>> sort1.err
The option -O
specifies the output file type, -o
specifies the output file path, and the input path simply follows. The arrows >>
and 2>>
are optional, but they specify paths to append std out and std error to, respectively.
- Index your bam:
samtools index -b ${prefix}/${sample}.sorted.bam ${prefix}/${sample}.sorted.bam.bai >> index1.out 2>> index1.err
The option -b
specifies the input bam path (same as the output from the previous command), followed by the output path for the index file, with the standard .bai
extension.
These steps essentially filter your alignment before you do any variant calling.
- Add or replace read groups (checks that you have proper read groups)
java -Djava.io.tmpdir=~/tmp -jar picard AddOrReplaceReadGroups \
INPUT=${prefix}/${sample}.sorted.bam \
OUTPUT=${prefix}/${sample}.sorted.rg.bam \
RGLB=DKCr \
RGPL=Illumina \
RGPU=unit1 \
RGSM=${sample} \
>> ~/RG.out 2>> ~/RG.err
Followed by another indexing step with samtools for the new bam:
samtools index -b ${prefix}/${sample}.sorted.rg.bam ${prefix}/${sample}.sorted.rg.bam.bai >> index2.out 2>> index2.err
Full disclosure at this point: I'm not sure if you need a samtools indexing here or a picard one, so just run both:
java -jar picard BuildBamIndex \
I=${prefix}/${sample}.sorted.rg.bam
- Split cigars Read GATK's description of this if you're curious.
java -jar gatk -T SplitNCigarReads -R ${ref} -I ${prefix}/${sample}.sorted.rg.bam -o ${prefix}/${sample}.sorted.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS >${sample}.sorted.split.out 2>${sample}.sorted.split.err
- Mark duplicates and index
java -jar picard MarkDuplicates \
I=${prefix}/${sample}.sorted.split.bam \
O=${prefix}/${sample}.sorted.split.dedup.bam \
M=${prefix}/${sample}.metrics.txt
...and index with picard again
java -jar picard BuildBamIndex \
INPUT=${prefix}/${sample}.sorted.split.dedup.bam
If you do these steps, make sure you're updating your input and output paths along the way...
- Create GVCF This is the final 'pre-processing' step. This creates GATK's GVCF file, which is essentially a VCF with information that only GATK can interpret.
java -jar gatk -T HaplotypeCaller -R ${ref} -I ${prefix}/${sample}.sorted.split.dedup.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o ${prefix}/${sample}.g.vcf -ERC GVCF >${prefix}/${sample}_hapl.out 2>${prefix}/${sample}_hapl.err
The result will be in ${prefix}/${sample}.g.vcf
, but note that this isn't a usable vcf. You have to do the next steps to call the genotypes.
IMPORTANT: if you want invariant sites as well, you have to use
-ERC BP_RESOLUTION
not the GVCF option. If you don't add this flag, your gvcf will be useless.
The following steps will combine all of your GVCFs into a single GVCF, then produce a usable VCF with all samples.
- Combine GVCFs Here I only show combining two samples:
java -jar gatk -T CombineGVCFs -R ${ref} --variant DKCr_182_84.15.g.vcf --variant DKCr_11_1207.g.vcf -o DKrubella.g.vcf > combgvcf.out 2> combgvcf.err
For each individual GVCF that you have, you need to call the --variant
option. So if you have 45 samples, you'll have 45 --variant
options. The file in -o
will be a combined GVCF.
- Genotype GVCF This is the step that will give you a combined, usable VCF. If you want only variant sites, run:
java -jar gatk -T GenotypeGVCFs -R ${ref} --variant <path-to-combined-gvcf> -o <output>.vcf
If you need invariant sites as well, run this instead:
java -jar gatk -T GenotypeGVCFs -R ${ref} --variant <path-to-combined-gvcf> --includeNonVariantSites -o <output>.vcf
It might be smart to run both, just in case you want invariant sites in the future.
Generally we hard-filter things, that is we drop entire sites if something is below our cutoffs. This is done in two steps. Below is just an example, do not use this as a guide for how to filter.
- Filter your VCF GATK first flags things you want to filter. You do this using JEXL expressions. For example, let's filter out sites with a genotype quality less than 30:
java -jar gatk -R ${ref} -T VariantFiltration --variant <path-to-vcf> -o <path-for-output> --filterExpression 'QUAL<30.0' --FilterName 'lowGQ'
- Select non-filtered sites The next step is to drop sites that are labeled. In our example this is:
java -jar gatk -T SelectVariants -R ${ref} -V <path-to-filtered-vcf> -ef -o <output>
The option -ef
hard filters these sites.
You can also filter specific types of variants, e.g. indels with the command --selectTypeToExclude INDEL
.
Depth is filtered with the JEXL expression DP
, like the QUAL
example. See GATK for more examples.
Always gzip vcfs and delete intermediate ones like the filtration steps. Thanks!
I like to set things up in automated pipelines. This includes setting variables for things like paths, and using config files. In this case, I set up a file with a sample name on every line, and wrapped my pre-processing steps in:
while read sample
do
<pre-processing>
done < $names
where $names
is a variable containing the path to my file with sample names. This loops through every sample name and runs the preprocessing steps. A full example pipeline follows:
#!/bin/bash
#install GATK: https://software.broadinstitute.org/gatk/guide/quickstart
#install Picard: https://broadinstitute.github.io/picard/ , https://github.com/broadinstitute/picard/releases/tag/2.8.1
#TK: set hard paths as variables so you don't need to touch the script below this section
prefix=/ohta/tyler.kent/Alignments/Rubella
picard=/ohta/tyler.kent/Software/picard-2.8.1/picard.jar
gatk=/ohta/tyler.kent/Software/GATK/GenomeAnalysisTK.jar
names=/ohta/tyler.kent/Alignments/Rubella/names1
ref=/ohta/tyler.kent/Storage/Capsella_rubella_v1.0_combined.fasta
#TK: loop over names in a file...more reproducible
while read sample
do
#################
#PRE-PROCESSESING
#################
#note, from GATK: "begin by mapping the sequence reads to the reference genome to produce a file in SAM/BAM format sorted by coordinate. Next, we mark duplicates to mitigate biases introduced by dat
a generation steps such as PCR amplification. Finally, we recalibrate the base quality scores, because the variant calling algorithms rely heavily on the quality scores assigned to the individual ba
se calls in each sequence read."
#Sort and index
#TK: rubella alignments are already bam...shouldn't be keeping sams on the servers
samtools sort -l 9 -O bam -T ~/tmp -o ${prefix}/${sample}.sorted.bam ${prefix}/${sample}.bam >> ~/sort1.out 2>> ~/sort2.out
samtools index -b ${prefix}/${sample}.sorted.bam ${prefix}/${sample}.sorted.bam.bai >> ~/index1.out 2>> ~/index2.err
java -Djava.io.tmpdir=~/tmp -jar $picard AddOrReplaceReadGroups \
INPUT=${prefix}/${sample}.sorted.bam \
OUTPUT=${prefix}/${sample}.sorted.rg.bam \
RGLB=DKCr \
RGPL=Illumina \
RGPU=unit1 \
RGSM=${sample} \
>> ~/RG.out 2>> ~/RG.err
samtools index -b ${prefix}/${sample}.sorted.rg.bam ${prefix}/${sample}.sorted.rg.bam.bai >> ~/index2.out 2>> ~/index2.err
java -Djava.io.tmpdir=~/tmp -jar $picard BuildBamIndex \
I=${prefix}/${sample}.sorted.rg.bam
#Splits Cigars
java -Djava.io.tmpdir=~/tmp -jar $gatk -T SplitNCigarReads -R ${ref} -I ${prefix}/${sample}.sorted.rg.bam -o ${prefix}/${sample}.sorted.split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U
ALLOW_N_CIGAR_READS >${sample}.sorted.split.out 2>${sample}.sorted.split.err
#Mark Duplicates & Index - this is the generalized command
java -Djava.io.tmpdir=~/tmp -jar $picard MarkDuplicates \
I=${prefix}/${sample}.sorted.split.bam \
O=${prefix}/${sample}.sorted.split.dedup.bam \
M=${prefix}/${sample}.metrics.txt
java -Djava.io.tmpdir=~/tmp -jar $picard BuildBamIndex \
INPUT=${prefix}/${sample}.sorted.split.dedup.bam
###################
#VARIANT DISCOVERY
###################
#Run haplotype caller with GVCF function
java -Djava.io.tmpdir=~/tmp -jar $gatk -T HaplotypeCaller -R ${ref} -I ${prefix}/${sample}.sorted.split.dedup.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -o ${prefix}/${sample}.g.vcf -ERC BP_RESOLUTION >${prefix}/${sample}_hapl.out 2>${prefix}/${sample}_hapl.err
done < $names
#up until now running commands over a loop for each file, the next step joins all files into one.
#CombineGVCFs
java -Djava.io.tmpdir=/ohta/tyler.kent/tmp -jar $gatk -T CombineGVCFs -R ${ref} --variant DKCr_182_84.15.g.vcf --variant DKCr_11_1207.g.vcf --variant DKCr_171_83.10..g.vcf --variant DKCr_189_22.13.g.vcf --variant DKCr_228_81.2.g.vcf --variant DKCr_4_762.g.vcf --variant DKCr_71_1316-5.g.vcf --variant DKCr_83_39.1.g.vcf --variant DKCr_12_1208.g.vcf --variant DKCr_172_72.12.g.vcf --variant DKCr_200_1408.g.vcf --variant DKCr_234_77.16.g.vcf --variant DKCr_52_1249-11.g.vcf --variant DKCr_73_1321-1.g.vcf --variant DKCr_8_925.g.vcf --variant DKCr_13_1209.g.vcf --variant DKCr_173_34.11.g.vcf --variant DKCr_202_RIAH.g.vcf --variant DKCr_2_697.g.vcf --variant DKCr_53_1267-15.g.vcf --variant DKCr_77_100.8.g.vcf --variant DKCr_9_984.g.vcf --variant DKCr_14.g.vcf --variant DKCr_177.g.vcf --variant DKCr_207_1411-3.g.vcf --variant DKCr_30_1574-1.g.vcf --variant DKCr_5_844.g.vcf --variant DKCr_78_79.1.g.vcf --variant DKCr_15_1311.g.vcf --variant DKCr_180_104.12.g.vcf --variant DKCr_20_774.g.vcf --variant DKCr_36_1407-8.g.vcf --variant DKCr_59_987-25.g.vcf --variant DKCr_7_907.g.vcf --variant DKCr_16_1377.g.vcf --variant DKCr_18_1453.g.vcf --variant DKCr_210_1314-10.g.vcf --variant DKCr_3_698.g.vcf --variant DKCr_63_1245-12.g.vcf --variant DKCr_79_75.2.g.vcf --variant DKCr_168_78.1.g.vcf --variant DKCr_185_6.26.g.vcf --variant DKCr_212_1319-3.g.vcf --variant DKCr_38_1504-11.g.vcf --variant DKCr_64_1354-12.g.vcf --variant DKCr_81_80TR1-TS1.g.vcf --variant DKCr_1_690.g.vcf --variant DKCr_188_23.9.g.vcf --variant DKCr_22_86IT1.g.vcf --variant DKCr_40_1575-1.g.vcf --variant DKCr_6_879.g.vcf --variant DKCr_82_TAAL-1.g.vcf -o DKrubella.g.vcf > combgvcf.out 2> combgvcf.err
#GenotypeGVCF
java -Djava.io.tmpdir=/ohta/tyler.kent/tmp -jar $gatk -T GenotypeGVCFs -R ${ref} --variant DKrubella.g.vcf --includeNonVariantSites -o DKrubella_allsites.vcf
###########
#FILTERING VARIANTS
###########
#see below for hard filtering
#filter by GQ
java -jar $gatk -R ${ref} -T VariantFiltration --variant DKrubella.vcf -o DKrubella_GQfilt.vcf --filterExpression 'QUAL<30.0' --FilterName 'lowGQ' 2> filtGQ.er
java -jar $gatk -T SelectVariants -R ${ref} -V DKrubella_allsites_GQfilt.vcf -ef -o DKrubella_allsites_excludedGQ30.vcf