18. Variant Calling with GATK - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

ANOTHER PIPELINE FOR CALLING VARIANTS

For the sake of completeness and to demonstrate that there are more ways to do the same thing, let’s run another pipeline using a different mapper and a different variant caller. Remember, you could just as easily use BWA and Freebayes, or Bowtie and GATK, or either combination of mapper and caller. Note that the same four steps as in the previous exercise will be taken: 1) map the reads, 2) sort the bam file, 3) call the variants, 4) filter/identify the variants.

interactive -p nocona -c 36

. ~/conda/etc/profile.d/conda.sh

conda activate varcall

module load gcc/9.2.0 samtools/1.11

mkdir -p /lustre/scratch/[eraider]/gge2022/variants/bwa

cd /lustre/scratch/[eraider]/gge2022/variants/bwa

ln -s ../../data/chr22/chr22.fa

ln -s ../../data/chr22/sample/pair.1.fq

ln -s ../../data/chr22/sample/pair.2.fq

Index the chromosome using bwa.

bwa index chr22.fa

While a little different from the way we invoked bowtie, this next line will actually do the mapping, this time skipping the production of a .sam file and going straight to the smaller .bam file.

bwa mem -M -R "@RG\tID:GGEClass\tPL:illumine\tPU:Unknown\tLB:lc_paired\tSM:GGEClass" chr22.fa -t 36 pair.1.fq pair.2.fq | samtools view -Sb -@ 36 -o bwa_chr22.bam

We need to sort as was done previously, this time using a module of GATK instead of samtools but with the same result, a sorted .bam file.

gatk SortSam --SORT_ORDER coordinate -I bwa_chr22.bam -O bwa_chr22_sort.bam -MAX_RECORDS_IN_RAM 1000000 -VALIDATION_STRINGENCY LENIENT -CREATE_INDEX TRUE -TMP_DIR tmp

One good thing about GATK is the very descriptive options flags. -I = input file; -O = output file; -CREATE_INDEX = create an index, etc.

Now, mark possible PCR duplicates. This is one place where GATK get's a little more thorough than freebayes. PCR duplicates are sequence reads that resulted from two or more copies of the same DNA fragment. At worst, they may contain erroneous base calls. At the very least, they make any variant alleles in those reads appear more often than they should in the data, thus skewing the outcome of the analysis. GATK takes the time to identify these types of reads, while freebayes does not.

gatk MarkDuplicates -I bwa_chr22_sort.bam -O bwa_chr22_marked.bam -M bwa_chr22_dupMetric.out -REMOVE_DUPLICATES TRUE -MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 1000000 -VALIDATION_STRINGENCY LENIENT -CREATE_INDEX TRUE -ASSUME_SORTED TRUE -TMP_DIR tmp

Finally, call the variants using a module of GATK, HaplotpyeCaller. However, it never fails that one program likes files to be indexed differently than another. The next two steps create the index and a sequence dictionary file. Both of these are then used in the HaplotypeCaller module.

gatk CreateSequenceDictionary -R chr22.fa

samtools faidx chr22.fa

gatk HaplotypeCaller -R chr22.fa -I bwa_chr22_marked.bam -O bwa_chr22_marked.vcf

Just like Freebayes, we can also limit ourselves to mapping only to certain regions.

gatk HaplotypeCaller -R chr22.fa -I bwa_chr22_marked.bam -O bwa_chr22_marked_small.vcf -L chr22:10536000-20806000

FOR YOU TO DO

Upload your answers/results to Assignment 16 - GATK.

Use vcffilter and vcfstats to compare the raw .vcf files from freebayes and GATK. Identify any differences in the numbers of SNPs and indels called by each.
Do the same for both sets of calls but filtered for calls that are of quality score 20 or better.
Compare the number of SNPs identified by each pipeline. What is the difference?
Use your skills with vcffilter and vcfstats to determine the source of those differences in number of SNPs identified.
Use your web browser to go to the Variant Effect Predictor. Input your bwa_chr22_marked.vcf and determine whether your data includes any nonsense or missense mutations. How many of each? What proportion of your variants were in introns?