GATK - Lavadav/EPP531_AGA GitHub Wiki

Step 6: Make a new folder for GATK and copy the .bam file in it.

mkdir 5_GATK
cp /work/pbgg8900/instructor_data/mapped_SRR7774156_sorted.bam

Step 7: Lets look at the percentage of mapped reads

samtools flagstat mapped_SRR7774156_sorted.bam > mapped_SRR7774156_sorted.stats

Step 8: Adding Read Groups

Now, that we have created a sorted BAM file, we have to add read groups to this file. There is no formal definition of what a 'read group' is, however in practice this term refers to a set of reads that are generated from a single run of a sequencing instrument.

Some programs require read groups to be present in the BAM files.

Generate header for picard tool

head -n1 /SRR7774156-trimmed-pair1.fastq | cut -f 1 -d ' '

Run Picard

ml picard/3.3.0-Java-17
java -jar $EBROOTPICARD/picard.jar \
                AddOrReplaceReadGroups \
                I=mapped_SRR7774156_sorted.bam \
                O=Final_SRR7774156_sorted_read_groups.bam \
                RGSM=SRR7774156 \
                RGLB=SRR7774156 \
                RGPL=illumina \
                [email protected]

Step 9: Index the picard output

samtools index Final_SRR7774156_sorted_read_groups.bam

Step 10: Remove and Sort Duplicates

GATK has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted.

Usage: Tool: GATK4 MarkDuplicatesSpark

Input: Aligned .bam file from BWA Step

Output: sorted and duplicate removed .bam file

gatk MarkDuplicatesSpark \
        -I Final_SRR7774156_sorted_read_groups.bam \
        -M Final_SRR7774156_metrics_dup.txt \
        -O Final_SRR7774156_sorted_dup_reads.bam

# Index your bam file
samtools index Final_SRR7774156_sorted_dup_reads.bam

Step 11: Alignment Matrices using Picard

Here, we will use the .bam file generated in above step and the Reference Genome used in BWA step to generate alignment and insert size metrices.

Usage: Tools: Picard Tools, Samtools

Input: sorted_dup_reads.bam; Reference Genome

Output: Merices Datafile and Histogram

java -jar $EBROOTPICARD/picard.jar CollectAlignmentSummaryMetrics \
        R= Tcacao_523_v2.Chrom.0.fa\
        INPUT=Final_SRR7774156_sorted_dup_reads.bam \
        OUTPUT=Final_SRR7774156_Alignment_metrics.txt

Step 12: Calling Variant

The first time we call variants to use it as an input for "Base Recalibration"

Usage:

Input: Sample_sorted_dup_reads.bam; Reference Genome

Output: Variant_raw.vcf

# Prepare Reference for GATK
gatk CreateSequenceDictionary -R Tcacao_523_v2.Chrom.0.fa
samtools faidx Tcacao_523_v2.Chrom.0.fa

# Call variant for base callibaration
gatk HaplotypeCaller \
        -R Tcacao_523_v2.Chrom.0.fa \
        -I Final_SRR7774156_sorted_dup_reads.bam \
        -O Final_SRR7774156_Variant_raw.vcf