GATK - Lavadav/EPP531_AGA GitHub Wiki
Step 6: Make a new folder for GATK and copy the .bam file in it.
mkdir 5_GATK
cp /work/pbgg8900/instructor_data/mapped_SRR7774156_sorted.bam
Step 7: Lets look at the percentage of mapped reads
samtools flagstat mapped_SRR7774156_sorted.bam > mapped_SRR7774156_sorted.stats
Step 8: Adding Read Groups
Now, that we have created a sorted BAM file, we have to add read groups to this file. There is no formal definition of what a 'read group' is, however in practice this term refers to a set of reads that are generated from a single run of a sequencing instrument.
Some programs require read groups to be present in the BAM files.
Generate header for picard tool
head -n1 /SRR7774156-trimmed-pair1.fastq | cut -f 1 -d ' '
Run Picard
ml picard/3.3.0-Java-17
java -jar $EBROOTPICARD/picard.jar \
AddOrReplaceReadGroups \
I=mapped_SRR7774156_sorted.bam \
O=Final_SRR7774156_sorted_read_groups.bam \
RGSM=SRR7774156 \
RGLB=SRR7774156 \
RGPL=illumina \
[email protected]
Step 9: Index the picard output
samtools index Final_SRR7774156_sorted_read_groups.bam
Step 10: Remove and Sort Duplicates
GATK has a combined tool to mark and sort duplicates called MarkDuplicateSpark tool. The sorts and marks duplicates based on query name and is different from Picard MarkDuplicate tool as it is coordinate sorted.
Usage: Tool: GATK4 MarkDuplicatesSpark
Input: Aligned .bam file from BWA Step
Output: sorted and duplicate removed .bam file
gatk MarkDuplicatesSpark \
-I Final_SRR7774156_sorted_read_groups.bam \
-M Final_SRR7774156_metrics_dup.txt \
-O Final_SRR7774156_sorted_dup_reads.bam
# Index your bam file
samtools index Final_SRR7774156_sorted_dup_reads.bam
Step 11: Alignment Matrices using Picard
Here, we will use the .bam file generated in above step and the Reference Genome used in BWA step to generate alignment and insert size metrices.
Usage: Tools: Picard Tools, Samtools
Input: sorted_dup_reads.bam; Reference Genome
Output: Merices Datafile and Histogram
java -jar $EBROOTPICARD/picard.jar CollectAlignmentSummaryMetrics \
R= Tcacao_523_v2.Chrom.0.fa\
INPUT=Final_SRR7774156_sorted_dup_reads.bam \
OUTPUT=Final_SRR7774156_Alignment_metrics.txt
Step 12: Calling Variant
The first time we call variants to use it as an input for "Base Recalibration"
Usage:
Input: Sample_sorted_dup_reads.bam; Reference Genome
Output: Variant_raw.vcf
# Prepare Reference for GATK
gatk CreateSequenceDictionary -R Tcacao_523_v2.Chrom.0.fa
samtools faidx Tcacao_523_v2.Chrom.0.fa
# Call variant for base callibaration
gatk HaplotypeCaller \
-R Tcacao_523_v2.Chrom.0.fa \
-I Final_SRR7774156_sorted_dup_reads.bam \
-O Final_SRR7774156_Variant_raw.vcf