Lab: Short read variation III Variant calling with GATK - mestato/EPP622_2024 GitHub Wiki

Calling Variants on Illumina Data with GATK

Before Calling Variants

Adding Read Groups

We need to add read group (RG) information to perform our downstream analysis. We're going to add it to our alignment files now, but RGs can also be added at the alignment stage (if you're curious, see the BWA manual and check out the -R option for bwa mem).

Read group data is metadata that gives more context to the alignments for downstream tools so they can do things like distinguish samples and detect errors.

We're going to use AddOrReplaceReadGroups via GATK, but it's a program that's part of Picard Tools that's has been bundled with GATK.

Navigate to your 3_bwa folder where you have your sorted BAM data.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	AddOrReplaceReadGroups \
	--INPUT <your_read_set>_sorted.bam \
	--OUTPUT <your_read_set>_sorted.RG.bam \
	--RGSM <your_read_set> \
	--RGLB <your_read_set> \
	--RGPL ILLUMINA \
	--RGPU <your_read_set>

Note

--INPUT is your sorted BAM file

--OUTPUT is the name of your new file that will have RG data

--RGSM is the name of the sample the input belongs to

--RGLB is the name of the library this sample belongs to.

--RGPL is the name of the platform that was used to create these reads (e.g. Illumina, PacBio, etc.)

--RGPU is (intended to be) the the platform unit that describes exactly which flowcell, lane, and sample multiplex barcode was used when the reads were created.

In our case, we are just setting the RGPU field to the name of the sample, which will be fine for our purposes. If our experiment was more complex, we would be using more specific information here.

See You can read about the arguments for this tool here. Also see this page for more information on read groups.

If you want to save space, you can delete your old sorted BAM file because your new [...].RG.bam is the file you'll work with from here on.

Samtools Index

Now that we've got a sorted BAM with read groups, we need to create an index with samtools index.

spack load [email protected]

samtools index <your_read_set>_sorted.RG.bam

The resulting index file will be named something like [...]_sorted.RG.bam.bai.

Other tools will often automatically look for this file in the same directory as the BAM when you run them. Typically, if a tool can't find this file, it will raise an error to tell you.

Example Samtools Index Loop

Note

Here's an example of what processing all of your samples with a for loop might look like. This isn't likely to be the best solution if you have a lot of data, but it would get the job done.

You don't need to run this, this is just an example

spack load [email protected]_10
spack load [email protected]

for BAM in *_sorted.bam; do
	SAMPLENAME=$(basename "$BAM" | sed 's/_sorted.bam$//')
	/pickett_shared/software/gatk-4.2.6.1/gatk AddOrReplaceReadGroups \
			--INPUT $BAM \
			--OUTPUT ${SAMPLENAME}_sorted.RG.bam \
			--RGSM $SAMPLENAME \
			--RGLB $SAMPLENAME \
			--RGPL ILLUMINA \
			--RGPU $SAMPLENAME

	samtools index ${SAMPLENAME}_sorted.RG.bam
done

Calling Variants with GATK

GATK Homepage

GATK is a large toolkit of programs for a variety of bioinformatic analyses. We're following the best practices workflow outlined by the Broad Institute here.

In addition to GATK, we'll be using some tools from Picard that are bundled with GATK.

Supporting files

Note

You do not need to run this particular step, these files are already in the reference folder

We use samtools faidx and CreateSequenceDictionary to make an index (.fai) and sequence dictionary (.dict) for our reference.

spack load [email protected]

samtools faidx /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz

/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary \
	-R /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz

Prep

Make a new folder for this step, and create symlinks to the files we're going to use.

mkdir 4_gatk
cd 4_gatk

# link the sorted RG BAM and its index
ln -s $(readlink -e ../3_bwa/*_sorted.RG.*) ./

# link some relevant reference genome files
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz . # The compressed reference file
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.gzi . # The bgzip index for the reference (you only need this file if the reference is bgzipped)
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.fai . # The genome index from samtools faidx
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.dict . # The sequence dictionary from CreateSequenceDictionary

Note

readlink resolves a relative path into a canonical path from root, so you don't have to worry if you relocate your symbolic link file, it will still point to the same file (as long as that other file doesn't move!)

See man readlink for more on how to use it.

Variant Calling

On a Single Sample

Using GATK's HaplotypeCaller, we'll call variants on a single sample. We're also going to call variants on only one chromosome for this example to same some time.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	--java-options "-Xmx4G" \
	HaplotypeCaller \
	-R solenopsis_invicta_genome.fa.gz \
	-I <your_read_set>_sorted.RG.bam \
	-O <your_read_set>_NC_052664.1.vcf \
	-bamout <your_read_set>_sorted_NC_052664.1.RG.realigned.bam \
	-L NC_052664.1

Note

-R is the reference file used to create the alignments.

-I is the input BAM you want to use when calling variants.

-O is where you want to write the output VCF. We appended a _NC_052664.1 to the name so we can tell this is only for one chromosome.

-bamout outputs the reads and assembled pseudohaplotypes that were realigned when calling variants. Again, we put NC_052664.1 in part of the name so we know it only contains data on the one chromosome. We will take a look at this file together in a little while.

-L is the interval where you would like to call variants. We provided just a chromosome name, but we could have also supplied a specific region with the format chrom:start-stop, for example `NC_052664.1:1-5000.

If you do not provide an -L argument, HaplotypeCaller will call variants on the entire genome.

If you use something like less to browse through the resulting VCF (Variant Call Format) file, you'll see a large amount of header information on the lines that start with ##. That information is relevant to downstream tools and for diagnostics, but isn't very easy to read or interesting to us right now.

Let's look at the first few lines after that header.

grep -v '^##' <your_read_set>_NC_052664.1.vcf | head -n 3

You should see something like the following:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SRR6922447
NC_052664.1     54393   .       T       C       176.97  .       AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=27.00;QD=29.50;SOR=3.912     GT:AD:DP:GQ:PL  1/1:0,6:6:18:191,18,0
NC_052664.1     59958   .       C       CACCG   526.02  .       AC=2;AF=1.00;AN=2;DP=12;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.20;QD=25.36;SOR=5.136    GT:AD:DP:GQ:PL  1/1:0,12:12:36:540,36,0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR6922447
NC_052664.1 54393 . T C 176.97 . AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=27.00;QD=29.50;SOR=3.912 GT:AD:DP:GQ:PL 1/1:0,6:6:18:191,18,0
NC_052664.1 59958 . C CACCG 526.02 . AC=2;AF=1.00;AN=2;DP=12;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.20;QD=25.36;SOR=5.136 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0

Now, this is good information on a couple of loci in this sample, but if we want to compare several samples' variants, we would have to match up the sites where they all have variants. Furthermore, we could only compare samples at sites where they all had variants unless we went back to the original data and checked those same sites over again.

We could run all of our samples at the same time and get a single VCF that describes the sites in all of the samples, but we wouldn't be able to add any more samples to that analysis afterwards, and if it were to fail midway (for example if the system crashed, or ran out of memory or storage), then we'd have to wait on an entire new run from scratch.

Especially if your study involves a large number of samples, those issues can be very troublesome. This is where the GVCF (Genomic Variant Call Format) approach comes in.

Genomic VCF

Let's generate a GVCF file with HaplotypeCaller this time instead.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	--java-options "-Xmx4G" \
	HaplotypeCaller \
	-R solenopsis_invicta_genome.fa.gz \
	-I <your_read_set>_sorted.RG.bam \
	-O <your_read_set>.g.vcf \
	-ERC GVCF

Note

-ERC GVCF (or --emit-ref-confidence GVCF) instructs the tool to output the GVCF format.

This time, we named the output [...].g.vcf so we know this is a genomic VCF.

We did not use -L this time, so we called variants for the whole genome.

-R and -I are the same as before.

The GVCF (Genomic Variant Call Format) is similar to the VCF, but it reports every site. Try taking a look at the first few entries like we did with the VCF above. In the GVCF, there are also lines that don't describe variants but instead describe stretches of the genome and the quality of the base calls in the BAM. These are used later to compare different samples to each other without having to dive back in to every sample's BAM to check if it was reference type or had missing/low quality data at that site.

The tradeoff with this approach it that we can't directly use the GVCF for analysis.

For now, please copy your GVCF file into a shared folder so we can use them all to create a single VCF that contains the variants for all of the samples.

cp <your_read_set>.g.vcf /pickett_sphinx/teaching/EPP622_2024/analyses/GVCFs/

Note

If you have a study that involves a very large number of samples (hundreds to thousands), look into GenomicsDBImport, which can more efficiently store and handle samples compared to the rest of this workflow.

Calling Variants with GVCFs and Multiple Samples

Now that we've got several samples' GVCFs, we can call variants on the group of them.

CombineGVCFs

Before we can call variants, we need to gather our GVCFs together into one file using GATK CombineGVCFs.

First, let's create a new folder for this step and link our GVCF files.

mkdir 5_gatk_combine
cd 5_gatk_combine

# symlink our reference genome and required other supporting files
ln -s $(readlink -e ../4_gatk/solenopsis_invicta_genome.*) .

# create symlinks to the GVCFs
ln -s /pickett_sphinx/teaching/EPP622_2024/analyses/GVCFs/*.g.vcf .

Now we need to tell GATK which samples to use when calling variants on the dataset. There are at least two ways to do this that make sense.

Method 1: Script to Build a Script

One way to do that is to supply every GVCF as a single argument. We can make a script that can build a script with those arguments automatically with a for loop like this:

echo '/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs -R solenopsis_invicta_genome.fa.gz \'

for f in *.g.vcf; do
	echo "--variant $f \\"
done

echo "-O solenopsis_combined.g.vcf"

If you create that script and run it, you will get an output like this:

/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs -R solenopsis_invicta_genome.fa.gz \
--variant <sample1>.g.vcf \
--variant <sample2>.g.vcf \
--variant <sampleEtc>.g.vcf \
-O solenopsis_combined.g.vcf

Every line with --variant lists one of the GVCF files. If you that text into a script and run it, then CombineGVCFs will run with all of your samples.

Method 2: Sample List

Another way to provide all of the files to GATK is by creating a file that lists all of them, then giving that file to CombineGVCFs.

# create the sample list
ls *.g.vcf > sample.list

# run CombineGVCFs
/pickett_shared/software/gatk-4.2.6.1/gatk \
	CombineGVCFs \
	-R solenopsis_invicta_genome.fa.gz \
	--variant sample.list \
	-O solenopsis_combined.g.vcf

Choose one of the above methods, and run it to generate your combined GVCF.

GenotypeGVCFs

Now we'll calculate the genotypes from the combined GVCF file using GATK GenotypeGVCFs.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	GenotypeGVCFs \
	-R solenopsis_invicta_genome.fa.gz  \
	-V solenopsis_combined.g.vcf \
	-O solenopsis_combined.vcf

Let's use bcftools to see how many SNPs we got. Use bcftools stats to create a summary of the VCF, then view it with something like less. Look for a line that says "number of SNPs:"

spack load bcftools

bcftools stats solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt

less solenopsis_combined.vcf.stats.txt

Now we have a whole load of SNPs, however, this VCF probably contains a lot of variants that may not be real, so we need to filter the variants so we know which sites are more likely to be real.

For filtering variants, Broad Institute recommends using their variant quality score recalibration (VQSR) technique when possible. However, that would require a large set of high-quality previously known variants, which we don't have. The next best thing is "hard filtering" our data.

Hard Filtering

We'll use GATK VariantFiltration to mark sites that we think aren't real.

Make a new folder and link your reference and combined VCF files from the last step.

mkdir 6_filter
cd 6_filter

ln -s $(readlink -e ../5_gatk_combine/solenopsis_invicta_genome*) .
ln -s $(readlink -e ../5_gatk_combine/solenopsis_combined.vcf*) . # this will link your combined VCF and the matching stats file

Basic filters

First, let's filter with a very basic set of parameters. We'll require that our sites are over 20.0 quality, and have more than 5x depth. We'll compare the results to another set of filters later.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	VariantFiltration \
	-R solenopsis_invicta_genome.fa.gz  \
	-V solenopsis_combined.vcf \
	-O solenopsis_combined.Basicfilters.vcf \
	--filter-name "basic_filter" --filter-expression "QUAL > 20.0 && DP > 5"

Recommended filters

Now, let's use the suggested filters from Broad Institute. More detail on these is available here as well.

Additionally, they recommend to that we filter the SNPs and INDELs separately, but we're just going to filter our data with the recommended SNP parameters to save some time.

/pickett_shared/software/gatk-4.2.6.1/gatk \
	VariantFiltration \
	-R solenopsis_invicta_genome.fa.gz  \
	-V solenopsis_combined.vcf \
	-O solenopsis_combined.GATKfilters.vcf \
	--filter-name "QD_filter" --filter-expression "QD < 2.0" \
	--filter-name "FS_filter" --filter-expression "FS > 60.0" \
	--filter-name "MQ_filter" --filter-expression "MQ < 40.0" \
	--filter-name "SOR_filter" --filter-expression "SOR > 4.0" \
	--filter-name "MQRankSum" --filter-expression "MQRankSum < -12.5" \
	--filter-name "ReadPosRankSum" --filter-expression "ReadPosRankSum < -8.0"

Comparing the Filtered VCFs

VariantFiltration only marks the sites if they don't pass, it doesn't remove them, so we need to provide an extra argument to bcftools stats to look at only those sites that passed our filtered.

spack load bcftools

bcftools stats -f "PASS,." ../5_gatk_combine/solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt

bcftools stats -f "PASS,." solenopsis_combined.Basicfilters.vcf > solenopsis_combined.Basicfilters.vcf.stats.txt

bcftools stats -f "PASS,." solenopsis_combined.GATKfilters.vcf > solenopsis_combined.GATKfilters.vcf.stats.txt

grep 'number of SNPs:' *stats.txt

You can see that with the filters from Broad Institute we kept many more SNPs than with our simple filter, and we can be relatively confident that most of them are high enough quality for meaningful analysis.

Visualizing Variants with IGV

We're going to look at some of the variants that HaplotypeCaller called using Integrative Genome Viewer (IGV).

Online IGV doesn't support b/gzipped files, so I recommend linking the already decompressed reference file and it's index to your directory:

ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa .
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.fai .

Use scp or your favorite method to copy the following files to your computer:

  • solenopsis_invicta_genome.fa
  • solenopsis_invicta_genome.fa.fai
  • <your_read_set>_sorted.RG.bam
  • <your_read_set>_sorted.RG.bam.bai
  • <your_read_set>_sorted.RG.realigned.bam
  • <your_read_set>_sorted.RG.realigned.bai
  • <your_read_set>_NC_052664.1.vcf

Tarball Compression

I recommend putting them all into one archive so you can move them easily. Then you can use readlink to get the path to your archive to copy it easily.

tar -hczf igv_files.tar.gz solenopsis_invicta_genome.* SRR6922447_sorted* SRR6922447_NC_052664.1.vcf
readlink -e ./igv_files.tar.gz

Note

If you are using Windows, you will need 7Zip or some other way to extract your igv_files.tar.gz file.

If you are using Mac, you can extract the archive from the command line with tar -xvf igv_files.tar.gz, or you can simply navigate to the file and double click on it to open it.

If you can't figure out how to extract your files, I recommend falling back on transferring them one-by-one instead.

The transfer will take a few minutes or so. Once it's done, extract your igv_files.tar.gz, and you'll have all of your files.

Online Viewer

Now, navigate to the Online IGV app.

Click Genome and select Local File.... Navigate to your reference and index files ([...].fa,[...].fa.fai), select both of them (e.g. by ctrl- or shift-clicking), and confirm the selection. You should see the S. invicta chromosome names from the reference appear near the top of the page.

Next, click Track and select the rest of your files at the same time. You should see horizontal tracks appear underneath the chromosome names.

⚠️ **GitHub.com Fallback** ⚠️