Lab: Short read variation III Variant calling with GATK - mestato/EPP622_2024 GitHub Wiki
We need to add read group (RG) information to perform our downstream analysis. We're going to add it to our alignment files now, but RGs can also be added at the alignment stage (if you're curious, see the BWA manual and check out the -R
option for bwa mem
).
Read group data is metadata that gives more context to the alignments for downstream tools so they can do things like distinguish samples and detect errors.
We're going to use AddOrReplaceReadGroups via GATK, but it's a program that's part of Picard Tools that's has been bundled with GATK.
Navigate to your 3_bwa
folder where you have your sorted BAM data.
/pickett_shared/software/gatk-4.2.6.1/gatk \
AddOrReplaceReadGroups \
--INPUT <your_read_set>_sorted.bam \
--OUTPUT <your_read_set>_sorted.RG.bam \
--RGSM <your_read_set> \
--RGLB <your_read_set> \
--RGPL ILLUMINA \
--RGPU <your_read_set>
Note
--INPUT
is your sorted BAM file
--OUTPUT
is the name of your new file that will have RG data
--RGSM
is the name of the sample the input belongs to
--RGLB
is the name of the library this sample belongs to.
--RGPL
is the name of the platform that was used to create these reads (e.g. Illumina, PacBio, etc.)
--RGPU
is (intended to be) the the platform unit that describes exactly which flowcell, lane, and sample multiplex barcode was used when the reads were created.
In our case, we are just setting the RGPU
field to the name of the sample, which will be fine for our purposes. If our experiment was more complex, we would be using more specific information here.
See You can read about the arguments for this tool here. Also see this page for more information on read groups.
If you want to save space, you can delete your old sorted BAM file because your new [...].RG.bam
is the file you'll work with from here on.
Now that we've got a sorted BAM with read groups, we need to create an index with samtools index
.
spack load [email protected]
samtools index <your_read_set>_sorted.RG.bam
The resulting index file will be named something like [...]_sorted.RG.bam.bai
.
Other tools will often automatically look for this file in the same directory as the BAM when you run them. Typically, if a tool can't find this file, it will raise an error to tell you.
Note
Here's an example of what processing all of your samples with a for loop might look like. This isn't likely to be the best solution if you have a lot of data, but it would get the job done.
You don't need to run this, this is just an example
spack load [email protected]_10
spack load [email protected]
for BAM in *_sorted.bam; do
SAMPLENAME=$(basename "$BAM" | sed 's/_sorted.bam$//')
/pickett_shared/software/gatk-4.2.6.1/gatk AddOrReplaceReadGroups \
--INPUT $BAM \
--OUTPUT ${SAMPLENAME}_sorted.RG.bam \
--RGSM $SAMPLENAME \
--RGLB $SAMPLENAME \
--RGPL ILLUMINA \
--RGPU $SAMPLENAME
samtools index ${SAMPLENAME}_sorted.RG.bam
done
GATK is a large toolkit of programs for a variety of bioinformatic analyses. We're following the best practices workflow outlined by the Broad Institute here.
In addition to GATK, we'll be using some tools from Picard that are bundled with GATK.
Note
You do not need to run this particular step, these files are already in the reference folder
We use samtools faidx and CreateSequenceDictionary to make an index (.fai
) and sequence dictionary (.dict
) for our reference.
spack load [email protected]
samtools faidx /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz
/pickett_shared/software/gatk-4.2.6.1/gatk CreateSequenceDictionary \
-R /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz
Make a new folder for this step, and create symlinks to the files we're going to use.
mkdir 4_gatk
cd 4_gatk
# link the sorted RG BAM and its index
ln -s $(readlink -e ../3_bwa/*_sorted.RG.*) ./
# link some relevant reference genome files
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz . # The compressed reference file
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.gzi . # The bgzip index for the reference (you only need this file if the reference is bgzipped)
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.fai . # The genome index from samtools faidx
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.dict . # The sequence dictionary from CreateSequenceDictionary
Note
readlink
resolves a relative path into a canonical path from root, so you don't have to worry if you relocate your symbolic link file, it will still point to the same file (as long as that other file doesn't move!)
See man readlink
for more on how to use it.
Using GATK's HaplotypeCaller, we'll call variants on a single sample. We're also going to call variants on only one chromosome for this example to same some time.
/pickett_shared/software/gatk-4.2.6.1/gatk \
--java-options "-Xmx4G" \
HaplotypeCaller \
-R solenopsis_invicta_genome.fa.gz \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>_NC_052664.1.vcf \
-bamout <your_read_set>_sorted_NC_052664.1.RG.realigned.bam \
-L NC_052664.1
Note
-R
is the reference file used to create the alignments.
-I
is the input BAM you want to use when calling variants.
-O
is where you want to write the output VCF. We appended a _NC_052664.1
to the name so we can tell this is only for one chromosome.
-bamout
outputs the reads and assembled pseudohaplotypes that were realigned when calling variants. Again, we put NC_052664.1
in part of the name so we know it only contains data on the one chromosome. We will take a look at this file together in a little while.
-L
is the interval where you would like to call variants. We provided just a chromosome name, but we could have also supplied a specific region with the format chrom:start-stop
, for example `NC_052664.1:1-5000.
If you do not provide an -L
argument, HaplotypeCaller
will call variants on the entire genome.
If you use something like less
to browse through the resulting VCF (Variant Call Format) file, you'll see a large amount of header information on the lines that start with ##
. That information is relevant to downstream tools and for diagnostics, but isn't very easy to read or interesting to us right now.
Let's look at the first few lines after that header.
grep -v '^##' <your_read_set>_NC_052664.1.vcf | head -n 3
You should see something like the following:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR6922447
NC_052664.1 54393 . T C 176.97 . AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=27.00;QD=29.50;SOR=3.912 GT:AD:DP:GQ:PL 1/1:0,6:6:18:191,18,0
NC_052664.1 59958 . C CACCG 526.02 . AC=2;AF=1.00;AN=2;DP=12;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.20;QD=25.36;SOR=5.136 GT:AD:DP:GQ:PL 1/1:0,12:12:36:540,36,0
#CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | SRR6922447 |
---|---|---|---|---|---|---|---|---|---|
NC_052664.1 | 54393 | . | T | C | 176.97 | . | AC=2;AF=1.00;AN=2;DP=6;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=27.00;QD=29.50;SOR=3.912 | GT:AD:DP:GQ:PL | 1/1:0,6:6:18:191,18,0 |
NC_052664.1 | 59958 | . | C | CACCG | 526.02 | . | AC=2;AF=1.00;AN=2;DP=12;ExcessHet=0.0000;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.20;QD=25.36;SOR=5.136 | GT:AD:DP:GQ:PL | 1/1:0,12:12:36:540,36,0 |
Now, this is good information on a couple of loci in this sample, but if we want to compare several samples' variants, we would have to match up the sites where they all have variants. Furthermore, we could only compare samples at sites where they all had variants unless we went back to the original data and checked those same sites over again.
We could run all of our samples at the same time and get a single VCF that describes the sites in all of the samples, but we wouldn't be able to add any more samples to that analysis afterwards, and if it were to fail midway (for example if the system crashed, or ran out of memory or storage), then we'd have to wait on an entire new run from scratch.
Especially if your study involves a large number of samples, those issues can be very troublesome. This is where the GVCF (Genomic Variant Call Format) approach comes in.
Let's generate a GVCF file with HaplotypeCaller
this time instead.
/pickett_shared/software/gatk-4.2.6.1/gatk \
--java-options "-Xmx4G" \
HaplotypeCaller \
-R solenopsis_invicta_genome.fa.gz \
-I <your_read_set>_sorted.RG.bam \
-O <your_read_set>.g.vcf \
-ERC GVCF
Note
-ERC GVCF
(or --emit-ref-confidence GVCF
) instructs the tool to output the GVCF format.
This time, we named the output [...].g.vcf
so we know this is a genomic VCF.
We did not use -L
this time, so we called variants for the whole genome.
-R
and -I
are the same as before.
The GVCF (Genomic Variant Call Format) is similar to the VCF, but it reports every site. Try taking a look at the first few entries like we did with the VCF above. In the GVCF, there are also lines that don't describe variants but instead describe stretches of the genome and the quality of the base calls in the BAM. These are used later to compare different samples to each other without having to dive back in to every sample's BAM to check if it was reference type or had missing/low quality data at that site.
The tradeoff with this approach it that we can't directly use the GVCF for analysis.
For now, please copy your GVCF file into a shared folder so we can use them all to create a single VCF that contains the variants for all of the samples.
cp <your_read_set>.g.vcf /pickett_sphinx/teaching/EPP622_2024/analyses/GVCFs/
Note
If you have a study that involves a very large number of samples (hundreds to thousands), look into GenomicsDBImport, which can more efficiently store and handle samples compared to the rest of this workflow.
Now that we've got several samples' GVCFs, we can call variants on the group of them.
Before we can call variants, we need to gather our GVCFs together into one file using GATK CombineGVCFs.
First, let's create a new folder for this step and link our GVCF files.
mkdir 5_gatk_combine
cd 5_gatk_combine
# symlink our reference genome and required other supporting files
ln -s $(readlink -e ../4_gatk/solenopsis_invicta_genome.*) .
# create symlinks to the GVCFs
ln -s /pickett_sphinx/teaching/EPP622_2024/analyses/GVCFs/*.g.vcf .
Now we need to tell GATK which samples to use when calling variants on the dataset. There are at least two ways to do this that make sense.
One way to do that is to supply every GVCF as a single argument. We can make a script that can build a script with those arguments automatically with a for loop like this:
echo '/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs -R solenopsis_invicta_genome.fa.gz \'
for f in *.g.vcf; do
echo "--variant $f \\"
done
echo "-O solenopsis_combined.g.vcf"
If you create that script and run it, you will get an output like this:
/pickett_shared/software/gatk-4.2.6.1/gatk CombineGVCFs -R solenopsis_invicta_genome.fa.gz \
--variant <sample1>.g.vcf \
--variant <sample2>.g.vcf \
--variant <sampleEtc>.g.vcf \
-O solenopsis_combined.g.vcf
Every line with --variant
lists one of the GVCF files. If you that text into a script and run it, then CombineGVCFs
will run with all of your samples.
Another way to provide all of the files to GATK is by creating a file that lists all of them, then giving that file to CombineGVCFs
.
# create the sample list
ls *.g.vcf > sample.list
# run CombineGVCFs
/pickett_shared/software/gatk-4.2.6.1/gatk \
CombineGVCFs \
-R solenopsis_invicta_genome.fa.gz \
--variant sample.list \
-O solenopsis_combined.g.vcf
Choose one of the above methods, and run it to generate your combined GVCF.
Now we'll calculate the genotypes from the combined GVCF file using GATK GenotypeGVCFs.
/pickett_shared/software/gatk-4.2.6.1/gatk \
GenotypeGVCFs \
-R solenopsis_invicta_genome.fa.gz \
-V solenopsis_combined.g.vcf \
-O solenopsis_combined.vcf
Let's use bcftools
to see how many SNPs we got. Use bcftools stats
to create a summary of the VCF, then view it with something like less
. Look for a line that says "number of SNPs:"
spack load bcftools
bcftools stats solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt
less solenopsis_combined.vcf.stats.txt
Now we have a whole load of SNPs, however, this VCF probably contains a lot of variants that may not be real, so we need to filter the variants so we know which sites are more likely to be real.
For filtering variants, Broad Institute recommends using their variant quality score recalibration (VQSR) technique when possible. However, that would require a large set of high-quality previously known variants, which we don't have. The next best thing is "hard filtering" our data.
We'll use GATK VariantFiltration to mark sites that we think aren't real.
Make a new folder and link your reference and combined VCF files from the last step.
mkdir 6_filter
cd 6_filter
ln -s $(readlink -e ../5_gatk_combine/solenopsis_invicta_genome*) .
ln -s $(readlink -e ../5_gatk_combine/solenopsis_combined.vcf*) . # this will link your combined VCF and the matching stats file
First, let's filter with a very basic set of parameters. We'll require that our sites are over 20.0 quality, and have more than 5x depth. We'll compare the results to another set of filters later.
/pickett_shared/software/gatk-4.2.6.1/gatk \
VariantFiltration \
-R solenopsis_invicta_genome.fa.gz \
-V solenopsis_combined.vcf \
-O solenopsis_combined.Basicfilters.vcf \
--filter-name "basic_filter" --filter-expression "QUAL > 20.0 && DP > 5"
Now, let's use the suggested filters from Broad Institute. More detail on these is available here as well.
Additionally, they recommend to that we filter the SNPs and INDELs separately, but we're just going to filter our data with the recommended SNP parameters to save some time.
/pickett_shared/software/gatk-4.2.6.1/gatk \
VariantFiltration \
-R solenopsis_invicta_genome.fa.gz \
-V solenopsis_combined.vcf \
-O solenopsis_combined.GATKfilters.vcf \
--filter-name "QD_filter" --filter-expression "QD < 2.0" \
--filter-name "FS_filter" --filter-expression "FS > 60.0" \
--filter-name "MQ_filter" --filter-expression "MQ < 40.0" \
--filter-name "SOR_filter" --filter-expression "SOR > 4.0" \
--filter-name "MQRankSum" --filter-expression "MQRankSum < -12.5" \
--filter-name "ReadPosRankSum" --filter-expression "ReadPosRankSum < -8.0"
VariantFiltration
only marks the sites if they don't pass, it doesn't remove them, so we need to provide an extra argument to bcftools stats
to look at only those sites that passed our filtered.
spack load bcftools
bcftools stats -f "PASS,." ../5_gatk_combine/solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.Basicfilters.vcf > solenopsis_combined.Basicfilters.vcf.stats.txt
bcftools stats -f "PASS,." solenopsis_combined.GATKfilters.vcf > solenopsis_combined.GATKfilters.vcf.stats.txt
grep 'number of SNPs:' *stats.txt
You can see that with the filters from Broad Institute we kept many more SNPs than with our simple filter, and we can be relatively confident that most of them are high enough quality for meaningful analysis.
We're going to look at some of the variants that HaplotypeCaller
called using Integrative Genome Viewer (IGV).
Online IGV doesn't support b/gzipped files, so I recommend linking the already decompressed reference file and it's index to your directory:
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa .
ln -s /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.fai .
Use scp
or your favorite method to copy the following files to your computer:
- solenopsis_invicta_genome.fa
- solenopsis_invicta_genome.fa.fai
- <your_read_set>_sorted.RG.bam
- <your_read_set>_sorted.RG.bam.bai
- <your_read_set>_sorted.RG.realigned.bam
- <your_read_set>_sorted.RG.realigned.bai
- <your_read_set>_NC_052664.1.vcf
I recommend putting them all into one archive so you can move them easily. Then you can use readlink
to get the path to your archive to copy it easily.
tar -hczf igv_files.tar.gz solenopsis_invicta_genome.* SRR6922447_sorted* SRR6922447_NC_052664.1.vcf
readlink -e ./igv_files.tar.gz
Note
If you are using Windows, you will need 7Zip or some other way to extract your igv_files.tar.gz
file.
If you are using Mac, you can extract the archive from the command line with tar -xvf igv_files.tar.gz
, or you can simply navigate to the file and double click on it to open it.
If you can't figure out how to extract your files, I recommend falling back on transferring them one-by-one instead.
The transfer will take a few minutes or so. Once it's done, extract your igv_files.tar.gz
, and you'll have all of your files.
Now, navigate to the Online IGV app.
Click Genome
and select Local File...
. Navigate to your reference and index files ([...].fa
,[...].fa.fai
), select both of them (e.g. by ctrl
- or shift
-clicking), and confirm the selection. You should see the S. invicta chromosome names from the reference appear near the top of the page.
Next, click Track
and select the rest of your files at the same time. You should see horizontal tracks appear underneath the chromosome names.