Notes on GATK Variant Filtering - 18liedan/genomics_memo GitHub Wiki
This is a description created by TinyBio (April 9th, 2025)
Hard Filtering with GATK VariantFiltration
The VariantFiltration
command applies a series of filters to your VCF file based on specific quality metrics. Here's what each filter does:
-
QD (Quality by Depth):
- Filter:
[QD < 2.0]
- Description: This filter removes variants with a low quality score relative to the depth of coverage. A low QD value suggests that the variant's quality is not supported by the number of reads.
- Filter:
-
QUAL (Quality Score):
- Filter:
[QUAL < 30.0]
- Description: This filter removes variants with a low overall quality score. A higher QUAL score indicates greater confidence in the variant call.
- Filter:
-
SOR (Strand Odds Ratio):
- Filter:
[SOR > 4.0]
- Description: This filter removes variants with a high strand bias, which can indicate sequencing errors. A high SOR value suggests that the variant is not equally supported by reads from both strands.
- Filter:
-
FS (Fisher Strand Bias):
- Filter:
[FS > 60.0]
- Description: This filter removes variants with significant strand bias, as determined by Fisher's exact test. High FS values indicate potential sequencing artifacts.
- Filter:
-
MQ (Mapping Quality):
- Filter:
[MQ < 40.0]
- Description: This filter removes variants with low mapping quality, which can indicate that the reads supporting the variant are not well-aligned to the reference genome.
- Filter:
-
MQRankSum (Mapping Quality Rank Sum Test):
- Filter:
[MQRankSum < -12.5]
- Description: This filter removes variants where the mapping quality of reads supporting the reference allele is significantly different from those supporting the alternate allele.
- Filter:
-
ReadPosRankSum (Read Position Rank Sum Test):
- Filter:
[ReadPosRankSum < -8.0]
- Description: This filter removes variants where the position of the variant within the reads is significantly different between the reference and alternate alleles.
- Filter:
Soft Filtering with VCFtools
After hard filtering, you apply additional population-level filters using VCFtools:
-
Max Missing:
- Option:
[--max-missing 1.0]
- Description: This option retains variants that are present in all samples (no missing data). A value of 1.0 means 100% of samples must have data for the variant.
- Option:
-
HWE (Hardy-Weinberg Equilibrium):
- Option:
[--hwe 0.001]
- Description: This filter removes variants that deviate significantly from Hardy-Weinberg equilibrium, which can indicate genotyping errors or population structure. Variants with a p-value less than 0.001 are filtered out.
- Option:
-
MAF (Minor Allele Frequency):
- Option:
[--maf 0.1]
- Description: This filter removes variants with a minor allele frequency below 0.1. This ensures that only variants with a reasonable frequency in the population are retained.
- Option:
-
Recode:
- Option:
--recode
- Description: This option outputs a new VCF file with the filtered variants.
- Option:
Summary
- Hard Filtering: Removes variants based on quality metrics to ensure high-confidence variant calls.
- Soft Filtering: Further refines the variant set based on population genetics criteria to ensure biological relevance and data completeness.