Notes on GATK Variant Filtering - 18liedan/genomics_memo GitHub Wiki
This is a description created by TinyBio (April 9th, 2025)
Hard Filtering with GATK VariantFiltration
The VariantFiltration command applies a series of filters to your VCF file based on specific quality metrics. Here's what each filter does:
- 
QD (Quality by Depth): - Filter: [QD < 2.0]
- Description: This filter removes variants with a low quality score relative to the depth of coverage. A low QD value suggests that the variant's quality is not supported by the number of reads.
 
- Filter: 
- 
QUAL (Quality Score): - Filter: [QUAL < 30.0]
- Description: This filter removes variants with a low overall quality score. A higher QUAL score indicates greater confidence in the variant call.
 
- Filter: 
- 
SOR (Strand Odds Ratio): - Filter: [SOR > 4.0]
- Description: This filter removes variants with a high strand bias, which can indicate sequencing errors. A high SOR value suggests that the variant is not equally supported by reads from both strands.
 
- Filter: 
- 
FS (Fisher Strand Bias): - Filter: [FS > 60.0]
- Description: This filter removes variants with significant strand bias, as determined by Fisher's exact test. High FS values indicate potential sequencing artifacts.
 
- Filter: 
- 
MQ (Mapping Quality): - Filter: [MQ < 40.0]
- Description: This filter removes variants with low mapping quality, which can indicate that the reads supporting the variant are not well-aligned to the reference genome.
 
- Filter: 
- 
MQRankSum (Mapping Quality Rank Sum Test): - Filter: [MQRankSum < -12.5]
- Description: This filter removes variants where the mapping quality of reads supporting the reference allele is significantly different from those supporting the alternate allele.
 
- Filter: 
- 
ReadPosRankSum (Read Position Rank Sum Test): - Filter: [ReadPosRankSum < -8.0]
- Description: This filter removes variants where the position of the variant within the reads is significantly different between the reference and alternate alleles.
 
- Filter: 
Soft Filtering with VCFtools
After hard filtering, you apply additional population-level filters using VCFtools:
- 
Max Missing: - Option: [--max-missing 1.0]
- Description: This option retains variants that are present in all samples (no missing data). A value of 1.0 means 100% of samples must have data for the variant.
 
- Option: 
- 
HWE (Hardy-Weinberg Equilibrium): - Option: [--hwe 0.001]
- Description: This filter removes variants that deviate significantly from Hardy-Weinberg equilibrium, which can indicate genotyping errors or population structure. Variants with a p-value less than 0.001 are filtered out.
 
- Option: 
- 
MAF (Minor Allele Frequency): - Option: [--maf 0.1]
- Description: This filter removes variants with a minor allele frequency below 0.1. This ensures that only variants with a reasonable frequency in the population are retained.
 
- Option: 
- 
Recode: - Option: --recode
- Description: This option outputs a new VCF file with the filtered variants.
 
- Option: 
Summary
- Hard Filtering: Removes variants based on quality metrics to ensure high-confidence variant calls.
- Soft Filtering: Further refines the variant set based on population genetics criteria to ensure biological relevance and data completeness.