Notes on GATK Variant Filtering - 18liedan/genomics_memo GitHub Wiki

This is a description created by TinyBio (April 9th, 2025)

Hard Filtering with GATK VariantFiltration

The VariantFiltration command applies a series of filters to your VCF file based on specific quality metrics. Here's what each filter does:

  1. QD (Quality by Depth):

    • Filter[QD < 2.0]
    • Description: This filter removes variants with a low quality score relative to the depth of coverage. A low QD value suggests that the variant's quality is not supported by the number of reads.
  2. QUAL (Quality Score):

    • Filter[QUAL < 30.0]
    • Description: This filter removes variants with a low overall quality score. A higher QUAL score indicates greater confidence in the variant call.
  3. SOR (Strand Odds Ratio):

    • Filter[SOR > 4.0]
    • Description: This filter removes variants with a high strand bias, which can indicate sequencing errors. A high SOR value suggests that the variant is not equally supported by reads from both strands.
  4. FS (Fisher Strand Bias):

    • Filter[FS > 60.0]
    • Description: This filter removes variants with significant strand bias, as determined by Fisher's exact test. High FS values indicate potential sequencing artifacts.
  5. MQ (Mapping Quality):

    • Filter[MQ < 40.0]
    • Description: This filter removes variants with low mapping quality, which can indicate that the reads supporting the variant are not well-aligned to the reference genome.
  6. MQRankSum (Mapping Quality Rank Sum Test):

    • Filter[MQRankSum < -12.5]
    • Description: This filter removes variants where the mapping quality of reads supporting the reference allele is significantly different from those supporting the alternate allele.
  7. ReadPosRankSum (Read Position Rank Sum Test):

    • Filter[ReadPosRankSum < -8.0]
    • Description: This filter removes variants where the position of the variant within the reads is significantly different between the reference and alternate alleles.

Soft Filtering with VCFtools

After hard filtering, you apply additional population-level filters using VCFtools:

  1. Max Missing:

    • Option[--max-missing 1.0]
    • Description: This option retains variants that are present in all samples (no missing data). A value of 1.0 means 100% of samples must have data for the variant.
  2. HWE (Hardy-Weinberg Equilibrium):

    • Option[--hwe 0.001]
    • Description: This filter removes variants that deviate significantly from Hardy-Weinberg equilibrium, which can indicate genotyping errors or population structure. Variants with a p-value less than 0.001 are filtered out.
  3. MAF (Minor Allele Frequency):

    • Option[--maf 0.1]
    • Description: This filter removes variants with a minor allele frequency below 0.1. This ensures that only variants with a reasonable frequency in the population are retained.
  4. Recode:

    • Option--recode
    • Description: This option outputs a new VCF file with the filtered variants.

Summary

  • Hard Filtering: Removes variants based on quality metrics to ensure high-confidence variant calls.
  • Soft Filtering: Further refines the variant set based on population genetics criteria to ensure biological relevance and data completeness.