Variant Filtering Strategies - igheyas/Bioinformatics GitHub Wiki

Variant Filtering Strategies

Once you have your raw VCF of variant calls, the next step is to filter for high-confidence sites. We’ll cover three common approaches.

1. Hard-Filtering by Quality Metrics

Apply simple “if-then” criteria on QUAL, depth (DP), mapping quality (MQ), etc.

# filter out variants with:
#   QUAL < 30, DP < 10, MQ < 40
bcftools filter \
  -i 'QUAL>=30 && DP>=10 && MQ>=40' \
  raw_snps_indels.vcf.gz \
  -Oz -o hard_filtered.vcf.gz

# index the filtered VCF
bcftools index hard_filtered.vcf.gz

- `QUAL`: Phred-scaled probability that the variant is wrong  
- `DP`: Total read depth at the site  
- `MQ`: Average mapping quality of supporting reads  

> ⚠️ Choose thresholds based on your data (coverage, read technology, species).

## 2. Variant Quality Score Recalibration (VQSR) with GATK

A machine-learning–based approach that builds a model from known/training sites.


You already have your ref_genome.fa. Now, in the same directory and with your conda env activated (where gatk and samtools live), run:
```bash
# 1) Index the FASTA for samtools/GATK
samtools faidx ref_genome.fa

# 2) Create a GATK‐style sequence dictionary
gatk CreateSequenceDictionary \
  -R ref_genome.fa \
  -O ref_genome.dict

you should see the following files:

ref_genome.fa
ref_genome.fa.fai
ref_genome.dict

A) Using a pre-built snpEff database Install snpEff (via Conda/Bioconda if you’re using miniconda):

conda install -c bioconda snpeff

List available databases to find the exact name:

snpEff databases | grep -i escherichia

It looks like the GCA-tagged database failed to download because SnpEff doesn’t actually host that archive on its blob server. The simplest fix is to use the “generic” K-12 MG1655 database (no “gca…” suffix), which is available. Here’s how you can get it and then proceed with your annotation:

Verify which MG1655 names SnpEff knows about

snpEff databases | grep -i mg1655

Here’s how to get SnpEff up and running safely (all downloads come from the official snpEff.blob.core.windows.net site), plus a fallback you can document in your Wiki if you’d rather skip it:

  1. Install SnpEff (if you haven’t already)
# Using conda (recommended, keeps things isolated):
conda install -c bioconda snpeff

# Or with apt (will install system-wide):
sudo apt update
sudo apt install snpeff

You can absolutely skip that first “cd” step. You don’t need to hunt around for somebody else’s install directory. Instead, just build your little “toy” database right here in your project folder.

Thought for 4 seconds Since you can’t pull any of the pre‐packaged databases from the SNPeff blob (you get a 409 on every download), the fastest way to get a working demo is to roll your own tiny “toy” database locally. You don’t have to point at NCBI’s huge E. coli GFF—just make a one‐gene GFF3 and teach snpEff about it.

  1. Make sure your toy database folder exists
mkdir -p snpEff_data/toy

  1. Create toy.gff3 with your minimal annotation Copy‐and‐paste the entire block below (including both EOF lines). After you paste the last EOF, press Enter once and you’ll return to the normal shell prompt.
cat > toy.gff3 << 'EOF'
##gff-version 3
##sequence-region NC_000913.3 1 1000
NC_000913.3   .   gene            1     1000   .   +   .   ID=gene1;Name=DemoGene
EOF

To abort, type Ctrl+C 3. Verify that toy.gff3 was created

ls -lh toy.gff3
# you should see toy.gff3 listed now

  1. Copy it into your snpEff data dir
cp toy.gff3 snpEff_data/toy/genes.gff3
  1. Also copy your reference FASTA into that same folder
cp ref_genome.fa  snpEff_data/toy/sequences.fa
  1. Now build the toy database
snpEff \
  -dataDir $(pwd)/snpEff_data \
  build \
  -gff3 \
  -v toy

After that, you’ll have a minimal “toy” snpEff database under snpEff_data/toy, and cp toy.gff3 … will no longer complain.

  1. Functional‐Impact Filtering via snpEff Once you have a hard‐filtered or VQSR‐filtered VCF (e.g. hard_filtered.vcf.gz), annotate it with your toy database:
# 1) Annotate with snpEff
snpEff \
  -dataDir $(pwd)/snpEff_data \
  -v toy \
  hard_filtered.vcf.gz \
| bgzip -c > annotated.vcf.gz

# 2) Index the annotated VCF
tabix -p vcf annotated.vcf.gz

You’ve now covered the three canonical filtering tiers:

Hard-filters (QUAL, DP, MQ, etc.)

VQSR (GATK-based recalibration)

Functional filters (snpEff → HIGH/MODERATE impact)

At this point there really isn’t anything else you “must” do under Variant Filtering Strategies – you’ve hit all of the major approaches. If you want to sprinkle in a few extras, you could:

Genotype-level filters: e.g. bcftools view -i 'FMT/GQ>20 && FMT/DP>10' to drop individual calls with low confidence.

Allele-frequency filters: if you had a cohort, you might remove very common variants (--max-maf 0.01) or focus on rare ones.

Strand-bias or positional filters: exclude variants with extreme strand bias (e.g. INFO/SB > X) or variants very close to indels.

But those are all just variations on the same theme. Unless you have a specialty use-case, you’re done here. The next thing in your wiki would be moving on to “VCF Format Deep Dive” or “Basic VCF Operations” (sorting, indexing, querying, stats).