SnpEff and SnpSift - mestato/EPP622_2024 GitHub Wiki

Effect Prediction with SnpEff

Using SnpEff, we will take a look at the variants in our samples and determine potentially significant effects on the organism's biology.

Make a new folder called 7_SnpEff for this analysis alongside your other analysis folders inside of /pickett_sphinx/teaching/EPP622_24/YOURNAME/, and enter it.

mkdir 7_SnpEff
cd 7_SnpEff

Prep: Renaming Chromosomes

The SnpEff database has different names for the chromosomes than we have in our reference. If we look at the NCBI entry and scroll down to the "Chromosomes" table, we can see that this organism has two sets of names for its chromosomes, one for RefSeq, and one for Genbank.

We have the RefSeq names in our reference, but SnpEff's database uses the Genbank names, so we need to edit our VCF to make it compatible with the SnpEff database.

I've already made a translation table for you to use, so you just need to use bcftools annotate make a new VCF that SnpEff can use.

spack load bcftools

bcftools annotate \
        --rename-chrs /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/refseq_to_genbank_chr.tsv \
        ../6_filter/solenopsis_combined.GATKfilters.vcf \
        --output solenopsis_combined.GATKfilters.genbank_chrs.vcf

Here, we've named the output [...]genbank_chrs.vcf so know this VCF uses those alternate names.

Running SnpEff

spack load [email protected]_10
java -Xmx4G -jar /pickett_shared/software/snpEff/snpEff.jar Solenopsis_invicta solenopsis_combined.GATKfilters.genbank_chrs.vcf > solenopsis_combined.GATKfilters.genbank_chrs.ANN.vcf

If you look closely at the new VCF file's INFO column, SnpEff added a field called ANN that describes the particular site and any potential effects.

#CHROM	POS	...	INFO ...
CM028732.1	54393	...	AC=2;AF=1;AN=2;DP=6;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=27;QD=29.5;SOR=3.912;ANN=C|intergenic_region|MODIFIER|LOC105205108-LOC105206611|LOC105205108-LOC105206611|intergenic_region|LOC105205108-LOC105206611|||n.54393T>C||||||	...
CM028732.1	59958	...	AC=2;AF=1;AN=2;DP=12;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=50.2;QD=25.36;SOR=5.136;ANN=CACCG|intergenic_region|MODIFIER|LOC105205108-LOC105206611|LOC105205108-LOC105206611|intergenic_region|LOC105205108-LOC105206611|||n.59958_59959insACCG||||||	...
CM028732.1	75742	...	AC=1;AF=0.5;AN=2;BaseQRankSum=1.69;DP=12;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=18.89;ReadPosRankSum=0;SOR=0.693;ANN=T|intergenic_region|MODIFIER|LOC105205108-LOC105206611|LOC105205108-LOC105206611|intergenic_region|LOC105205108-LOC105206611|||n.75742C>T||||||	...
CM028732.1	261146	...	AC=2;AF=1;AN=2;BaseQRankSum=-0.534;DP=50;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=23.65;MQRankSum=-2.7;QD=25.74;ReadPosRankSum=0;SOR=5.59;ANN=T|upstream_gene_variant|MODIFIER|LOC105203981|LOC105203981|transcript|XM_011172946.3|protein_coding||c.-4598C>T|||||4544|,T|intron_variant|MODIFIER|LOC105206862|LOC105206862|transcript|LOC105206862_df_pt|pseudogene|10/10|n.3921+7722G>A||||||	...

The information in the ANN sub-fields (separated by |s after ANN=) contains lots of details about the individual site's effect. See this page for an in-depth description of how they work.

ANN=T|upstream_gene_variant|MODIFIER|LOC105203981|LOC105203981|transcript|XM_011172946.3|protein_coding||c.-4598C>T|||||4544|,T|intron_variant|MODIFIER|LOC105206862|LOC105206862|transcript|LOC105206862_df_pt|pseudogene|10/10|n.3921+7722G>A||||||

The kinds of annotations that SnpEff can label are detailed here.

Selecting Sites with SnpSift filter

So far, we have a VCF that describes variant sites, potential effects, and genotypes for a whole cohort of samples all in one file.

However, if we want to more easily work with certain kinds of variants, or only passing sites, or another subset of our VCF data, we should use SnpSift.

In particular, we're going to select sites that meet the following criteria:

  • Variant site passed previous filter criteria
  • Variant site has a Moderate or High potential effect

To select the desired sites, we have to write expressions in a sort of scripting language that SnpSift expects.

Selecting Passing Sites

To select only passing sites, we will use the expression "(( na FILTER ) \| (FILTER = 'PASS'))".

In plain language, the whole expression means, "sites with a FILTER column that is empty (( na FILTER)), OR (|) sites with a FILTER column that has only the word PASS ((FILTER = 'PASS'))."

That expression is actually two smaller expressions that are compared to each other with a logical OR comparison, so if one of them is true for a given site, that site should be selected.

Selecting Moderate and High Effect Sites

To select sites with potentially moderate or high effects, we will use the expression "((ANN[*].IMPACT has 'MODERATE') \| (ANN[*].IMPACT has 'HIGH'))".

This expression is made up of two smaller expressions just like the last one.

(ANN[*].IMPACT has 'MODERATE') means "sites with any annotations that contain the word 'MODERATE' in the IMPACT subfield."

(ANN[*].IMPACT has 'HIGH') means "sites with any annotations that contain the word 'HIGH' in the IMPACT subfield."

These two expressions are compared with a logical OR (|), so if either one is true for a given site, that site will be selected.

Selecting Both Passing and Moderate/High Effect Sites

We can combine the above expressions with a logical AND (&) to say that a given site must pass both filters to be selected.

(( na FILTER ) \| (FILTER = 'PASS')) & ((ANN[*].IMPACT has 'MODERATE') \| (ANN[*].IMPACT has 'HIGH')).

This expression reads as "sites that have either nothing or 'PASS' in their FILTER column, and also have either 'MODERATE' or 'HIGH' in any of their annotations' IMPACT sub-fields."

We need to be careful about how we use parentheses in our expressions, because without grouping the expressions correctly, we might not select exactly the sites we wanted. Notice that the above expression is written with parentheses around each small expression, each group of expressions, and then the whole expression.

Running SnpSift filter

Here's an example of how to run SnpSift filter with the above expressions.

spack load [email protected]_10

PASS="(( na FILTER ) | (FILTER = 'PASS'))"
MOD_OR_HIGH="((ANN[*].IMPACT has 'MODERATE') | (ANN[*].IMPACT has 'HIGH'))"

SIFT_EXPR="$PASS & $MOD_OR_HIGH"

cat solenopsis_combined.GATKfilters.refseq_chrs.ANN.vcf | java -Xmx4G -jar /pickett_shared/software/snpEff/SnpSift.jar filter "$SIFT_EXPR" > pass_mod_high_10x.vcf