variantFiltering - christianparobek/cambodiaWGS GitHub Wiki

###variantFiltering

Manske et al. use a number of filering criteria for filtering their 1M+ P. falciparum variants down to ~80K. We have 200K+ P. vivax variants, and would like to use only the highest quality variants among these.

After generating the combined VCF (n=69 ... we got rid of the 9 weakest samples), I filtered out SNPs falling among Neafsey's paralogs, TRF regions and the subtelomeric regions containing virs (viewed vir layout on chromosomes in PlasmoDB and made an interval files that removed the end of each chromosome with virs, with a 100bp buffer).

Next, applied a coverage filter that only accepts SNPs with >=05x coverage in 100% of our samples. NOTE: Could accepting only those SNPs that occur in 100% of samples obscure real population substructure, if for example, samples from OM have an extra genetic element that is in Sal1 but that BB doesn't have? That would be missing data in BB but would maybe be informative in OM. Not sure how distance matrices treat missing data...

Now, want to apply strict quality filters. I decided on quality filter cutoffs for the following metrics by first plotting the distributions of qualities, then cutting based on the observed distributions. Here are the quality measure I'm using:

Quality By Depth (QD) - This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples. –GATK Documentation
Fisher Score (FS) - Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls. –GATK Documentation
RMS Mapping Quality (MQ) - Root mean square mapping quality “provides an estimation of the overall mapping quality of reads supporting a variant call, averaged over all samples in a cohort. The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.” –GATK Documentation
MQRankSum - This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls. –GATK Documentation
ReadPosRankSum - This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls. –GATK Documentation

Plots of the distributions can be viewed here: (link to a github.io page??)

As we can see, for most of these quality criteria, there is a clear value below or above which we should make a quality cut. I will filter out all SNP entries with values:

QD < 5
FS > 10
MQ < 60
MQRankSum < -5
ReadPosRankSum < -5

This leaves us with 82,281 variants. Also want to filter out the super-rare alleles and keep only alleles that have 1% MAF across all samples or at least 10 supporting reads in one sample. However, it looks like all passing entries have at least 1% MAF, so everything passes this filter. As a last filter, keep only the biallelic SNPs.... this leaves us with 81,835 variants. At least one of the triallelic SNPs I glanced at in IGV looked believable, but we need to take them out since many programs only consider biallelics.

Finally, let's look at withon-exon SNP coverage depth vs outside-exon SNP coverage depth. Fortunately, for P. vivax it's about the same either way. See depthFinder.

#####19 June, 2015 Because of the way I named samples, most samples have some fragment of a date appended to the end of the sample name in the final VCF file. Use the following code to fix it:

## In only the line that has "CHROM" in it, remove the date bit that begins with a "_"
cat all_goods_UG.pass.vcf | sed '/CHROM/ s=_[0-9]*==g' > all_goods_UG.pass.namesfix.vcf

## Ensure old and new file are identical except for the fixed names
comm -23 all_goods_UG.pass.namesfix.vcf all_goods_UG.pass.vcf
comm -23 all_goods_UG.pass.vcf all_goods_UG.pass.namesfix.vcf