4. Quality Filtering - DianaCarolinaVergara/SNPs_pipeline GitHub Wiki

VCF Quality Filtering

Extracted from:

(http://grunwaldlab.github.io/Population_Genetics_in_R/qc.html)

Missing data

Missing data can frequently be due to samples (columns) or variants (rows) of low quality. Here we demonstrate how to identify samples and variants in the data set that has a high degree of missingness.

Quantifying missingness, one sample

To quantify missingness for a single sample we can use the function is.na(). This function uses a vector as an argument and returns a logical vector (TRUE and FALSE) indicating which values are NA. If we remind ourselves that TRUEs and FALSEs are numerically encoded as 1 and 0 it reminds us we can take a sum of this logical vector to determine the degree of missingness.

as.numeric(c(FALSE, TRUE))
 [1] 0 1
sum(as.numeric(c(FALSE, TRUE)))
 [1] 1

We can extract a matrix of depths and query the first sample for missingness. This reports the number of missing variants in the first sample. We could similarly count the number of missing samples from a variant by accessing a row instead of a column. We could also convert this to a percentage by using length() to determine the total number of values in either the column or row and use this as a denominator.

Muricea_dp<- extract.gt(Muricea.vcf_ID_SNPs, element = "DP", as.numeric=TRUE)

Example:

example_dp <- extract.gt(vcf_ID_SNPs, element = "DP", as.numeric=TRUE)
sum(is.na(example_dp[,1]))

heatmap.bp(example_dp, rlabels=FALSE)

Quantifying missingness, all samples

We can extend the functionality of the above example to many columns or rows by using the apply() function.

?apply

example_miss <- apply(example_dp, MARGIN = 2, function(x){ sum(is.na(x)) })
example_miss <- example_miss/nrow(example_vcf_novo_ID_SNPs)

library(RColorBrewer)
palette(brewer.pal(n=12, name = 'Set3'))

par(mar = c(12,4,4,2))
barplot(example_miss, las = 2, col = 1:12)
title(ylab = "Missingness (%)")

This allows us to visualize the degree of missingness on a per-sample basis. One decision could be to omit this sample. However, if the goal is to make comparisons among these species we may instead search for variants that are present in both taxa.

We can do something similar to query the variants (rows) for missingness. We will use a histogram.

par(mar = c(5,4,4,2))
example_miss_2 <- apply(example_dp, MARGIN = 1, function(x){ sum(is.na(x)) })
example_miss_2 <- example_miss_2/ncol(example_vcf_novo_ID_SNPs@gt[,-1])

hist(example_miss_2, col = "#8DD3C7", xlab = "Missingness (%)", main = "")

Censoring data

Here we censor the variants that we do not feel are of typical coverage. When we censor variants we score them as missing (NA). By censoring instead of removing we preserve the geometry of the matrix. This helps facilitate multiple manipulations of the matrix if desired. Let's begin by reminding us how abundant NAs are in our dataset.

input example_vcf_novo_ID_SNPs

A number of methods can be used to create intervals that you may consider acceptable. The use of quantiles may be considered desirable because they are non-parametric. We can fit different intervals to different samples using the function apply().

example_quants <- apply(example_dp, MARGIN=2, quantile, probs=c(0.1), na.rm=TRUE)
example_quants
class(example_quants)
example_quants[]

We can create a second matrix of depths where we subtract the lower threshold of each sample from its depth using the function sweep(). We can similarly subtract the upper threshold from our samples to create a second matrix. Now everything above zero is above our threshold and can be set to NA.

?sweep

example_dp_2 <- sweep(example_dp, MARGIN=2, FUN = "-", example_quants[])
example_dp[example_dp_2 < 0] <- NA

example_dp[Abip_dp <5] <- NA
example_dp

We now have a vcfR object that has variants that we feel are of higher quality than our original file.

example_vcf_novo_ID_SNPs@gt[,-1][is.na(example_dp) == TRUE] <- NA
example_vcf_novo_ID_SNPs

Omitting data

We omit samples and variants that have been set as missing (NA).

example_dp<- extract.gt(example_vcf_novo_ID_SNPs, element = "DP", as.numeric=TRUE)
example_vcf_novo_ID_SNPs

One way to visualize these data is to use a heatmap.

heatmap.bp(Abip_dp, rlabels=FALSE)

heatmap.bp(Abip_dp, rlabels=FALSE)

Omitting samples

We can see that some samples have a high degree of missingness. By omitting these samples we may reduce the overall missingness in the data set. We accomplish this by using the function is.na() in conjunction with apply() to determine each sample's degree of missingness and use this information to omit samples below a threshold.

We can then visualize this change with another heatmap.

example_miss <- apply(example_dp, MARGIN = 2, function(x){ sum(is.na(x)) })
example_miss <- example_miss/nrow(example_dp)*100

par(mar = c(12,4,4,2))
barplot(example_miss, las = 2, col = 1:12)
title(ylab = "Missingness (%)")

example_vcf_novo_ID_SNPs@gt <- example_vcf_novo_ID_SNPs@gt[, c(TRUE, example_miss < 70)]
example_vcf_novo_ID_SNPs

example_dp <- extract.gt(example_vcf_novo_ID_SNPs, element = "DP", as.numeric=TRUE)
heatmap.bp(Abip_dp, rlabels = FALSE)

For our Muricea samples:

Omitting samples with missingness > 70%

#Omitting samples

Muricea.vcf_ID_SNPs_miss <- apply(Muricea.vcf_ID_SNPs_dp, MARGIN = 2, function(x){ sum(is.na(x)) })
Muricea.vcf_ID_SNPs_miss <- Muricea.vcf_ID_SNPs_miss/nrow(Muricea.vcf_ID_SNPs_dp)*100


par(mar = c(7,4,4,2))
barplot(Muricea.vcf_ID_SNPs_miss, las = 2, col = 1:12, axes=TRUE, axisnames = TRUE)
title("Missingness (%)")
par(font.axis = 2)
par(font.lab = 2)
par(mar = c(5,4,4,2))
barplot(Muricea.vcf_ID_SNPs_miss, col = 1:100, axes=TRUE, axisnames = TRUE)
title(main= "Missingness in Muricea", ylab = "Missingness (%)", xlab= "Sample")

Muricea.vcf_ID_SNPs@gt <- Muricea.vcf_ID_SNPs@gt [, c(TRUE, Muricea.vcf_ID_SNPs_miss < 30)]


Muricea.vcf_ID_SNPs

Muricea.vcf_ID_SNPs_dp2<- extract.gt(Muricea.vcf_ID_SNPs, element = "DP", as.numeric=TRUE)


heatmap.bp(Muricea.vcf_ID_SNPs_dp2[1:1000,], rlabels = FALSE)

Omitting variants

Previously we have seen how to quantify and visualize missingness for variants in our dataset. We can use this information to omit variants that have a high degree of missingness similar to how we treated the samples.

example_miss <- apply(example_dp, MARGIN = 1, function(x){ sum( is.na(x) ) } )
example_miss <- example_miss / ncol(example_dp)*100
example_DP_filtered <- example_vcf_novo_ID_SNPs[example_miss<50]
example_DP_filtered
head(example_DP_filtered)
example_dp <- extract.gt(example_DP_filtered, element="DP", as.numeric = TRUE)
heatmap.bp(example_dp, rlabels = FALSE)

And Muricea samples with miss<90

Muricea_miss <- apply(Muricea_dp, MARGIN = 1, function(x){ sum( is.na(x) ) } )
Muricea_miss <- Muricea_miss / ncol(Muricea_dp)*100
Muricea_DP_filtered <- Muricea.vcf_ID_SNPs[Muricea_miss<90]

Muricea_DP_filtered

Muricea_dp <- extract.gt(Muricea_DP_filtered, element="DP", as.numeric = TRUE)
heatmap.bp(Muricea_dp, rlabels = FALSE)