Variant Analysis Vcftools - Bioinfo-niab/VariantAnalysis GitHub Wiki

Common filtering of VCF file by using vcftools

  1. Output allele frequency for all sites in the input vcf file from chromosome 1

    vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

  2. Output a new vcf file from the input vcf file that removes any indel sites

    vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only --recode : The output file has the suffix ".recode.vcf" --recode-INFO-all : will keep all info from original file

  3. Output file comparing the sites in two vcf files vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out in1_v_in2

  4. Output a new vcf file to standard out without any sites that have a filter tag, then compress it with gzip

    vcftools --gzvcf input_file.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > output_PASS_only.vcf.gz

  5. Output a Hardy-Weinberg p-value for every site in the bcf file that does not have any missing genotypes

    vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out output_noMissing

  6. Output nucleotide diversity at a list of positions

    zcat input_file.vcf.gz | vcftools --vcf - --site-pi --positions SNP_list.txt --out nucleotide_diversity

  7. to select Bi allelic alleles vcftools --vcf final_OnlySNP.recode.vcf --min-alleles 2 --max-alleles 2 --recode --out final_biallelicSNP

GENOTYPE VALUE FILTERING

         --min-meanDP : Includes only sites with mean depth values (all included individuals) greater than/equal 
         --max-meanDP : Similar
         --hwe : Assesses sites for Hardy-Weinberg Equilibrium using an exact test. Sites with a p-value below  
                 the threshold defined by this option are taken to be out of HWE, and therefore excluded.
         --max-missing: Exclude  sites  on  the  basis of the proportion of missing data (between 0 and 1) where     
                        0 allows sites that are completely missing and 1 indicates no missing data allowed.
         --max-missing-count: Exclude sites with more than this number of missing genotypes over all individuals.
         --phased: Excludes all sites that contain unphased genotypes.
  1. Multiple filtering together: site level

    vcftools --vcf DP100_JerseyKashmiri_finalSNP.vcf --out D100 --remove-filtered-all --max-missing-count 0
    --minDP 10 --minQ 10 --maxDP 1000 --maf 1.0 --recode --recode-INFO-all

  2. Multiple filtering together: genotype level

    vcftools --vcf DP100_JerseyKashmiri_finalSNP.vcf --out D100 --remove-filtered-geno-all --max-missing-count 0
    --minDP 30 --minQ 10 --maxDP 1000 --maf 1.0 --recode --recode-INFO-all

Example

  1. Filtering For: Only Bi allelic SNP ( SNP of Bi alleleic nature) vcftools --vcf milk_gene_snps_filtered_12_sahiwal.vcf --remove-filtered-all --remove-indels --recode --recode-INFO-all --min-alleles 2 --max-alleles 2 --out BiAllelicSNPs_only

    Results: After filtering, kept 1907548 out of a possible 2176045 Sites

  2. Filter For: SNP which called for all inidivuals vcftools --vcf BiAllelicSNPs_only.recode.vcf --remove-filtered-geno-all --recode --recode-INFO-all
    --max-missing-count 0 --min-meanDP 20 --minQ 10 --out BiAllelicSNPs_NOMissing

    Results: After filtering, kept 956736 out of a possible 1907548 Sites

GT:Genotype
GQ:Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype
GL:Genotype Likelihood, log10-scaled likelihoods of the data given the called genotype for each possible 
   genotype generated from the reference and alternate alleles given the sample ploidy
DP:Read Depth
AD:Number of observation for each allele
QR:Sum of quality of the reference observations
QA:Sum of quality of the alternate observations

### Site level
AC: Total number of alternate alleles in called genotypes
AN: Total number of alleles in called genotypes
AF: Estimated allele frequency in the range (0,1]
NS: Number of samples
RO:Reference allele observation count
AO:Alternate allele observation count
AB:Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous
ABP: Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality"

3.Filtered For: with less stringency only keep variants that have been successfully genotyped in 50% of individuals, minimum quality score of 20, i.e. --minQ 20 minor allele count of 3, i.e. --mac 3 minimum depth for a genotype call (minimum mean depth) i.e. --minDP 3

vcftools --vcf BiAllelicSNPs_only.recode.vcf --remove-filtered-geno-all --recode --recode-INFO-all 
--max-missing 0.5 --mac 3 --minQ 20 --minDP 3 --out BiAllelicSNPs_50Missing
  1. Filter For: to get rid of individuals that did not sequence well. We can do this by assessing individual levels of missing data.

vcftools --vcf raw.g5mac3dp3.recode.vcf --missing-indv >out.imiss cat out.imiss (5 column: Fifth column : Fmissing) mawk '!/IN/' out.imiss | cut -f5 > totalmissing mawk '$5 > 0.5' out.imiss | cut -f1 > lowDP.indv

###To remove low DP individual form list vcftools --vcf raw.g5mac3dp3.recode.vcf --remove lowDP.indv --recode --recode-INFO-all --out raw.g5mac3dplm http://www.ddocent.com/filtering/