Basic VCF Operations - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

raw_snps_indels.vcf.gz

Basic VCF Operations (bcftools & vcftools)

  • View, filter & query
    • bcftools
    # view all variants, or apply simple filters (e.g. QUAL>30, PASS only)
bcftools view raw_snps_indels.vcf.gz
bcftools view -i 'QUAL>30 && FILTER="PASS"' raw_snps_indels.vcf.gz 
  • vcftools
 # extract SNPs only and write a new VCF
vcftools --gzvcf raw_snps_indels.vcf.gz --remove-indels --recode --stdout > snps_only.vcf

Sort & index

# sort VCF by CHROM/POS and compress
  bcftools sort -Oz -o sorted.vcf.gz raw_snps_indels.vcf.gz
 # build a tabix index for fast random access
  bcftools index sorted.vcf.gz
  • Summary statistics
    • bcftools stats
# generate a full stats report
bcftools stats sorted.vcf.gz > stats.txt

# view a summary plot of those stats
plot-vcfstats -p vcf_plots/ stats.txt
  • vcftools
# allele frequency distribution
vcftools --gzvcf raw_snps_indels.vcf.gz --freq --out allele_freq

# missing data per-sample or per-site
vcftools --gzvcf raw_snps_indels.vcf.gz  --missing-indv
vcftools --gzvcf raw_snps_indels.vcf.gz  --missing-site

Inspect the .frq output

head -n 10 allele_freq.frq

You should see columns like: