Basic VCF Operations - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
raw_snps_indels.vcf.gz
Basic VCF Operations (bcftools & vcftools)
- View, filter & query
- bcftools
# view all variants, or apply simple filters (e.g. QUAL>30, PASS only)
bcftools view raw_snps_indels.vcf.gz
bcftools view -i 'QUAL>30 && FILTER="PASS"' raw_snps_indels.vcf.gz
- vcftools
# extract SNPs only and write a new VCF
vcftools --gzvcf raw_snps_indels.vcf.gz --remove-indels --recode --stdout > snps_only.vcf
Sort & index
# sort VCF by CHROM/POS and compress
bcftools sort -Oz -o sorted.vcf.gz raw_snps_indels.vcf.gz
# build a tabix index for fast random access
bcftools index sorted.vcf.gz
- Summary statistics
bcftools stats
# generate a full stats report
bcftools stats sorted.vcf.gz > stats.txt
# view a summary plot of those stats
plot-vcfstats -p vcf_plots/ stats.txt
- vcftools
# allele frequency distribution
vcftools --gzvcf raw_snps_indels.vcf.gz --freq --out allele_freq
# missing data per-sample or per-site
vcftools --gzvcf raw_snps_indels.vcf.gz --missing-indv
vcftools --gzvcf raw_snps_indels.vcf.gz --missing-site
Inspect the .frq output
head -n 10 allele_freq.frq
You should see columns like: