Variant Analysis Bcftools - Bioinfo-niab/VariantAnalysis GitHub Wiki

Filtering with BCFtools

For advance filtering -i, --include and -e, --exclude. In the examples below, we demonstrate the usage on the query command because it allows us to show the output in a very compact form using the -f formatting option. (For details about the format, see the Extracting information page.) Simple example: filtering by fixed columns

Fixed columns such as QUAL, FILTER, INFO are straightforward to filter. In this example, we use the -e 'FILTER="."' expression to exclude sites where FILTER is not set:

$ bcftools query -e'FILTER="."' -f'%CHROM %POS %FILTER\n' file.bcf | head -2 1 3000150 PASS 1 3000151 LowQual

In this example, we use the -i 'QUAL>20 && DP>10' expression to include only sites with big enough quality and depth:

$ bcftools query -i'QUAL>20 && DP>10' -f'%CHROM %POS %QUAL %DP\n' file.bcf | head -2 1 14930 31.2757 13 1 17538 37.9458 12

todo: vcf=/lustre/scratch116/vr/projects/hipsci/cnv/exome-validation/mpileup/ffdm#ffdm_3.bcf $bt query $vcf -i'QUAL="."' -f' %CHROM %POS %QUAL\n' | head -2 Comparing string to numeric value: QUAL="."

FORMAT columns

When filtering FORMAT tags, the OR logic is applied with multiple samples. For example, if we want to remove sites with an uncalled genotype in any sample, the expression -i 'GT!="."' is not going to work:

$ bcftools query -i'GT!="."' -f'%CHROM %POS [ %GT]\n' file.bcf | head -2 1 30923 ./. 1/1 1 54490 ./. 1/1

Instead, the reverse logic -e 'GT ="."' must be applied:

$ bcftools query -e'GT ="."' -f'%CHROM %POS [ %GT]\n' file.bcf | head -2 1 69511 1/1 1/1 1 71850 0/0 0/1

FORMAT columns and boolean expressions (&& vs & and || vs |)

Say our VCF contains the per-sample depth and genotype quality annotations and we want to include only sites where one or more samples have big enough coverage (DP>10) and genotype quality (GQ>20). The expression -i 'FMT/DP>10 & FMT/GQ>20' selects sites where the conditions are satisfied within the same sample:

$ bcftools query -i'FMT/DP>10 & FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf 49979 SampleA:DP=10 GQ=50 SampleB:DP=20 GQ=40

On the other hand, if we need to include sites where both conditions met but not necessarily in the same sample, we use the && operator rather than &:

$ bcftools query -i'FMT/DP>10 && FMT/GQ>20' -f'%POS[\t%SAMPLE:DP=%DP GQ=%GQ]\n' file.bcf 31771 SampleA:DP=10 GQ=50 SampleB:DP=40 GQ=20 49979 SampleA:DP=10 GQ=50 SampleB:DP=20 GQ=40

Similarly, the | operator can select just the matching samples:

$ bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i'FMT/DP=19 | FMT/DP="."' test/view.filter.vcf 3162006 A 19

3162007 A . 3162007 B .

or the whole record when || is used:

$ bcftools query -f'[%POS %SAMPLE %DP\n]\n' -i'FMT/DP=19 || FMT/DP="."' test/view.filter.vcf 3162006 A 19 3162006 B 1

3162007 A . 3162007 B .

https://samtools.github.io/bcftools/howtos/filtering.html

Bcftools cheatsheet

*bcftools filter
*Filter variants per region (in this example, print out only variants mapped to chr1 and chr2)
qbcftools filter -r1,2 ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.hg38.vcf.gz

*printing out info for only 2 samples:
bcftools view -s NA20818,NA20819 filename.vcf.gz

*printing stats only for variants passing the filter:
bcftools view -f PASS filename.vcf.gz

*printing variants withoud header:
bcftools view -H

*printing variants on a particular region:
bcftools view -r chr20:1-200000 -s NA20818,NA20819 filename.vcf.gz

*print all variants except for the ones falling within region:
bcftools view -t ^chr20:1-30000000 ex_bams.samtools.20161231.vcf.gz >out.vcf

*view the positions passed in a file (accepted files are .vcf and .bed):
bcftools view -R 0002.vcf in.vcf.gz

*view the positions passed in a tsv file:
bcftools view -R 0002.tsv in.vcf.gz
# The format of 0002.tsv:
20      79000   80000
20      90000   100000

*printing out only the chr info:
bcftools query -f '%CHROM\n' filename.vcf
/
*now, print out the chr\tpos
bcftools query -f '%CHROM\t%POS\n' filename.vcf
/
*now, print out the AF INFO field
bcftools query -f '%INFO/AF\n'
/
#getting a particular annotation from the VCF
bcftools query -f '%QUAL\n' 0002.vcf
/
#printing chr pos and a particular annotation from a VCF:
bcftools query -f '%CHROM\t%POS\t%INFO/DP\n' in.vcf.gz
/
#printing out the sets assigned by GATK CombineVariants
~/bin/bcftools-1.6/bcftools query -f '%set\n' out_combine.vcf.gz |sort |uniq
/
0#printing a list of samples from a VCF:
bcftools query -l test.vcf
/
#also, the FORMAT annotations can be obtained by:
~/bin/bcftools/bcftools query -f '[%GT]\n' ../0002.vcf |wc -l #the GT in this case
/
*selecting snps from file:
~/bin/bcftools/bcftools view -v snps lc_bams.bcftools.20170319.NA12878.vcf.gz 
/
*selecting the variants from a VCF (excluding 0|0 genotypes)
bcftools view -c1 input.vcf
/
*selecting the non-variants from a VCF(AC=0)
bcftools view -H -C0 concat.allchrs.sites.vcf.gz 
/
#filtering:
/
#using one of the INFO annotations (IDV)
bcftools filter -sFilterName -e'IDV<5' input.vcf
/
#OR logical operator:
bcftools filter -sFilterName -e'DP>50000 | IDV<9' input.vcf
/
#filtering on FORMAT annotation:
bcftools filter -sFilterName -e'FORMAT/DP<5' input.vcf
/
#filtering on INFO annotation:
bcftools filter -sFilterName -e'INFO/DP<5' input.vcf
/
#printing out variants that pass the filter:
~/bin/bcftools/bcftools view -f.,PASS lc_bams.bcftools.20170411.exc.norm.SNP.filtered.vcf.gz
/
#bcftools stats and filtering:
~/bin/bcftools/bcftools stats -f "PASS,." file.vcf
/
#select only biallelic (excluding multiallelic) snps
bcftools view -m2 -M2 -v snps input.vcf.gz
/
#select only the multiallelic snps
bcftools view -m3 -v snps input.vcf.gz
/
#printing the set info in the INFO field:
bcftools view -i 'set="freebayes_lcex"' combined.all.chr20.vcf.gz
/
#printing all entries having a quality <10
bcftools view -i 'QUAL<10' in.vcf.gz
/
#removing FORMAT column from the VCF
#it will remove all FORMAT annotations except the GT information
bcftools annotate -x FORMAT ifile.vcf.gz
/
#removing INFO field from VCF
bcftools annotate --remove INFO in.vcf.gz
/
#annotating a vcf file using the annotations from a different VCF (in this case we only annotate the INFO/DP)
bcftools annotate -c 'INFO/DP' -a annt.vcf.gz in.vcf.gz
/
#annotating a vcf file with a tabular file:
see page https://github.com/samtools/bcftools/wiki/HOWTOs#annotate-from-bed
/
#drop individual genotype information
bcftools view -G input.vcf.gz
/
#getting stats on the number of REF/ALT swaps and other things:
bcftools +fixref file.bcf -- -f ref.fa
/
#correcting the REF/ALT swaps:
bcftools norm --check-ref ws -f ref.fa in.vcf.gz -o out.vcf.gz -Oz
//
#changing the sample names in a VCF:
#the samplenames.txt file has the following format:
#oldsamplename newsamplename
bcftools reheader -s samplenames.txt NA12878.giab.SNP.chr20.non_valid.vcf.gz -o NA12878.giab.SNP.chr20.non_valid.reheaded.vcf.gz
//
#changing the header:
bcftools reheader -h newheader.txt ../combined.all.chr20.vcf.gz.snps.biallelic.vcf_chr20.vcf.gz.ensembl.vcf.gz.phased.ligated.NA12878.ucsc.vcf.gz -o combined.all.chr20.vcf.gz.snps.biallelic.vcf_chr20.vcf.gz.ensembl.vcf.gz.phased.ligated.NA12878.ucsc.reheaded.vcf.gz
//
# in order to use the plugins:
export BCFTOOLS_PLUGINS=~/bin/bcftools-1.6/plugins/
//
# taqg2tag:
# convert PL to GL
bcftools +tag2tag in.vcf -- -r --pl-to-gl
//
# normalizing the multiallelic variants:

# with -any I will split the multiallelic variants (SNPs+INDELs) into several records
bcftools norm -m -any in.vcf.gz -o out.norm.vcf.gz -Oz

For example:

chr20   60280   .       TTTCCA  TTTCCATTCCA,T   744     PASS    .

Will be converted to:

chr20   60280   .       TTTCCA  TTTCCATTCCA     744     PASS    .
chr20   60280   .       TTTCCA  T       744     PASS    .
//
#selecting the missing (uncalled) genotypes:
bcftools view -u in.vcf.gz -o missing_genotypes.vcf.gz -Oz
//
#select a particular genotype (0/1 or 1/1) from a vcf. In this case access sample accessed by index 8:
bcftools view -H combined.snps_indels_chr1.filt.vcf.gz.onlyvariants.vcf.gz.ensembl.vcf.gz.85706.vcf.gz -i 'GT[8]="het"'
//
#select all lines having exactly AC=2
bcftools view -i'AC=2' in.vcf.gz
//
# if we have a tab in a VCF defined in the header like:
##INFO=<ID=GRCH37_38_REF_STRING_MATCH,Number=0,Type=Flag,Description="Indicates reference allele in origin GRCh37 vcf string-matches reference allele in dbsnp GRCh38 vcf">
/
#we can check for the records having this tag by doing:
bcftools view -H -i'GRCH37_38_REF_STRING_MATCH=1' ALL.chr7_GRCh38.genotypes.20170504.ensembl.vcf.NA12878.biallelic.nonvariants.nonvalid.snps.vcf.gz |less
/
#And the contrary by doing:
bcftools view -H -i'GRCH37_38_REF_STRING_MATCH=0' ALL.chr7_GRCh38.genotypes.20170504.ensembl.vcf.NA12878.biallelic.nonvariants.nonvalid.snps.vcf.gz |less
/

#####
## vcf to bcf
bcftools convert -o milk_gene_snps_filtered_12_sahiwal -O u milk_gene_snps_filtered_12_sahiwal.vcf

## Bialleleic and 1% missing
bcftools view milk_gene_snps_filtered_12_sahiwal.bcf -m2 -M2 -v snps -i 'F_MISSING<0.05' -o BiAllelicSNPs_NOMissing.bcf -O u
bcftools view milk_gene_snps_filtered_12_sahiwal.bcf --max-alleles 2 --exclude-types indels -o BiAllelicSNPs.bcf -O u