VCF_Manipulation_Filtering - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop: Manipulating & Filtering VCF Files with VCFtools

Workshop Overview

In this workshop, students will learn how to use VCFtools to perform various filtering operations on Variant Call Format (VCF) files. VCFtools is a commonly used bioinformatics toolkit for processing and analyzing VCF files, particularly in population genetics studies.

By the end of this workshop, you will learn how to:

  1. Filter VCF files by specific positions or regions.
  2. Filter variants by type, allele frequency, genotype quality, and other criteria.
  3. Extract specific samples or regions.

Prerequisites

  • VCFtools installed (We have installed yesterday)

  • Using conda activate WGS_analysis to activate the conda environment

  • Make a new directory to store the output vcf

    mkdir VCF_output
    

Filtering by Position

1. Filtering by Chromosome

You can filter variants by specific chromosomes using the --chr or --not-chr options. These options can be used multiple times to include or exclude multiple chromosomes.

 vcftools --vcf SRR629180_chr1.vcf --chr NC_004325.2 --recode-INFO-all --recode --out VCF_output/keep_chr1

To exclude specific chromosomes:

vcftools --vcf SRR629180_chr1.vcf --not-chr NC_004325.2 --recode-INFO-all --recode --out VCF_output/excl_chr1

2. Filtering by Region

To filter by specific regions, use --chr, --from-bp, and --to-bp together to specify the region of interest.

vcftools --vcf SRR629180_chr1.vcf --chr NC_004325.2 --from-bp 10000 --to-bp 20000 --recode-INFO-all --recode --out VCF_output/keep_region

4. Filtering by BED File

First, you can generate a BED file using vim regions.bed

NC_004325.2 1 1000
NC_004325.2 9000 10000

You can also filter by regions specified in a BED file using --bed or --exclude-bed.

vcftools --vcf SRR629180_chr1.vcf --bed regions.bed --recode-INFO-all --recode --out VCF_output/keep_bedregion

To exclude regions specified in a BED file:

vcftools --vcf SRR629180_chr1.vcf --exclude-bed regions.bed --recode-INFO-all --recode --stdout VCF_output/excl_bedregion

Filtering by Variant Type

1. Filtering Indels

To retain only INDELs, use --keep-only-indels:

vcftools --vcf SRR629180_chr1.vcf --keep-only-indels --recode-INFO-all --recode --out VCF_output/indels_only

To remove INDELs, use --remove-indels:

vcftools --vcf SRR629180_chr1.vcf --remove-indels --recode --recode-INFO-all --out VCF_output/snps_only

Filtering by Allele and Genotype

1. Filtering by Allele Frequency

To filter by Minor Allele Frequency (MAF), use --maf or --max-maf:

vcftools --vcf SRR629180_chr1.vcf --maf 0.05 --recode --recode-INFO-all --out VCF_output/filtered_maf

To filter by non-reference allele frequency:

vcftools --vcf SRR629180_chr1.vcf --non-ref-af 0.1 --recode --recode-INFO-all --out VCF_output/filtered_non_ref_af

2. Filtering by Minor Allele Count

To filter by Minor Allele Count (MAC), use --mac or --max-mac:

vcftools --vcf SRR629180_chr1.vcf --mac 5 --recode --recode-INFO-all --out VCF_output/filtered_mac

3. Filtering by Number of Alleles

To retain variants with a specific number of alleles, use --min-alleles and --max-alleles:

vcftools --vcf SRR629180_chr1.vcf --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out VCF_output/filtered_alleles

Filtering by Coverage and Quality

1. Filtering by Coverage Depth

To filter based on mean coverage depth, use --min-meanDP or --max-meanDP:

vcftools --vcf SRR629180_chr1.vcf --min-meanDP 3 --recode --recode-INFO-all --out VCF_output/filtered_depth

2. Filtering by Missing Data

To filter variants based on missing data rate, use --max-missing:

vcftools --vcf SRR629180_chr1.vcf --max-missing 0.8 --recode --recode-INFO-all --out VCF_output/filtered_missing

To filter based on the count of missing individuals, use --max-missing-count:

vcftools --vcf SRR629180_chr1.vcf --max-missing-count 10 --recode --recode-INFO-all --out VCF_output/filtered_missing_count

3. Filtering by Quality Score

To retain variants with a quality score greater than a specific value, use --minQ:

vcftools --vcf SRR629180_chr1.vcf --minQ 30 --recode --recode-INFO-all --out VCF_output/filtered_quality

Filtering by Samples

Before this part's practice, you can make symbolic link of the VCF file I have generated of Chr1 of Pf7 dataset to your own directory

ln -s /home/renzirui/dataset/Pf7_practice_chr1.recode.ks.vcf ./Pf7_practice_chr1.vcf

1. Filtering Specific Samples

To retain or remove specific individuals, use --indv or --remove-indv:

vcftools --vcf Pf7_practice_chr1.vcf --indv PE0423-C --recode --recode-INFO-all --out VCF_output/keep_indv
vcftools --vcf Pf7_practice_chr1.vcf --remove-indv PE0423-C --recode --recode-INFO-all --out VCF_output/remove

multiple assignments of the --indv or --remove-indv parameter is allowed:

vcftools --vcf Pf7_practice_chr1.vcf --indv PE0423-C --indv PE0128-C --recode --recode-INFO-all --out VCF_output/keep_indv

To retain or remove multiple individuals from a file, use --keep or --remove:

First you can generate a list of samples to keep via vim samples_to_keep.list

PE0423-C
PE0128-C
QS0126-C
QS0132-C

Then try to keep those samples in the list

vcftools --vcf Pf7_practice_chr1.vcf --keep samples_to_keep.list --recode --recode-INFO-all --out VCF_output/kept_samplelist

2. Randomly Retain Samples

To randomly retain a specific number of individuals, use --max-indv:

vcftools --vcf Pf7_practice_chr1.vcf --max-indv 4 --recode --recode-INFO-all --out VCF_output/filtered_random_samples

Filter VCF file for downstream analysis

First make a new directory for Today's VCF filtering

mkdir VCF_filtering
cd VCF_filtering

Then copy the link into this directory

cp ../Pf7_practice_chr1.vcf ./
  • We start with filtering individual: using vcftools to calculate missing rate of each individual using --missing-indv
vcftools --vcf Pf7_practice_chr1.vcf --missing-indv
awk '$5 > 0.2' out.imiss | cut -f1 > lowDP.indv
  • Then we remove those individuals with high missing rate generated by the above step, and then filtering the mutation sites
vcftools --vcf Pf7_practice_chr1.vcf --max-missing 0.8 --max-alleles 2 --minQ 30 --remove-filtered-all --remove lowDP.indv --recode --recode-INFO-all --out Pf7_practice_chr1.filtered
  • Finally we filter the vcf to keep those snp sites

    cat Pf7_practice_chr1.filtered.recode.vcf > Pf7_practice_chr1.filtered.snps.vcf
    

Above are a classic filtering on both individual and the variation sites for downstream analysis.