VCF_Manipulation_Filtering - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Workshop: Manipulating & Filtering VCF Files with VCFtools
Workshop Overview
In this workshop, students will learn how to use VCFtools to perform various filtering operations on Variant Call Format (VCF) files. VCFtools is a commonly used bioinformatics toolkit for processing and analyzing VCF files, particularly in population genetics studies.
By the end of this workshop, you will learn how to:
- Filter VCF files by specific positions or regions.
- Filter variants by type, allele frequency, genotype quality, and other criteria.
- Extract specific samples or regions.
Prerequisites
-
VCFtools installed (We have installed yesterday)
-
Using
conda activate WGS_analysis
to activate the conda environment -
Make a new directory to store the output vcf
mkdir VCF_output
Filtering by Position
1. Filtering by Chromosome
You can filter variants by specific chromosomes using the --chr
or --not-chr
options. These options can be used multiple times to include or exclude multiple chromosomes.
vcftools --vcf SRR629180_chr1.vcf --chr NC_004325.2 --recode-INFO-all --recode --out VCF_output/keep_chr1
To exclude specific chromosomes:
vcftools --vcf SRR629180_chr1.vcf --not-chr NC_004325.2 --recode-INFO-all --recode --out VCF_output/excl_chr1
2. Filtering by Region
To filter by specific regions, use --chr
, --from-bp
, and --to-bp
together to specify the region of interest.
vcftools --vcf SRR629180_chr1.vcf --chr NC_004325.2 --from-bp 10000 --to-bp 20000 --recode-INFO-all --recode --out VCF_output/keep_region
4. Filtering by BED File
First, you can generate a BED file using vim regions.bed
NC_004325.2 1 1000
NC_004325.2 9000 10000
You can also filter by regions specified in a BED file using --bed
or --exclude-bed
.
vcftools --vcf SRR629180_chr1.vcf --bed regions.bed --recode-INFO-all --recode --out VCF_output/keep_bedregion
To exclude regions specified in a BED file:
vcftools --vcf SRR629180_chr1.vcf --exclude-bed regions.bed --recode-INFO-all --recode --stdout VCF_output/excl_bedregion
Filtering by Variant Type
1. Filtering Indels
To retain only INDELs, use --keep-only-indels
:
vcftools --vcf SRR629180_chr1.vcf --keep-only-indels --recode-INFO-all --recode --out VCF_output/indels_only
To remove INDELs, use --remove-indels
:
vcftools --vcf SRR629180_chr1.vcf --remove-indels --recode --recode-INFO-all --out VCF_output/snps_only
Filtering by Allele and Genotype
1. Filtering by Allele Frequency
To filter by Minor Allele Frequency (MAF), use --maf
or --max-maf
:
vcftools --vcf SRR629180_chr1.vcf --maf 0.05 --recode --recode-INFO-all --out VCF_output/filtered_maf
To filter by non-reference allele frequency:
vcftools --vcf SRR629180_chr1.vcf --non-ref-af 0.1 --recode --recode-INFO-all --out VCF_output/filtered_non_ref_af
2. Filtering by Minor Allele Count
To filter by Minor Allele Count (MAC), use --mac
or --max-mac
:
vcftools --vcf SRR629180_chr1.vcf --mac 5 --recode --recode-INFO-all --out VCF_output/filtered_mac
3. Filtering by Number of Alleles
To retain variants with a specific number of alleles, use --min-alleles
and --max-alleles
:
vcftools --vcf SRR629180_chr1.vcf --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out VCF_output/filtered_alleles
Filtering by Coverage and Quality
1. Filtering by Coverage Depth
To filter based on mean coverage depth, use --min-meanDP
or --max-meanDP
:
vcftools --vcf SRR629180_chr1.vcf --min-meanDP 3 --recode --recode-INFO-all --out VCF_output/filtered_depth
2. Filtering by Missing Data
To filter variants based on missing data rate, use --max-missing
:
vcftools --vcf SRR629180_chr1.vcf --max-missing 0.8 --recode --recode-INFO-all --out VCF_output/filtered_missing
To filter based on the count of missing individuals, use --max-missing-count
:
vcftools --vcf SRR629180_chr1.vcf --max-missing-count 10 --recode --recode-INFO-all --out VCF_output/filtered_missing_count
3. Filtering by Quality Score
To retain variants with a quality score greater than a specific value, use --minQ
:
vcftools --vcf SRR629180_chr1.vcf --minQ 30 --recode --recode-INFO-all --out VCF_output/filtered_quality
Filtering by Samples
Before this part's practice, you can make symbolic link of the VCF file I have generated of Chr1 of Pf7 dataset to your own directory
ln -s /home/renzirui/dataset/Pf7_practice_chr1.recode.ks.vcf ./Pf7_practice_chr1.vcf
1. Filtering Specific Samples
To retain or remove specific individuals, use --indv
or --remove-indv
:
vcftools --vcf Pf7_practice_chr1.vcf --indv PE0423-C --recode --recode-INFO-all --out VCF_output/keep_indv
vcftools --vcf Pf7_practice_chr1.vcf --remove-indv PE0423-C --recode --recode-INFO-all --out VCF_output/remove
multiple assignments of the --indv
or --remove-indv
parameter is allowed:
vcftools --vcf Pf7_practice_chr1.vcf --indv PE0423-C --indv PE0128-C --recode --recode-INFO-all --out VCF_output/keep_indv
To retain or remove multiple individuals from a file, use --keep
or --remove
:
First you can generate a list of samples to keep via vim samples_to_keep.list
PE0423-C
PE0128-C
QS0126-C
QS0132-C
Then try to keep those samples in the list
vcftools --vcf Pf7_practice_chr1.vcf --keep samples_to_keep.list --recode --recode-INFO-all --out VCF_output/kept_samplelist
2. Randomly Retain Samples
To randomly retain a specific number of individuals, use --max-indv
:
vcftools --vcf Pf7_practice_chr1.vcf --max-indv 4 --recode --recode-INFO-all --out VCF_output/filtered_random_samples
Filter VCF file for downstream analysis
First make a new directory for Today's VCF filtering
mkdir VCF_filtering
cd VCF_filtering
Then copy the link into this directory
cp ../Pf7_practice_chr1.vcf ./
- We start with filtering individual: using vcftools to calculate missing rate of each individual using
--missing-indv
vcftools --vcf Pf7_practice_chr1.vcf --missing-indv
awk '$5 > 0.2' out.imiss | cut -f1 > lowDP.indv
- Then we remove those individuals with high missing rate generated by the above step, and then filtering the mutation sites
vcftools --vcf Pf7_practice_chr1.vcf --max-missing 0.8 --max-alleles 2 --minQ 30 --remove-filtered-all --remove lowDP.indv --recode --recode-INFO-all --out Pf7_practice_chr1.filtered
-
Finally we filter the vcf to keep those snp sites
cat Pf7_practice_chr1.filtered.recode.vcf > Pf7_practice_chr1.filtered.snps.vcf
Above are a classic filtering on both individual and the variation sites for downstream analysis.