Running freebayes variant calling - SIWLab/Lab_Info GitHub Wiki

Freebayes is a haplotype-based bayesian variant caller. In its most simple form the only input required is a reference fasta file and one or more .bam files aligned to the same reference; the output is a .vcf file containing a list of SNPs/indels for each sample. Running freebayes on a small number of samples can be quite quick depending on genome size, coverage, repeat content, etc. Obtaining a gvcf (all sites in the genome, including invariant ones) is more time consuming. It can be parallelized to speed things up considerably. See the documentation or enter freebayes --help in the terminal for more details.

## basic example
freebayes -f ref.fa aln.bam >var.vcf

This is sufficient for one sample, but depending on your genome size, coverage, repeat content, etc. it might take a really really long time. The solution is to run in parallel. Freebayes has a built in wrapper that runs freebayes over a set of regions simultaneously using GNU parallel. When running this on the server I've run into problems since their script results in many open files which quickly exceeds the user file limit. The example below does the same thing as the freebayes-parallel script but in a way that doesn't crash.

freebayes parallel

generate regions file

Freebayes comes with a script that easily generates a list of genomic regions from your reference. You will need a fasta index of your reference genome.

samtools faidx ref.fa
fasta_generate_regions.py ref.fa.fai 1000000 > regions.txt #this creates a list of 1Mb regions spanning the entire genome.

The regions list will look something like:

X:0-1000000
X:1000000-2000000
X:2000000-3000000
X:3000000-4000000
X:4000000-5000000
X:5000000-6000000
X:6000000-7000000
X:7000000-8000000
X:8000000-9000000
X:9000000-10000000
X:10000000-11000000 ...

Now we can run the freebayes variant caller on multiple regions of the genome at the same time. The example below also includes additional parameters and multiple .bam samples.

#!/bin/bash
#freebayes version 1.3.4
export TMPDIR=/ohta2/bianca.sacchi/tmp/
regionsfile=../data/variants_hap1/regions.txt
ncpus=10
reference=../data/genome_hap1/NChap1_split.fa
bamlist=bamlist1tx # contains list of bams (with paths)


cat "$regionsfile" | parallel -j "$ncpus" "freebayes -f $reference \
        --use-best-n-alleles 4 \
        --genotype-qualities \
        --bam-list $bamlist \
        --region {} \
        --vcf ../data/variants_hap1/tmp_tx/{.}.vcf" #tmp vcfs go here

Once this has finished running you will end up with a temporary directory full of bams for every region.

# concatenate temp vcfs in the tmp folder specified 
cat tmp/*.vcf | vcffirstheader| vcfstreamsort -w 1000 | vcfuniq > output.vcf