17. Variant Calling with Freebayes - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Just assembling a genome is great but in reality just having a single assembly is not particularly useful if you want to investigate variation among individuals. Some of the utility of having a genome assembly comes in annotating the genes and other structural features but one thing evolutionary and functional biologists are interested in is what makes one genome different from others and how those differences manifest. One of the main principles of evolutionary theory is that all the action is happening in the variation among organisms, especially the variation among organisms within a species or population. For this exercise, you will learn a simple pipeline to call variant alleles in single individual. This could be expanded to multiple individuals but that’s for a later date (if we have time).

SETTING UP

All of these exercises can be conducted with an interactive session. No need to submit anything to the queue. However, in real analyses of complete eukaryotic genomes, you will likely want to do so. And you may wish to do so for this exercise for scheduling purposes because the analyses should take an hour or so to finish. For now though, these instructions assume you are going to use an interactive session.

Even though this is a relatively simple analysis using a limited data set, it never hurts to have plenty of computing power.

interactive -p nocona -c 36

It may take a while to get this many processors. If so, consider writing this exercise as a submission script after you get all the software installed and files downloaded so you can go about your life.

First, you’ll need to have some software in your path.

. ~/conda/etc/profile.d/conda.sh

conda activate

conda create --name varcall

conda activate varcall

conda config --set channel_priority strict

conda install -c conda-forge -c bioconda vcftools vcflib=1.0.3 freebayes=1.3.6 bwa rtg-tools bcftools bedtools samtools bowtie2 picard matplotlib gatk4

You could also try using mamba. It also may help to try and install them one at a time rather than all at once. That will allow you to determine which package is the most troublesome.

I encountered issues because samtools was out of date. You may as well. If necessary, there is a samtools module that is already installed on HPCC.

module load gcc/9.2.0 samtools/1.11

Make a directory for this project.

mkdir /lustre/scratch/[eraider]/gge2023/variants

Now, let’s get to work.

A PIPELINE FOR CALLING VARIANTS

The first step to any variant calling pipeline is to map your reads to the genome. In this case, we’re only going to map a small set of reads to a small part of the human genome. Specifically, we have a set of reads that map to a subregion of chromosome 22.

Step 1 - Get the chromosome assembly and the data to map

For this exercise, we will be calling variants for only a small region of human chromosome 22. You'll need that data to get this to work. Fortunately, it's available on the UCSC genome browser (along with ~20 billion other interesting things). I suggest you browse a little when you have a moment.

For now, get a copy of chromosome 22 for you to use.

mkdir -p /lustre/scratch/[eraider]/gge2023/data/chr22

cd /lustre/scratch/[eraider]/gge2023/data/chr22

The following command will download the data we need from the UCSC Genome Browswer.

wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz

And this will unzip it.

gunzip chr22.fa.gz

Now get some sample reads that I've collected for you.

cp -r /lustre/scratch/daray/gge2023/data/chr22/sample.tgz .

These data are compressed into what's called a tar archive. That's a way of combining multiple files into a single compressed file. You can decompress and separate those files using

tar -xvf sample.tgz

Your data will now consist of two sequence files in the data/chr22/sample directory.

Step 2 - Map the reads to the genome/genome region

As has been previously discussed, there are several packages for mapping reads to genomes but the major ones are samtools, bwa, and bowtie2. We will run all during this and the following exercise.

mkdir /lustre/scratch/[eraider]/gge2023/variants/bowtie2

cd /lustre/scratch/[eraider]/gge2023/variants/bowtie2

Create a link to chromosome 22 (no need to make copies unnecessarily).

ln -s /lustre/scratch/[eraider]/gge2023/data/chr22/chr22.fa

Do the same for each of the read pair files.

ln -s /lustre/scratch/[eraider]/gge2023/data/chr22/sample/pair.1.fq

ln -s /lustre/scratch/[eraider]/gge2023/data/chr22/sample/pair.2.fq

Index the chromosome fasta. We've already discussed indexing. Use the index name 'chr22'. Remember, you're using bowtie2 for this exercise. Make sure you use the methods associated with that software.

Map the reads as you've done before. Use as many as 128 processors on Nocona. This step took about 15 minutes for me using 36 processors. Run this job so that it creates a file called 'bowtie_chr22.sam' in your variants/bowtie2 directory. Then convert that .same file to a .bam file.

If you need to review .sam and .bam files, go back to exercise 10.

Step 3 – Sort the .bam file using samtools. This step took a few minutes. Then create an index for the sorted .bam file. You should end up with a file that is name 'bowtie_chr22_sort.bam' and a corresponding index file.

Step 4 - Call variants using Freebayes (https://github.com/ekg/freebayes).

We could also use GATK (https://software.broadinstitute.org/gatk/), another popular package. However, GATK rather complicated to use and tends to involve much more pre-processing. It’s also thought to be a bit more accurate. We’re going to start with Freebayes. Freebayes is a Bayesian variant caller .

From the freebayes site:

Freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

Freebayes is haplotype-based, in the sense that it calls variants based on the literal sequences of reads aligned to a particular target, not their precise alignment. This model is a straightforward generalization of previous ones (e.g. PolyBayes, samtools, GATK) which detect or report variants based on alignments. This method avoids one of the core problems with alignment-based variant detection--- that identical sequences may have multiple possible alignments:

For demonstration purposes, we're going to time how long this process takes using the time command. That time part of the following line is not necessary to run freebayes. It's just for us to check how long it takes for this exercise. It took me a few minutes.

Remember, you should have freebayes installed as part of your 'varcall' conda environment.

time freebayes -f chr22.fa bowtie_chr22_sort.bam > bowtie_chr22_calls.vcf

You could reduce time if you know a particular region where the reads should map. Of course, to do that you’d have to figure out where the reads are mapping in the genome or region of the genome. I've already done that for you. Note the addition of the -r option in the next attempt to run freebayes.

time freebayes -f chr22.fa -r chr22:10536000-50806000 bowtie_chr22_sort.bam > bowtie_large_region_calls.vcf

You’ll notice that it didn’t make much difference in time in this case. How many bases were included in the analysis? ~40 million bp (50806000 - 10536000 = ~40.27 Mb).

Try this, reduce the mappable region to only ~10.3 Mb:

time freebayes -f chr22.fa -r chr22:10536000-20806000 bowtie_chr22_sort.bam >bowtie_small_region_calls.vcf

Step 5 – Identifying the variants called.

Take a look at the output file that’s in .vcf format.

less -S bowtie_chr22_calls.vcf

There is the header:

##fileformat=VCFv4.2
##fileDate=20190730
##source=freeBayes v1.1.0-54-g49413aa
##reference=chr22.fa
##contig=<ID=chr22,length=50818468>
##phasing=none
##commandline="freebayes -f chr22.fa -r chr22:10536000-50806000 sample_sort.bam"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
...

Depending on your data, the header could be of various lengths. Many details on the data are provided in the header. For example, what filtering criteria were used, what quality thresholds were invoked, what program and version was used to generate this file, etc.

Followed by the actual variant calls.

...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	unknown
chr22	10587558	.	A	C	3.80278	.	AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=7.35324;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=13;MQMR=0;NS=1;NUMALT=1;ODDS=0.33611;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=45;QR=0;RO=0;RPL=2;RPP=7.35324;RPPR=0;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	1/1:2:0,2:0:0:2:45:-1.95215,-0.60206,0
chr22	10587575	.	A	G	3.01745	.	AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=7.35324;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=13;MQMR=0;NS=1;NUMALT=1;ODDS=0.00266737;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=40;QR=0;RO=0;RPL=2;RPP=7.35324;RPPR=0;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	1/1:2:0,2:0:0:2:40:-1.80734,-0.60206,0
chr22	10587605	.	G	A	0.000192697	.	AB=0;ABP=0;AC=0;AF=0;AN=2;AO=2;CIGAR=1X;DP=5;DPB=5;DPRA=0;EPP=3.0103;EPPR=3.73412;GTI=0;LEN=1;MEANALT=1;MQM=2.5;MQMR=24;NS=1;NUMALT=1;ODDS=10.5428;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=41;QR=69;RO=3;RPL=1;RPP=3.0103;RPPR=3.73412;RPR=1;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=3;SRP=9.52472;SRR=0;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/0:5:3,2:3:69:2:41:0,-1.03576,-5.12957
…. and so on.

There's no way to go through everything that's in each row, especially that last bit. But we can discuss the first few columns.

Field number Field name Description
1 CHROM The name of the chromosome/scaffold/contig on which the variant is being called.
2 POS The position (1-based) of the variant on the chromosome/scaffold/contig.
3 ID The identifier of the variant, if one exists. For example, a known SNP in dbSNP is rs6054257. In our case there are no known SNPs, so '-'.
4 REF The identity of the variant in the reference sequence. For example in the human chr22 we're using, the base at position 10587558 is 'A'.
5 ALT The identity of the variant in the mapped data.
6 QUAL The quality score of the inferred variant.
7 FILTER A flag indicating what filters were passed. PASS if all were passed.
8 INFO A possibly extensive list of values. Too many to discuss here.
9 FORMAT An optional set of fields describing the samples.

Get some basic stats. vcfstats is a convenient and easy to use tool. It generates a simple text file that has some basic information about the variants called.

mkdir nofilter

rtg vcfstats bowtie_chr22_calls.vcf >nofilter/bowtie_chr22_vcfstats.txt

Alternatively, we can use bcftools to get some additional information:

bcftools stats -F chr22.fa -s - bowtie_chr22_calls.vcf >nofilter/bowtie_chr22_bcfstats.txt

Note that the -s option can be used to include multiple vcf files for analysis.

A second tool, vcffilter, allows you to select calls with only quality values that you find acceptable. For example, the following command will create a new vcf file that only includes variants with a quality score of 20 or better. How many SNPS had call qualities better than 20 (considered high quality)?

mkdir filter_gt20
vcffilter -f "QUAL > 20" bowtie_chr22_calls.vcf >filter_gt20/bowtie_chr22_calls_gt20.vcf
rtg vcfstats filter_gt20/bowtie_chr22_calls_gt20.vcf> filter_gt20/bowtie_chr22_vcfstats_gt20.txt

You could do the same with bcftools. But in this case, you can pipe the output of vcffilter directly into bcftools:

vcffilter -f "QUAL > 20" bowtie_chr22_calls.vcf | bcftools stats -F chr22.fa -s - >filter_gt20/bowtie_chr22_bcfstats_gt20.txt

Finally, you can plot the results from bcftools to get a nice visual representation of your data.

plot-vcfstats -p filter_gt20 -t chr22_qual_gt20 -T chr22_qual_gt20 filter_gt20/bowtie_chr22_bcfstats_gt20.txt

plot-vcfstats -p nofilter -t chr22_nofilter -T chr22_nofilter nofilter/bowtie_chr22_bcfstats.txt

FOR YOU TO DO

  1. Use this page (http://www.htslib.org/doc/samtools.html) to determine what each option of each samtools command does. What is the sorting criterion for this step? What other criteria are possible?

  2. How long did it take to complete the first freebayes run (time freebayes -f chr22.fa bowtie_chr22_sort.bam > bowtie_chr22_calls.vcf)?

  3. What about for the two subsequent example runs that you did?

  4. Now imagine if you were working with the whole human genome but still only wanted to map to just this one section of chromosome 22 (chr22:10536000-20806000). Use the interwebs to find out what fraction of the human genome chromosome 22 represents and extrapolate to how long it would take to map our reads to the entire genome, only to determine that they mapped to only a single small region.

  5. Use vcffilter to create a .vcf file that consists only calls that scored 30 or higher. Call it bowtie_chr22_calls_gt30.vcf. How many SNPs meeting this criterion are there in the file? Provide a path to this file.

  6. How many of the variant calls with scores higher than 20 were insertions? Deletions? Compare that to the number of insertions and deletions in the original, unfiltered vcf.

  7. There's a new term introduced in output that we haven't discussed in class, 'mnp'. Knowing what you do about the term 'snp' and using context clues from the output, what do you think this is?

  8. Use vcffilter to create a vcf file with only variants that have a quality of less than 20. Call it bowtie_chr22_calls_lt20.vcf. Provide the path to that file.

  9. Use your web browser to go to the Variant Effect Predictor. Input your bwa_chr22_marked.vcf and determine whether your data includes any nonsense or missense mutations. How many of each? What proportion of your variants were in introns?

Submit all answers and files to Assignment 15 - Freebayes.