tutorial saccharomyces - KamilSJaron/smudgeplot GitHub Wiki

This tutorial is for v0.4.0.

version v0.2.4 - the corresponding version of this page

with hindsight

The default smudgeplot was misleading. There was a smudge on an unexpected place, and I thought it's because of too low cutoff (parameter L), but the problem was somewhere else. The sneaky part in this dataset was that there is no AB smudge and to figure that the "mysterious smudge" is just a problem of 1n coverage estimate and incorrect smudge annotation as a consequence. Once smudgeplot is told what is the expected monoploid coverage, everything starts to make sense.

Intro to the funky yeast

There is an open issue asking about a strange smudgeplot of supposedly tetraploid saccharomyces Knaus et al 2018(https://www.frontiersin.org/articles/10.3389/fgene.2018.00123/full).

strange_smudgeplot

Looking at the posted smudgeplot - yeah, it's really weird. Beside the AB smudge, there is also a one more ~ at the L * cov line (the error line). I am suspecting that the L was too low, but I got to investigate a tiny bit if I am right.

The analysis

Get the data

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_[12].fastq.gz

Use fastK to get a kmer database (kmer_db)

FastK -v -t10 -k21 -M16 -T4 *_[12].fastq.gz -Nkmer_db

I suspected that the L is too low, hence we should look at the kmer histogram. I fit there a default diploid GenomeScope model, because I more care about the shape of the histogram than anything else at this point

Histex -G kmer_db > kmer_k21.hist
genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Scer_genomescope

strange_genomescope

Alright. We see three peaks, ~15x, ~30x and ~60x. So the only question is if the smallest peak the 1n of the genome? I also expect that choosing L = 15 and high U (3000) would generate a smudgeplot similar to the one posted

smudgeplot.py hetmers -L 15 -t 4 -o kmerpairs_L15 --verbose kmer_db
smudgeplot.py all -o saccharomyces_L15 -t "yeast" kmerpairs_L15_text.smu

and indeed, it is somewhat similar. However, the updated version of smudgeplot actually deal much better with low-coverage datasets and even with the same settings the 1n coverage inference and the tetraploid structure of the genome is apparent.

How is that apparent? Well, only a tetraploid genome would predominantly feature tetraploid smudges (AAAB and AABB) while lacking any others. Nevertheless, if we chose lower L, we will likely see a much nicer plot as the smudges will be further from the coverage filter, say L = 5.

smudgeplot.py hetmers -L 5 -t 4 -o kmerpairs_L5 --verbose kmer_db
smudgeplot.py all -o saccharomyces_L5 -t "yeast" kmerpairs_L5_text.smu

And indeed we qualitatively see the same picture, but it looks cleaner

Now that we know we have a tetraploid and that we also have a good idea about the 1n coverage (around 16x), we can fit the genomescope model with the appropriate ploidy and coverage prior

genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -l 16 -n Scer_genomescope_p4

strange_genomescope2

And Voilà, the genome size is approximately right (14Mbp vs expected 12Mbp) and the spacing of the individual peaks seems to be exactly matching the tetraploid model. GenomeScope and Smudgeplot converged on similar 1n coverages and both look very sane in the context of expected biology. Genome profiling well done.