tutorial saccharomyces - KamilSJaron/smudgeplot GitHub Wiki
This tutorial is for v0.4.0.
with hindsight
The default smudgeplot was misleading. There was a smudge on an unexpected place, and I thought it's because of too low cutoff (parameter L), but the problem was somewhere else. The sneaky part in this dataset was that there is no AB smudge and to figure that the "mysterious smudge" is just a problem of 1n coverage estimate and incorrect smudge annotation as a consequence. Once smudgeplot is told what is the expected monoploid coverage, everything starts to make sense.
Intro to the funky yeast
There is an open issue asking about a strange smudgeplot of supposedly tetraploid saccharomyces Knaus et al 2018(https://www.frontiersin.org/articles/10.3389/fgene.2018.00123/full).
Looking at the posted smudgeplot - yeah, it's really weird. Beside the AB smudge, there is also a one more ~ at the L * cov line (the error line). I am suspecting that the L was too low, but I got to investigate a tiny bit if I am right.
The analysis
Get the data
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR326/001/SRR3265401/SRR3265401_[12].fastq.gz
Use fastK
to get a kmer database (kmer_db
)
FastK -v -t10 -k21 -M16 -T4 *_[12].fastq.gz -Nkmer_db
I suspected that the L is too low, hence we should look at the kmer histogram. I fit there a default diploid GenomeScope model, because I more care about the shape of the histogram than anything else at this point
Histex -G kmer_db > kmer_k21.hist
genomescope.R -i kmer_k21.hist -k 21 -p 2 -o . -n Scer_genomescope
Alright. We see three peaks, ~15x, ~30x and ~60x. So the only question is if the smallest peak the 1n of the genome? I also expect that choosing L = 15 and high U (3000) would generate a smudgeplot similar to the one posted
smudgeplot.py hetmers -L 15 -t 4 -o kmerpairs_L15 --verbose kmer_db
smudgeplot.py all -o saccharomyces_L15 -t "yeast" kmerpairs_L15_text.smu
and indeed, it is somewhat similar. However, the updated version of smudgeplot actually deal much better with low-coverage datasets and even with the same settings the 1n coverage inference and the tetraploid structure of the genome is apparent.
How is that apparent? Well, only a tetraploid genome would predominantly feature tetraploid smudges (AAAB and AABB) while lacking any others. Nevertheless, if we chose lower L, we will likely see a much nicer plot as the smudges will be further from the coverage filter, say L = 5.
smudgeplot.py hetmers -L 5 -t 4 -o kmerpairs_L5 --verbose kmer_db
smudgeplot.py all -o saccharomyces_L5 -t "yeast" kmerpairs_L5_text.smu
And indeed we qualitatively see the same picture, but it looks cleaner
Now that we know we have a tetraploid and that we also have a good idea about the 1n coverage (around 16x), we can fit the genomescope model with the appropriate ploidy and coverage prior
genomescope.R -i kmer_k21.hist -k 21 -p 4 -o . -l 16 -n Scer_genomescope_p4
And Voilà, the genome size is approximately right (14Mbp vs expected 12Mbp) and the spacing of the individual peaks seems to be exactly matching the tetraploid model. GenomeScope and Smudgeplot converged on similar 1n coverages and both look very sane in the context of expected biology. Genome profiling well done.