Genome modeling as a quality control - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

5. K-mer spectra as QC

K-mer spectra analysis is a powerful technique to detect various problems with sequencing libraries.

Lecture on YouTube: K-mers for quality control of genomic libraries

The lecture includes examples where genome profiling revealed various issues.

KAT for genome assembly QC

K-mer spectra can be even be used to assess genome assemblies. In the last part of the lecture linked above we explain how the plot can be made and what can we learn from it.

For a tutorial we will use tse-tse genome assemby TODO and reads in TODO. The whole command is something like this:

kat comp -t <n of threads> -o <output directory, e.g. tse-tse_reads_vs_assembly> <input1-readfile> <input2-genomeassembly>

You could run it in an interactive node, by typing first srun --ntasks=12 --mem-per-cpu=12G --time=02:00:00 --qos=devel --account=nn9458k --pty bash -i

And then running it like this (using, for example, one of the read files):

kat comp -t 8 -o tse-tse_reads_vs_assembly \
/cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/Glossina_14_R1.fastq.gz \
/cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/ncbi-genomes-2021-09-12/GCF_014805625.1_Yale_Gfus_2_genomic.fna.gz

This may take a while (~30', or you could ask for more resources when running srun) and because we wanted to make it a quick job, we also only selected one of the smallest read files (Glossina_14_R1.fastq.gz), and therefore the coverage of the histograms is not very high. But now you know how to run in on your/other datasets!

To run it on a script with both your R1 and R2 reads, you can use something like:

We have a copy of this script in /cluster/projects/nn9458k/oh_know/teachers/kamil/data/kat_two_read_files.sh, so you can copy that one into your home directory :)

And then submit the job with:


mkdir -p data/sacharomyces/

wget TODO -O data/sacharomyces/TODO_R1.fastq.
wget TODO

sbatch kat_two_read_files.sh /cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/Glossina_12_R1.fastq.gz \
/cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/Glossina_12_R2.fastq.gz \
/cluster/projects/nn9458k/oh_know/teachers/kamil/data/tse-tse/ncbi-genomes-2021-09-12/GCF_014805625.1_Yale_Gfus_2_genomic.fna.gz \
tse-tse-ref_vs_reads