Concept of k mers - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

An introductory lecture is available on YouTube

What is a k-mer

In a genomic context, k-mers are sub-strings of nucleotides of length k contained within a biological sequence. This means any biological sequence can be decomposed into a number of k-mers, and this number will depend on both the length of the sequence (L) and k-mer length (k). For example, in the following sequence: AAGTCCAT (L=8), there are 7 k-mers of length 2 (2-mers), 6 3-mers, 5 4-mers, 4 5-mers, 3 6-mers and 2 7-mers, being always the number of k-mers in a sequence equal to L - k + 1.

kmers_simple

Decomposition of a sequence to k-mers can be done on an assembly, or a read set or any other sequence or set of sequences respectively. K-mers are used for many things and therefore for each rule there will be an exception, however usually when a sequence is decomposed to k-mers, we end up with a set of k-mers and their respective frequencies. That practically means, we lose the information about the genomic context. This cost is then compensated with gained statistical power to learn about your genome.

k-mers of a whole-genome sequencing dataset

The power k-mers is the most obvious seen in the context of whole genome sequencing.