Manual model fitting - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Introduction

This exercise has one sole purpose - understanding the logic behind fitting genome models to k-mer spectra, in particular using GenomeScope genome model.

K-mer spectra can have many forms and shapes dependent on both biology and the sequencing technique, but let's use an idealised case - a k-mer spectra, that does not have any sequencing errors, or repetitive DNA. It has been manually constructed from a k-mer spectra of Timema cristinae (all the sequencing reads associated with this biosample). We calculaded k-mer spectra (as in the last tutorial) and then manually trimmed both sides so it is close to the "ideal case". This is how the idealised k-mer spectra looks like:

simplified_linear_plot

See? All the k-mers are unique, the error rate is 0. It can't get much better, so, let's fit an analogous model ouserlves.

  1. Download the idealised k-mer spectra here.
  2. open R, load the k-mer spectra and plot it. From now on, all the code is going to be in R.
ideal_kmer_spec <- read.table('Timema_cristinae_kmer_k21_simplified.hist', col.names = c('cov', 'freq'))
plot(ideal_kmer_spec$cov, ideal_kmer_spec$freq, type = 'l', xlab = 'Coverage', ylab = 'Frequency', main = 'ideal')

Of course, this looks the same as... the plot above (it's mostly that you see that the data is really the same).

There are just two peaks, so the model we will fit will be something like

frequency ~ 1n peak +
            2n peak

We could choose from various distributions to model the two peaks, but it just happens, that negative binomial, fits sequencing coverage very well. Negative bionomial has usually two parameters stopping parameter and the success probability, however, that is not a convinient notion for us, we will therefore use a reparametrised version of negative binomial with mean 'mu', and 'size', the dispersion parameter. So in our model, we will fit mean, and the dispersion parameter and the model can look like this

Probability ~ α NB(  kcov, dispersion) +
              β NB(2*kcov, dispersion2)

Now, the α and β will be the contributions of the two peaks. Notice that we fit the mean of the first peak with the same parameter as the mean of the second peak s, just multiplied by a factor of 2. That is because the heterozygous loci have 1/2 of the coverage of the homozygous loci and therefore, we actually desire to co-estimate the haploid coverage jointly for both of the at the same time.

There are still a few adjustments we need to do to the model to finally make a reasonable fit. In the last model we parametrise the dispersion parameters of the two models independently, that is however unnecessary - with increased coverage, the variance in sequencing depth increases too, but the "overdispersal" scales the same for the whole sequencing run. We therefore use the same trick we used for the coverage and fit the overdispersion as a single parameter scaled by the mean

Probability ~ α NB(  kcov,   kcov / overdispersion) +
              β NB(2*kcov, 2*kcov / overdispersion)

There are a few more things to add, but this is a minimal viable product. So, we can attempt to fit this model using non-linear least squared method (nls). This is a process that is very similar to classical least square method, but instead of calculating the exact best fit, it iterates through a range of possible solutions and finds the best one among them. These techniques are powerful, for they can fit parameters to any kind of model really, but they are also very sensitive to the initial parameters and the strategy of the parameter search.

  1. Fit and plot the following model
y <- ideal_kmer_spec$freq
x <- ideal_kmer_spec$cov
gm1 <- nls(y ~ a * dnbinom(x, size = kcov   / overdispersion, mu = kcov    ) +
               b * dnbinom(x, size = kcov*2 / overdispersion, mu = kcov * 2),
           start = list(a = 300e6, b = 500e6, kcov = 25, overdispersion = 0.5),
           control = list(minFactor=1e-12, maxiter=40))

summary(gm1)
Model interpretation

Alright, we see that a (my α) is about 1/2 of b (β). That means there is about twice as much homozygous to heterozygous k-mers (the estimates are the estimates of how many exactly). We also see that the k-mer coverage is 27.6x. That is already quite some information, but nothing really

To plot the model, we need to get the values predicted within the range of our data. Let let's define the model as a function

predict_basic_model <- function(x, a, b, kcov, overdispersion){
    a * dnbinom(x, size = kcov   / overdispersion, mu = kcov    ) +
    b * dnbinom(x, size = kcov*2 / overdispersion, mu = kcov * 2)
}

gm1_y <- predict_basic_model(x, coef(gm1)['a'], coef(gm1)['b'], coef(gm1)['kcov'], coef(gm1)['overdispersion'])

and now we can finally plot it.

plot(ideal_kmer_spec$cov, ideal_kmer_spec$freq, type = 'l', xlab = 'Coverage', ylab = 'Frequency', main = 'ideal')
lines(ideal_kmer_spec$cov, gm1_y, lty = 2, col = 'red')

Does the model fit well the data? Ya, but there are so many annoying things about it... for instance, what is the genome size? Or what is the heterozygosity? Let's make a few last modification so the model will give us these estimates too.

  1. Fit a model that estimates heterozygosity

We can express α and β using probability of a heterozygous nucleotide (r) and the k-mer length (k). Look at the cascade of following expressions

Screenshot 2023-02-07 at 13 17 51

With this we could estimate the heterozygosity using k as a constant, while assuming that the heteorzygosity (probability of a nucleotidy being heterozygous) is the same for each nucleotide across the genome. Note there is a one less parameter too, instead of α and β (or a and b), there is just r, but there is one more consideration - while our previous model features large α and β coeficients, this model expects them to be relative proportions of the two peaks (the probability of heterozygous vs homozygous k-mers), which means we need to multiple the joint distribution by the genome length - this will be a one more added parameter to the equation, leading to the same total number as it was in the last case.

k <- 21
gm2 <- nls(y ~ ((2*(1-(1-r)^k)) * dnbinom(x, size = kcov   / overdispersion, mu = kcov    ) +
                ((1-r)^k)       * dnbinom(x, size = kcov*2 / overdispersion, mu = kcov * 2)) * length,
           start = list(r = 0, kcov = 25, overdispersion = 0.5, length = 1000e6),
           control = list(minFactor=1e-12, maxiter=40))
summary(gm2)

So, what do you think? The heterozygosity seems to be in the right ballpark (0.99% by GenomeScope vs 0.98% by our model), K-mer coverage is 27.6x in both models, and the genome size is also remarkably similar ~760Mbp.

  1. Fiddle around starting values

Now, try to fit the same model, while changing the starting values (e.g. r = 0.02 or 0.2). Have the starting condition affected the model?

What do you think singular gradient means?

That is when the least-square surface is flat around the initial values, which leads to impossibility to optimize the model, in other words, the initial guesstimates need to be in the right ballpark.

Till now we had an idealised case - trimmed of all repetitions and errors. In truth, the unidealised version of the k-mer spectra still looks pretty good and is available here. See check the genomescope model.

real_linear_plot

  1. So what's the effect of the "unidelised version"? Is the fit still good? Which estimated values get affected the most?
Solution

The model explains the data very well in both cases, the 1n coverage and heteoryzogisty are practically the same, the error peak also seems "belivable" (mostly made of sequencing errors). Notice however, how the genome size has changed - that is for all the repetitive DNA that is not part of the k-mer spectra. We will cover later in a different tutorial (TODO link genome size estimate).

  1. What happens if you fit your own model to the full k-mer spectra? Load that k-mer spectra and fit the model and compare it to the estimates by genomescope.
Solution
real_kmer_spec <- read.table('Timema_cristinae_kmer_k21.hist', col.names = c('cov', 'freq'))
x <- real_kmer_spec$cov
y <- real_kmer_spec$freq
k <- 21 

gm3 <- nls(y ~ ((2*(1-(1-r)^k)) * dnbinom(x, size = kcov   / overdispersion, mu = kcov    ) +
                ((1-r)^k)       * dnbinom(x, size = kcov*2 / overdispersion, mu = kcov * 2)) * length,
           start = list(r = 0, kcov = 25, overdispersion = 0.5, length = 1000e6),
           control = list(minFactor=1e-12, maxiter=40))
summary(gm3)

So, we did not model the error peak, but the coverage and heterozygosity is moreless right. This is in part because there are not many duplications in the genome (no bumps around 3n and 4n coverages), but also the majority of this estimate does come from the unique portion of the genome. THe main difference is however the genome length. I don't think this comes as a surprise, given our model does not take into account anything repetitive.

  1. Plot the real k-mer spectra with the model. Do you see peaks? If not, where do you think is the problem?
Solution

It happens that k-mers that are in the dataset once vastly dominate the dataset and totally rescale the y-axis. Furthermore, the full histogram can have thousands or even millions of coverages represented which stretches the x-axis outside of the region of interest. So, those genomic peaks are still there, but just squashed on both axes to a very new pixels. There are tons of ways around that, here is one

xlim_range <- c(1, coef(gm3)['kcov'] * 5) # plotting only up to 5x estimated coverage
ylim_range <- range(real_kmer_spec$freq[-c(1:5)]) # range of frequencies omitting coverages 1 to 5

plot(real_kmer_spec$cov, real_kmer_spec$freq, type = 'l', xlab = 'Coverage', ylab = 'Frequency', main = 'real', ylim = ylim_range, xlim = xlim_range)

predict_second_model <- function(x, r, k, kcov, overdispersion, length){
    ((2*(1-(1-r)^k)) * dnbinom(x, size = kcov   / overdispersion, mu = kcov    ) +
     ((1-r)^k)       * dnbinom(x, size = kcov*2 / overdispersion, mu = kcov * 2)) * length
}

gm3_y <- predict_second_model(real_kmer_spec$cov, coef(gm3)['r'], k, coef(gm3)['kcov'], coef(gm3)['overdispersion'], coef(gm3)['length'])

lines(real_kmer_spec$cov, gm3_y, lty = 2, col = 'red')

What's next

You hopefully understand the logic behind the combinatorics of genome modelling. To get a better appreciation of GenomeScope, we recommend reading the supplementary materials of the original GenomeScope paper that explains the model in full - e.g. how to extend the same logic to include also genomic duplications, estimating the error rate etc. These models were later extended to polyploids, where the combinatorics gets even wilder, check for a polyploid versions of the model.

👆 Go back to Table of Content

👉 ⚒ Fit some simple diploid (but real) genome models of notorious organisms.

👉 📖 Read about common difficulties in characterisation of diploid genomes using k-mer spectra analysis

⚠️ **GitHub.com Fallback** ⚠️