demonstrating the effect of sequencing error rate on k mer coverage - KamilSJaron/k-mer-approaches-for-biodiversity-genomics GitHub Wiki

Prerequisites

UNIX environment with wget and other command line tools;
samtools (wgsim)
FastK
R and GenomeScope
(helpful) supplementary materials of "k-mer approaches for biodiversity genomics" manuscript, so you can match equations in this script to the manuscript

Download a dummy reference, simulate reads, and calculate k-mer histograms

I will simulate some some reads using these 5 scaffolds as a reference. They are 4917567 bp (~5Mbp) in total and it makes is very easy to calculate the expected coverage given number of reads.

I will simulate Illumina-like reads with some reasonable genomic parameters, while varying the error rate.

wget https://github.com/KamilSJaron/generic_genomics/raw/refs/heads/master/test/data/five_scaffolds.fa.gz 
for err in 0 0.001 0.005 0.01 0.02; do
    # wgsim simulates reads from a haploid genome in certain mutational distance to the reference (-r)
    # haplotype 1 will be the same as the reference 
    wgsim -N 500000 -1 125 -2 125 -r 0 -R 0.05 -X 0.3 -e $err -h -S 170725 five_scaffolds.fa.gz 20x_reads_"$err"_h1_R1.fq 20x_reads_"$err"_h2_R2.fq > 20x_variants_h1_"$err".bed 
    # haplotype 2 will be 0.5% diverged with some indels too
    wgsim -N 500000 -1 125 -2 125 -r 0.005 -R 0.05 -X 0.3 -e $err -h -S 170726 five_scaffolds.fa.gz 20x_reads_"$err"_h2_R1.fq 20x_reads_"$err"_h1_R2.fq > 20x_variants_h2_"$err".bed 
done

For the all the datasets, we calculate the k-mer histogram and git genomescope model.

for err in 0 0.001 0.005 0.01 0.02; do
    echo $error_rate
    FastK -v -t1 -k31 -M120 -T4 20x_reads_"$err"_h[12]_R[12].fq -NFastK_Table_"$err"
    Histex -G FastK_Table_"$err" > e"$err"_k31.hist
    genomescope.R -i e"$err"_k31.hist -o gs -n gs_"$err" -k 31
done

Of these genomescope runs, I extracted the fit numbers and pasted them (for simplicity) to R in the code below. One could also get them directly from the runs.

GenomeScope genome sizes and k-mer coverages

G <- 4917567
L = 2e6 # 5e5 of pair-end reads (R1 and R2) for both h1 and h2, 4 * 5e5 = 2e6
R = 125
k = 31
Cg = (L * R) / (G * 2) # equation in supplementary materials 2
# taken from GenomeScope plots (it's also possible to extract it from summaries, but this was quicker for 5 values)
gs <- data.frame(err = c(0, 0.001, 0.005, 0.01, 0.02), Ck = c(19.7, 19.1, 16.9, 14.4, 10.5), G_est = c(4809317, 4823948, 4816498, 4823948, 4842251))
G - gs[, 'G_est']
#  108250  93619 101069  93619  75316
# GenomeScope genome size is gently overestimating
# probably due to the difference between the model simualting reads, and GenomeScope

Gk_exp <- ((R - k + 1) * Cg) / R # equation is in supplementary materials 2
# 19.3185 - expected Gk coverage assuming no errors

Gk_exp_err_adjusted <- Gk_exp * (1 - gs[, 'err'])^k # equation is in supplementary materials 2
# adjusting for the sequencing error
# 19.31850 18.72852 16.53822 14.14700 10.32718

Gk_exp_err_adjusted - gs[, 'Ck']
# -0.3815037 -0.3714802 -0.3617778 -0.2530000 -0.1728219
# all Ck estimated by GenomeScope are within 0.5x of what we would theoretically expect 
# given simulated genome length, reads length, number of reads, and most of all - error rates

hists <- lapply(paste0("e", gs[, 'err'], "_k31.hist"), read.table) 

# now applying equation 
G_est_ignoring_errs <- sapply(hists, function(hist){ sum(hist[, 1] * hist[, 2]) } ) / (gs[, 'Ck'] * 2) # equation was on line 259, now in supplementary materials 2
# 4819472 4970884 5617962 6593324 9042268
G_est_ignoring_errs - G
# -98095.30   53316.61  700394.78 1675756.78 4124701.00
# the greater error rate is, the greater overestimate of the genome size the naive estimate has

If we are happy with everything, we can cleanup.

rm gs/*
rm *.hist
rm 20x*
rm FastK*
rm .FastK*

What's next

The next step is look into common patterns one can observe while genome profiling shown in the next sections, we display a diversity of genomes with very different characteristics.

👆 Go back to Table of Content

👉 ⚒ In case you missed it, check out an example of how to manually model! manual model fitting.

👉 📖 Read about some of the challenges faced when modeling diploid genomes Common difficulties in characterisation of diploid genomes using k mer spectra analysis