Estimating what k mer parameter to use for a de bruijn graph based assembly (ABySS, Soapdenovo, Velvet) - Green-Biome-Institute/AWS GitHub Wiki
The following is the current method that we use in order to approximate the k-mer parameter that is required by most of the short-read assembler that we use to do genome assembly.
To do this, we will be using the following tool, called Velvet Advisor, which can be found at:
https://dna.med.monash.edu/~torsten/velvet_advisor/
In order to use this tool, we need 5 pieces of information. This includes:
- Total number of reads
On your EC2 instance with your files downloaded, we can use the following commands to determine the number of reads:
zcat your-filename.fastq.gz | wc -l
will output the total number of lines within the uncompressed version of your .fastq.gz
file (zcat
is used to expand and view a compressed file without uncompressing it). If the files is not compressed, simply use: wc your-filename.fastq
If your file is a FASTQ file, then divide this number by 4 to get the total number of reads within that file. If it is a FASTA file, divide by 2 to get the total number of reads. (please refer here for a description of different sequencing filetypes)
- If the reads are paired-end or single-end
Hopefully you know if your reads are paired or single-end if it is your data. If the data is not yours, please look at the description of the data from wherever you downloaded it. Otherwise, the two pieces of identifying information here are going to be that 1) you either have 1 or 2 sequencing files and 2) within the filename you might see something like _01
and _02
or _R1
and _R2
. If you only have 1 sequencing file, it is likely not paired-end (unless you lost the other file). If you have 2 sequencing files and they have _01
and _02
or _R1
and _R2
in their name, those are probably paired-end files.
Lastly, you can also check by reading out the first read of both files using:
zcat GBI-2_Cmolli_S78_R1_001.fastq.gz | head -n 2
on both of your FASTQ or FASTA files. If the read label for the first read is the same within each file besides a 1:N:0:
and 2:N:0:
(or some other way that labels the reads as 1 and 2 for forward and reverse), then those are paired-end reads.
- Read length
This is also information you hopefully know if it is your data. If not, use the following command on your FASTQ files:
zcat GBI-2_Cmolli_S78_R1_001.fastq.gz | head -n 2 | awk '{ print length }'
The second number will be your read length. For reads of 151, I tend to trim the last base that tends to have a high error rate and get read lengths of 150. Same for reads of 101, I trim the last base and get read lengths of 100.
- Estimated genome size
Please refer to the following page here for a protocol to determine an estimated genome size from your reads.
- Desired k-mer coverage for your assembly
The Velvet assembler gives some information regarding this value at the bottom of the page:
"The choice of k is a trade-off between specificity and sensitivity. Longer kmers bring you more specificity (i.e. less spurious overlaps); but reduce depth of coverage. The sweet spot is related to the properties of your genome and to the error rate in the reads...
Experience shows that kmer coverage should be above 10 to start getting decent results. If Ck is above 20, you might be "wasting" coverage. Experience also shows that empirical tests with different values of k are not costly to run."
Further information found at the bottom of the above Velvet assembler page: http://ivory.idyll.org/blog/the-k-parameter.html
Typically, I will input 10, 20, 30, 40, 50, and 60 to see what k-mer value is suggested and then mark down that range.
For example if I have 600 million paired-end reads with read lengths of 150 and i estimate my genome size to be 500 megabases, then use 10, 20, and 30 for the k-mer coverage, I get a range of:
10: 147 20: 143 30: 139 40: 135 50: 131 60: 127
From within this range, I will choose 2 or 3 values to start with, say 127 and 139. I will assemble with these values and then using the results of those assemblies, choose another value or two to assemble with. In this example, say my assembly with k=127 seems to have a higher BUSCO score and better statistics, I may then choose 123 and 131 and do another two assemblies with those two k-mer parameters.