ska weed - simonrharris/SKA GitHub Wiki
The weed subcommand allows kmers to be removed (or weeded) from split kmer files. This allows removal of kmers found in DNA sequences that may cause problems for downstream analysis such as mobile genetic elements, DNA from contaminants or adapter sequences. Removing these sequences also speeds up analysis and reduces file sizes.
When using the split kmer files for mapping against a reference, this approach can be used to 'mask' regions of the reference genome.
There are two ways to weed kmers using ska week
- Provide a split kmer file containing the split kmers to be removed from the file. It is important to note that these can vary at the middle base, which allows kmers to be removed even if they contain some variation.
- If multiple split kmer files are provided, kmers can be removed if they are present in fewer or more than N samples (defined with the -m and -M options respectively), or less than or more than a percentage of the samples (defined with the -p and -P options respectively). The idea here is that split kmers that are not in a large proportion of the samples are probably accessory genome or contamination.
To remove mobile elements when mapping to a reference, simply create a multi-fasta of the sequences of the reference mobile elements and create the split kmer file with ska fasta
An second approach is to compile a database of mobile element sequences for a species of interest in fasta format and create a split kmer file from that using ska fasta
ska weed [options] <split kmer files>
Options:
-f <file> File of split kmer file names. These will be added to or
used as an alternative input to the list provided on the
command line.
-h Print this help.
-i <file> Name of kmer file containing kmers to be weeded. [Required]
-m <int> Minimum number of samples required to possess a split
kmer for that kmer to be retained. [Default = 0]
-M <int> Maximum number of samples required to possess a split
kmer for that kmer to be retained. 0 = No maximum. [Default = 0]
-p <float> Minimum proportion of samples required to possess a split
kmer for that kmer to be retained. [Default = 0.0]
-P <float> Maximum proportion of samples required to possess a split
kmer for that kmer to be retained. [Default = 1.0]