ska fastq - simonrharris/SKA GitHub Wiki
The fastq subcommand creates a split kmer file from one or more fastq files.
The command creates a split kmer file containing a single sample with all split kmers from all fastq files provided to it.
To remove noisy kmers caused by sequencing error from the split kmer file, ska uses two approaches. First, it only creates kmers from sequence for which the base quality scores are all above a cutoff defined using the -q option. The default is a cutoff of 20, which in our experience gives a good balance between removing noise and retaining as much information as possible. The second filtering step is to remove any kmers that have a coverage of less than 4 or less than 2 in any of the fastq files if more than one file is provided. These defaults can be changed using the -c and -f options respectively. Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.
Allele frequencies can be output to tsv file using the -a flag. The output is a tab delimited list of split kmers and a count of A, C, G and T middle bases found for that kmer in the skf file.
ska fastq [options] <fastq files>
Options:
-h Print this help.
-a Print allele frequencies of split kmers to file [Default = false]
-c <int> Coverage cutoff. Kmers with coverage below this value will
be discarded. [Default = 4]
-C <int> File coverage cutoff. Kmers with coverage below this value
in any of the fastq files will be discarded. [Default = 2]
-k <int> Split Kmer size. The kmer used for searches will be twice
this length, with the variable base in the middle. e.g. a
kmer of 15 will search for 31 base matches with the middle
base being allowed to vary. Must be divisible by 3.
[Default = 15]
-m <float> Minimum allowable minor allele frequency. Kmer alleles below
this frequency will be discarded. [Default = 0.2]
-o <file> Output prefix. [Default = fastq]
-q <int> Quality filter for fastq files. No kmers will be created
from sequence including quality scores below this cutoff.
[Default = 20]