ska fastq - simonrharris/SKA GitHub Wiki

SKA fastq

The fastq subcommand creates a split kmer file from one or more fastq files.

The command creates a split kmer file containing a single sample with all split kmers from all fastq files provided to it.

Filtering

To remove noisy kmers caused by sequencing error from the split kmer file, ska uses two approaches. First, it only creates kmers from sequence for which the base quality scores are all above a cutoff defined using the -q option. The default is a cutoff of 20, which in our experience gives a good balance between removing noise and retaining as much information as possible. The second filtering step is to remove any kmers that have a coverage of less than 4 or less than 2 in any of the fastq files if more than one file is provided. These defaults can be changed using the -c and -f options respectively. Finally, the base call for the middle base in the split kmer is filtered to remove any bases where the minor alleles are found in more than 20% of the kmers. This is a strategy often used in read mapping to account for sequencing error, and can be adjusted using the -m option.

Allele frequencies

Allele frequencies can be output to tsv file using the -a flag. The output is a tab delimited list of split kmers and a count of A, C, G and T middle bases found for that kmer in the skf file.

Usage

ska fastq [options] <fastq files>

Options:
-h		Print this help.
-a		Print allele frequencies of split kmers to file [Default = false]
-c <int>	Coverage cutoff. Kmers with coverage below this value will 
		be discarded. [Default = 4]
-C <int>	File coverage cutoff. Kmers with coverage below this value 
		in any of the fastq files will be discarded. [Default = 2]
-k <int>	Split Kmer size. The kmer used for searches will be twice 
		this length, with the variable base in the middle. e.g. a 
		kmer of 15 will search for 31 base matches with the middle 
		base being allowed to vary. Must be divisible by 3. 
		[Default = 15]
-m <float>	Minimum allowable minor allele frequency. Kmer alleles below 
		this frequency will be discarded. [Default = 0.2]
-o <file>	Output prefix. [Default = fastq]
-q <int>	Quality filter for fastq files. No kmers will be created 
		from sequence including quality scores below this cutoff. 
		[Default = 20]
⚠️ **GitHub.com Fallback** ⚠️