Common file types - simonrharris/SKA GitHub Wiki
A split kmer is simply two kmers separated by a single base. Searching for exact matches is extremely fast. Searching for split kmers retains this speed, but allows variation in the middle base. Therefore, single nucleotide polymorphisms can be quickly identified between files in cases where the flanking kmers are conserved.
Split kmer files contain each unique split kmer identified within a sequence. For fastq files, these are filtered to attempt to remove noise caused by sequencing error (see ska fastq for details on filtering). SKA names split kmer files with a .skf (split kmer file) suffix.
There is no real need to look at the split kmers files, but for those interested, the format is described here.
The first three lines of the split kmer file are the header. The first line contains the version of SKA that created the file, the second line contains a single integer, which is the kmer size used to create the file, and the third line contains a list of the samples in the file.
The rest of the lines in the file contain string representations of the split kmers and the samples they are found in. The first n/6 ascii characters representing a bitstring storing which samples the split kmers on the line are found in. The rest of the line contains the split kmers. For each split kmer, the first character is the is the middle base of the split kmer, and is followed by a string of 2k/3 ascii characters representing the split kmer itself. To save space and increase performance, the kmers are compressed by converting each trimer in the kmer into an ascii symbol. This is the reason that the kmer sizes used to create the split kmer files must be divisible by 3.
SKA fasta and SKA alleles can create split kmers from fasta files, while SKA fastq can create them from fastq files. SKA annotate, SKA map and SKA type also take fasta files as input.
Many of the SKA subcommands allow input of a list of split kmer files in a file of file names. This is simply a file containing a white-space separated list of split kmer file locations.
e.g.
sample1.skf sample2.skf
sample3.skf
As with the file of file names, many SKA subcommands provide the option to restrict the analysis to a set of samples listed in a file containing a white-space separated list of sample names. This is particularly useful for analysing a subset of samples in a merge split kmer file contianing a large number of samples. SKA distance also outputs clusters in this format in files called .cluster..txt.
e.g.
sample1 sample2
sample3