ska summary - simonrharris/SKA GitHub Wiki
The summary subcommand prints some summary statistics for a set of split kmer files.
Column | Description |
---|---|
Sample | The name of the sample being summarised |
Kmer size | The split kmer size used to create the file |
Total kmers | Total number of split kmers in the file |
As | Number of split kmers with an A as the middle base |
Cs | Number of split kmers with an C as the middle base |
Gs | Number of split kmers with an G as the middle base |
Ts | Number of split kmers with an T as the middle base |
Ns | Number of split kmers with an N as the middle base |
Others | Number of split kmers with any other letter as the middle base |
GC Content | The GC content of the middle base of all split kmers |
The summary subcommand is useful for QC purposes. You would expect the number of split kmers in each kmer file to be approximately the length of the genome or slightly higher. If the number of split kmers is much lower than the expected genome size, then the sequence data may not be of high enough quality or at high enough depth for the default settings if ska fastq was used to produce the split kmer file, or your assembly may be incomplete it ska fasta was used. If the number of split kmers is much larger than the expected then you may have contamination in your sequencing data, or your data may be of low quality. Similarly, you would expect the GC content of the middle base of the split kmers to be representative of the species being sequenced.
ska summary [options] <split kmer files>
Options:
-f <file> File of split kmer file names. These will be added to or
used as an alternative input to the list provided on the
command line.
-h Print this help