ska align - simonrharris/SKA GitHub Wiki

SKA align

The align subcommand allows reference-free alignment of split kmer files.

Running the command creates an alignment of the middle bases of all split kmers that are present in at least a proportion of the split kmer files. This proportion is defined by the user using the -p option and must be between 0 and 1. It should be noted that allowing split kmers that are only found in few samples to be included in the alignment may lead to the addition of noise. However, being too stringent can lose some signal, particularly in datasets which are more diverse.

The -v flag restricts the output to variant sites. i.e. those with at least two different bases (A, C, G or T) present at the site. This can be useful as input for phylogenetic reconstruction methods. SKA will also output the number of A, C, G ant T 'constant' sites (note these may also contain Ns or gaps), which can be input into some ML and Bayesian phylogenetic reconstruction methods for ascertainment bias correction.

The -k flag allows the user to print the aligned split kmers to file. This could be used to quickly add new samples to an alignment, or to annotate the kmers using ska annotate. If the -v option is used, only the kmers from variant aligned sites will be output.

During the alignment process, SKA attempts to estimate the number of alignments that may have been missed due to multiple variants occurring within a single kmer length. This number is supposed to be a rough guide to whether SKA is providing complete results or not. For alignments of small numbers of closely-related samples the expected number of missed alignments will be small (and probably overestimated in many cases), but for diverse samples or large numbers of samples (where the total number of variant sites becomes high), many alignments would be missed. SKA align is not recommended for analysis of diverse samples.

Please note that the order of sites in the output of SKA align is not genome order. Therefore, the alignment is not suitable as an input into methods such as Gubbins or ClonalFrameML which identify regions affected by recombination using SNP density. Until we implement a method for sorting the output by genome position, please consider SKA map or other mapping software to produce input for these methods.

Usage

ska align [options] <split kmer files>

Options:
-h		Print this help.
-f <file>	File of split kmer file names. These will be added to or 
		used as an alternative input to the list provided on the 
		command line.
-k		Print aligned split kmers to file.
-o <file>	Prefix for output files. [Default = reference_free]
-p <float>	Minimum proportion of isolates required to possess a split 
		kmer for that kmer to be included in the alignment. 
		[Default = 0.9]
-s <file>	File of sample names to include in the alignment.
-v		Output variant only alignment. [Default = all sites]
⚠️ **GitHub.com Fallback** ⚠️