ska map - simonrharris/SKA GitHub Wiki

SKA map

The map subcommand allows alignment of split kmer files to a reference fasta file.

The output file is a fasta file containing one or more sequences of the same length as the concatenated sequences in the reference fasta file. This allows split kmer files to be mapped either individually or in sets and then concatenated to form a multiple sequence alignment.

By default the reference sequence is not included in the output file, but it can be included using the -i flag.

By default split kmers that are repeated in the reference sequence are not aligned, but instead filled with Ns. These can be included using the -i flag. Please note, however, this is done at your own risk, as it can introduce false positive variants into the alignment.

By default this command fills bases in the output alignment with the middle bases of aligned split kmers. These are written in upper case. Where a middle base differs from the reference sequence the bases either side, comprising the kmer, are filled with the reference sequence in lower case. This is done, because the variant middle base would preclude mapping either side of the variant, leading to unmapped regions around all variant sites. To apply the same filling technique to all aligned kmers, use the -a flag. This leads to a slightly more complete alignment at the slight added risk of misaligning some bases.

Aligned split kmers can be output to a file using the -v flag. This can be useful to allow annotation of variant kmers using ska annotate.

The -c flag tells SKA that all contigs in the reference fasta file are circular, which allows mapping up to the ends of the contigs by mapping across the two ends of each sequence.

During the alignment process, SKA attempts to estimate the number of alignments for each sample that may have been missed due to multiple variants occurring within a single kmer length. This number is supposed to be a rough guide to whether SKA is providing complete results or not. For alignments of of samples that are very similar to the reference sequence, the expected number of missed alignments will be small (and probably overestimated in many cases), but for samples that are very diverse from the reference, many alignments would be missed. SKA map is not recommended for analysis of diverse samples. SKA map also prints the percentage of the reference genome covered by aligned kmers. Together with the estimate of missed alignments, this can be used to assess the quality of the alignment.

Usage

ska map [options] <split kmer files>

Options:
-a <file>	Map all bases of kmers (Default = just map middle base).
-c		Treat all reference contigs as circular.
-h		Print this help.
-f <file>	File of split kmer file names. These will be added to or 
		used as an alternative input to the list provided on the 
		command line.
-k <int>	Split Kmer size. The kmer used for searches will be twice 
		this length, with the variable base in the middle. e.g. a 
		kmer of 15 will search for 31 base matches with the middle 
		base being allowed to vary. Must be divisible by 3. 
		Must be the same value used to create the kmer files. 
		[Default = 15]
-i		Include reference sequence in alignment.
-m		Map bases to repeats rather than making them N.
-o <file>	Output file prefix. [Default = mappedkmers]
-r <file>	Reference fasta file name. [Required]
-s <file>	File of sample names to include in the alignment.
-v		Output variant only alignment. [Default = all sites]
⚠️ **GitHub.com Fallback** ⚠️