ska type - simonrharris/SKA GitHub Wiki

SKA type

The type subcommand allows rapid typing of alleles for a set of loci. If a profile file is provided a combined typing profile can be identified, for example for MLST schemes.

Input format

The input for the command is a query split kmer file (containing one or more samples), one or more multi-fasta files containing the alleles of a typing locus, and an optional file linking alleles to profiles.

Each locus fasta should have the following format

>abc_1
ACTGCTGC
>acc_2
ACTGCAGC

Where the string before the underscore (abc) is the name of the locus and the integer after the underscore (1,2,...,N) is the allele number.

The profiles file should have the following format, which is a common format, for example when downloading profiles from bigsDB.

The first row should be a header row and all following rows represent a single profile.

Each row must be tab separated into columns. One column must be labelled "ST" in the header row, and this column should be the sequence type of each profile in the following rows. If a column is labelled "clonal_complex", this will be treated as information on the clonal complex of the sequence type, and not included as an allele. This column can be treated as free text but will not appear in the output. All other column headers are interpreted as locus names, and should match the locus names in the locus fasta files. For all rows following the header row, these columns must be filled with integers representing allele numbers and must correspond with the allele numbers in the appropriate locus fasta file.

Example profiles input file format

ST	abc	def	ghi	jkl	mno	pqr	stu	clonal_complex
1	1	1	1	1	1	1	1	cc 1
2	2	1	5	2	1	1	2	cc 2
...

Output format

The output will be printed to screen as a tab delimited list of samples with the allele(s) identified for each locus and their ST where a profiles file was provided.

Where multiple alleles of a locus are matched equally well, both alleles will be listed, separated by forward slashes. For each allele match some suffix characters describe particular features of the match:

  • An sterisk (*) indicates that the match is not identical (i.e. there was no perfect match found
  • A hyphen (-) indicates that there were gaps in the alignment to the allele, indicating either that the allele was incomplete in the sample (possibly due to poor sequencing quality or a true deletion), or that the allele in the sample was too diverse from any of the alleles in the locus file to match the entire sequence (i.e. has at least 2 SNPs within a kmer length or includes indels)
  • An N indicates that the split kmers mapped to the allele included at least one N. i.e. the sample included uncertainty in the split kmers matching the allele Where there is uncertainty in the allele calls or no identical match is found for at least one allele, then the ST column will be filled with a hyphen.

Example output

Sample	abc	def	ghi	jkl	mno	pqr	stu	ST
sample1	1	1	1	1	1	1	1	1
sample2	2	1	5	2	1	1	2	2
sample3	2	1	5*	2	1	1	2	-
sample4	1	2	2/7*-	1/8*-	3	4	1	-

Usage

ska type [options] <locus fasta files>

Options:
-h		Print this help.
-f <file>	File of locus fasta file names. These will be added to or 
		used as an alternative input to the list provided on the 
		command line.
-p <file>	tab file containing profile information.
-q <file>	Query split kmer file. This can be a single kmer file.
⚠️ **GitHub.com Fallback** ⚠️