SKA distance

The distance subcommand allows calculation of pairwise distances between and single-linkage clustering of samples in split kmer files based on user-defined SNP and identity cutoffs.

Clustering cutoffs

The clustering method employed is very simple. Samples are clustered if they meet the requirements of both of the following

If the number of SNPs between them is less than the SNP cutoff set with the -s option [Default = 20] and
They meet the identity cutoff set with the -i option [Default = 0.9]. I.e. they share at least this proportion of the total number split kmers in the file with fewer kmers.

Output files

Distance.txt output columns

Column	Description
Sample 1	The name of the first sample being compared
Sample 2	The name of the first sample being compared
Matches	Number of split kmers found in both samples where the middle base is an A, C, G or T and matches between samples
Mismatches	Number of split kmers found in only one of the samples
Jaccard Index	Ratio of split kmers found in both samples to the total found in the two samples: matches/(matches+mismatches)
Mash-like distance	A distance based on the Mash distance calculation using the Jaccard Index (j) above and the split kmer length (k): (-1/(2k+1))*ln(2j/(1+j)) for 0<j≤1 or 1 for j=0
SNPs	Number of split kmers found in both samples where the middle base is an A, C, G or T but differs between files
SNP distance	The ratio of SNPs to matches: SNPs/matches

Clusters.txt output columns

Column	Description
File	The name of the split kmer file
Cluster__autocolour	An index for the cluster containing the file

Note: The __autocolour suffix to the Cluster column is to allow automatic colouring when the file is opened in MicroReact

Cluster.x.txt files

For all clusters comprising two or more samples a samples file will be produced. These files can be used as input for many SKA subcommands.

Dot file

A graph connecting samples using the cutoffs defined by the user is output and can be visualised along with the clusters.txt file in MicroReact. By default samples are only included in the dot file if they appear in a cluster of two or more samples. To include all samples in the file, use the -S flag.

Usage

ska distance [options] <split kmer files>

Options:
-c 		Do not print clusters files.
-d 		Do not print distances file.
-h		Print this help.
-f <file>	File of split kmer file names. These will be added to or 
		used as an alternative input to the list provided on the 
		command line.
-i <float>	Identity cutoff for defining clusters. Isolates will be 
		clustered if they share at least this proportion of the 
		split kmers in the file with fewer kmers and pass the SNP 
		cutoff. [Default = 0.9]
-o <file>	Prefix for output files. [Default = distances]
-s <int>	SNP cutoff for defining clusters. Isolates will be clustered 
		if they are separated by fewer than this number of SNPs and 
		pass the identity cutoff. [Default = 20]
-S 		Include singletons in dot file

ska distance - simonrharris/SKA GitHub Wiki

SKA distance

Clustering cutoffs

Output files

Distance.txt output columns

Clusters.txt output columns

Cluster.x.txt files

Dot file

Usage

Citation

⚠️ GitHub.com Fallback ⚠️

ska distance - simonrharris/SKA GitHub Wiki

SKA distance

Clustering cutoffs

Output files

Distance.txt output columns

Clusters.txt output columns

Cluster.x.txt files

Dot file

Usage

Citation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️