Home - simonrharris/SKA GitHub Wiki

SKA

Introduction

SKA (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers. A split kmer is a pair of kmers in a DNA sequence that are separated by a single base. Split kmers allow rapid comparison and alignment of conserved, small genomes, so are particularly suited for bacterial pathogen surveillance or outbreak investigations. SKA can produce split kmer files from fasta format assemblies or directly from fastq format read sequences, align them with or without a reference sequence, compute pairwise distances, identify clusters and provide various comparison and summary statistics. Currently all testing has been carried out on high-quality Illumina read data, so results for other platforms may vary.

Compared with most read-mapping or assembly-based approaches for identifying genomic variation, SKA is simple to use, fast, memory efficient (for good quality data from small genomes) and requires relatively little disk space for file storage. It also provides methods for isolate clustering and an accurate reference-free alignment approach for closely-related bacterial genomes that could aid outbreak investigations where no close reference genome is available. SKA relies on no dependencies and simply requires GNU make and a version of g++ which supports C++11.

Installation

Please see the readme from the SKA github repository for installation instructions.

Synopsis

Create split kmer files for a set of paired fastq files for a number of DNA samples using ska fastq

ska fastq -o sample1 sample1_1.fastq.gz sample1_2.fastq.gz
ska fastq -o sample2 sample2_1.fastq.gz sample2_2.fastq.gz
ska fastq -o sample3 sample3_1.fastq.gz sample3_2.fastq.gz

Print a brief summary of the split kmer files using ska summary to check that they are consistent with the species being sequenced

ska summary sample1.skf sample2.skf sample3.skf

Type MLST loci for sample1 using ska type

ska type -q sample1.skf -p MLST_profiles.tsv locus1_alleles.fasta locus2_alleles.fasta locus3_alleles.fasta locus4_alleles.fasta locus5_alleles.fasta locus6_alleles.fasta locus7_alleles.fasta

Compare split kmers in sample 1 with those in samples 2 and 3 using ska compare

ska compare -q sample1.skf sample2.skf sample3.skf

Merge samples1 and 2 into a single file using ska merge

ska merge -o merged sample1.skf sample2.skf

Calculate pairwise distances of the three samples and assign them to clusters based on a 25 SNP cutoff and a minimum split kmer identity of 95% using ska distance

ska distance -s 25 -i 0.95 merged.skf sample3.skf

Create a split kmer file for mobile elements from a reference genome using ska fasta

ska fasta -o MGEs MGEs.fasta

Weed out the MGE split kmers from each of the sample split kmer files using ska weed

ska weed -i MGEs.skf merged.skf
ska weed -i MGEs.skf sample3.skf

Align the weeded kmer files against the reference genome using ska map

ska map -o reference.aln -r reference.fasta merged.weeded.skf sample3.weeded.skf

Produce a reference-free alignment of variant split kmers from the three samples and output the variant split kmers to file using ska align

ska align -v -o reference_free merged.weeded.skf sample3.weeded.skf

Annotate the variant split kmers from the previous command on a gff of a reference genome and include product descriptions in the output vcf using ska annotate

ska annotate -p -r reference.gff -o annotated_variants reference_free_variant.skf

Identify split kmers that are unique to samples 2 and 3 using ska unique (assuming that they are more similar than either is to sample 1). For this you require a sample file containing the names of the ingroup samples (sample2 and sample3)

ska unique -o unique merged.weeded.skf sample3.weeded.skf -i ingroup.fofn

Quickly assess if new samples contains those unique kmers (i.e. seem to be related to samples 2 and 30) using ska compare

ska compare -q unique.skf sample4.skf sample5.skf sample6.skf

Subcommands

SKA includes several subcommands to carry out a number of different analysis tasks.

  • align: Create a reference-free alignment of a set of split kmer files
  • alleles: Create a merged split kmer file for all sequenes in one or more multifasta files
  • annotate: Locate/annotate split kmers in a reference fasta/gff file
  • compare: Print comparison statistics for a query split kmer file against a set of subject split kmer files
  • distance: Calculate pairwise distances and cluster isolates from a set of split kmer files
  • fasta: Create a split kmer file from one or more fasta files
  • fastq: Create a split kmer file from one or more fastq files
  • humanise: Print kmers from a split kmer file in human readable formats
  • info: Print some information about one or more kmer files
  • map: Align one or more split kmer files against a reference fasta file
  • merge: Merge multiple split kmer files into a single file
  • summary: Print some summary statistics for one or more split kmer files
  • type: Type split kmer files using a set of allele fasta files and a profile file
  • unique: Extract split kmers that are unique to a subset of samples
  • version: Just print the version and citation for SKA
  • weed: Remove split kmers from a split kmer file if they are present in a second split kmer file. This is useful for excluding split kmers that match mobile genetic elements, contaminants or adaptor sequences