MLST calling with ARIBA - sanger-pathogens/ariba GitHub Wiki

MLST calling with ARIBA

Get MLST scheme

ARIBA can be used for MLST using the typing schemes from PubMLST. A list of available species can be obtained by running

ariba pubmlstspecies

Download the data (in this example, Staphylococcus aureus) using pubmlstget:

ariba pubmlstget "Staphylococcus aureus" get_mlst

Note that a few species have dual typing schemes each. For example:

  • Escherichia coli#1: Achtman's seven-gene scheme
  • Escherichia coli#2: Pasteur Institute's eight-gene scheme

See Issue 185 for how to make a customised MLST database.

Run ARIBA

Then run MLST using ARIBA with:

ariba run get_mlst/ref_db reads_1.fq reads_2.fq ariba_out

where reads_1.fq and reads_2.fq are the paired reads files for your sample.

Output files

The two important files are mlst_report.tsv and mlst_report.details.tsv.

mlst_report.tsv is a summary of the allele calls and identified sequence type. The format is like this:

ST	 gene1   gene2  gene3
42	 1       4      7

where in this case the sequence type is identified as 42.

A star next to any call indicates that there was some uncertainty. For example:

ST	 gene1   gene2  gene3
42*  1*      4      7

A star is added if any heterozygous SNPs are detected, the percent of the gene called or percent identity is less than 100, or there is more than one contig in the assembly.

mlst_report.details.tsv has more details on each allele call. For example, the file corresponding to the previous report could look like this:

gene  allele  cov     pc    ctgs   depth   hetmin   hets
gene1 1*      100.00  99.8  1      28.9    .        .
gene2 4       100.00  100.0 1      45.9    .        .
gene3 7       100.0   100.0 1      54.3    .        .

where the columns are as follows.

  1. gene: the name of the gene
  2. allele: the allele called
  3. cov: percent of the gene that was assembled
  4. pc: percent identity between the gene and assembly
  5. ctgs: number of contigs in the assembly
  6. depth: mean read depth of the contig(s)
  7. hetmin: minimum(max allele depth as a percent of total depth), across all identified heterozygous SNPs. e.g. for the example below where the hets column is 30,10.25,10,5, this would be 100 * min(30/(30+10), 25/(25+10+5)) = 62.5.
  8. hets: a list of the heterozygous SNP depths. For example 30,10.25,10,5 corresponds to two heterozygous SNPs, the first with read depths 30 and 10, and the second with depths 25, 10, and 5.

All other output files are as described in the run page.