Task: summary - sanger-pathogens/ariba GitHub Wiki

Task: summary

This summarises the results from one or more runs of ARIBA.

The usage is

ariba summary out in.report.1.tsv in.report.2.tsv ...

where in.report.1.tsv, in.report.2.tsv ... is a list of report files made by separate runs of ARIBA.

It makes three output files:

  • out.csv. This is a csv file that can be viewed in your favourite spreadsheet program.

  • out.phandango.{csv,tre}. These are two files that allow you to view the results in Phandango. They can be drag-and-dropped straight into Phandango. Note that ARIBA makes a rough tree, using the contents of the CSV file. You may wish to provide your own tree file to Phandango and stop ARIBA from making a tree with the --no_tree option.

By default, the output is minimal. It contains one column per cluster and one row per sample, with a "yes" or "no" as to whether or not each sample has a match (as described below) to that cluster. You can exactly control what is output, or use the option --preset to use one of several preset combinations. In order of increasing number of columns, the values that can be used with --preset are: minimal, cluster_small, cluster_all, cluster_var_groups, all, all_no_filter.

Continue reading for an explanation of all the columns that can be reported.

Tuning the output

Cluster columns

There can be up to seven columns output per cluster:

  1. assembled: this is one of "no", "fragmented", "interrupted", "yes", "yes_nonunique", depending on the flag. Please see here for how it is determined.

  2. match: this is either "yes" or "no". It is set to "yes" if the assembled column is "yes" or "yes_nonunique", and in the case of a variants-only gene it must also have a known variant. Otherwise it is set to "no".

  3. ref_seq: this is set to the name of the closest reference sequence for each sample. Set to "NA" if assembled is "no".

  4. pct_id: this is the percent identity of the contig that has the largest value in the ref_base_assembled column of the report. Set to "NA" if assembled is "no".

  5. ctg_cov: same as 4, except this is the mean read depth across the contig.

  6. known_var: "yes" or "no" depending on whether or not the sample has a known variant. Set to "NA" if assembled is "no".

  7. novel_var: "yes" or "no" depending on whether or not the sample has a novel variant (ie not specified in the original metadata). Set to "NA" if assembled is "no".

Which of the seven columns are output is controlled using the option --cluster_cols. Provide a comma-separated list of the names that you want in the output. For example:

--cluster_cols assembled,match,known_var

would report the three columns assembled, match and known_var.

Variant columns

By default, variants are not reported when running summary. There are three types of variant columns that can be reported. The reporting of variants can be switched on using any of the options --v_groups, --known_variants, and --novel_variants.

  • --v_groups: this only applies if you allocated the variants to groups, eg when running aln2meta. Otherwise, you can ignore this option. If it is used, it will output a column for each group, showing whether or not each sample has any variant from that group.

  • --known_variants: output a column per variant, showing whether or not each sample has it. This applies to variants that ARIBA is already aware of because they were provided in the original metadata when running prepareref.

  • --novel_variants: this is the same as --known_variants except novel variants are reported, ie any variants found that were not given in the original metadata.

Column and row filtering

By default, any row or column that only contains "no" or "NA" is removed. This filtering can be changed using the options --col_filter n and --row_filter n.

Presets

Preset combinations of the columns to output are available using the --preset option. The default is --preset minimal. Using --preset will override the options --cluster_cols, --v_groups, --known_variants, --novel_variants, --col_filter, and --row_filter.

The cluster columns are set follows depending on the preset:

Preset Value of --cluster_cols
minimal match
cluster_small assembled,match,ref_seq,known_var
cluster_all assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var
cluster_var_groups assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var
all assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var
all_no_filter assembled,match,ref_seq,pct_id,ctg_cov,known_var,novel_var

The variant options and row/column filtering are set as follows depending on the preset:

Preset --v_groups, --known_variants, --novel_variants row_filter col_filter
minimal (none used) y y
cluster_small (none used) y y
cluster_all (none used) y y
cluster_var_groups --v_groups y y
all --v_groups --known_variants --novel_variants y y
all_no_filter --v_groups --known_variants --novel_variants n n

Other options

The other options, not described above, are as follows:

  • --no_tree. This stops the tree being calculated, which can be quite slow.

  • --min_id. The Minimum percent identity to count as assembled. Default: 90.

  • --only_clusters Cluster_names. Only report data for the given comma-separated list of cluster names, eg: cluster1,cluster2,cluster42.