Task: flag - sanger-pathogens/ariba GitHub Wiki

Task: flag

This task reports the meaning of a flag.

During assembly, various things can happen. The possibilities are encoded into a flag (column 3 of the report). To get the meaning of a flag, for example 27, run:

ariba flag 27

The output is:

Meaning of flag 27
[X] assembled
[X] assembled_into_one_contig
[ ] region_assembled_twice
[X] complete_gene
[X] unique_contig
[ ] scaffold_graph_bad
[ ] assembly_fail
[ ] variants_suggest_collapsed_repeat
[ ] hit_both_strands
[ ] has_variant
[ ] ref_seq_choose_fail

An X means that part of the flag is true. In this example, flag 27 is an ideal result - the gene was assembled into one unique contig with a complete gene (starts with a start codon, and the only stop codon is at the end).

The meanings are as follows:

  • assembled: the assembly is compared to the reference sequence using nucmer. If at least 95% of the reference sequence has nucmer matches to the assembly, then gene_assembled is true. The 95% is a default value that can be changed with the option --assembled_threshold. Note that this says nothing about how many contigs represent the gene (see assembled_into_one_contig)

  • assembled_into_one_contig: this is set to true if gene_assembled is true, and also there is a single contig with a nucmer match that covers at least 95% of the reference sequence. Note that there could still be other contigs that match the reference (see region_assembled_twice).

  • region_assembled_twice: this is set to true if more than 3% of the reference sequence has more than one match to the assembly. The 3% cutoff can be changed with the option --unique_threshold.

  • complete_gene: if there is a match to the full length of the reference sequence, or if the match is not quite complete, then ARIBA will try to extend it to the nearest start and stop codons. If this is successful, and the only stop codon is at the end of the inferred gene sequence, then complete_gene is set to true. This will never be set if the reference is a non-coding sequence.

  • unique_contig: this is set to true if there is exactly one contig in the assembly that has nucmer matches to the reference sequence.

  • scaffold_graph_bad: the reads are mapped back to the assembly and links between the contigs from read pair information is used to construct a scaffolding graph. If there is any ambiguity in this graph, for example the end of contig A could join to the start of contig B or contig C, then scaffold_graph_bad is set to true.

  • assembly_fail: this is set when the assembler producdes no ouput. The most likely cause is a few reads spuriously mapped to the reference sequence, whose depth is too low to assemble.

  • variants_suggest_collapsed_repeat: after mapping the reads back to the assembly, variants are called using samtools. If samtools calls any variants in any position that matches to the reference gene, then this is set to true. It suggests that the assembly has collapsed more than one sequence down into one sequence, hence the reads suggesting variants. Alternatively, this could be caused by a mixed input sample.

  • hit_both_strands: this means there is a contig that has two (or more) matches to the reference, but the matches are in opposite orientations.

  • has_variant: this is set to true if there is any variant between the assembly and the reference. For a noncoding sequence, this means any nucleotide change. For a gene, this means any non-synonymous change. Except that a known variant is only counted when the assembly has the variant type, as opposed to the wild type (bear in mind that the reference could have the wild type or the variant type).

  • ref_seq_choose_fail: this is set to true if something went wrong when trying to find the closest reference sequence within a cluster.