Autocycler clean - rrwick/Autocycler GitHub Wiki
Autocycler clean lets users manually remove or duplicate sequences in a consensus assembly produced by Autocycler combine. This is typically done when the consensus assembly is not fully resolved (one or more clusters contain multiple tigs), which often occurs with linear sequences due to inconsistent ends between input contigs. After removing/duplicating sequences, any linear paths will be merged together, which can allow the incomplete part of the genome to be resolved into a single sequence.
autocycler clean -i autocycler_out/consensus_assembly.gfa -o autocycler_out/cleaned_assembly.gfa -r 7,8
This command removes tigs 7 and 8 from consensus_assembly.gfa
, merges any linear paths in the graph and then saves the result to cleaned_assembly.gfa
. If you need your result in FASTA format, you can use Autocycler gfa2fasta to convert.
Usage: autocycler clean [OPTIONS] --in_gfa <IN_GFA> --out_gfa <OUT_GFA>
Options:
-i, --in_gfa <IN_GFA> Autocycler GFA file (required)
-o, --out_gfa <OUT_GFA> Output GFA file (required)
-r, --remove <REMOVE> Tig numbers to remove from the input graph
-d, --duplicate <DUPLICATE> Tig numbers to duplication in the input graph
-h, --help Print help
-V, --version Print version
- Viewing
consensus_assembly.gfa
in Bandage can be useful for determining which tigs should be kept/deleted. - Depth values in
consensus_assembly.gfa
represent the number of input sequences that contributed to each sequence. Users will typically delete lower-depth tigs to prioritise resolving higher-depth tigs. - The output GFA created by Autocycler clean is also a valid input. This means that you can clean in multiple stages, using the output of one round of cleaning as the input for the next.
- The values for
-r
and-d
can contain spaces if it is enclosed in quotes. For example:-r "7, 8"
. This allows for copy-pasting from Bandage's 'selected nodes' list. - If
-r
and-d
are not specified, no sequences will be removed, and the input graph will remain unchanged. - If any of the specified tigs in
-r
or-d
do not exist in the input graph, Autocycler clean will return an error and terminate. Ensure tig IDs match those in the GFA. - The only tigs which can be duplicated are those which contain exactly two links to other tigs. Each copy of the duplicated tig will keep one of the links.
- If an invalid tig ID is specified in
-d
(e.g., a tig with more than two links), Autocycler will return an error and terminate. Verify tig IDs and their properties in Bandage before running the command.
Here is an example assembly graph (consensus_assembly.gfa
) made by Autocycler combine that is not fully resolved, as visualised by Bandage:
As you can see, the chromosome and five plasmids have assembled to completion (one sequence), but one linear plasmid did not. This is made obvious by the fact that consentigs are coloured blue and other sequences are coloured orange.
Here is a zoomed-in view of the incomplete plasmid, with Bandage's labels for name (top), length (middle) and depth (bottom):
For a final Autocycler graph, depth refers to how many input assemblies were used to create the sequence. The bulk of this plasmid is in sequence 3, a 90 kbp consentig made from 17 input assemblies.
On the right side, the linear plasmid has a hairpin end. Sequences 4, 9 and 12 have decent depth (almost as high as the main consentig) but sequences 5 and 8 have a depth of only 1× (i.e. supported by just one input assembly). To clean up this end of the plasmid, it therefore makes sense to delete sequences 5 and 8, which will allow the other sequences (4, 9 and 12) to merge with the main consentig.
On the left side, the linear plasmid has an open end. All sequences here have lower depths, i.e. only a small number of input assemblies support them. The non-integer depths (sequences 6 and 10) are because linear paths with different depths were merged together. To clean up this end of the plasmid, one could take two approaches: either delete all of these sequences (because they are low depth relative to the main consentig) or delete just sequences 7 and 11 (the shorter and lower-depth options at each branch). The former approach errs on the side of too little sequence and the latter errs on the side of too much.
Here is an Autocycler clean command which removes the 1× depth sequences from the graph and then merges linear paths:
autocycler clean -i autocycler_out/consensus_assembly.gfa -o autocycler_out/cleaned_assembly.gfa -r 5,7,8,11
And here is the resulting cleaned_assembly.gfa
visualised by Bandage:
The linear plasmid now consists of a single sequence, completing the assembly. However, since this took the erring-on-too-much-sequence approach for the linear plasmid's open end, trimming that contig may be warranted. This could be done by aligning the reads to the assembly and looking for a position where the read depth drops off sharply.
This example involves another linear plasmid, this time with a terminal inverted repeat (TIR) at its ends. This will require running Autocycler clean twice: first to remove low-depth sequences and then to duplicate the TIR. Many thanks to Tom Raaymakers for providing the data for this example!
Here is the not-fully-resolved consensus_assembly.gfa
assembly graph:
And here is a zoomed-in view of the incomplete plasmid, with Bandage's labels for name (top), length (middle) and depth (bottom):
Unlike the previous example, this plasmid has a TIR, so both ends of its main consentig connect to the unresolved part.
As with the previous example, we can clean this graph by removing the lower-depth sequences from the graph. We'll keep tigs 2, 3, 14, 12, 9, 13, 10 and 5 (tracing the highest-depth path), which means we'll remove tigs 4, 6, 7, 8, 11 and 15. Here is the command:
autocycler clean -i autocycler_out/consensus_assembly.gfa -o autocycler_out/cleaned_assembly_1.gfa -r 4,6,7,8,11,15
After cleaning, the TIR is a now single sequence, but the assembly is still incomplete:
So it is necessary to run Autocycler clean once more, this time duplicating the TIR so each end of the plasmid has a copy. The TIR is tig 4 (note that tigs are renumbered when Autocycler clean is run), so here is the command:
autocycler clean -i autocycler_out/cleaned_assembly_1.gfa -o autocycler_out/cleaned_assembly_2.gfa -d 4
Which now results in a fully-resolved assembly graph:
Note that this process has relied on an assumption: that both TIR ends of the linear plasmid are identical (but opposite strands). If this was not the case, i.e. if the two ends had sequence differences, then our assembly will have erased those differences. For this reason, post-Autocycler polishing (e.g. with Medaka) is a good idea, as it can reintroduce any lost variation in the TIR. And as with the previous example, our assembled sequence may contain too many bases, so manual trimming may be warranted.