Generating input assemblies - rrwick/Autocycler GitHub Wiki
Basics
Each of the read subsets made by Autocycler subsample can now be assembled, producing multiple alternative assemblies of the same genome. It is beneficial to use different assemblers, as different tools can make different errors. While these assemblies are not produced by Autocycler, the Autocycler helper command makes them easier to run.
Importantly, the input assemblies should be complete, i.e. each sequence in the genome should be assembled into a single contig. Autocycler can tolerate a small number of fragmented input assemblies, but if most or all input assemblies are fragmented, Autocycler will not work well. In order to achieve this, you will likely need reads at least as long as the longest repeat in the genome. For most bacterial genomes, 10 kbp reads will be sufficient, but some genomes with long repeats will require longer reads.
Example commands
threads=16 # set as appropriate for your system (no more than 128)
genome_size=$(autocycler helper genome_size --reads "$reads" --threads "$threads") # can set this manually if you know the value
mkdir assemblies
for assembler in canu flye metamdbg miniasm necat nextdenovo plassembler raven; do
for i in 01 02 03 04; do
autocycler helper "$assembler" --reads subsampled_reads/sample_"$i".fastq --out_prefix assemblies/"$assembler"_"$i" --threads "$threads" --genome_size "$genome_size"
done
done
These commands assume that Autocycler subsample was run with the default value of --count 4
.
Notes
- Autocycler itself is relatively quick to run, so this step (generating input assemblies) will be the slowest step of a full Autocycler assembly. See Parallelising input assemblies for some suggestions on speeding this up.
- While the above commands are written to run in serial (one after another), the input assemblies can be generated in parallel to save time, if you have the computational resources to do so.
- Canu assemblies can be very accurate, but it often takes much longer to run than other assemblers. I recommend using Canu if you have the time and computational resources, but you may wish to exclude it for performance reasons.
- In my experience, NECAT can sometimes crash and fail to produce an assembly. Re-running with a different thread count often allows it to complete successfully.
- Assemblies made by Plassembler will only include plasmids (i.e. will exclude the chromosome). Using its assembler with Autocycler can improve the recovery of small plasmids, which other long-read assemblers often struggle with.
Optional manual intervention
Before continuing to the next step, consider inspecting each input assembly to identify and remove incomplete or problematic assemblies. This optional step helps ensure the rest of Autocycler's pipeline proceeds smoothly.
Consider the following:
- Check chromosome length: For most bacterial genomes, a single chromosome constitutes the majority of the genome. If an assembly lacks a contig of the expected chromosomal length, it is likely fragmented. For example, an E. coli chromosome is ~5 Mbp in length, so a complete E. coli assembly should contain a ~5 Mbp contig. If an E. coli assembly instead contains a 3 Mbp and 2 Mbp contig, it is almost certainly incomplete.
- Visual inspection of GFA: For assemblers that produce a GFA graph file, a visual inspection in Bandage (or the currently maintained fork BandageNG) can be useful. Most bacterial sequences are circular, and a contig with a single circularising link (connecting its start to its end) in the graph is likely complete.
- Input read sufficiency: If one or two or the input assemblies contain incomplete sequences, that's generally not a concern. But if many or most input assemblies are fragmented, that suggests that your input read set may be insufficient. For example, your genome may contain a long repeat exceeding your read length.
- What to remove: If an assembly has a fragmented sequence, you can delete the fragmented contigs but leave the rest of the assembly. For example, if an assembly had an incomplete chromosome but all the plasmids were complete, then you can just delete the chromosomal fragments and leave the complete plasmids.
It is also possible to add hints to the FASTA headers of input assemblies which will affect their weight in the clustering and consensus steps. See Influencing Autocycler via contig headers for more details.