Manually curated assembly - rrwick/Autocycler GitHub Wiki
This page follows the same steps as the Fully Automated Assembly page but adds additional manual steps that allow the user to curate and examine the results. These optional steps let you inspect intermediate outputs and make adjustments, ensuring that the final consensus assembly is as accurate as possible.
Steps 1 and 2: subsample reads and generate input assemblies
reads=ont.fastq.gz # your read set goes here
threads=16 # set as appropriate for your system (no more than 128)
genome_size=$(genome_size_raven.sh "$reads" "$threads") # can set this manually if you know the value
autocycler subsample --reads "$reads" --out_dir subsampled_reads --genome_size "$genome_size"
mkdir assemblies
for assembler in canu flye miniasm necat nextdenovo raven; do
for i in 01 02 03 04; do
"$assembler".sh subsampled_reads/sample_"$i".fastq assemblies/"$assembler"_"$i" "$threads" "$genome_size"
done
done
# Optional step: remove the subsampled reads to save space
rm subsampled_reads/*.fastq
Manual step: curate input assemblies
At this stage, you can inspect each input assembly and decide whether you want to delete or modify it before continuing with Autocycler. See the Generating input assemblies page for more details.
You can also adjust the relative weights of contigs for the clustering and consensus steps by adding hints to their FASTA headers. See the Influencing Autocycler via contig headers page for more details.
Steps 3 and 4: compress and cluster input assemblies
autocycler compress -i assemblies -a autocycler_out
autocycler cluster -a autocycler_out
Manual step: curate clusters
At this stage, you can inspect the clustering and, if desired, modify it before continuing with Autocycler. See the Autocycler cluster page for more details.
Steps 5 and 6: trim and resolve each QC-pass cluster
for c in autocycler_out/clustering/qc_pass/cluster_*; do
autocycler trim -c "$c"
if [ $(wc -c <"$c"/1_untrimmed.gfa) -lt 1000000 ](/rrwick/Autocycler/wiki/-$(wc--c-<"$c"/1_untrimmed.gfa)--lt-1000000-); then
autocycler dotplot -i "$c"/1_untrimmed.gfa -o "$c"/1_untrimmed.png
autocycler dotplot -i "$c"/2_trimmed.gfa -o "$c"/2_trimmed.png
fi
autocycler resolve -c "$c"
done
The above loop also runs Autocycler dotplot clusters less than ~1 Mbp in size, for both the untrimmed and trimmed sequences. This size limit is because Autocycler dotplot is fast to run on small sequences (e.g. plasmids) but can take a while to finish for longer sequences (e.g. chromosomes).
Manual step: examine dotplots
After trimming, you can visually inspect each cluster's dotplots, which can show the effects of trimming and reveal potential structural issues. See the Autocycler dotplot page for more information.
Manual step: examine Autocycler bridging
In this step, you can review how Autocycler has bridged the sequences to form a consensus. This can be useful for identifying regions where sequence ambiguity remains. In particular, it can be helpful to examine each cluster's 4_merged.gfa
file to see if there is structural heterogeneity or conflicts between assemblies, which may suggest areas to review or adjust manually.
Step 7: combine resolved clusters into a final assembly
autocycler combine -a autocycler_out -i autocycler_out/clustering/qc_pass/cluster_*/5_final.gfa
The final consensus assembly will be saved as autocycler_out/consensus_assembly.fasta
.
Manual step: remove any extraneous sequences
If the consensus assembly is not fully resolved, viewing the assembly graph (consensus_assembly.gfa
) in Bandage can reveal any problematic parts of the assembly. It may then be possible to use Autocycler clean to remove unwanted tigs to allow for a fully resolved assembly.