Tutorial (hard) - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki
Welcome to the HARD version of the tutorial. Here you will be given:
- High-level instructions on what to do.
- Goals for each step in the process.
And that's it – no hand holding here!
Required files
If you haven't already, get a hybrid Illumina+ONT read set to assemble:
- Paired-end Illumina reads in FASTQ format
- ONT reads in either FASTQ format (basecalled) or fast5/pod5 format (raw)
You can use the [sample data]] – either the R10.4 or R9.4.1 reads will work. Or you can use your own reads if they meet [the requirements.
ONT basecalling
Your ONT reads are already in FASTQ format, you can skip this step. If your ONT reads are in fast5/pod5 format, read on. The goal of ONT basecalling is to translate the raw-data reads to a more usable FASTQ format, minimising sequence errors.
For ONT basecalling, I recommend using the latest version of their production basecaller. At the time of writing, this is Guppy (available on the ONT community site), though it may be replaced by Dorado in the future. Use the basecalling model which matches your pore/chemistry, and choose the highest-accuracy option (probably called 'super' or 'sup'). Basecalling is computationally intensive and will require a computer with a big GPU.
If your sequencing run used barcodes, Guppy can trim these as it basecalls. (--trim_barcodes
). This also trims adapters sequences (which are outside barcodes). If your run didn't use barcodes, use --trim_adapters
with Guppy.
Depending on the settings used, Guppy may have produced multiple FASTQ files for your isolate. If so, combine these into a single FASTQ so all of your ONT reads are in one file.
Read QC
The goal of read QC is to discard low-quality reads and/or trim off low-quality regions of reads. This will make them easier to use in later steps (assembly and polishing).
Illumina read QC is pretty easy – just run fastp (or a similar tool) to remove adapters and trim off low-quality bases. Assuming nothing is wrong with your Illumina read set, this should only remove a small fraction of the bases, so your post-QC Illumina read set should be almost as big as your pre-QC Illumina read set.
For ONT read QC, I recommended throwing out reads too short to be useful. A low threshold (e.g. discarding reads below 1 kbp) will ensure that any small plasmid reads are retained. If your pre-QC read set has a poor N50, a higher threshold (e.g. 10 kbp) may be appropriate, but small plasmids might be excluded. There is a trade-off between post-QC depth and N50: low thresholds favour depth and high thresholds favour N50. Both depth and N50 are important for assembly, so choose a value appropriate for your reads – e.g. if you have plenty of depth you can use a higher threshold. I also recommended discarding the lowest quality ONT reads. You can rely on the ONT sequencer/basecaller to do this filtering (e.g. only using 'pass' reads and not 'fail' reads). Or you can use Filtlong to discard the worst reads in the set.
At this point you should have post-QC Illumina reads (in two FASTQ files) and post-QC ONT reads (in one FASTQ file).
Trycycler assembly
In the Trycycler assembly step, you will be doing a long-read-only assembly, i.e. only using the ONT reads. The goal is to produce a complete assembly free of any structural errors. Small-scale errors are okay (and probably inevitable) – they will be dealt with in subsequent polishing steps.
I won't say much more about it here, because everything you need to know is in the Trycycler documentation.
Medaka polishing
Medaka polishing aims to fix as many remaining errors in your assembly using only ONT reads. The goal is to produce the best possible ONT-only assembly.
Medaka is pretty easy to run – the only key thing is matching its model to your pore/speed/basecaller. Some examples:
- If you used a R9.4.1 flowcell and basecalled with Guppy v5 super, then use the
r941_min_sup_g507
model. - If you used a R10.4 flowcell and basecalled with Guppy v6 super, then use the
r104_e81_sup_g610
model. - If you used a R10.4.1 flowcell at 400 bps and basecalled with Guppy v6 super, then use the
r1041_e82_400bps_sup_g615
model.
Note that you probably can't get an exact match with your Guppy version, e.g. you might have used Guppy v6.2.1 but there is no r104_e81_sup_g621
model, just r104_e81_sup_g610
. That's okay, just pick the closest one available – if it's the same major version it's probably fine.
Consult the Medaka documentation for more details, but note that you can run Medaka on a per-replicon basis using Trycycler's partitioned reads (read more here).
While Medaka usually makes an assembly better, I have encountered cases where it introduced more errors than it fixed, so you might want to try polishing with Clair3 as well. You can then use ALE to compare your Trycycler, Trycycler+Medaka and Trycycler+Clair3 assemblies (see Assessing quality without a reference). Whichever is best should be used as input for short-read polishing in the next step.
Polypolish polishing
At this point your Trycycler+Medaka assembly is the best possible genome you can get using only ONT reads. If all went well, it should only contain small-scale errors (mostly single-bp substitutions and indels). The goal of short-read polishing is to fix these small-scale errors using Illumina reads.
Polypolish is a very safe polisher, i.e. it's unlikely to introduce any errors during polishing, so you should use it first. Follow the instructions in the Polypolish documentation. Make sure to do the insert-size filter step, as this can help a little bit with errors in repeats.
Other short-read polishing
At this point your Trycycler+Medaka+Polypolish assembly may be perfect! However, since Polypolish is conservative, it may have left some errors, e.g. in regions of low Illumina depth. So you should now try other short-read polishing tools to see if they make any changes. I recommend POLCA (because it has a low chance of introduced errors) and FMLRC2 (because it can sometimes fix errors other tools cannot). ntEdit, HyPo and NextPolish might also be worth a try, see the Polypolish paper for more details on these tools.
However, do NOT assume any changes made in this step are necessarily improvements. Rather, interpret them as hypotheses. Each changed region of the genome should be manually examined in IGV so you can decide whether or not you believe it. There can be tough cases which may take some detective work!
Final steps
Hopefully you now have a perfect genome assembly. If you really need to ensure its perfection, I'd encourage you to do the entire assembly process again, but with different data/tools/choices. If your second genome assembly is base-for-base identical to your first, that increases your confidence in its perfection. If not, you should look closely at the regions where they differ and try to understand what's going on.
I would also recommend doing a short-read-only assembly with your Illumina reads. This can help you to find missing replicons (e.g. due to underrepresentation in your ONT reads) or spurious contigs (e.g. due to contamination in your ONT reads).