Tutorial (medium) - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki
Welcome to the MEDIUM version of the tutorial. Here you will be given:
- Moderately detailed instructions on what to do.
- Goals for each step in the process.
- Expected results after each step.
- Tips and guidelines along the way.
Required files
If you haven't already, download the sample data hybrid Illumina+ONT read set to assemble:
- Paired-end Illumina reads in FASTQ format:
S_aureus_JKD6159_Illumina_1.fastq.gz
: 3.4 million reads, 499 MbpS_aureus_JKD6159_Illumina_2.fastq.gz
: 3.4 million reads, 499 Mbp
- Basecalled ONT R10.4 reads in FASTQ format:
S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz
: 1.8 million reads, 5.6 Gbp
You can also grab my reference assembly (S_aureus_JKD6159.fasta
) to check your finished work.
Read QC
The goal of read QC is to discard low-quality reads and/or trim off low-quality regions of reads. This will make them easier to use in later steps (assembly and polishing).
For Illumina read QC, use fastp to remove adapters and trim off low-quality bases. Its default settings work well, so you just need to give it input and output files. Note that some paired-end reads become orphaned during QC, i.e. their corresponding read is discard so they are no longer part of a pair. This shouldn't be very many of these, so I like to save the orphaned reads into a file, confirm that it only contains a small proportion of the reads, then discard it.
This ONT read set has a poor N50 (4.2 kbp). Throwing out shorter reads will improve the N50 at the cost of depth, but since this read set is so deep (5.6 Gbp), that trade-off is worth it. Run Filtlong with --min_length 6000
to discard reads less than 6 kbp in length. You can then run Filtlong again with --keep_percent 90
to throw out the worst 10% of reads.
At this point you should have post-QC Illumina reads (in two FASTQ files) and post-QC ONT reads (in one FASTQ file). After these QC steps, your Illumina read set won't have changed much, but your ONT read will have a much better N50 (15 kbp) but still plenty of depth (1.8 Gbp).
Trycycler assembly
In the Trycycler assembly step, you will be doing a long-read-only assembly, i.e. only using the ONT reads. The goal is to produce a complete assembly free of any structural errors. Small-scale errors are okay (and probably inevitable) – they will be dealt with in subsequent polishing steps.
Trycycler is multi-step process, and you'll find lots of details in the Trycycler documentation. Briefly, the steps are:
- Run
trycycler subsample
to produce a group of semi-independent read subsets. - Make an assembly from each read subset. It helps to use a variety of assemblers, e.g. Flye, miniasm/Minipolish and Raven.
- Run
trycycler cluster
to cluster contigs from these assemblies. You'll then need to decide which clusters represent genuine replicons (and should be kept) and which are erroneous (and should be discarded). This is a key point of human judgement in the Trycycler pipeline. Hint: the S. aureus JKD6159 genome has one chromosome, one 21 kbp plasmid and a phage sequence that appears in both circular and integrated in the chromosome. You can discard or keep the phage cluster, but note that my reference assembly for this genome does not include a separate contig for the phage (i.e. it only contains the integrated version). - Run
trycycler reconcile
on each cluster. This step usually requires human judgement and interaction. Some contigs may not circularise and will need to be repaired or discarded. Small contigs might have duplicated sequence (e.g. the 21 kbp plasmid assembled into a 42 kbp contig) and will need to be trimmed. Some contigs may have large errors that result in poor alignment with other contigs. This is the hardest part of Trycycler assembly – it gets easier after this! - Run
trycycler msa
on each cluster to generate a multiple sequence alignment. - Run
trycycler partition
to assign each read to a cluster. - Run
trycycler consensus
to generate a consensus sequence for each cluster.
When you are finished, you should have a 7_final_consensus.fasta
file for each of your clusters.
Medaka polishing
Medaka polishing aims to fix as many remaining errors in your assembly using only ONT reads. The goal is to produce the best possible ONT-only assembly.
Medaka is pretty easy to run – the only key thing is matching its model to your pore/speed/basecaller. For this dataset, the r104_e81_sup_g610
model is the best match. Also note that while Medaka can use GPU acceleration, this isn't necessary for a bacterial genome – running it on a CPU is fine.
Since Trycycler partitions reads for each cluster, you can run Medaka on a per-replicon basis using Trycycler's partitioned reads (read more here).
When you are finished, you should have an 8_medaka.fasta
file for each of your clusters. You can now merge these together to form a single FASTA file for your ONT-only S. aureus JKD6159 assembly.
While Medaka usually makes an assembly better, I have encountered cases where it introduced more errors than it fixed, so you might want to try polishing with Clair3 as well. You can then use ALE to compare your Trycycler, Trycycler+Medaka and Trycycler+Clair3 assemblies (see Assessing quality without a reference). Whichever is best should be used as input for short-read polishing in the next step.
Polypolish polishing
At this point your Trycycler+Medaka assembly is the best possible genome you can get using only ONT reads. If all went well, it should only contain small-scale errors (mostly single-bp substitutions and indels). The goal of short-read polishing is to fix these small-scale errors using Illumina reads.
Polypolish is a very safe polisher, i.e. it's unlikely to introduce any errors during polishing, so you should use it first. Follow the instructions in the Polypolish documentation. Briefly, the steps are:
- Aligning your Illumina reads (separately for the
_1
and_2
files) to your assembly using the all-alignments-per-read option. - Running Polypolish's insert size filter to exclude alignments with incongruous insert sizes.
- Running Polypolish with the filtered alignments.
When you are finished, you should have a FASTA file for your Trycycler+Medaka+Polypolish assembly. It's a good idea to run the compare_assemblies.py
script (read more here) to compare this against your Trycycler+Medaka assembly, i.e. to see exactly what changes Polypolish made. The expectation is that Polypolish's changes are small (mostly single-bp) and sparsely distributed across the genome. If you see large changes or clusters of changes, that's a red flag that something odd could be happening at that part of the genome.
Other short-read polishing
At this point your Trycycler+Medaka+Polypolish assembly may be perfect! However, since Polypolish is conservative, it may have left some errors, e.g. in regions of low Illumina depth. So you should now try other short-read polishing tools to see if they make any changes. I recommend POLCA (because it has a low chance of introduced errors) and FMLRC2 (because it can sometimes fix errors other tools cannot). ntEdit, HyPo and NextPolish might also be worth a try, see the Polypolish paper for more details on these tools.
However, do NOT assume any changes made in this step are necessarily improvements. Rather, interpret them as hypotheses. Each changed region of the genome should be manually examined in IGV so you can decide whether or not you believe it:
- Use the
compare_assemblies.py
script (see Comparing assemblies) to find regions with differences. - Align both short and long reads to the assembly before and after changes were made.
- View the alignments in IGV for each region. I find it helpful to have two simultaneous instances of IGV: one for the 'before' assembly and one for the 'after' assembly.
- Sometimes one version of the assembly is clearly better/worse in both Illumina and ONT alignments.
- Sometimes Illumina reads prefer one version of the assembly while ONT reads prefer the other. These are tough cases and may take some detective work! Is the region in a repeat? Is there genomic heterogeneity here? Is the difference in a homopolymer? Is difference in a methylation motif?
Final steps
Hopefully you now have a perfect genome assembly. If you really need to ensure its perfection, I'd encourage you to do the entire assembly process again, but with some differences, for example:
- Different long read type (e.g. R9.4.1 instead of R10.4 – both are available in the sample data)
- Different read QC thresholds
- Different long-read assemblers in the Trycycler pipeline
- Different short-read polishers
If your second genome assembly is base-for-base identical to your first, that increases your confidence in its perfection. If not, you should look closely at the regions where they differ and try to understand what's going on.
I would also recommend doing a short-read-only assembly with your Illumina reads. This can help you to find missing replicons or spurious contigs. For example:
- Are there small plasmids in the short-read-only assembly that are missing from your hybrid assembly? If so, they may have been underrepresented in the ONT reads (read more here).
- Are there contigs in your hybrid assembly without corresponding sequence in the short-read-only assembly? If so, they might have come from ONT barcode leakage.
Check your work
You can now compare your assembly to mine (the reference assembly on the sample data page) to see if there are any differences. The compare_assemblies.py
script will be useful here, but note that it requires contigs to have the same strand and starting position, so you might need to 'rotate' some of your sequences before running it.