Tutorial (easy) - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

Welcome to the EASY version of the tutorial. Here you will be given:

  • Step-by-step instructions on what to do.
  • Exact commands to run.
  • Goals for each step in the process.
  • Expected results after each step.
  • Tips and guidelines along the way.

Assuming your required software is installed and working, this should be (mostly) foolproof!

Required files

If you haven't already, download the sample data hybrid Illumina+ONT read set to assemble:

  • Paired-end Illumina reads in FASTQ format:
    • S_aureus_JKD6159_Illumina_1.fastq.gz: 3.4 million reads, 499 Mbp
    • S_aureus_JKD6159_Illumina_2.fastq.gz: 3.4 million reads, 499 Mbp
  • Basecalled ONT R10.4 reads in FASTQ format:
    • S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz: 1.8 million reads, 5.59 Gbp

You can also grab my reference assembly (S_aureus_JKD6159.fasta) to check your finished work.

Here are commands to download the files:

mkdir reads reference
wget --no-check-certificate -O reference/S_aureus_JKD6159.fasta https://bridges.monash.edu/ndownloader/files/37312027
wget --no-check-certificate -O reads/S_aureus_JKD6159_Illumina_1.fastq.gz https://bridges.monash.edu/ndownloader/files/37312789
wget --no-check-certificate -O reads/S_aureus_JKD6159_Illumina_2.fastq.gz https://bridges.monash.edu/ndownloader/files/37312840
wget --no-check-certificate -O reads/S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz https://bridges.monash.edu/ndownloader/files/37317376

Read QC

The goal of read QC is to discard low-quality reads and/or trim off low-quality regions of reads. This will make them easier to use in later steps (assembly and polishing).

For Illumina read QC, use fastp to remove adapters and trim off low-quality bases. Its default settings work well, so you just need to give it input and output files:

mkdir reads_qc
fastp --in1 reads/S_aureus_JKD6159_Illumina_1.fastq.gz --in2 reads/S_aureus_JKD6159_Illumina_2.fastq.gz --out1 reads_qc/illumina_1.fastq.gz --out2 reads_qc/illumina_2.fastq.gz --unpaired1 reads_qc/illumina_u.fastq.gz --unpaired2 reads_qc/illumina_u.fastq.gz

Note that some paired-end reads become orphaned during QC, i.e. their corresponding read is discard so they are no longer part of a pair. This shouldn't be very many of these, so I like to save the orphaned reads into a file (illumina_u.fastq.gz), confirm that it only contains a small proportion of the reads, then discard it:

rm reads_qc/illumina_u.fastq.gz

This ONT read set has a poor N50 (4.2 kbp). Throwing out shorter reads will improve the N50 at the cost of depth, but since this read set is so deep (5.6 Gbp), that trade-off is worth it. Run Filtlong to discard reads less than 6 kbp in length:

filtlong --min_length 6000 reads/S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz > reads_qc/ont_6k.fastq

You can then run Filtlong again to throw out the worst 10% of reads:

filtlong --keep_percent 90 reads_qc/ont_6k.fastq > reads_qc/ont.fastq
rm reads_qc/ont_6k.fastq

After these QC steps, you should be left with an ONT read set with a much better N50 (15 kbp) but still plenty of depth (1.8 Gbp).

At this point you should have the following post-QC Illumina and ONT reads ready to be assembled:

reads_qc/illumina_1.fastq.gz
reads_qc/illumina_2.fastq.gz
reads_qc/ont.fastq

Trycycler assembly

In the Trycycler assembly step, you will be doing a long-read-only assembly, i.e. only using the ONT reads. The goal is to produce a complete assembly free of any structural errors. Small-scale errors are okay (and probably inevitable) – they will be dealt with in subsequent polishing steps.

Trycycler is multi-step process, and you'll find lots of details in the Trycycler documentation. There are a few steps along the way which require human judgement and intervention. I'll describe the exact commands I needed to run here, but your results might look slightly different (e.g. if you used different versions of the assemblers).

Trycycler step 1: subsampling reads

Run trycycler subsample to produce a group of semi-independent read subsets.

trycycler subsample --reads reads_qc/ont.fastq --out_dir read_subsets --genome_size 2.8m

Providing the genome size is optional but makes the command run faster.

After this finishes, you should have 12 FASTQ files in the read_subsets directory.

Trycycler step 2: assembly

Make an assembly from each read subset. It helps to use a variety of assemblers, here we use Flye, miniasm/Minipolish and Raven.

threads=16  # change as appropriate for your system
mkdir assemblies

flye --nano-hq read_subsets/sample_01.fastq --threads "$threads" --out-dir assembly_01 && cp assembly_01/assembly.fasta assemblies/assembly_01.fasta && rm -r assembly_01
miniasm_and_minipolish.sh read_subsets/sample_02.fastq "$threads" > assembly_02.gfa && any2fasta assembly_02.gfa > assemblies/assembly_02.fasta && rm assembly_02.gfa
raven --threads "$threads" --disable-checkpoints read_subsets/sample_03.fastq > assemblies/assembly_03.fasta

flye --nano-hq read_subsets/sample_04.fastq --threads "$threads" --out-dir assembly_04 && cp assembly_04/assembly.fasta assemblies/assembly_04.fasta && rm -r assembly_04
miniasm_and_minipolish.sh read_subsets/sample_05.fastq "$threads" > assembly_05.gfa && any2fasta assembly_05.gfa > assemblies/assembly_05.fasta && rm assembly_05.gfa
raven --threads "$threads" --disable-checkpoints read_subsets/sample_06.fastq > assemblies/assembly_06.fasta

flye --nano-hq read_subsets/sample_07.fastq --threads "$threads" --out-dir assembly_07 && cp assembly_07/assembly.fasta assemblies/assembly_07.fasta && rm -r assembly_07
miniasm_and_minipolish.sh read_subsets/sample_08.fastq "$threads" > assembly_08.gfa && any2fasta assembly_08.gfa > assemblies/assembly_08.fasta && rm assembly_08.gfa
raven --threads "$threads" --disable-checkpoints read_subsets/sample_09.fastq > assemblies/assembly_09.fasta

flye --nano-hq read_subsets/sample_10.fastq --threads "$threads" --out-dir assembly_10 && cp assembly_10/assembly.fasta assemblies/assembly_10.fasta && rm -r assembly_10
miniasm_and_minipolish.sh read_subsets/sample_11.fastq "$threads" > assembly_11.gfa && any2fasta assembly_11.gfa > assemblies/assembly_11.fasta && rm assembly_11.gfa
raven --threads "$threads" --disable-checkpoints read_subsets/sample_12.fastq > assemblies/assembly_12.fasta

After this finishes, you should have 12 FASTA files in the assemblies directory, each about 3 Mbp in size. If it worked as expected, you can now delete the read subsets to save space.

rm -r read_subsets

Trycycler step 3: clustering contigs

Run trycycler cluster to cluster contigs from these assemblies.

trycycler cluster --assemblies assemblies/*.fasta --reads reads_qc/ont.fastq --out_dir trycycler

You'll now need to decide which clusters represent genuine replicons (and should be kept) and which are erroneous (and should be discarded). This is a key point of human judgement in the Trycycler pipeline. Here's what my output looked like:

trycycler/cluster_001/1_contigs:
  trycycler/cluster_001/1_contigs/A_contig_1.fasta:   2,837,384 bp,  584.1x
  trycycler/cluster_001/1_contigs/B_utg000001c.fasta: 2,818,669 bp,  592.4x
  trycycler/cluster_001/1_contigs/C_Utg528.fasta:     2,818,375 bp,  585.7x
  trycycler/cluster_001/1_contigs/D_contig_1.fasta:   2,852,481 bp,  582.4x
  trycycler/cluster_001/1_contigs/E_utg000001c.fasta: 2,818,138 bp,  595.4x
  trycycler/cluster_001/1_contigs/F_Utg518.fasta:     2,818,138 bp,  595.3x
  trycycler/cluster_001/1_contigs/G_contig_2.fasta:   2,775,541 bp,  581.5x
  trycycler/cluster_001/1_contigs/H_utg000001c.fasta: 2,818,010 bp,  595.3x
  trycycler/cluster_001/1_contigs/I_Utg458.fasta:     2,816,674 bp,  595.4x
  trycycler/cluster_001/1_contigs/J_contig_2.fasta:   2,818,676 bp,  595.4x
  trycycler/cluster_001/1_contigs/K_utg000001l.fasta: 2,749,101 bp,  596.2x
  trycycler/cluster_001/1_contigs/L_Utg486.fasta:     2,818,064 bp,  595.2x

trycycler/cluster_002/1_contigs:
  trycycler/cluster_002/1_contigs/A_contig_3.fasta:      20,730 bp, 1016.0x
  trycycler/cluster_002/1_contigs/B_utg000003c.fasta:    20,730 bp, 1004.9x
  trycycler/cluster_002/1_contigs/C_Utg526.fasta:        20,726 bp, 1004.7x
  trycycler/cluster_002/1_contigs/D_contig_3.fasta:      20,730 bp,  993.7x
  trycycler/cluster_002/1_contigs/E_utg000002c.fasta:    20,729 bp, 1014.0x
  trycycler/cluster_002/1_contigs/F_Utg516.fasta:        20,723 bp, 1010.9x
  trycycler/cluster_002/1_contigs/G_contig_1.fasta:      20,730 bp, 1021.0x
  trycycler/cluster_002/1_contigs/H_utg000002c.fasta:    20,730 bp, 1005.2x
  trycycler/cluster_002/1_contigs/I_Utg460.fasta:        20,725 bp, 1010.0x
  trycycler/cluster_002/1_contigs/J_contig_1.fasta:      41,461 bp,  590.4x
  trycycler/cluster_002/1_contigs/K_utg000002c.fasta:    20,734 bp, 1010.8x
  trycycler/cluster_002/1_contigs/L_Utg484.fasta:        20,729 bp,  993.4x

trycycler/cluster_003/1_contigs:
  trycycler/cluster_003/1_contigs/A_contig_2.fasta:      43,144 bp,  548.9x
  trycycler/cluster_003/1_contigs/B_utg000002c.fasta:    42,633 bp,  192.2x
  trycycler/cluster_003/1_contigs/C_Utg530.fasta:        59,405 bp,  543.4x
  trycycler/cluster_003/1_contigs/D_contig_2.fasta:      43,144 bp,  461.0x
  trycycler/cluster_003/1_contigs/G_contig_3.fasta:      43,141 bp, 1341.3x

Yours might not be identical, but it should be similar. Clusters 1 and 2 are the chromosome and plasmid, and they look to be legit clusters. Viewing the contigs.newick file in FigTree reveals them to be nice tight clusters. Cluster 3 is a bit different: it only showed up in 5 out of the 12 assemblies, and its position in contigs.newick shows that it has some common sequence with cluster 1. This genome has a phage sequence which exists in two forms, integrated into the chromosome and circular, and cluster 3 is the circular form. For simplicity, we will exclude this cluster by renaming its directory.

mv trycycler/cluster_003 trycycler/bad_cluster_003

Trycycler step 4: reconciling clusters

We now run trycycler reconcile on each cluster. This step usually requires human judgement and interaction. Some contigs may not circularise and will need to be repaired or discarded. This is the hardest part of Trycycler assembly – it gets easier after this!

Let's start with cluster 1.

trycycler reconcile --reads reads_qc/ont.fastq --cluster_dir trycycler/cluster_001

My first attempt resulted in this error:

Error: failed to circularise sequence K_utg000001l because it contained too large of a start-end gap. You can either increase the value of the --max_add_seq/--max_add_seq_percent parameters or exclude the sequence altogether and try again.

I.e. K_utg000001l was missing too much sequence to be circularised. Let's exclude it (renaming it to not end in .fasta is sufficient) and try again.

mv trycycler/cluster_001/1_contigs/K_utg000001l.fasta trycycler/cluster_001/1_contigs/K_utg000001l.fasta.bad
trycycler reconcile --reads reads_qc/ont.fastq --cluster_dir trycycler/cluster_001

My second attempt result in this error:

Error: some pairwise worst-1kbp identities are below the minimum allowed value of 25.0%. Please remove offending sequences or lower the --min_1kbp_identity threshold and try again.

In this case, G_contig_1 and I_Utg454 aligned poorly to other contigs, indicating that they have some big errors. I'll exclude these two and try once more.

mv trycycler/cluster_001/1_contigs/G_contig_1.fasta trycycler/cluster_001/1_contigs/G_contig_1.fasta.bad
mv trycycler/cluster_001/1_contigs/I_Utg454.fasta trycycler/cluster_001/1_contigs/I_Utg454.fasta.bad
trycycler reconcile --reads reads_qc/ont.fastq --cluster_dir trycycler/cluster_001

This time the command completed without error, so we're done reconciling cluster 1!

Now let's reconcile cluster 2.

trycycler reconcile --reads reads_qc/ont.fastq --cluster_dir trycycler/cluster_002

This command immediately fails because two contigs (D_contig_3 and J_contig_1) are twice as long as the rest. Let's exclude them and try again.

mv trycycler/cluster_002/1_contigs/D_contig_3.fasta trycycler/cluster_002/1_contigs/D_contig_3.fasta.bad
mv trycycler/cluster_002/1_contigs/J_contig_1.fasta trycycler/cluster_002/1_contigs/J_contig_1.fasta.bad
trycycler reconcile --reads reads_qc/ont.fastq --cluster_dir trycycler/cluster_002

And it now it finished without error!

Each of your clusters should now have a 2_all_seqs.fasta file which contains that cluster's contigs with the same strand, same starting position and clean circularisation.

Trycycler step 5: multiple sequence alignment

Now you can run trycycler msa on each cluster to generate a multiple sequence alignment.

trycycler msa --cluster_dir trycycler/cluster_001
trycycler msa --cluster_dir trycycler/cluster_002

Each of your clusters should now have a 3_msa.fasta file that contains the same sequences as the 2_all_seqs.fasta file but aligned to each other.

Trycycler step 6: read partitioning

Now run trycycler partition to assign each read to a cluster.

trycycler partition --reads reads_qc/ont.fastq --cluster_dirs trycycler/cluster_*

Each of your clusters should now have a 4_reads.fastq file that contains the reads associated with the cluster. These will be used in the next Trycycler step (consensus) and during Medaka polishing.

Trycycler step 7: consensus generation

The final Trycycler step is to run trycycler consensus to generate a consensus sequence for each cluster.

trycycler consensus --cluster_dir trycycler/cluster_001
trycycler consensus --cluster_dir trycycler/cluster_002

When you are finished, you should have a 7_final_consensus.fasta file for each of your clusters. Let's merge them into a single FASTA file.

cat trycycler/cluster_*/7_final_consensus.fasta > trycycler.fasta

Medaka polishing

Medaka polishing aims to fix as many remaining errors in your assembly using only ONT reads. The goal is to produce the best possible ONT-only assembly.

Medaka is pretty easy to run – the only key thing is matching its model to your pore/speed/basecaller. For this dataset, the r104_e81_sup_g610 model is the best match. Also note that while Medaka can use GPU acceleration, this isn't necessary for a bacterial genome – running it on a CPU is fine.

Since Trycycler partitions reads for each cluster, you can run Medaka on a per-replicon basis using Trycycler's partitioned reads (read more here).

medaka_consensus -i trycycler/cluster_001/4_reads.fastq -d trycycler/cluster_001/7_final_consensus.fasta -o trycycler/cluster_001/medaka -m r104_e81_sup_g610
mv trycycler/cluster_001/medaka/consensus.fasta trycycler/cluster_001/8_medaka.fasta
rm -r trycycler/cluster_001/medaka trycycler/cluster_001/*.fai trycycler/cluster_001/*.mmi

medaka_consensus -i trycycler/cluster_002/4_reads.fastq -d trycycler/cluster_002/7_final_consensus.fasta -o trycycler/cluster_002/medaka -m r104_e81_sup_g610
mv trycycler/cluster_002/medaka/consensus.fasta trycycler/cluster_002/8_medaka.fasta
rm -r trycycler/cluster_002/medaka trycycler/cluster_002/*.fai trycycler/cluster_002/*.mmi

When you are finished, you should have an 8_medaka.fasta file for each of your clusters. You can now merge these together to form a single FASTA file for your ONT-only S. aureus JKD6159 assembly.

cat trycycler/cluster_*/8_medaka.fasta > trycycler_medaka.fasta

Polypolish polishing

At this point your Trycycler+Medaka assembly is the best possible genome you can get using only ONT reads. If all went well, it should only contain small-scale errors (mostly single-bp substitutions and indels). The goal of short-read polishing is to fix these small-scale errors using Illumina reads.

Polypolish is a very safe polisher, i.e. it's unlikely to introduce any errors during polishing, so you should use it first.

First, align your Illumina reads (separately for the _1 and _2 files) to your assembly using the all-alignments-per-read option.

threads=16
bwa index trycycler_medaka.fasta
bwa mem -t "$threads" -a trycycler_medaka.fasta reads_qc/illumina_1.fastq.gz > alignments_1.sam
bwa mem -t "$threads" -a trycycler_medaka.fasta reads_qc/illumina_2.fastq.gz > alignments_2.sam

Then run Polypolish's insert size filter to exclude alignments with incongruous insert sizes.

polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam

Then run Polypolish with the filtered alignments.

polypolish polish trycycler_medaka.fasta filtered_1.sam filtered_2.sam > trycycler_medaka_polypolish.fasta
rm *.bwt *.pac *.ann *.amb *.sa *.sam

When you are finished, you should have a FASTA file for your Trycycler+Medaka+Polypolish assembly. It's a good idea to compare (read more here) this against your Trycycler+Medaka assembly, i.e. to see exactly what changes Polypolish made.

compare_assemblies.py trycycler_medaka.fasta trycycler_medaka_polypolish.fasta

The expectation is that Polypolish's changes are small (mostly single-bp) and sparsely distributed across the genome. If you see large changes or clusters of changes, that's a red flag that something odd could be happening at that part of the genome.

Other short-read polishing

At this point your Trycycler+Medaka+Polypolish assembly may be perfect! However, since Polypolish is conservative, it may have left some errors, e.g. in regions of low Illumina depth. So you should now try other short-read polishing tools to see if they make any changes. I recommend POLCA because it has a low chance of introduced errors).

threads=16
polca.sh -a trycycler_medaka_polypolish.fasta -r reads_qc/illumina_1.fastq.gz" "reads_qc/illumina_2.fastq.gz -t "$threads"
seqtk seq trycycler_medaka_polypolish.fasta.PolcaCorrected.fa | paste - - | sort | tr '\t' '\n' > trycycler_medaka_polypolish_polca.fasta  # remove newlines from sequence and restore contig order
rm *PolcaCorrected.fa *.amb *.ann *.bai *.bam *.batches *.bwt *.fai *.names *.pac *.report *.sa *.sam *.success *.vcf

You can then see what changes POLCA made.

compare_assemblies.py trycycler_medaka_polypolish.fasta trycycler_medaka_polypolish_polca.fasta

While there are other short-read polishers you can try (e.g. FMLRC2), they have a much higher risk of introducing errors, so their changes must be viewed sceptically. We will therefore not use them here in the easy tutorial – see the medium tutorial for more details.

Check your work

You can now compare your assembly to mine (the reference assembly on the sample data page) to see if there are any differences. The compare_assemblies.py script will be useful here, but note that it requires contigs to have the same strand and starting position, so you might need to 'rotate' some of your sequences before running it.