Assembly Evaluation - mel-jo/probable-memory GitHub Wiki

Tools used in this step

  • QUAST - It is used to evaluate the quality of genome assemblies as it gives a comprehensive set of metrics that help assess the accuracy, completeness, and contiguity of an assembled genome.

Input for this step

.fasta files of the Canu assemblies, Flye assemblies and the reference genomes.

Strain Polished Flye assembly Polished Canu assembly Reference genome
R7 polished_SRR24413072_flye_assembly.fasta polished_SRR24413072_canu_assembly.fasta R7_genome.fasta
HP126 polished_SRR24413066_flye_assembly.fasta polished_SRR24413066_canu_assembly.fasta HP126_genome.fasta
DV3 polished_SRR24413081_flye_assembly.fasta polished_SRR24413082_canu_assembly.fasta DV3_genome.fasta

The Process

QUAST is run on all 6 assemblies generated so far in this project, using the respective strain reference genomes.

Code
module load bioinfo-tools
module load quast

quast.py \
  assembly.fasta \
  -R reference_genome.fasta \
  -o output_dir/ \
  --threads 2

This would produce a set of files for each run. But there are two files of interest to determine the quality of these assemblies: report.html, which has all the metrics necessary to evaluate this assembly and icarus.html, has Icarus visualizer which gives a side-by-side comparison between the assembly and the reference genome making it easy to see how much of the genome, the assembly has covered.

R7 strain

Metric Canu Assembly Flye Assembly
Genome fraction (%) 99.973 99.93
Duplication ratio 1.009 1.000
Largest alignment 4,910,723 9,361,194
Total aligned length 9,410,713 9,653,734
NG50 4,910,723 9,361,194
LG50 1 1
# Misassemblies 1 0
Misassembled contig length 39,376 0
# mismatches per 100 kbp 0.34 0.21
# indels per 100 kbp 0.98 0.66
# contigs 5 2
Largest contig 4,910,897 9,361,194
Total length 9,741,684 9,653,734

imageFlye assembly - Most of the genome is covered but there is a small gap in middle between the contig (contig_pilon_3) spanning 4 kbp.

image Canu assembly - Unlike with Flye assembly, there are no gaps. But there is a misassembly with by a contig.

All things considered, the Flye assembly is the better choice to proceed with for downstream analyses here because it is more contiguous compared to the Canu assembly. There is also no misassemblies seen in the Flye assembly, but the Canu assembly has one. Fewer mismatches and indels per 100 kbp in Flye, also implies better base-level accuracy in that assembly. Flye assembly’s duplication ratio is exactly 1.000, indicating there is no redundancy in the assembled regions. But on the other hand, Canu assembly has a slight overrepresentation (duplication ratio 1.009) which suggests minor over-assembly or unresolved repeats as seen above by tig00000002_pilon.

HP126 strain

Metric Canu Assembly Flye Assembly
Genome fraction (%) 91.679 90.726
Duplication ratio 1 1
Largest alignment 8,578,855 6,973,408
Total aligned length 8,819,142 8,727,420
NG50 8,578,855 6,973,408
LG50 1 1
# Misassemblies 0 0
# mismatches per 100 kbp 0.35 0.25
# indels per 100 kbp 1.53 0.87
# contigs 3 4
Largest contig 8,578,857 6,973,408
Total length 8,819,144 8,727,420

image Flye assembly - The plasmid region is poorly assembled whilst 0.5 Mbp of the genome is not assembled towards the end.

image Canu assembly - Again the plasmid region is poorly assembled, although this assembly covers more regions than Flye assembly did, whilst also missing 0.5 Mbp of the genome although at the beginning.

Here the assembly from Canu is better than Flye's. Canu's assembly is more contiguous with fewer contigs formed whilst also having the largest contig in size at 8.5 Mb compared to the 6.9 Mb contig from Flye. Both assemblies miss 0.5 Mb of the genome at different places, but the Canu assembly has captured more plasmid region than Flye did. Whilst Flye assembly has better base level accuracy with fewer mismatches and indels, these differences are relatively minor.

DV3 strain

Metric Canu Assembly Flye Assembly
Genome fraction (%) 81.001 80.56
Duplication ratio 1 1.003
Largest alignment 8,035,839 6,977,687
Total aligned length 8,779,469 8,757,914
NG50 8,035,839 6,977,687
LG50 1 1
# Misassemblies 0 0
# mismatches per 100 kbp 0.32 0.27
# indels per 100 kbp 1.5 1.5
# contigs 3 4
Largest contig 8,035,839 6,977,687
Total length 8,779,473 8,757,915

image Flye assembly - Unlike with HP126, in this assembly the plasmid regions are fully covered. But in return, the genomic regions are poorly assembled.

image Canu assembly - Here, the plasmid region takes a hit but more genomic region is covered.

Out of the three strains, the assembly for this strain is the worst with only 80% of the genome being covered. But out of these assemblies, Canu's assembly is marginally better here. Canu assembled the genome into fewer contigs and produced a substantially larger contig (8 Mbp vs 6.9 Mbp), indicating better assembly contiguity. Both assemblies are free of misassemblies and have the same indel rate, though Flye has slightly fewer mismatches and a slightly higher duplication ratio. The decision purely falls based on what needs to be prioritized, the genome region or the plasmid region.

Output for this step

report.html which has all the necessary information to know about the assemblies and icarus.html for seeing the assembly alignment to a reference.

⚠️ **GitHub.com Fallback** ⚠️