03 Assembly evaluation - saltpinna/Genome_analysis_project GitHub Wiki

Quast

The script used to run Quast for quality evaluation of the genome assmelby from Canu can be found under code/scripts/Quast_script_assembly_eval.sh. This produces a report file containing the results from the evaluation, which is presented below. From the Quast report, we can see that the genome fraction that was aligned to the reference was 85% which is not great, but still the large majority of the genome was able to be aligned to the reference. The largest contig length correspons well to the chromosome of the bacterium, which indicates that my assembly managed to assemble this part of the genome well.

Result from Quast report:

Mummer

As an additional analysis step, Mummer was used to evaluate the assembly. The script used to run Mummer for evaluation of the genome assembly and plot the results can be found under code/Mummerplot_script_assembly_eval.sh. This resulted in a mummerplot dot plot representing the alignment of the assembled genome to a reference genome. The red dots in the Mummerplot represent matches on the forward strand while blue dots represent matches on the reverse strand. We can see a clear red line of dots with inclination 1 in the middle of the plot. This means that the assembly matches the reference genome quite well on the forward strand in this region. We can also see blue lines of dots in the beginning and end of the assembly which indicates inversions of that sequence. These lines are a sign of a good assembly. We do still have a lot of dots that do not fall on a line which indicate differences between our assembly and the reference genome.

Result from Mummerplot:

Questions

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...)

N50, N90 etc mean the contig length such that using equal or longer contigs sum up 50% (or 90% etc) of the bases in the assembly. It is a quality measure of the assembly because if the contigs are longer, that indicates that repeat sequences have been spanned and there are less gaps, i.e. the assembler has been able to put together more reads to create longer contigs. To be able to say anything about the quality of the assembly based on the N50 or N90 value, it is a good idea to compare it to the length of the genome (or assembly). This is instead the NG50 value, which for this assembly was 30 127. An ideal result would be for the NG50 value to be the same as the length of the largest contig, which in this case is 2775620 nt, since this would mean that the assembler managed to assemble the entire chromosome into one scaffold. The quality of the assembly is best measured using a combination of all the measures mentioned above, by weighting them all together.

How does your assembly compare with the reference assembly? What can have caused the differences?

Compared to the reference assembly, the total assembly length was longer and the resulting chromosome length was also longer for my assembly. This indicates that there are probably some seqeunces there that do not really belong there. One reason for the differences in assembly quality is that the writers of the article used multiple different types of reads (Nanopore and Illumina were used to correct the reads and close gaps), as well as some additional softwares (Celera, BWA, Spades). This can cause differences in assembly length and quality due to different algorithms being used and different data being used.

Why do you think your assembly is better/worse than the public one?

I think my assembly is worse than the public one because less data was used to create the genome assembly.

Does that mean your assembly is bad? Why?

This does not necessarily mean that my assembly is bad, but given a choice between mine and the public assembly, I would choose the public one to use.