Assessing quality without a reference - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

If you don't have a reliable reference sequence, how can you gauge the quality of your assembly? In particular, how can you compare two alternative versions of an assembly (e.g. before and after polishing) to see which is likely to be better?

A common scenario is at the long-read polishing stage. Imagine you've done a Trycycler assembly and polished with Medaka, and you want to see if Medaka made the assembly better or worse. If Medaka made it better, then you should use the Trycycler+Medaka assembly in subsequent polishing. But if Medaka made it worse, then you should use the Trycycler assembly.

Manually checking differences in IGV is a thorough way to do this, and it is well-suited to the final stages of short-read polishing, where you want to clean up every last error. However, this is quite laborious, so a more broad-brush-stroke approach is better for simple A-vs-B assembly assessment.

We examined this in the Polypolish paper using simulated read sets (see Table S4 and Figure S10), and we found that ALE score was the best predictor if you have short reads, and mean Prodigal length was the best predictor for long-read-only data.

Assessing with ALE

If you have short reads, then running ALE to get an ALE score is the best option. Larger ALE scores are better, and since ALE scores are negative, 'larger' means scores with a smaller magnitude and are closer to zero.

To make this easier, we have created the ale_score.sh script (which you can find in the scripts directory of this repo). Just give it the assembly, the two Illumina read files and a thread count like this:

ale_score.sh assembly.fasta reads/illumina_1.fastq.gz reads/illumina_2.fastq.gz 16

And it will produce a file (the assembly filename with .ale appended) which contains the ALE score.

Assessing with Prodigal

If you don't have short reads, then you can run Prodigal and calculate the mean length of predicted proteins. Larger is better because assembly errors (especially indels) often create stop codons.

To make this easier, we have created the mean_prodigal_length.sh script (which you can find in the scripts directory of this repo). Just give it the assembly like this:

mean_prodigal_length.sh assembly.fasta

And it will produce a file (the assembly filename with .prod appended) which contains the mean Prodigal length.