Effect of assembly fragmentation - rrwick/Verticall GitHub Wiki

Introduction

This page describes a mini-study I did to examine the effect of assembly fragmentation on Verticall results. Specifically, I wanted to answer this question: What is a good target contig N50 for assemblies that are to be used in Verticall? I.e. do assemblies with a contig N50 below a certain value produce inferior results in Verticall?

To examine this, I used eight E. coli chromosomes with completed assemblies: MINF_1A, MINF_9A, MSB1_1A, MSB1_1D, MSB1_4E, MSB1_8B, MSB1_8G and MSB2_1A. These genomes have a range of relatedness, as shown in this pairwise Verticall tree made with the completed assemblies:

Assembly fragmentation tree

These genomes also have some amount of horizontally acquired content – up to ~10% as determined by Verticall.

Methods

To produce fragmented assemblies, I started with the set of eight complete chromosomes and did the following:

Identify the assembly with the highest N50.
Take the largest contig in that assembly and split it at a random position.
Repeat the above two steps until the mean N50 of all eight genomes was below a target value.

This process was carried out for 24 target N50 values evenly spaced on a log scale: 100, 158, 251, 398, 631, 1000, 1585, 2512, 3981, 6310, 10000, 15849, 25119, 39811, 63096, 100000, 158489, 251189, 398107, 630957, 1000000, 1584893, 2511886 and 3981072. This produced 24 sets of eight assemblies, the first with an N50 of ~100, the second with an N50 of ~158, etc.

I then ran Verticall pairwise on each set of assemblies. This gave 56 pairwise comparisons (8² minus 8 self-vs-self) from which I took the mean vertical distance.

Results and discussion

Figure 1: Verticall pairwise distance vs assembly N50. There are 56 lines, each representing an ordered pair of assemblies.

Figure 1 shows that from an N50 of ~100k and up, Verticall's distances are very consistent, producing flat lines in the plot. This means that a Verticall analysis on assemblies with a 100k N50 is nearly the same as a Verticall analysis on completed genomes. As the N50 decreases from 100k down to 1k, the distances accumulate more and more noise and there are a couple instances of strange behaviour (discussed more below). Below an N50 of 1k, Verticall's distances trend downwards – i.e. with extremely fragmented assemblies, Verticall underestimates pairwise distance.

Between an N50 of about 3k to 100k, MSB1_1D behaved strangely when compared to MSB1_1A and MSB1_8G. This can be seen as the green and purple lines which bounce between a distance of about 0.008 and 0.001. This result is actually not primarily due to assembly fragmentation but is instead related to sensitivity to the smoothing factor. This case is described in greater detail on the Effect of smoothing factor page, so read more there.

Another oddity occurred with the MINF_9A-vs-MSB1_8B pair which switched from a very close distance (~0.00006) at large N50s to a greater distance (~0.002) at N50s below 3k. This is because there is horizontal content in this pair (~8.5% of the alignment) which Verticall detected at larger N50s but failed to detect in highly fragmented assemblies. This demonstrates that Verticall's ability to detect horizontal content can break down in very fragmented assemblies.

Conclusions

Based on these results, I recommend an assembly N50 of 100k or greater for ideal results with Verticall. Conveniently, this should be attainable for most bacterial genomes using modern short-read sequencing. Assemblies with an N50 of down to 10k are not ideal but still probably usable. Assemblies with an N50 below 10k are pretty bad, and I wouldn't recommend using them in Verticall.