Illustrated example 1 - rrwick/Verticall GitHub Wiki

This page illustrates the Verticall pairwise process using a real pair of assemblies:

  • INF208: a Klebsiella variicola isolate
  • KSB1_8D: a Klebsiella variicola isolate where some parts of the genome have been replaced by Klebsiella pneumoniae sequence

Klebsiella genomes engage in a lot of homologous recombination, even across species boundaries, making them good to use in Verticall demonstrations!

Both assemblies are completed, which doesn't change how Verticall works (see Effect of assembly fragmentation) but helps to make for cleaner painted-contig plots.

The plots shown on this page (and the other illustrated example pages) are edited versions of plots made by the Verticall view command.

Alignments

Verticall executed these commands to generate the alignments:

minimap2 -k15 -w10 -d KSB1_8D.mmi KSB1_8D.fasta
minimap2 -c -t 1 --eqx -x asm20 KSB1_8D.mmi INF208.fasta

This produces 126 raw alignments in total, but Verticall culls redundant ones (e.g. a lower-scoring alignment that overlaps with a higher-scoring alignment) resulting in 74 used alignments:

Example 1 alignments

Initial distribution

Verticall automatically chose a window size of 9000 bp and a step of 90 bp, resulting in a total of 50601 windows. Counting the number of differences in each of these windows produced this distribution:

Example 1 distribution

While this might superficially look like a histogram of continuous data, it's not! It's a discrete distribution of genomic distance from difference counts per 9000 bp window: 0/9000, 1/9000, 2/9000, etc. Zooming in reveals the discreteness more clearly:

Example 1 distribution zoomed

I often find these distributions nicer to view using a square-root transformation on the x-axis:

Example 1 distribution sqrt

You can do this in Verticall view with the --sqrt_distance option. Just remember this transformation means the spacing between bars shrinks as you go to the right, so the distribution has more mass on the right than it might appear.

Smoothed distribution

In its raw state, the distribution is noisy and has 286 local maxima! This won't do, so Verticall smooths the distribution:

Example 1 distribution smoothed

The smoothed distribution has only three local maxima – much better!

Partitioned distribution

Verticall then partitions the distribution into three categories: vertical (blue), horizontal (red) and ambiguous (grey):

Example 1 distribution partitioned

See the Pairwise assembly comparison page for a description of the values (e.g. maxpeak, tv-low, etc.) shown in this plot.

You can see that for this assembly pair, there is a local maximum to the right of the most-massive peak and another to the left. So Verticall has labelled some regions of the distribution horizontal because they are too divergent (right side) and other regions as horizontal because they are too close (left side).

Painted alignments

Now that each difference-count-per-window value has been assigned a category, we can 'paint' these onto the alignments. This plot shows all 74 alignments. The y-axis of this plot corresponds to the x-axis of the distributions shown above (again using a square-root transformation), and the squiggly line shows the genomic distance in the sliding windows. The categories (vertical, horizontal and ambiguous) are used to the colour the background:

Example 1 painted alignments with ambiguous

Verticall then resolves the ambiguous regions based on the neighbouring regions, resulting in this cleaner plot with only two categories (vertical and horizontal):

Example 1 painted alignments

This resolution can also be applied back to the distribution, i.e. the ambiguous regions can be be divided up between vertical and horizontal:

Example 1 distribution resolved

Painted contigs

Finally, Verticall transfers this 'paint' back onto the assemblies' contigs.

Here is INF208, the first assembly:

Example 1 painted contigs INF208

And here is KSB1_8D, the second assembly:

Example 1 painted contigs KSB1_8D

Note that while the painted alignments had only two categories (vertical and horizontal), the painted contigs have a third category, unaligned, for regions that did not align to the other assembly.

Distance

The mean distance between these two isolates (using the entirety of their alignments) is 0.01413. This is similar to the value you'd get using Mash (distance=0.01524) or FastANI (identity=98.282%, distance=0.01718).

However, that distance includes the horizontally-acquired regions, so if we want the vertical distance (i.e. the distance using only the vertically-inherited parts of the genome), it will be too high. Verticall's mean vertical distance only uses the vertically-painted parts of the alignments and gives a lower distance of 0.00807.

⚠️ **GitHub.com Fallback** ⚠️