Primary vs secondary results - rrwick/Verticall GitHub Wiki
This page uses distances and trees made a group of Klebsiella pneumoniae genomes to illustrate primary vs secondary results and --multi
can be used.
When Verticall analyses a pair of genomes, usually there is a clearly majority component of the distance distribution (see illustrated examples 1, 2 and 3). But sometimes, there can be a close call, where about half of the distribution supports one distance and half supports another.
Here's an example using two genomes, INF028 and INF153, where about half of the genome supports a close distance and the other half supports a larger distance:
In this case, the closer peak had 52% of the distribution mass, and so that's what Verticall used for its primary result, giving a distance of 0.00005. The more distant peak had 43% of the distance mass, and if that was used, the distance would have been 0.00476.
Whenever there is a secondary peak with a mass close enough (controlled by --secondary
option) to that of the primary peak, Verticall will include both in its TSV. One line will contain the primary result, with primary
in the result_level
column (see Columns in pairwise tsv file), and another line will contain the secondary result, with secondary
in the result_level
column.
Here is an example of primary and secondary TSV lines for INF083 and INF201:
assembly_a assembly_b alignment_count n50_alignment_length aligned_fraction mean_distance window_size window_count mean_window_distance median_window_distance mass_peaks result_level peak_window_distance peak_mass alignments_vertical_fraction alignments_horizontal_fraction mean_vertical_window_distance median_vertical_window_distance mean_vertical_distance r/m assembly_a_vertical_fraction assembly_a_horizontal_fraction assembly_a_unaligned_fraction assembly_b_vertical_fraction assembly_b_horizontal_fraction assembly_b_unaligned_fraction
INF083 INF201 75 297216 0.927693605 0.003000368 9200 50498 0.002555766 0.000997074 0.000000000,0.001413043,0.003043478 primary 0.000031606 0.516523533 46.88% 53.12% 0.000092318 0.000051740 0.000082445 68.809523810 41.66% 50.22% 8.13% 42.38% 48.90% 8.72%
INF083 INF201 75 297216 0.927693605 0.003000368 9200 50498 0.002555766 0.000997074 0.000000000,0.001413043,0.003043478 secondary 0.003000510 0.428087835 39.70% 60.30% 0.005831898 0.004760175 0.006235939 0.070159786 37.74% 54.13% 8.13% 37.03% 54.25% 8.72%
Verticall matrix needs a single distance for each assembly pair, and so its --multi
option will control its behaviour whenever there are multiple results for a pair:
-
--multi first
: this is the default behaviour, and it will use the distance in the first line in the TSV file for that pair. Since Verticall always writes primary lines before secondary lines, this will extract the primary distance. However, users can manually reorder the TSV file or delete lines in the TSV file if they want a specific secondary result used. -
--multi exclude
: this will make Verticall discard assemblies that result in multi-result pairs. It discards samples in descending order of how many multi-result pairs they are in, until no more multi-result pairs remain. E.g. if one assembly was in all of the multi-result pairs, it would be the only one excluded. -
--multi high
: this will make Verticall always choose the highest distance in each multi-result pair. -
--multi low
: this will make Verticall always choose the lowest distance in each multi-result pair.
Sometime close-call results can lead to distances that are not compatible with trees. Consider these three distances made using --multi first
:
- INF028 vs INF153: 0.00005
- INF028 vs INF251: 0.00447
- INF153 vs INF251: 0.00009
INF028 are INF251 distant from each other, but both are very close to INF153. This violates the triangle inequality and can thus confuse distance-based tree methods. Four isolates in this collection (INF083, INF153, INF251 and INF252) exhibit behaviour like this.
Here is the tree made with --multi first
distances, where those four isolates have strange branch lengths:
Here is the tree made with --multi exclude
distances, where those four isolates have been removed:
Here is the tree made with --multi high
distances, where those four isolates are now on deeper branches:
Here is the tree made with --multi low
distances, where those four isolates have become part of the largest clade: