distance tree workflow - rrwick/Verticall GitHub Wiki

This page described how to use Verticall to produce a distance-based tree. This workflow is best suited to collections of isolates which are too diverse for Gubbins.

Some key points about the distance tree workflow:

It works with both closely related groups of isolates and highly diverse isolates. However, distance-based trees are not ideal for very closely related groups (e.g. 10s of SNPs between isolates).
It scales with the square of the number of isolates: O(n²). Small collections should be fast, but large collections (e.g. thousands of isolates) will take a lot of time and/or CPUs.
It filters out recombination in two ways: 1) by painting the alignments and ignoring horizontal parts and 2) by using the median sliding-window distance which is robust to outliers. This makes it more sensitive than the alignment tree workflow.

Requirements

An assembly for each of your genomes in FASTA format.
- Put this in a single directory (the instructions below assume this directory is named assemblies).
- Sample names are taken from the assembly filenames: e.g. Sample_123.fasta is good, assembly.fasta is bad.
- The assemblies cannot contain ambiguous bases. You can use Verticall repair to split contigs at ambiguous bases if necessary.
- Good assemblies (e.g. with a big N50) are better, but fragmented assemblies are okay.

Step 1: pairwise comparisons

This command will perform pairwise comparisons of each assembly to the reference with Verticall pairwise:

verticall pairwise -i assemblies -o verticall.tsv

This scales O(n²) with the number of assemblies, so it may take a long time for large collections. If you have a big computing cluster, you can parallelise the work with the --part option like this:

# First run this to completion to ensure all alignment indices are built:
verticall pairwise -i assemblies -o verticall.tsv --index_only

# Then launch jobs:
for i in {001..100}; do
    job_scheduler "verticall pairwise -i assemblies -o verticall_"$i".tsv --part "$i"/100 --skip_check"
done

# When the jobs are finished, merge the results together:
cat verticall_*.tsv > verticall.tsv

Step 2: distance matrix

This Verticall matrix command will produce a PHYLIP distance matrix from the TSV file:

verticall matrix -i verticall.tsv -o verticall.phylip

Step 3: tree

Here's a minimal tree-building command with FastME:

fastme -i verticall.phylip -o verticall.newick

See Distance based tree methods for more info on building trees from a distance matrix.