distance tree workflow - rrwick/Verticall GitHub Wiki
This page described how to use Verticall to produce a distance-based tree. This workflow is best suited to collections of isolates which are too diverse for Gubbins.
Some key points about the distance tree workflow:
- It works with both closely related groups of isolates and highly diverse isolates. However, distance-based trees are not ideal for very closely related groups (e.g. 10s of SNPs between isolates).
- It scales with the square of the number of isolates: O(n2). Small collections should be fast, but large collections (e.g. thousands of isolates) will take a lot of time and/or CPUs.
- It filters out recombination in two ways: 1) by painting the alignments and ignoring horizontal parts and 2) by using the median sliding-window distance which is robust to outliers. This makes it more sensitive than the alignment tree workflow.
- An assembly for each of your genomes in FASTA format.
- Put this in a single directory (the instructions below assume this directory is named
assemblies
). - Sample names are taken from the assembly filenames: e.g.
Sample_123.fasta
is good,assembly.fasta
is bad. - The assemblies cannot contain ambiguous bases. You can use Verticall repair to split contigs at ambiguous bases if necessary.
- Good assemblies (e.g. with a big N50) are better, but fragmented assemblies are okay.
- Put this in a single directory (the instructions below assume this directory is named
This command will perform pairwise comparisons of each assembly to the reference with Verticall pairwise:
verticall pairwise -i assemblies -o verticall.tsv
This scales O(n2) with the number of assemblies, so it may take a long time for large collections. If you have a big computing cluster, you can parallelise the work with the --part
option like this:
# First run this to completion to ensure all alignment indices are built:
verticall pairwise -i assemblies -o verticall.tsv --index_only
# Then launch jobs:
for i in {001..100}; do
job_scheduler "verticall pairwise -i assemblies -o verticall_"$i".tsv --part "$i"/100 --skip_check"
done
# When the jobs are finished, merge the results together:
cat verticall_*.tsv > verticall.tsv
This Verticall matrix command will produce a PHYLIP distance matrix from the TSV file:
verticall matrix -i verticall.tsv -o verticall.phylip
Here's a minimal tree-building command with FastME:
fastme -i verticall.phylip -o verticall.newick
See Distance based tree methods for more info on building trees from a distance matrix.