Comparing assemblies - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki
It is often useful to see the differences between two alternative versions of a genome assembly. In particular, I often need to do this during short-read polishing, e.g. when trying to decide if I approve of a change made by a polisher.
For this, I made the compare_assemblies.py
script, which you can find in the scripts directory of this repo. Note that due to its dependency on the edlib package, this script will not run on Python 3.11 or later, so I recommend using Python 3.8-3.10.
Usage
usage: compare_assemblies.py [--padding PADDING] [--merge MERGE] [--aligner {mappy,edlib}] [-h]
assembly_1 assembly_2
Assembly comparison tool
https://github.com/rrwick/Perfect-bacterial-genome-tutorial
Inputs:
assembly_1 First assembly to compare
assembly_2 Second assembly to compare
Settings:
--padding PADDING Bases of additional sequence to show before/after each change (default: 15)
--merge MERGE Changes this close are merged together in the output (default: 30)
--aligner {mappy,edlib} Aligner library: mappy has affine-gap, edlib is more robust (default: mappy)
Help:
-h, --help Show this help message and exit
Example
As an example, imagine you produced an assembly with Trycycler and with Unicycler, and you wanted to see where and how they differ.
Run the tool like this:
compare_assemblies.py trycycler.fasta unicycler.fasta
And it will produce results that look something like this:
trycycler_contig_1 52530-52724: ATGTTGTGGAAATCGCGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGA
unicycler_contig_1 52530-52559: ATGTTGTGGAAATCG---------------------------------------------------------------------------------------------------------------------------------------------------------------------CGTGAAATAATCAGA
*********************************************************************************************************************************************************************
trycycler_contig_1 75329-75359: GTTGGTACATTATATTGTCATTTTGAAAGTA
unicycler_contig_1 75164-75194: GTTGGTACATTATATCGTCATTTTGAAAGTA
*
trycycler_contig_1 176536-176565: TGGTGAAGCGACACT-AGAAAAAATGCTGAA
unicycler_contig_1 176371-176401: TGGTGAAGCGACACTGAGAAAAAATGCTGAA
*
trycycler_contig_1 315199-315229: ACAGGTAAAATCTATTGGTATGTTCTTAATG
unicycler_contig_1 315035-315065: ACAGGTAAAATCTATCGGTATGTTCTTAATG
*
trycycler_contig_1 490814-490844: AATACAAAGGGCAGCAAAACCGCGAGGTCAA
unicycler_contig_1 490650-490680: AATACAAAGGGCAGCGAAACCGCGAGGTCAA
*
trycycler_contig_1 491187-491284: TATTTAACATTCAAA----------TATTTTT----TGGTTAAAGCGATATTGCTTATGAAAATAAAG-CAGT-ATGCGAG--CGCTT-GACTAAAAAGAAATTGTACATTGAAAAC
unicycler_contig_1 491023-491138: TATTTAACATTCAAAAAAATGGGCCTATAGCTCAGCTGGTTAGAGCG-CACGCCTGATAAGCGTGAGGTCGGTGGTTCGAGTCCACTTAGGCCCACCATTATTTGTACATTGAAAAC
********** *** **** * ** *** * * *** * * * * ** * ** * * * ** ** ** *
trycycler_contig_1 492394-492424: AATGCCAATTAATTTGACTTGGGAGTCAGAA
unicycler_contig_1 492248-492278: AATGCCAATTAATTTAACTTGGGAGTCAGAA
*
trycycler_contig_1 494276-494306: TGATGAGGTTAATAGATTCGAGGTGGAAGCA
unicycler_contig_1 494130-494160: TGATGAGGTTAATAGGTTCGAGGTGGAAGCA
*
trycycler_contig_1 494848-494888: AGACCGCAAGGTCTCTTTTTTTTATGTCTAAAACGTCAAAA
unicycler_contig_1 494702-494741: AGACCGCAAGGTCTC-TTTTTTTATATCTAAAACGTCAAAA
* *
trycycler_contig_1 494935-494965: TGAATCTGACGAAACAAGAAAAGAGCGCAAC
unicycler_contig_1 494788-494818: TGAATCTGACGAAACGAGAAAAGAGCGCAAC
*
trycycler_contig_1 584077-584107: CACGGCAAATATTGTTGTGGGCTTTTTTAAA
unicycler_contig_1 583930-583960: CACGGCAAATATTGTCGTGGGCTTTTTTAAA
*
Tips
- It's a standalone script so doesn't require installation, but you'll need the edlib and mappy Python libraries:
python3 -m pip install edlib mappy
- The two assemblies must be very similar: same number of contigs, same contig order, same strand and same starting position. This is to allow per-contig global alignment.
- There are two alignment engines: edlib and mappy
- The default is mappy, which does affine-gap alignment. However, it's really a local aligner, so if two sequences diverge too much, it may fail to produce a global alignment, in which case the script will prompt you to try edlib instead.
- The edlib engine does true global alignment, but without affine-gaps, so indel regions can look messier.
- The output uses 1-based sequence coordinates. I.e. the first base of a 1000-bp contig is position 1 and the last base is position 1000.
- If you don't care about the human-readable output and just want a total difference count between the two assemblies, you can run it like this:
compare_assemblies.py assembly_1.fasta assembly_2.fasta | grep -o "*" | wc -l
. Note that this counts all single-bp differences, so a 5-bp indel will count as five differences.