Comparing assemblies - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

It is often useful to see the differences between two alternative versions of a genome assembly. In particular, I often need to do this during short-read polishing, e.g. when trying to decide if I approve of a change made by a polisher.

For this, I made the compare_assemblies.py script, which you can find in the scripts directory of this repo. Note that due to its dependency on the edlib package, this script will not run on Python 3.11 or later, so I recommend using Python 3.8-3.10.

Usage

usage: compare_assemblies.py [--padding PADDING] [--merge MERGE] [--aligner {mappy,edlib}] [-h]
                             assembly_1 assembly_2

Assembly comparison tool
https://github.com/rrwick/Perfect-bacterial-genome-tutorial

Inputs:
  assembly_1               First assembly to compare
  assembly_2               Second assembly to compare

Settings:
  --padding PADDING        Bases of additional sequence to show before/after each change (default: 15)
  --merge MERGE            Changes this close are merged together in the output (default: 30)
  --aligner {mappy,edlib}  Aligner library: mappy has affine-gap, edlib is more robust (default: mappy)

Help:
  -h, --help               Show this help message and exit

Example

As an example, imagine you produced an assembly with Trycycler and with Unicycler, and you wanted to see where and how they differ.

Run the tool like this:

compare_assemblies.py trycycler.fasta unicycler.fasta

And it will produce results that look something like this:

    trycycler_contig_1 52530-52724: ATGTTGTGGAAATCGCGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGAGGAATCGTTTGAAATCATCGTAGTAATCGCGCGAAAATCACGTGAAATAATCAGA
    unicycler_contig_1 52530-52559: ATGTTGTGGAAATCG---------------------------------------------------------------------------------------------------------------------------------------------------------------------CGTGAAATAATCAGA
                                                   *********************************************************************************************************************************************************************               

    trycycler_contig_1 75329-75359: GTTGGTACATTATATTGTCATTTTGAAAGTA
    unicycler_contig_1 75164-75194: GTTGGTACATTATATCGTCATTTTGAAAGTA
                                                   *               

  trycycler_contig_1 176536-176565: TGGTGAAGCGACACT-AGAAAAAATGCTGAA
  unicycler_contig_1 176371-176401: TGGTGAAGCGACACTGAGAAAAAATGCTGAA
                                                   *               

  trycycler_contig_1 315199-315229: ACAGGTAAAATCTATTGGTATGTTCTTAATG
  unicycler_contig_1 315035-315065: ACAGGTAAAATCTATCGGTATGTTCTTAATG
                                                   *               

  trycycler_contig_1 490814-490844: AATACAAAGGGCAGCAAAACCGCGAGGTCAA
  unicycler_contig_1 490650-490680: AATACAAAGGGCAGCGAAACCGCGAGGTCAA
                                                   *               

  trycycler_contig_1 491187-491284: TATTTAACATTCAAA----------TATTTTT----TGGTTAAAGCGATATTGCTTATGAAAATAAAG-CAGT-ATGCGAG--CGCTT-GACTAAAAAGAAATTGTACATTGAAAAC
  unicycler_contig_1 491023-491138: TATTTAACATTCAAAAAAATGGGCCTATAGCTCAGCTGGTTAGAGCG-CACGCCTGATAAGCGTGAGGTCGGTGGTTCGAGTCCACTTAGGCCCACCATTATTTGTACATTGAAAAC
                                                   **********   *** ****      *    ** ***  *  * *** * * * *  ** *    ** *   * * ** ** ** *               

  trycycler_contig_1 492394-492424: AATGCCAATTAATTTGACTTGGGAGTCAGAA
  unicycler_contig_1 492248-492278: AATGCCAATTAATTTAACTTGGGAGTCAGAA
                                                   *               

  trycycler_contig_1 494276-494306: TGATGAGGTTAATAGATTCGAGGTGGAAGCA
  unicycler_contig_1 494130-494160: TGATGAGGTTAATAGGTTCGAGGTGGAAGCA
                                                   *               

  trycycler_contig_1 494848-494888: AGACCGCAAGGTCTCTTTTTTTTATGTCTAAAACGTCAAAA
  unicycler_contig_1 494702-494741: AGACCGCAAGGTCTC-TTTTTTTATATCTAAAACGTCAAAA
                                                   *         *               

  trycycler_contig_1 494935-494965: TGAATCTGACGAAACAAGAAAAGAGCGCAAC
  unicycler_contig_1 494788-494818: TGAATCTGACGAAACGAGAAAAGAGCGCAAC
                                                   *               

  trycycler_contig_1 584077-584107: CACGGCAAATATTGTTGTGGGCTTTTTTAAA
  unicycler_contig_1 583930-583960: CACGGCAAATATTGTCGTGGGCTTTTTTAAA
                                                   *               

Tips

  • It's a standalone script so doesn't require installation, but you'll need the edlib and mappy Python libraries:
    • python3 -m pip install edlib mappy
  • The two assemblies must be very similar: same number of contigs, same contig order, same strand and same starting position. This is to allow per-contig global alignment.
  • There are two alignment engines: edlib and mappy
    • The default is mappy, which does affine-gap alignment. However, it's really a local aligner, so if two sequences diverge too much, it may fail to produce a global alignment, in which case the script will prompt you to try edlib instead.
    • The edlib engine does true global alignment, but without affine-gaps, so indel regions can look messier.
  • The output uses 1-based sequence coordinates. I.e. the first base of a 1000-bp contig is position 1 and the last base is position 1000.
  • If you don't care about the human-readable output and just want a total difference count between the two assemblies, you can run it like this: compare_assemblies.py assembly_1.fasta assembly_2.fasta | grep -o "*" | wc -l. Note that this counts all single-bp differences, so a 5-bp indel will count as five differences.