Genomes comparison - jsgounot/metagenomic-pipelines GitHub Wiki

Fasta / species rapid comparison

skani: The last ANI tool designed by Jim. Just use this for now.
mash: The standard fast genome and metagenome distance estimation using MinHash.
bindash: Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop.
sourmash: Another potential Mash successor. Allows containment and use kmer abundance to estimate species or reads abundance. Easier way to include new species to existing database.

ANI to global alignment

While kmer based ANI calculation are important for large dataset, global alignments provide a more reliable values, especially for close genomes. To minimize the number of pairwise global alignments, we can use an approach similar to drep, like this pipeline that run SkANI and then Mummer4 for selected pair with ANI < X.

Archive part

About mash

The easy way to run mash, given a list of MAGs

Make the sketch

ls -d /absolute/path/to/your/fasta/*.fa > fasta.list.txt
mash sketch -l fasta.list.txt -p 8 -k 21 -s 10000 -o sketch.k21.s10000.msh

With two sketch files, run mash:

mash dist -p 8 -d 0.1 sketch.k21.s10000.number1.msh sketch.k21.s10000.number1.msh | gzip > dist.d01.tsv.gz

You can quickly retrieve the best result (the one with the minimum mash distance) with this command line:

zcat dist.d01.tsv.gz | sort -nk3 | head

About sourmash

This is a nice update to mash using SBT-minhash, including multiple features:

Containment
Scalability
You can update a previously built sketch
There are multiple function such as reads classification

Issue:

So far only one genome at a time.

Fasta / species finer comparison

FastANI: Fast Whole-Genome Similarity (ANI) estimation. Slower than MinHash protocols
SKANI: Accurate, fast nucleotide identity calculation for MAGs and databases
Mummer4: Slow but more accurate
FASTGA - To test, might not be the best for bacterial genomes

Fasta / species rapid estimation

Refseq Masher: Mash MinHash search your nucleotide sequences against a NCBI RefSeq genomes database