Genomes comparison - jsgounot/metagenomic-pipelines GitHub Wiki
Fasta / species rapid comparison
- skani: The last ANI tool designed by Jim. Just use this for now.
- mash: The standard fast genome and metagenome distance estimation using MinHash.
- bindash: Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop.
- sourmash: Another potential Mash successor. Allows containment and use kmer abundance to estimate species or reads abundance. Easier way to include new species to existing database.
ANI to global alignment
While kmer based ANI calculation are important for large dataset, global alignments provide a more reliable values, especially for close genomes. To minimize the number of pairwise global alignments, we can use an approach similar to drep, like this pipeline that run SkANI and then Mummer4 for selected pair with ANI < X.
Archive part
About mash
The easy way to run mash, given a list of MAGs
Make the sketch
ls -d /absolute/path/to/your/fasta/*.fa > fasta.list.txt
mash sketch -l fasta.list.txt -p 8 -k 21 -s 10000 -o sketch.k21.s10000.msh
With two sketch files, run mash:
mash dist -p 8 -d 0.1 sketch.k21.s10000.number1.msh sketch.k21.s10000.number1.msh | gzip > dist.d01.tsv.gz
You can quickly retrieve the best result (the one with the minimum mash distance) with this command line:
zcat dist.d01.tsv.gz | sort -nk3 | head
About sourmash
This is a nice update to mash using SBT-minhash, including multiple features:
- Containment
- Scalability
- You can update a previously built sketch
- There are multiple function such as reads classification
Issue:
Fasta / species finer comparison
- FastANI: Fast Whole-Genome Similarity (ANI) estimation. Slower than MinHash protocols
- SKANI: Accurate, fast nucleotide identity calculation for MAGs and databases
- Mummer4: Slow but more accurate
- FASTGA - To test, might not be the best for bacterial genomes
See also metagenome dereplication
Fasta / species rapid estimation
- Refseq Masher: Mash MinHash search your nucleotide sequences against a NCBI RefSeq genomes database