8 QAQC - coopermkr/sdepressaAssembly GitHub Wiki
There are lots of ways to quality check a genome assembly, but a few have emerged as core standards for all references. We have already seen some of those statistics earlier with code from the assemblathon, which lets us assess the contiguity of our genome. Now we can look at two different ways to assess completeness.
First, is BUSCO, which searches the assembly for highly conserved genes that should occur once (single copy orthologues): https://busco.ezlab.org/
Though it's important to note that in our case, because we're dealing with a tetraploid, it would make sense for many of our BUSCO genes to actually be double copy.
#Run against subphased blue scaffolds
busco -c 30 -i ../5.subphaser/final/blue.fa -l eudicots_odb10 -o out -m genome
Next we can use a different tool to assess how much of our reads are represented in our final assembly. To do this, we first need to build a kmer database of our pacbio long reads using the counter meryl.
indir=data
meryl count threads=20 k=20 $indir/pac*.fastq.gz output pacbio.meryl
Meryl outputs a specially formatted database that its parent program Merqury will interpret to determine how much of the information in our reads made it into our genome.
Documentation for both meryl and Merqury can be found at: https://github.com/marbl/merqury
merqury.sh ../pacbio.meryl 5.subphaser/final/blue.fa 5.subphaser/gold.fa phased.scaff
The most important output we get from Merqury is a summary table describing the kmer completeness of our assembly. The program has functionality for tetraploids, which I took advantage of here by specifying paths to fasta files containing both subgenomes. This way Merqury provides completeness metrics for both subgenomes individually and together.