GenomeMetrics - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Genome metrics

Section: Genome assembly assessment [3/5].

Continuity

A very popular metrics used to evaluate assemblies is a wighted median of contig sizes, called N50. It represents a size of smallest contig in a set of contigs covering at least half of genome :

N50_plot

Calculation of N50

Let the set of assembled contigs be sorted from the longest to shortest contig. Now we sum contig sizes till we reach an half of total assembly size (i.e. sum of all contig sizes in the assembly). The last contig size that we added to summation is N50.

Note that this metrics rely on completely correct assembly. If all the reads would be just catenated in one huge (completely wrong) contig, N50 will be huge even the assembly is meaning less. N50 should never be the only metric used to evaluate assemblies.

NG50

A baby step towards biological reality is using known genome size instead of total sum of contigs for calculation of N50, such metrics is called then NG50.

Completeness

The idea here is to build approximate set of expected genes that can be found in assembly. The very first approach Cegma have identified a dataset of ultra conserved eukaryotic genes that was used as set of genes that should be identified in every Eukaryotic assembly. This software is no longer supported, it has been replaced by BUSCO. Developers of BUSCO provide set of conserved single-copy genes for several major taxonomic groups. The conserved genes in BUSCO are defined as genes that are shared in a unique copy at least by 90% of species in a taxa, which help sets to be rather big. However, at the same time it means that identification of all of them in a single copy might be simply impossible because of biological reality.

Note that genes are supposed to be in a single copy. Therefore extensive amount of genes identified as duplicated might be a good indication of partial of complete separation of haplotypes in the assembly.

Here an example of BUSCO output:

 INFO    Results:
 INFO    C:17.9%[S:17.8%,D:0.1%],F:0.1%,M:82.0%,n:1440
 INFO    258 Complete BUSCOs (C)
 INFO    257 Complete and single-copy BUSCOs (S)
 INFO    1 Complete and duplicated BUSCOs (D)
 INFO    1 Fragmented BUSCOs (F)
 INFO    1181 Missing BUSCOs (M)
 INFO    1440 Total BUSCO groups searched

Contamination

Back

Table of content