completeness metrics - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki
The idea here is to build approximate set of expected genes that can be found in assembly. The very first approach Cegma have identified a dataset of ultra conserved eukaryotic genes that was used as set of genes that should be identified in every Eukaryotic assembly. This software is no longer supported, it has been replaced by BUSCO. Developers of BUSCO provide set of conserved single-copy genes for several major taxonomic groups. The conserved genes in BUSCO are defined as genes that are shared in a unique copy at least by 90% of species in a taxa, which help sets to be rather big. However, at the same time it means that identification of all of them in a single copy might be simply impossible because of biological reality.
Note that genes are supposed to be in a single copy. Therefore extensive amount of genes identified as duplicated might be a good indication of partial of complete separation of haplotypes in the assembly.
Next
Read about Contamination .
Finish this section, go to Checkpoint
Go back to Table of content .