Checkpoint: genome assembly answers - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

Checkpoint Genome assembly : Answers

What are three main steps of long read assembly? [ OLC - Overlaps, Layout and Consensus]
What is the computational bottleneck? [ Overlap - this step requires to search to all-versus-all overlaps between reads. There are computational trick how to make this procedure faster, but it still remains to be the long step. ]
What are aspects of genome that should help you decide which algorithm to use for assembly? [ Genome size - small genomes have non-scaled programs ( HGAP, Unicycler ) that are very good in assembling small genomes, heterozygosity, structural variations between haplotypes, ploidy - Canu is not designed for dangerous levels of heterozygosity (> 0.5%, but < 3%), some haplotype sequences would get smashed into one sequence while other would be assembled separately resulting in something in between of haploid and diploid assembly. The assembler suited for highly heterozygous and di/polyploid data is Falcon with it's module for unzipping haplotypes vis Falcon-unzip]

What is the extra step in Canu besides OLC? [Correction of reads, creating corrected reads that will enter classical OLC assembly. ]
What is the omitted step in Miniasm assembly? Are you aware of a tool to calculate the omitted step? [It omits Consensus step, a tool calculating consensus build with Miniasm in mind is called Racon]
Which of the assembler was faster? Can you guess why? [Miniasm is way faster because it omits correction step and consensus step, but also because algorithm used for search of overlaps is significantly faster than algorithm used by Canu.]