Genome assembly - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki
Genome assembly refers to a process of recreation of genomic sequence out of sequencing reads. First assemblers were parsing read by read trying to extend the sequence with a greedy approach. This has approach has been abandoned and replaced by employing graphs for representation of relationships between sequencing reads before interpretation of graph. Dependent on the type of a graph used, assembly methods are categorized to two families :
- De Brujin graph based assemblers - each read is decomposed to kmers, every edge in a graph is a kmer and every more is a k-1-mer overlap between kmers. This approach is the predominant approach for short read assembly; examples : ABySS, SOAPdenovo2, SPAdes
- OLC assemblers, based on string graphs1, where each vertex is a read and each edge is an overlap between two reads. This is the approach mostly used for long read assembly; Examples : Celera, Falcon, SGA, mira
Why these both these approaches are used is well explained in introduction of this preprint.
OLC stands for three steps of assembly :
- Overlap - find pair-wise overlaps of sequencing reads
- Layout - build a string graph of reads, simplify graph and find a path through it
- Consensus - calculate a consensus sequence on every position of assembly
The Overlap step is computationally very intensive, naively it is a quadratic problem, even though there are algorithms solving it in a linear time2.
The Layout step simplify a graph build of read overlaps and find the path though it leading in a set of continuous genomic sequences (contigs).
The consensus step calls a consensus nucleotide sequence of contigs, otherwise the final assembly would have approximately error rate of raw reads. One of the assemblers we will use (Miniasm) skips this step to speed up the process.
We made a very incomplete small guide in choice of assemblers, see the page Choice of assembler, but note that software change a lot and this list is based on very limited experience.
Go to tutorial assembly using Canu.
Go to tutorial assembly using Miniasm.
Go back to Table of content.