Introduction

Section: Genome Assembly [1/5].

Genome assembly refers to a process of recreation of genomic sequence out of sequencing reads. First assemblers were parsing read by read trying to extend the sequence with a greedy approach. This has approach has been abandoned and replaced by employing graphs for representation of relationships between sequencing reads before interpretation of graph. Dependent on the type of a graph used, assembly methods are categorized to two families :

De Brujin graph based assemblers - each read is decomposed to kmers, every edge in a graph is a kmer and every more is a k-1-mer overlap between kmers. This approach is the predominant approach for short read assembly; examples : ABySS, SOAPdenovo2, SPAdes
OLC assemblers, based on string graphs, where each vertex is a read and each edge is an overlap between two reads. This is the approach mostly used for long read assembly; Examples : Celera, Falcon, SGA, mira

Why these both these approaches are used is well explained in introduction of this preprint.

OLC assembly

OLC stands for three steps of assembly :

Overlap - find pair-wise overlaps of sequencing reads
Layout - build a string graph of reads, simplify graph and find a path through it
Consensus - calculate a consensus sequence on every position of assembly

The Overlap step is computationally very intensive, naively it is a quadratic problem, even though there are algorithms solving it in a linear time².

The Layout step simplifies a graph build of read overlaps and find the path though it leading in a set of continuous genomic sequences (contigs).

The consensus step calls a consensus nucleotide sequence of contigs, otherwise the final assembly would have approximately error rate of raw reads. One of the assemblers we will use (Miniasm) skips this step to speed up the process.

Choice of Assembler

The most important aspect of genome assembly is the algorithm that is used. The important genomic features that will will affect the choice are genome size, heterozygosity are proportion of low complexity sequences.

assembler	genome_size	heterozygosity	pros	cons
Falcon	any	any	handling variable levels of haplotype divergence	hard to install
Canu	any	low or high	elegant read correction, nice assembly reports
HGAP	bacterial size		usually single contig assembly of bacteria	slow, hard to install
Miniasm	any	low	easy to install, super fast to run	not that accurate

It is important to mention that for different genomes you might need to consider replacing one of the Overlap-Layout-Consensus steps by an alternative approach that suits to your genome.

Other options: Hybrid assembly

For your information, you can also improve the quality of your assembly by adding Illumina data. This can be done with Pilon. Here is an example of integrated pipeline.

Genome Assembly

Back

Table of content

IntroGenome - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Introduction

OLC assembly

Choice of Assembler

Other options: Hybrid assembly

Next

Back

⚠️ GitHub.com Fallback ⚠️

IntroGenome - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Introduction

OLC assembly

Choice of Assembler

Other options: Hybrid assembly

Next

Back

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️