IntroGenome - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Introduction

Section: Genome Assembly [1/5].

Genome assembly refers to a process of recreation of genomic sequence out of sequencing reads. First assemblers were parsing read by read trying to extend the sequence with a greedy approach. This has approach has been abandoned and replaced by employing graphs for representation of relationships between sequencing reads before interpretation of graph. Dependent on the type of a graph used, assembly methods are categorized to two families :

  • De Brujin graph based assemblers - each read is decomposed to kmers, every edge in a graph is a kmer and every more is a k-1-mer overlap between kmers. This approach is the predominant approach for short read assembly; examples : ABySS, SOAPdenovo2, SPAdes
  • OLC assemblers, based on string graphs, where each vertex is a read and each edge is an overlap between two reads. This is the approach mostly used for long read assembly; Examples : Celera, Falcon, SGA, mira

Why these both these approaches are used is well explained in introduction of this preprint.

OLC assembly

OLC stands for three steps of assembly :

  1. Overlap - find pair-wise overlaps of sequencing reads
  2. Layout - build a string graph of reads, simplify graph and find a path through it
  3. Consensus - calculate a consensus sequence on every position of assembly

The Overlap step is computationally very intensive, naively it is a quadratic problem, even though there are algorithms solving it in a linear time2.

The Layout step simplifies a graph build of read overlaps and find the path though it leading in a set of continuous genomic sequences (contigs).

The consensus step calls a consensus nucleotide sequence of contigs, otherwise the final assembly would have approximately error rate of raw reads. One of the assemblers we will use (Miniasm) skips this step to speed up the process.

Choice of Assembler

The most important aspect of genome assembly is the algorithm that is used. The important genomic features that will will affect the choice are genome size, heterozygosity are proportion of low complexity sequences.

assembler genome_size heterozygosity pros cons
Falcon any any handling variable levels of haplotype divergence hard to install
Canu any low or high elegant read correction, nice assembly reports
HGAP bacterial size usually single contig assembly of bacteria slow, hard to install
Miniasm any low easy to install, super fast to run not that accurate

It is important to mention that for different genomes you might need to consider replacing one of the Overlap-Layout-Consensus steps by an alternative approach that suits to your genome.

Other options: Hybrid assembly

For your information, you can also improve the quality of your assembly by adding Illumina data. This can be done with Pilon. Here is an example of integrated pipeline.

Next

Back

⚠️ **GitHub.com Fallback** ⚠️