Assembly canu - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

The goal is to assemble chromosome four of D. melanogaster using raw PacBio reads and assembler Canu.

Canu

Canu is currently one of leading assemblers, it was used for the assembly of domestic goat genome. Canu has very decent documentation. For details of the method you can take a look on the paper. Besides the classical OLC steps, Canu performs a error correction of reads before the assembly by aligning a subset of long reads to longest ~40x of the input data. Computation wise it is like one extra Overlap step.

Data to assemble

We have chosen to assemble only chromosome four, because it's small (~1.5M), but still got a structure of a eukaryotic chromosome. We have prepared for you long reads of chromosome four extracted in .fastq format. If you are interested how exactly we have done that, you can check details of getting chromosome four reads. Get sequencing reads

mkdir -p ~/pacbio/dna
wget -O ~/pacbio/dna/dmel_ch4_reads.fastq.gz ftp://ftpmrr.unil.ch/LongReadWorkshop/data/dmel_ch4_reads.fastq.gz

Running of assembly

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
              <-pacbio-raw |
               -pacbio-corrected |
               -nanopore-raw |
               -nanopore-corrected> read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found here. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

⚠ There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

  • *.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
  • *.trimmedReads.fastq : file containing the sequences after correction and final trimming
  • *.layout : file containing informations about read inclusion in the final assembly
  • *.gfa : file containing the assembly graph by Canu
  • *.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

❟ How many contigs were produced?

❟ Does the total size seem to match your expectations?

⚠ The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

Next

Go to tutorial assembly using Miniasm

Finish this section, go to Checkpoint

Go back to Table of content.

⚠️ **GitHub.com Fallback** ⚠️