Genome Assembly

Section: Genome Assembly [2/5].

Data

We will practice assembly on D. melanogaster smallest chromosome (4). This was a choice driven by time constraints: size is small (~1.5M), but still conserving the structure of an eukaryotic chromosome. We have prepared for you long reads of chromosome four extracted in .fastq format:

# DNA long reads (PacBio)
wget https://drive.switch.ch/index.php/s/3yei57kRvBzOOPK/download -O dmel_ch4_reads.fastq.gz

Genome assembly: Miniasm

Miniasm is not a real OLC assembler. Is is actually performing only the layout (L) step. Consensus (C) step is just not computed and overlap (O) of reads is expected to be provided on input. The assembly does not require correction of reads, which is going to save a lot of time and resources, but it will cause a big nucleotide level errors of the resulting assembly. However, it comes very handy if you want to have a "quick-and-dirty" assembly to start playing on.

We compute overlaps between reads using the companion program (mapper) minimap2, currently the fastest solution. Minimap2 is a mapper, it's able to map read to reference, therefore when we map reads to reads, we need to specify the reads twice.

Both Minimap and Miniasm are installed on the Docker. Check them out!

minimap2 -h 
miniasm -h

Assembly

First, we compute overlaps:

minimap2 -x ava-pb <reads.fq.gz> <reads.fq.gz> | gzip -9 > <overlaps.paf.gz>

Then, we compute the string graph and its layout:

miniasm -f <reads.fq> <overlaps.paf.gz> > <contigs.gfa>

And finally convert the assembly graph to classical fasta with some awk magic:

awk '/^S/{print ">"$2"\n"$3}' <contigs.gfa> | fold > <contigs.fa>

Backup!

In case anything went wrong here, we provide the output of the steps of this assembly:

# miniasm assembly backup
wget https://drive.switch.ch/index.php/s/TiTZjzUgMZzRNYo/download -O miniasm.tar.gz

Genome assembly: Canu

Canu is currently one of leading assemblers, it was notably used for the assembly of domestic goat genome. It has has very decent documentation, details on the method are explicited in the paper. However, it's quite resource-hungry...

Besides the classical OLC steps, Canu performs a error correction of reads before the assembly by aligning a subset of long reads to longest ~40x of the input data. Computation-wise it is like one extra Overlap step.

Canu is already installed on the Docker. Check it out:

canu -h

Assembly

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you need to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false.

Try this command:

canu -p <canu_out> -d <canu_outdir> genomeSize=1.5m -maxThreads=2 useGrid=false -pacbio-raw dmel_ch4_reads.fastq.gz

There are still a lot of parameters that are possible to tweak (wink wink: merylMemory; batMemory; batThreads; cormhapMemory; cormhapThreads). For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process. There is a brilliant section in documentation about parameter tweaking.

When you succeed in finding the right combination, you'll end up with a rich output directory. For example:

aechchik@dee-serv03:/scratch/beegfs/monthly/aechchik/test_canu18$ ls -h canu_outdir2/
canu-logs                           canu_out2.gkpStore.err           canu_out2.unitigs.layout
canu_out2.contigs.fasta             canu_out2.gkpStore.gkp           canu_out2.unitigs.layout.readToTig
canu_out2.contigs.gfa               canu_out2.report                 canu_out2.unitigs.layout.tigInfo
canu_out2.contigs.layout            canu_out2.trimmedReads.fasta.gz  canu-scripts
canu_out2.contigs.layout.readToTig  canu_out2.unassembled.fasta      correction
canu_out2.contigs.layout.tigInfo    canu_out2.unitigs.bed            trimming
canu_out2.correctedReads.fasta.gz   canu_out2.unitigs.fasta          unitigging
canu_out2.gkpStore                  canu_out2.unitigs.gfa

The most interesting files are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

Backup!

In case anything went wrong here, we provide the output of the steps of this assembly:

# canu assembly backup
wget https://drive.switch.ch/index.php/s/6yhvPsyAQQ52qH8/download -O canu.tar.gz

Genome Visualisation

Back

Table of content

GenomeAsm - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Genome Assembly

Data

Genome assembly: Miniasm

Assembly

Backup!

Genome assembly: Canu

Assembly

Backup!

Next

Back

⚠️ GitHub.com Fallback ⚠️

GenomeAsm - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Genome Assembly

Data

Genome assembly: Miniasm

Assembly

Backup!

Genome assembly: Canu

Assembly

Backup!

Next

Back

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️