GenomeAsm - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Genome Assembly [2/5].
We will practice assembly on D. melanogaster smallest chromosome (4). This was a choice driven by time constraints: size is small (~1.5M), but still conserving the structure of an eukaryotic chromosome. We have prepared for you long reads of chromosome four extracted in .fastq
format:
# DNA long reads (PacBio)
wget https://drive.switch.ch/index.php/s/3yei57kRvBzOOPK/download -O dmel_ch4_reads.fastq.gz
Miniasm is not a real OLC assembler. Is is actually performing only the layout (L) step. Consensus (C) step is just not computed and overlap (O) of reads is expected to be provided on input. The assembly does not require correction of reads, which is going to save a lot of time and resources, but it will cause a big nucleotide level errors of the resulting assembly. However, it comes very handy if you want to have a "quick-and-dirty" assembly to start playing on.
We compute overlaps between reads using the companion program (mapper) minimap2
, currently the fastest solution. Minimap2 is a mapper, it's able to map read to reference, therefore when we map reads to reads, we need to specify the reads twice.
Both Minimap and Miniasm are installed on the Docker. Check them out!
minimap2 -h
miniasm -h
First, we compute overlaps:
minimap2 -x ava-pb <reads.fq.gz> <reads.fq.gz> | gzip -9 > <overlaps.paf.gz>
Then, we compute the string graph and its layout:
miniasm -f <reads.fq> <overlaps.paf.gz> > <contigs.gfa>
And finally convert the assembly graph to classical fasta
with some awk magic:
awk '/^S/{print ">"$2"\n"$3}' <contigs.gfa> | fold > <contigs.fa>
In case anything went wrong here, we provide the output of the steps of this assembly:
# miniasm assembly backup
wget https://drive.switch.ch/index.php/s/TiTZjzUgMZzRNYo/download -O miniasm.tar.gz
Canu
is currently one of leading assemblers, it was notably used for the assembly of domestic goat genome. It has has very decent documentation, details on the method are explicited in the paper. However, it's quite resource-hungry...
Besides the classical OLC steps, Canu performs a error correction of reads before the assembly by aligning a subset of long reads to longest ~40x of the input data. Computation-wise it is like one extra Overlap step.
Canu is already installed on the Docker. Check it out:
canu -h
Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2
) and you need to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false
.
Try this command:
canu -p <canu_out> -d <canu_outdir> genomeSize=1.5m -maxThreads=2 useGrid=false -pacbio-raw dmel_ch4_reads.fastq.gz
There are still a lot of parameters that are possible to tweak (wink wink: merylMemory; batMemory; batThreads; cormhapMemory; cormhapThreads
). For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process. There is a brilliant section in documentation about parameter tweaking.
When you succeed in finding the right combination, you'll end up with a rich output directory. For example:
aechchik@dee-serv03:/scratch/beegfs/monthly/aechchik/test_canu18$ ls -h canu_outdir2/
canu-logs canu_out2.gkpStore.err canu_out2.unitigs.layout
canu_out2.contigs.fasta canu_out2.gkpStore.gkp canu_out2.unitigs.layout.readToTig
canu_out2.contigs.gfa canu_out2.report canu_out2.unitigs.layout.tigInfo
canu_out2.contigs.layout canu_out2.trimmedReads.fasta.gz canu-scripts
canu_out2.contigs.layout.readToTig canu_out2.unassembled.fasta correction
canu_out2.contigs.layout.tigInfo canu_out2.unitigs.bed trimming
canu_out2.correctedReads.fasta.gz canu_out2.unitigs.fasta unitigging
canu_out2.gkpStore canu_out2.unitigs.gfa
The most interesting files are:
-
*.correctedReads.fasta.gz
: file containing the input sequences after correction, trim and split based on consensus evidence. -
*.trimmedReads.fastq
: file containing the sequences after correction and final trimming -
*.layout
: file containing informations about read inclusion in the final assembly -
*.gfa
: file containing the assembly graph by Canu -
*.contigs.fasta
: file containing everything that could be assembled and is part of the primary assembly
In case anything went wrong here, we provide the output of the steps of this assembly:
# canu assembly backup
wget https://drive.switch.ch/index.php/s/6yhvPsyAQQ52qH8/download -O canu.tar.gz