4. Assembly with TrioCanu - USDA-ARS-GBRU/Pepper_TrioBinning GitHub Wiki

TrioCanu refers to using Canu v2.2 assembly software to first perform trio-binning of the sequencing reads and then using Canu to assemble the genomes from the binned reads. We used Canu v2.2 here. See the 2018 publication "De novo assembly of haplotype-resolved genomes with trio binning" and the Trio-Binning quick start documentation for more information.

In all, we will run 3 scripts because trio-binning is not directly compatible with HiFi reads in Canu v2.2

Bin the reads
Assemble HDA149
Assemble HDA330

Bin HiFi Reads

canu \
 -p bin -d bins \
 genomeSize=3.5g \
 -haplotypeHDA149 /HDA149/fastp/*.fp.fq.gz \
 -haplotypeHDA330 /HDA330/fastp/*.fp.fq.gz \
 -pacbio /F1_PacBio/fastq/*.fastq.gz

Trio-binning is simply invoked by supplying the -haplotype* .fq.gz arguments.
'-pacbio' should cause the script to stop after binning is finished. It doesn't right away, so it's better to watch the output and stop the job manually once the binned reads are written. Output files are described below.

The resulting directory is called 'bins'. Within this directory we can find the binned reads under 'haplotype' (bins/haplotype/). In this case, the binned reads are named:

haplotype-HDA149.fasta.gz
haplotype-HDA330.fasta.gz
haplotype-unknown.fasta.gz

The results are found in the haplotype.log :

--  Haplotype './0-kmers/haplotype-HDA149.meryl':
--   use kmers with frequency at least 13.
--  Haplotype './0-kmers/haplotype-HDA330.meryl':
--   use kmers with frequency at least 11.
-- Begin    processing file /data/F1_btrim/reads_pass.fastq
-- Finished processing file /data/F1_btrim/reads_pass.fastq with 11824151 records
--
--  4586844 reads  83587608838 bases written to haplotype file ./haplotype-HDA149.fasta.gz.
--  4505565 reads  82082140843 bases written to haplotype file ./haplotype-HDA330.fasta.gz.
--  2730499 reads  36181517399 bases written to haplotype file ./haplotype-unknown.fasta.gz.
--
--     1243 reads       473159 bases filtered for being too short.
--

HiCanu Script for HDA149

After binning, HiCanu (-pacbio-hifi) is used to generate assemblies for HDA149 and HDA330.

Wondering how much memory you'll need? For these two assemblies, we used 485 MB with 60 cores and the jobs took 19-48 hours to complete. Chile is a 3.5 Gb genome and the input files were 25 Gb (haplotype-HDA[149/330.]fasta.gz and 11 Gb haplotype-unknown.fasta.gz.

canu \
 -p HDA149_assembly -d HDA149_assembly \
 genomeSize=3.5g \
 -pacbio-hifi /bins/haplotype/haplotype-HDA149.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz

HiCanu Script for HDA330

canu \
 -p HDA330_assembly -d HDA330_assembly \
 genomeSize=3.5g \
 -pacbio-hifi /bins/haplotype/haplotype-HDA330.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz
Also useful is the HDA149_assembly.report in the same directory. The first few lines are:

Get assembly statistics

We can get statistics for our assemblies by using stats.sh from the BBtools suite.

module load bbtools
stats.sh -Xmx5g t=4 in=HDA149_assembly/HDA149_assembly.contigs.fasta