4. Assembly with TrioCanu - USDA-ARS-GBRU/Pepper_TrioBinning GitHub Wiki

There are two steps in this assembly process as well. First we will bin the reads and then we will use the binned reads to generate the assemblies.

Binning HiFi Reads

canu \
 -p bin -d bins \
 genomeSize=3.5g \
 gridOptionsbat="--partition=mem" batMemory=1300 \
 gridOptionscns="--partition=mem" cnsMemory=1300 \
 -haplotypeHDA149 /HDA149/fastp/*.fp.fq.gz \
 -haplotypeHDA330 /HDA330/fastp/*.fp.fq.gz \
 -pacbio /F1_PacBio/fastq/*.fastq.gz

The different functionalities of Canu are invoked by specifying different parameters. TrioCanu is invoked by supplying two -haplotype* .fq.gz arguments. HiCanu is invoked with -pacbio-hifi. In the version available at the time of this research (Early 2023, Canu v2.2), TrioCanu and HiCanu are not directly compatible. The workaround is incorrectly specify CLR reads with -pacbio which causes Canu to error out during assembly, then to make two new HiCanu scripts.

The resulting directory is called 'bins'. Within this directory we can find the binned reads under 'haplotype' (bins/haplotype/). In this case, the binned reads are named:

  1. haplotype-HDA149.fasta.gz
  2. haplotype-HDA330.fasta.gz
  3. haplotype-unknown.fasta.gz

After binning reads we can run HiCanu to generate assemblies.

HiCanu Script for HDA149

canu \
 -p HDA149_assembly -d HDA149_assembly \
 genomeSize=3.5g \
 gridOptionsbat="--partition=mem" batMemory=1300 \
 gridOptionscns="--partition=mem" cnsMemory=1300 \
 -pacbio-hifi /bins/haplotype/haplotype-HDA149.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz
  • Notice that two bins are specified for each assembly, the corresponding parent and the shared reads (called haplotype-unknown.fasta.gz).
  • The assembly (HDA149_assembly.contigs.fasta) is written into the HDA149_assembly directory.

Also useful is the HDA149_assembly.report in the same directory. The first few lines are:

[TRIMMING/READS]
--
-- In sequence store './HDA149_assembly.seqStore':
--   Found 14638388 reads.
--   Found 239631337128 bases (68.46 times coverage).

HiCanu Script for HDA330

canu \
 -p HDA330_assembly -d HDA330_assembly \
 genomeSize=3.5g \
 gridOptionsbat="--partition=mem" batMemory=1300 \
 gridOptionscns="--partition=mem" cnsMemory=1300 \
 -pacbio-hifi /bins/haplotype/haplotype-HDA330.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz

Get assembly statistics

We can get statistics for our assemblies by using stats.sh from the BBtools suite.

module load bbtools
stats.sh -Xmx5g t=4 in=HDA149_assembly/HDA149_assembly.contigs.fasta