4. Assembly with TrioCanu - USDA-ARS-GBRU/Pepper_TrioBinning GitHub Wiki
TrioCanu refers to using Canu v2.2 assembly software to first perform trio-binning of the sequencing reads and then using Canu to assemble the genomes from the binned reads. We used Canu v2.2 here. See the 2018 publication "De novo assembly of haplotype-resolved genomes with trio binning" and the Trio-Binning quick start documentation for more information.
In all, we will run 3 scripts because trio-binning is not directly compatible with HiFi reads in Canu v2.2
- Bin the reads
- Assemble HDA149
- Assemble HDA330
Bin HiFi Reads
canu \
-p bin -d bins \
genomeSize=3.5g \
-haplotypeHDA149 /HDA149/fastp/*.fp.fq.gz \
-haplotypeHDA330 /HDA330/fastp/*.fp.fq.gz \
-pacbio /F1_PacBio/fastq/*.fastq.gz
- Trio-binning is simply invoked by supplying the
-haplotype* .fq.gz
arguments. - '-pacbio' should cause the script to stop after binning is finished. It doesn't right away, so it's better to watch the output and stop the job manually once the binned reads are written. Output files are described below.
The resulting directory is called 'bins'. Within this directory we can find the binned reads under 'haplotype' (bins/haplotype/). In this case, the binned reads are named:
- haplotype-HDA149.fasta.gz
- haplotype-HDA330.fasta.gz
- haplotype-unknown.fasta.gz
The results are found in the haplotype.log :
-- Haplotype './0-kmers/haplotype-HDA149.meryl':
-- use kmers with frequency at least 13.
-- Haplotype './0-kmers/haplotype-HDA330.meryl':
-- use kmers with frequency at least 11.
-- Begin processing file /data/F1_btrim/reads_pass.fastq
-- Finished processing file /data/F1_btrim/reads_pass.fastq with 11824151 records
--
-- 4586844 reads 83587608838 bases written to haplotype file ./haplotype-HDA149.fasta.gz.
-- 4505565 reads 82082140843 bases written to haplotype file ./haplotype-HDA330.fasta.gz.
-- 2730499 reads 36181517399 bases written to haplotype file ./haplotype-unknown.fasta.gz.
--
-- 1243 reads 473159 bases filtered for being too short.
--
HiCanu Script for HDA149
After binning, HiCanu (-pacbio-hifi) is used to generate assemblies for HDA149 and HDA330.
- Wondering how much memory you'll need? For these two assemblies, we used 485 MB with 60 cores and the jobs took 19-48 hours to complete. Chile is a 3.5 Gb genome and the input files were 25 Gb (haplotype-HDA[149/330.]fasta.gz and 11 Gb haplotype-unknown.fasta.gz.
canu \
-p HDA149_assembly -d HDA149_assembly \
genomeSize=3.5g \
-pacbio-hifi /bins/haplotype/haplotype-HDA149.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz
HiCanu Script for HDA330
canu \
-p HDA330_assembly -d HDA330_assembly \
genomeSize=3.5g \
-pacbio-hifi /bins/haplotype/haplotype-HDA330.fasta.gz /bins/haplotype/haplotype-unknown.fasta.gz
Also useful is the HDA149_assembly.report in the same directory. The first few lines are:
Get assembly statistics
We can get statistics for our assemblies by using stats.sh from the BBtools suite.
module load bbtools
stats.sh -Xmx5g t=4 in=HDA149_assembly/HDA149_assembly.contigs.fasta