Genome Assembly - a-lud/nf-pipelines GitHub Wiki
This sub-workflow is designed around assembling PacBio HiFi sequence reads into high-quality chromosomal assemblies.
The current version of the Assembly pipeline requires the following inputs
--hifi string Directory path containing the HiFi Fastq file.
--assembly string Which genome assembly output to analyses. Options: primary, haplotype1, haplotype2, haplotypes, all.
--hic string Directory path containing the Hi-C Fastq files.
--scaffolder string Which scaffolding software to use Options: pin_hic, salsa2, all.
--busco_db string Directory path to a pre-downloaded BUSCO database.
This argument requires a directory path to where the PacBio HiFi reads are located. It is expected that the reads are GZIP FASTQ files (can
have extensions *.fastq.gz or *.fq.gz. The reads are used for the primary contig assembly using hifiasm, along with genome size
estimation using KMC and GenomeScope2. The FASTQ files are also converted to FASTA format for gap-closing in the subsequnt
assembly_assessment sub-workflow.
Hifiasm produces a range of outputs to work with depending on the data that you give it. If you provide HiFi and Hi-C data, it will generate not only the primary assembly, but also separate assemblies for haplotype-1 and haplotype-2. These files should contain phased scaffolds/chromosomes, however the collection of sequences will be a mix of maternal/paternal sequences e.g. chromosome one is maternal in origin, but chromosome two is paternal in origin. To properly phase the collections of sequences into maternal and paternal files you would require Trio data.
I provide the option to specify which sequence/s you wish to progress with - primary, haplotypes (both haplotypes), haplotype1, haplotype2 or all. Just note: providing all will use a fair bit of disk space. Consider this if you have a quota.
Similar to the --hifi argument, the --hic argument just expects a directory path to where the Hi-C files are located in GZIP FASTQ format.
These reads are used by hifiasm to phase contigs (hifiasm does not scaffold the contigs) and by the scaffolding tools to convert the
contigs to chromosomes.
There are many scaffolding tools out there, however pin_hic and SALSA2 seem to be some of the best. The argument --scaffolder can take the arguments: pin_hic, salsa2, all. This simply lets you choose which tool to scaffold with, or to use both if you want to compare. From my testing, pin_hic seems to be as good as SALSA2 and works considerably faster while using less ram and disk space.
To run BUSCO you need to specify a database to use. The Phoenix HPC at Adelaide University doesn't have internet access on the compute-nodes. Therefore, I've simply required that you download the database you want to use ahead of time and pass its location to this argument.

The assembly sub-workflow has a number of output files that it generates. Below I've included a tree-schematic of what the output directory
should look like.
out_directory
├── adapter-removed-reads
│ ├── out-prefix.filt.fastq.gz
│ ├── out-prefix.fasta.gz
│ ├── out-prefix.blocklist
│ └── out-prefix.stats
├── assembly-contigs
│ └── out-prefix
│ ├── out-prefix-hap1.fa
│ └── ...
├── assembly-scaffold
│ └── pin_hic-out-prefix-hap1
│ ├── out-prefix-hap1.agp
│ ├── out-prefix-hap1.scaffold.fa
│ ├── ...
│ └── juicebox-files
│ ├── out-prefix-hap1-pin_hic.assembly
│ └── out-prefix-hap1-pin_hic.hic
├── genome-size
│ └── genomescope-out-prefix
│ ├── out-prefix_linear_plot.png
│ ├── out-prefix_summary.txt
│ └── ...
├── post-assembly-qc
│ ├── busco
│ │ ├── contig-out-prefix-hap1
│ │ └── scaffold-out-prefix-hap1-pin_hic
└── reports
├── DAG.svg
├── report.html
├── timeline.html
└── trace.txt
The tree diagram above doesn't include every output file, but should give an indication of the output directories and types of files you should see in each directory.
- If you are going to use the
assembly_assessmentsub-workflow after editing your assembly inJuicebox, DO NOT RENAME THE.assemblyFILE.Juiceboxwill create a new.assemblyfile called<prefix>.review.assemblythat is used by theassembly_assessmentworkflow. If you change anything about the<prefix>, the workflow will not know which genome the.review.assemblybelongs to and will error.