Genome Assembly - a-lud/nf-pipelines GitHub Wiki

Introduction

This sub-workflow is designed around assembling PacBio HiFi sequence reads into high-quality chromosomal assemblies.

Arguments

The current version of the Assembly pipeline requires the following inputs

--hifi string                Directory path containing the HiFi Fastq file.
--assembly string            Which genome assembly output to analyses. Options: primary, haplotype1, haplotype2, haplotypes, all.
--hic string                 Directory path containing the Hi-C Fastq files.
--scaffolder string          Which scaffolding software to use Options: pin_hic, salsa2, all.
--busco_db string            Directory path to a pre-downloaded BUSCO database.

Argument overview

Hifi

This argument requires a directory path to where the PacBio HiFi reads are located. It is expected that the reads are GZIP FASTQ files (can have extensions *.fastq.gz or *.fq.gz. The reads are used for the primary contig assembly using hifiasm, along with genome size estimation using KMC and GenomeScope2. The FASTQ files are also converted to FASTA format for gap-closing in the subsequnt assembly_assessment sub-workflow.

Assembly

Hifiasm produces a range of outputs to work with depending on the data that you give it. If you provide HiFi and Hi-C data, it will generate not only the primary assembly, but also separate assemblies for haplotype-1 and haplotype-2. These files should contain phased scaffolds/chromosomes, however the collection of sequences will be a mix of maternal/paternal sequences e.g. chromosome one is maternal in origin, but chromosome two is paternal in origin. To properly phase the collections of sequences into maternal and paternal files you would require Trio data.

I provide the option to specify which sequence/s you wish to progress with - primary, haplotypes (both haplotypes), haplotype1, haplotype2 or all. Just note: providing all will use a fair bit of disk space. Consider this if you have a quota.

Hic

Similar to the --hifi argument, the --hic argument just expects a directory path to where the Hi-C files are located in GZIP FASTQ format. These reads are used by hifiasm to phase contigs (hifiasm does not scaffold the contigs) and by the scaffolding tools to convert the contigs to chromosomes.

Scaffolder

There are many scaffolding tools out there, however pin_hic and SALSA2 seem to be some of the best. The argument --scaffolder can take the arguments: pin_hic, salsa2, all. This simply lets you choose which tool to scaffold with, or to use both if you want to compare. From my testing, pin_hic seems to be as good as SALSA2 and works considerably faster while using less ram and disk space.

BUSCO DB

To run BUSCO you need to specify a database to use. The Phoenix HPC at Adelaide University doesn't have internet access on the compute-nodes. Therefore, I've simply required that you download the database you want to use ahead of time and pass its location to this argument.

Pipeline schematic

Output files

The assembly sub-workflow has a number of output files that it generates. Below I've included a tree-schematic of what the output directory should look like.

out_directory
├── adapter-removed-reads
│   ├── out-prefix.filt.fastq.gz
│   ├── out-prefix.fasta.gz
│   ├── out-prefix.blocklist
│   └── out-prefix.stats
├── assembly-contigs
│   └── out-prefix
│       ├── out-prefix-hap1.fa
│       └── ...
├── assembly-scaffold
│   └── pin_hic-out-prefix-hap1
│       ├── out-prefix-hap1.agp
│       ├── out-prefix-hap1.scaffold.fa
│       ├── ...
│       └── juicebox-files
│           ├── out-prefix-hap1-pin_hic.assembly
│           └── out-prefix-hap1-pin_hic.hic
├── genome-size
│   └── genomescope-out-prefix
│       ├── out-prefix_linear_plot.png
│       ├── out-prefix_summary.txt
│       └── ...
├── post-assembly-qc
│   ├── busco
│   │   ├── contig-out-prefix-hap1
│   │   └── scaffold-out-prefix-hap1-pin_hic
└── reports
    ├── DAG.svg
    ├── report.html
    ├── timeline.html
    └── trace.txt

The tree diagram above doesn't include every output file, but should give an indication of the output directories and types of files you should see in each directory.

Important output notes

  • If you are going to use the assembly_assessment sub-workflow after editing your assembly in Juicebox, DO NOT RENAME THE .assembly FILE. Juicebox will create a new .assembly file called <prefix>.review.assembly that is used by the assembly_assessment workflow. If you change anything about the <prefix>, the workflow will not know which genome the .review.assembly belongs to and will error.
⚠️ **GitHub.com Fallback** ⚠️