Assembly Assessment - a-lud/nf-pipelines GitHub Wiki

Introduction

This sub-workflow runs a series of post-processing and assembly QC software on an assembly generated by the assembly sub-workflow.

Arguments

The current version of the assembly_assessment pipeline requires the following inputs

--reviewed_assembly string   Directory path containing the Juicebox-edited assembly files (should contain '<id>.review.assembly' files).
--contig string              Directory path containing the Hifiasm contig output as generated by the 'assembly' pipeline.
--filtered_hifi string       Directory path to adapter-filtered HiFi sequence data generated by the 'assembly' pipeline
--assembly string            Which genome assembly output to analyses. Options: primary, haplotype1, haplotype2, haplotypes, all.
--length integer             Filter scaffolds less than this length.
--busco_db string            Directory path to a pre-downloaded BUSCO database.

NOTE: There is a busco_plot process that plots all BUSCO results on a single figure (contig, scaffold, gap-filled). If you would like the contig and scaffold results to be in the figure, you'll need to set the mandatory argument --outdir to the same location as specified in the assembly pipeline.

Arguments overview

Reviewed assembly

This argument requires a directory path containing the <prefix>.review.assembly files output from the software Juicebox. These files should have the same prefix as the original <prefix>.assembly files generated by the assembly sub-workflow. Do not change the names of these files. The prefix is used as a key to match with the genome it belongs to. See here for an explanation as to why.

Contig

This is the path to the directory containing the hifiasm contig FASTA files (p_ctg, hap1, hap2).

Filtered HiFi

Provide the directory path to the adapter-removed-reads directory produced by the assembly pipeline. In this directory should be the filtered FASTA and FASTQ files in GZIP format.

Assembly

Which genome assembly to use throughout the pipeline: primary, haplotypes, haplotype1, haplotype2, all.

length

Whilst HiFi data can produce highly contiguous genome assemblies, there are still many small fragments that can make their way into the final assembly. The length argument filters the reviewed assembly for sequences longer than --length.

BUSCO DB

To run BUSCO you need to specify a database to use. The Phoenix HPC at Adelaide University doesn't have internet access on the compute-nodes. Therefore, I've simply required that you download the database you want to use ahead of time and pass its location to this argument.

Pipeline schematic

Output files

Below is an overview of what the output directory structure should look like once all the processes have completed.

assembly-results/
β”œβ”€β”€assembly-manual
β”‚   └── pin_hic
β”‚        β”œβ”€β”€ hydmaj-chromosome-p_ctg-pin_hic.fa
β”‚        β”œβ”€β”€ ...
β”‚        └── hydmaj-p_ctg.headers
β”œβ”€β”€assembly-gapClosed
β”‚   └── hydmaj-chromosome-p_ctg
β”‚       β”œβ”€β”€ hydmaj-chromosome-p_ctg.fa
β”‚       β”œβ”€β”€ ...
β”‚       └── hydmaj-chromosome-p_ctg.updated_scaff_infos
β”œβ”€β”€post-assembly-qc/mosdepth
β”‚   β”œβ”€β”€ hydmaj-chromosome-p_ctg.mosdepth.global.dist.txt
β”‚   β”œβ”€β”€ hydmaj-chromosome-p_ctg.mosdepth.summary.txt
β”‚   β”œβ”€β”€ hydmaj-chromosome-p_ctg.per-base.bed.gz
β”‚   └── hydmaj-chromosome-p_ctg.per-base.bed.gz.csi
β”œβ”€β”€post-assembly-qc/merqury/
β”‚   β”œβ”€β”€ hydmaj-chromosome-p_ctg_only.bed
β”‚   β”œβ”€β”€ hydmaj-chromosome-p_ctg_only.wig
β”‚   β”œβ”€β”€ ...
β”‚   └── reads.hist.ploidy
β”œβ”€β”€post-assembly-qc/quast/
β”‚   β”œβ”€β”€ basic_stats
β”‚   β”‚   β”œβ”€β”€ cumulative_plot.png
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   └── Nx_plot.png
β”‚   β”œβ”€β”€ quast.log
β”‚   β”œβ”€β”€ ...
β”‚   └── transposed_report.txt
└──post-assembly-qc/busco/
    β”œβ”€β”€ busco_figure.png
    β”œβ”€β”€ gapfilled-hydmaj-chromosome-hap1
    β”‚   β”œβ”€β”€ logs
    β”‚   β”œβ”€β”€ run_tetrapoda_odb10
    β”‚   └── short_summary.specific.tetrapoda_odb10.gapfilled-hydmaj-chromosome-hap1.txt
    β”œβ”€β”€ gapfilled-hydmaj-chromosome-hap2
    β”‚   β”œβ”€β”€ logs
    β”‚   β”œβ”€β”€ run_tetrapoda_odb10
    β”‚   └── short_summary.specific.tetrapoda_odb10.gapfilled-hydmaj-chromosome-hap2.txt
    └── gapfilled-hydmaj-chromosome-p_ctg
⚠️ **GitHub.com Fallback** ⚠️