Assembly Assessment - a-lud/nf-pipelines GitHub Wiki
This sub-workflow runs a series of post-processing and assembly QC software on an assembly generated by the assembly
sub-workflow.
The current version of the assembly_assessment
pipeline requires the following inputs
--reviewed_assembly string Directory path containing the Juicebox-edited assembly files (should contain '<id>.review.assembly' files).
--contig string Directory path containing the Hifiasm contig output as generated by the 'assembly' pipeline.
--filtered_hifi string Directory path to adapter-filtered HiFi sequence data generated by the 'assembly' pipeline
--assembly string Which genome assembly output to analyses. Options: primary, haplotype1, haplotype2, haplotypes, all.
--length integer Filter scaffolds less than this length.
--busco_db string Directory path to a pre-downloaded BUSCO database.
NOTE: There is a busco_plot
process that plots all BUSCO
results on a single figure (contig, scaffold, gap-filled). If you would like
the contig and scaffold results to be in the figure, you'll need to set the mandatory argument --outdir
to the same location as
specified in the assembly
pipeline.
This argument requires a directory path containing the <prefix>.review.assembly
files output from the software Juicebox
. These files
should have the same prefix as the original <prefix>.assembly
files generated by the assembly
sub-workflow. Do not change the names
of these files. The prefix
is used as a key to match with the genome it belongs to. See here for an explanation as to why.
This is the path to the directory containing the hifiasm
contig FASTA files (p_ctg
, hap1
, hap2
).
Provide the directory path to the adapter-removed-reads
directory produced by the assembly
pipeline. In this directory should be
the filtered FASTA and FASTQ files in GZIP format.
Which genome assembly to use throughout the pipeline: primary
, haplotypes
, haplotype1
, haplotype2
, all
.
Whilst HiFi data can produce highly contiguous genome assemblies, there are still many small fragments that can make their way into the
final assembly. The length
argument filters the reviewed assembly for sequences longer than --length
.
To run BUSCO
you need to specify a database to use. The Phoenix HPC at Adelaide University doesn't have internet access on the compute-nodes. Therefore, I've simply required that you download the database you want to use ahead of time and pass its location to this argument.
Below is an overview of what the output directory structure should look like once all the processes have completed.
assembly-results/
βββassembly-manual
β βββ pin_hic
β βββ hydmaj-chromosome-p_ctg-pin_hic.fa
β βββ ...
β βββ hydmaj-p_ctg.headers
βββassembly-gapClosed
β βββ hydmaj-chromosome-p_ctg
β βββ hydmaj-chromosome-p_ctg.fa
β βββ ...
β βββ hydmaj-chromosome-p_ctg.updated_scaff_infos
βββpost-assembly-qc/mosdepth
β βββ hydmaj-chromosome-p_ctg.mosdepth.global.dist.txt
β βββ hydmaj-chromosome-p_ctg.mosdepth.summary.txt
β βββ hydmaj-chromosome-p_ctg.per-base.bed.gz
β βββ hydmaj-chromosome-p_ctg.per-base.bed.gz.csi
βββpost-assembly-qc/merqury/
β βββ hydmaj-chromosome-p_ctg_only.bed
β βββ hydmaj-chromosome-p_ctg_only.wig
β βββ ...
β βββ reads.hist.ploidy
βββpost-assembly-qc/quast/
β βββ basic_stats
β β βββ cumulative_plot.png
β β βββ ...
β β βββ Nx_plot.png
β βββ quast.log
β βββ ...
β βββ transposed_report.txt
βββpost-assembly-qc/busco/
βββ busco_figure.png
βββ gapfilled-hydmaj-chromosome-hap1
β βββ logs
β βββ run_tetrapoda_odb10
β βββ short_summary.specific.tetrapoda_odb10.gapfilled-hydmaj-chromosome-hap1.txt
βββ gapfilled-hydmaj-chromosome-hap2
β βββ logs
β βββ run_tetrapoda_odb10
β βββ short_summary.specific.tetrapoda_odb10.gapfilled-hydmaj-chromosome-hap2.txt
βββ gapfilled-hydmaj-chromosome-p_ctg