Getting Started - a-lud/nf-pipelines GitHub Wiki
This page provides an introduction to running this pipeline. I'll try and detail some of key steps and gotchya's that you'll likely encounter when getting started.
The instructions for how to install the pipeline are in the README
.
Before running any analyses, it's usually important to curate your data files a little to ensure a smooth experience.
Remove unusual characters from file names before running any data. The last thing you want is a typo, non-standard character or incorrect name being carried through an analysis pipeline. I usually also like to set specific characters as separators (e.g. '-' or '_'). For example, I'll use hyphens as field separators and underscores as a replacement for white space. Further, it's typically a good idea to have a consistent file naming scheme.
If you have multiple sequence runs for a single sample, combine the files (unless there is a purpose to keeping them separate). Further, compress large files to save disk space.
I've written these pipelines to be compatible with the Phoenix HPC at Adelaide University. Phoenix users have a $FAST
partition
that is ~1TB in size and has fast I/O. I recommend that the pipeline and data files you intend to use be installed in a location
similar to this on your own system!
The README
specifies that a working versions of Nextflow
and conda
are needed to run these pipelines. It does not matter how these
are installed (e.g. locally, through a package manager or via modules (EasyBuild)), they just need to be present in your PATH
when
you go to run the pipeline.
The pipeline installs software via conda
, essentially creating self-contained conda environments
for each process. Rather than
creating the conda environments every time the pipeline is run, they are stored in a cache directory within the nf-pipelines
repository.
As such, it is important that you clone the pipeline repository to a location that has sufficient storage to handle multiple conda
environments of varying size (hence the recommendation above to install somewhere with plenty of space).
Nextflow pipelines run processes in a working directory (-workdir /path/to/dir
). When a process finishes, Nextflow
will handle
the data based on the publishDir
directive. I've currently set the pipeline up to use mode: copy
, which copies the
required output files to the designated output directory (as specified by the user). This means the pipeline essentially uses twice the
amount of storage needed, as there are two copies of each file: one in the workdir
and one in the outdir
. Where possible, I've tried to
remove unneeded outputs in process working directories. However, some files need to be kept for -resume
functionality in instances where
the pipeline fails.
At terminating processes (i.e. a process whose output is not consumed by any other downstream process), I've tried to use mode: move
to
reduce the redundancy. By default, Nextflow
uses mode: symlink
, which isn't bad, but means the actual files you'll want to interact
with are located in the working directory which can become problematic.
The main take-away is: ensure you have plenty of storage space for the pipeline to consume, as this pipeline will use a lot of it.
If you're happy with the output of the pipeline, you are welcome to remove the workdir
once the pipeline has finished and you have
verified that everything has finished successfully.
Running a Nextflow pipeline is pretty simple. Below I outline how I submit Nextflow jobs to the Phoenix HPC from a bash script running
in a screen
.
Other compute environments may have different rules on how to run Nextflow jobs. Be sure to check with your HPC team if you are unsure of the best appraoch.
There are a few mandatory arguments that must be passed to nf-pipeline
.
--outdir string Path to the output directory where the results will be saved.
--out_prefix string Prefix for output files.
--pipeline string Specification of which sub-workflow to run. Options: msa, hyphy, codeml, transcurate, assembly, assembly_assessment, repeat.
--partition string Which HPC partition to use Options: skylake, skylakehm, test.
If these arguments are not provided the pipeline will error early and not run.
You can bring up the arguments for the pipeline you want to run by using the --help
command followed by the pipeline
you want the help page for. Below is an example of calling --help
for the assembly
pipeline.
nextflow run main.nf --help assembly
Writing a simple submission script is the easiest way to run Nextflow pipelines (outside of running them directly from the command line).
Below is an example of a script that runs the assembly
sub-workflow.
#!/usr/bin/env bash
# Setting an environment variable for Nextflow
export NXF_OPTS="-Xms500M -Xmx2G"
## Pipeline/Directory paths
PIPE="/home/a1234567/hpcfs/software/nf-pipelines"
OUTDIR="/path/to/assembly-test"
# Call to the Nextflow pipeline
nextflow run ${PIPE}/main.nf \
--pipeline 'assembly' \
--outdir "${OUTDIR}" \
--out_prefix 'filename-out' \
-profile 'conda,slurm' \
-work-dir "${OUTDIR}/test-work" \
-with-notification 'first.last@email' \
--hifi '/path/to/hifi/data/dir' \
--hic '/path/to/hic/data/dir' \
--scaffolder 'salsa2' \
--assembly 'primary' \
--busco_db '/path/to/busco_db/tetrapoda_odb10' \
--partition 'skylakehm' \
-resume
The first section of the script defines some Java Virtual Machine (JVM) variables to limit the resources used by Nextflow.
# Setting an environment variable for Nextflow
export NXF_OPTS="-Xms500M -Xmx2G"
Here we are specifying the starting memory pool (-Xms500M
) and the maximum memory pool allocation (-Xmx2G
). This prevents Nextflow
from
soaking up all the resources on the head-node when we go to run our pipeline.
Next, we define some directory paths.
PIPE="/home/a1234567/hpcfs/software/nf-pipelines"
OUTDIR="/path/to/assembly-test"
The PIPE
variable stores the location to where we cloned the pipeline repository. The OUTDIR
variable
is where we'd like the output to go. Nextflow will create the output directory if it does not exist already.
Finally, we have the call to the pipeline. I've commented what each line does below.
nextflow run ${PIPE}/main.nf \
--pipeline 'assembly' \ # Which sub-workflow to run (only assembly is supported currently!)
--outdir "${OUTDIR}" \ # Output directory location.
--out_prefix 'filename-out' \ # Name used for output files
-profile 'conda,slurm' \ # Conda for software/slurm for hpc execution
-work-dir "${OUTDIR}/test-work" \ # Manually specifying working directory location
-with-notification 'first.last@email' \ # Your email. Will send a nicely formatted run-summary
--hifi '/path/to/hifi/data/dir' \ # Path to directory containing HIFI FASTQ file
--hic '/path/to/hic/data/dir' \ # Path to directory containing Hi-C FASTQ file
--scaffolder 'salsa2' \ # Which scaffolding tool to use
--assembly 'primary' \ # Which Hifiasm output to use throughout the pipeline
--busco_db '/path/to/busco_db/tetrapoda_odb10' \ # Path to pre-downloaded BUSCO database
--partition 'skylakehm' \ # Which HPC partition to submit the job to
-resume # I leave this in. Resume the pipeline if it fails for some reason
NOTE: Remove the comments (#
)if you copy-and-paste from the chunk above. Any white space/comments after the \
will cause
errors.
Now that we have our script ready, we simply need to run it from the head node. Pipelines typically take a long time to complete, so it is preferable to run them in the background. If you simply ran the script above from the command line, it'd run fine until you closed your terminal, at which point the pipeline would fail and so would the submitted jobs.
My preferred method for background jobs is to use screen
, but you could easily use the built in Nextflow command -bg
if you don't mind
not seeing how your jobs are progressing e.g. nextflow -bg run main.nf ...
.
To set up a screen environment on the head node of Phoenix, I run the following commands (again, remove comments if you copy and paste!)
cd /path/to/script/dir # Change into the directory containing our nextflow script
$ screen -S nf-assembly # This will activate a screen environment with the name 'nf-assembly'
$ bash nf-assembly.sh # Run the bash script
You should then be greeted by something that looks like the following
N E X T F L O W ~ version 21.10.0
Launching `/home/a1645424/hpcfs/software/nf-pipelines/main.nf` [kickass_poincare] - revision: efba8a88d7
---------------------------- mandatory -----------------------------------------
out_prefix filename-out
outdir /path/to/assembly-test
pipeline assembly
partition skylakehm
---------------------------- nf_arguments --------------------------------------
start 2022-04-14T10:30:53.253378+09:30
workDir /path/to/assembly-test/test-work
profile conda,slurm
---------------------------- assembly ------------------------------------------
hifi
- path /path/to/hifi/data/dir
- pattern *.fastq.gz
- nfiles 1
assembly primary
hic
- path /path/to/hic/data/dir
- pattern *_R{1,2}.fastq.gz
- nfiles 2
scaffolder salsa2
busco_db /path/to/busco_db/tetrapoda_odb10
---------------------------- cluster -------------------------------------------
partition skylakehm
max_memory 377 GB
max_cpus 40
max_time 3d
[67/2844aa] process > ASSEMBLY:hifiadapterfilt (HifiAdapterFilt hydmaj) [ 0%] 0 of 1
[- ] process > ASSEMBLY:seqkit_fq2fa -
[- ] process > ASSEMBLY:hifiasm_hic -
[- ] process > ASSEMBLY:busco_contig -
[- ] process > ASSEMBLY:bwa_mem2_index -
[- ] process > ASSEMBLY:arima_map_filter_combine -
[- ] process > ASSEMBLY:arima_dedup_sort -
[- ] process > ASSEMBLY:matlock_bam2 -
[- ] process > ASSEMBLY:salsa2 -
[- ] process > ASSEMBLY:busco_salsa2 -
[- ] process > ASSEMBLY:assembly_visualiser_salsa -
[0a/46267b] process > ASSEMBLY:kmc (KMC hydmaj) [ 0%] 0 of 1
[- ] process > ASSEMBLY:genomescope -
If everything looks to be running ok, you can exit the screen session using the following inputs
ctrl + a + d
To list your screen
environments, you can use the command below.
screen -ls
Which should show something similar to the following (note: your identifier before the name will be different)
There is a screen on:
30903.nf-assembly (Detached)
This says we have one screen that is in a detached state (i.e. in the background). We can open this screen by running the following
screen -r nf-assembly