Getting Started - a-lud/nf-pipelines GitHub Wiki

This page provides an introduction to running this pipeline. I'll try and detail some of key steps and gotchya's that you'll likely encounter when getting started.

Installation
Data preparation
- Sanitize file names
- Aggregate and compress
Setting up your workspace
Running the pipeline

Installation

The instructions for how to install the pipeline are in the README.

Data preparation

Before running any analyses, it's usually important to curate your data files a little to ensure a smooth experience.

Sanitize file names

Remove unusual characters from file names before running any data. The last thing you want is a typo, non-standard character or incorrect name being carried through an analysis pipeline. I usually also like to set specific characters as separators (e.g. '-' or '_'). For example, I'll use hyphens as field separators and underscores as a replacement for white space. Further, it's typically a good idea to have a consistent file naming scheme.

Aggregate and compress

If you have multiple sequence runs for a single sample, combine the files (unless there is a purpose to keeping them separate). Further, compress large files to save disk space.

Setting up your workspace

I've written these pipelines to be compatible with the Phoenix HPC at Adelaide University. Phoenix users have a $FAST partition that is ~1TB in size and has fast I/O. I recommend that the pipeline and data files you intend to use be installed in a location similar to this on your own system!

The README specifies that a working versions of Nextflow and conda are needed to run these pipelines. It does not matter how these are installed (e.g. locally, through a package manager or via modules (EasyBuild)), they just need to be present in your PATH when you go to run the pipeline.

The pipeline installs software via conda, essentially creating self-contained conda environments for each process. Rather than creating the conda environments every time the pipeline is run, they are stored in a cache directory within the nf-pipelines repository. As such, it is important that you clone the pipeline repository to a location that has sufficient storage to handle multiple conda environments of varying size (hence the recommendation above to install somewhere with plenty of space).

Storage requirements and the 'working directory'

Nextflow pipelines run processes in a working directory (-workdir /path/to/dir). When a process finishes, Nextflow will handle the data based on the publishDir directive. I've currently set the pipeline up to use mode: copy, which copies the required output files to the designated output directory (as specified by the user). This means the pipeline essentially uses twice the amount of storage needed, as there are two copies of each file: one in the workdir and one in the outdir. Where possible, I've tried to remove unneeded outputs in process working directories. However, some files need to be kept for -resume functionality in instances where the pipeline fails.

At terminating processes (i.e. a process whose output is not consumed by any other downstream process), I've tried to use mode: move to reduce the redundancy. By default, Nextflow uses mode: symlink, which isn't bad, but means the actual files you'll want to interact with are located in the working directory which can become problematic.

The main take-away is: ensure you have plenty of storage space for the pipeline to consume, as this pipeline will use a lot of it. If you're happy with the output of the pipeline, you are welcome to remove the workdir once the pipeline has finished and you have verified that everything has finished successfully.

Running the pipeline

Running a Nextflow pipeline is pretty simple. Below I outline how I submit Nextflow jobs to the Phoenix HPC from a bash script running in a screen.

Other compute environments may have different rules on how to run Nextflow jobs. Be sure to check with your HPC team if you are unsure of the best appraoch.

Mandatory arguments

There are a few mandatory arguments that must be passed to nf-pipeline.

--outdir string              Path to the output directory where the results will be saved.
--out_prefix string          Prefix for output files.
--pipeline string            Specification of which sub-workflow to run. Options: msa, hyphy, codeml, transcurate, assembly, assembly_assessment, repeat.
--partition string           Which HPC partition to use Options: skylake, skylakehm, test.

If these arguments are not provided the pipeline will error early and not run.

Sub-workflow arguments

You can bring up the arguments for the pipeline you want to run by using the --help command followed by the pipeline you want the help page for. Below is an example of calling --help for the assembly pipeline.

nextflow run main.nf --help assembly

Create a job script

Writing a simple submission script is the easiest way to run Nextflow pipelines (outside of running them directly from the command line). Below is an example of a script that runs the assembly sub-workflow.

#!/usr/bin/env bash

# Setting an environment variable for Nextflow
export NXF_OPTS="-Xms500M -Xmx2G"

## Pipeline/Directory paths
PIPE="/home/a1234567/hpcfs/software/nf-pipelines"
OUTDIR="/path/to/assembly-test"

# Call to the Nextflow pipeline
nextflow run ${PIPE}/main.nf \
    --pipeline 'assembly' \
    --outdir "${OUTDIR}" \
    --out_prefix 'filename-out' \
    -profile 'conda,slurm' \
    -work-dir "${OUTDIR}/test-work" \
    -with-notification 'first.last@email' \
    --hifi '/path/to/hifi/data/dir' \
    --hic '/path/to/hic/data/dir' \
    --scaffolder 'salsa2' \
    --assembly 'primary' \
    --busco_db '/path/to/busco_db/tetrapoda_odb10' \
    --partition 'skylakehm' \
    -resume

The first section of the script defines some Java Virtual Machine (JVM) variables to limit the resources used by Nextflow.

# Setting an environment variable for Nextflow
export NXF_OPTS="-Xms500M -Xmx2G"

Here we are specifying the starting memory pool (-Xms500M) and the maximum memory pool allocation (-Xmx2G). This prevents Nextflow from soaking up all the resources on the head-node when we go to run our pipeline.

Next, we define some directory paths.

PIPE="/home/a1234567/hpcfs/software/nf-pipelines"
OUTDIR="/path/to/assembly-test"

The PIPE variable stores the location to where we cloned the pipeline repository. The OUTDIR variable is where we'd like the output to go. Nextflow will create the output directory if it does not exist already.

Finally, we have the call to the pipeline. I've commented what each line does below.

nextflow run ${PIPE}/main.nf \
    --pipeline 'assembly' \                             # Which sub-workflow to run (only assembly is supported currently!)
    --outdir "${OUTDIR}" \                              # Output directory location.
    --out_prefix 'filename-out' \                       # Name used for output files
    -profile 'conda,slurm' \                            # Conda for software/slurm for hpc execution
    -work-dir "${OUTDIR}/test-work" \                   # Manually specifying working directory location
    -with-notification 'first.last@email' \             # Your email. Will send a nicely formatted run-summary
    --hifi '/path/to/hifi/data/dir' \                   # Path to directory containing HIFI FASTQ file
    --hic '/path/to/hic/data/dir' \                     # Path to directory containing Hi-C FASTQ file
    --scaffolder 'salsa2' \                             # Which scaffolding tool to use
    --assembly 'primary' \                              # Which Hifiasm output to use throughout the pipeline
    --busco_db '/path/to/busco_db/tetrapoda_odb10' \    # Path to pre-downloaded BUSCO database
    --partition 'skylakehm' \                           # Which HPC partition to submit the job to
    -resume                                             # I leave this in. Resume the pipeline if it fails for some reason

NOTE: Remove the comments (#)if you copy-and-paste from the chunk above. Any white space/comments after the \ will cause errors.

Run the pipeline from a screen

Now that we have our script ready, we simply need to run it from the head node. Pipelines typically take a long time to complete, so it is preferable to run them in the background. If you simply ran the script above from the command line, it'd run fine until you closed your terminal, at which point the pipeline would fail and so would the submitted jobs.

My preferred method for background jobs is to use screen, but you could easily use the built in Nextflow command -bg if you don't mind not seeing how your jobs are progressing e.g. nextflow -bg run main.nf ....

To set up a screen environment on the head node of Phoenix, I run the following commands (again, remove comments if you copy and paste!)

cd /path/to/script/dir          # Change into the directory containing our nextflow script
$ screen -S nf-assembly         # This will activate a screen environment with the name 'nf-assembly'
$ bash nf-assembly.sh           # Run the bash script

You should then be greeted by something that looks like the following

N E X T F L O W  ~  version 21.10.0
Launching `/home/a1645424/hpcfs/software/nf-pipelines/main.nf` [kickass_poincare] - revision: efba8a88d7
---------------------------- mandatory -----------------------------------------

out_prefix                   filename-out
outdir                       /path/to/assembly-test
pipeline                     assembly
partition                    skylakehm

---------------------------- nf_arguments --------------------------------------

start                        2022-04-14T10:30:53.253378+09:30
workDir                      /path/to/assembly-test/test-work
profile                      conda,slurm

---------------------------- assembly ------------------------------------------

hifi
 - path                      /path/to/hifi/data/dir
 - pattern                   *.fastq.gz
 - nfiles                    1
assembly                     primary
hic
 - path                      /path/to/hic/data/dir
 - pattern                   *_R{1,2}.fastq.gz
 - nfiles                    2
scaffolder                   salsa2
busco_db                     /path/to/busco_db/tetrapoda_odb10

---------------------------- cluster -------------------------------------------

partition                    skylakehm
max_memory                   377 GB
max_cpus                     40
max_time                     3d

[67/2844aa] process > ASSEMBLY:hifiadapterfilt (HifiAdapterFilt hydmaj) [  0%] 0 of 1
[-        ] process > ASSEMBLY:seqkit_fq2fa                             -
[-        ] process > ASSEMBLY:hifiasm_hic                              -
[-        ] process > ASSEMBLY:busco_contig                             -
[-        ] process > ASSEMBLY:bwa_mem2_index                           -
[-        ] process > ASSEMBLY:arima_map_filter_combine                 -
[-        ] process > ASSEMBLY:arima_dedup_sort                         -
[-        ] process > ASSEMBLY:matlock_bam2                             -
[-        ] process > ASSEMBLY:salsa2                                   -
[-        ] process > ASSEMBLY:busco_salsa2                             -
[-        ] process > ASSEMBLY:assembly_visualiser_salsa                -
[0a/46267b] process > ASSEMBLY:kmc (KMC hydmaj)                         [  0%] 0 of 1
[-        ] process > ASSEMBLY:genomescope                              -

If everything looks to be running ok, you can exit the screen session using the following inputs

ctrl + a + d

To list your screen environments, you can use the command below.

screen -ls

Which should show something similar to the following (note: your identifier before the name will be different)

There is a screen on:
	30903.nf-assembly	(Detached)

This says we have one screen that is in a detached state (i.e. in the background). We can open this screen by running the following

screen -r nf-assembly