012: Nextflow execution on UCL cluster - McGranahanLab/Guidebook GitHub Wiki

This tutorial is dedicated to provide a working example of how to run nextflow pipelines on UCL cluster (and submit jobs from nextflow). This tutorial assumes that you are familiar with basis of Nextflow, Singularity and job submitting in Sun Grid Environment (SGE). If not, please find below resources to familiarize yourself:

Step 1: configure your account to run nextflow

:warning: If you're planning to use nextflow on the UCL cluster, you will need to login on a special node - askey. It is accessible though gamble, i.e. you first need to ssh gamble, and then ssh <your_user_name>@askey.cs.ucl.ac.uk (Try just ssh ssh askey if the previous command dosen't work). If you see something like:

could not open any host key
ssh_keysign: no reply
sign using hostkey ecdsa-sha2-nistp521 SHA256:Vr+PP1cVtxUtq23TEhjvye0MmRGjhyKTWDpK0AKobbU failed
Authentication failed.

ask [email protected] for access.

:warning:Please use the node only this node to run nextflow. :warning:

Like with any other software, you need to load module with nextflow from the library. Yet, nextlow requires java be available at hand on server, so it's needed to be loaded too. I suggest to add following lines to ~/.bashrc so modules would load and be available as soon as you login to the cluster:

export PATH=/share/apps/jdk-10.0.1/bin/:${PATH}
export JAVA_HOME=/share/apps/jdk-10.0.1/
export PATH=/share/apps/colcc/nextflow/:${PATH}

You need to log out and log in back in order for changes to take place. If you don't want to add these lines into your ~/.bashrc, just simply run them in terminal. However, once you logout and login again, nextflow won't be available anymore.

To test that nextflow is loaded, type in terminal:

nextflow -v

Output should be:

nextflow version 20.07.1.5412

Step 2: Examine main.nf

main.nf is just a regular nextflow script, there is nothing which would distinguish it from scripts run on local machines. In fact, it could be run on local computer.

The script contains 4 processes:

download_reference_genome - downloads chr6 of hg19 from ensembl
index_reference_genome_samtools - indexes downloaded fasta with samtools
index_reference_genome_bwa - indexes downloaded fasta with bwa
index_reference_genome_picard - indexes downloaded fasta with picard

All indexing processes will be executed in parallel, i.e. indexing with samtools and bwa will be computed at the same time. This is done in order to speed up the process of indexing.

Note that despite indexing will be done with samtools, bwa and picard, we do not load corresponding modules. This is because we will use singularity container (see below).

Upon successful completion of the workflow a message will be shown and work directory containing temporary files could be removed (work/). Also, directory _assets/reference_genome/ will be created containing following files:

test.Homo_sapiens.GRCh37.chr6.dna_sm.dict
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.amb
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.ann
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.bwt
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.fai
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.pac
test.Homo_sapiens.GRCh37.chr6.dna_sm.fa.sa

Step 3: Configuration files

In comparison to standard nextflow run, we need to make 2 changes: 1) tell nextflow that we run it on SGE 2) tell to use singularity containers

Configuration to nextflow on SGEs

All code described below is in conf/ucl.conf, please examine the file carefully.

In general, to let nextflow know that we will run it on server with SGE following directives under scope executor are added:

executor {
    name = 'sge'
    queueSize = 75
    pollInterval = '30sec'
}

name states that SGE is run on server
queueSize limits amount of jobs which nextflow will submit. Default is 100.
pollInterval tells how often a poll occurs to check for a process termination.

Then the workflow will be executed every process will be submitted to the cluster as a separate job. This is why we need to specify clusterOptions under process directives.

process {
    executor = 'sge'
    errorStrategy = 'finish'
    cache = 'lenient'

    clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=1G,tmem=2G'

    withLabel: 'XL' {
        cpus = 8
        penv = 'smp'
        clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=32G,tmem=32G -pe smp 8'
   }
}

executor = 'sge' defines that every process will be run under SGE executor defined above
clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=1G,tmem=2G' cluster options for processes without any label. Cluster options are basically a linearized header of the classical job submission script.
- Part -S /bin/bash is necessary and is followed by various options which are usually found in the header of job submitting script.
- In the example above -cwd tell that process will use the directory, where the job has been submitted, as the working directory.
- -l h_rt=24:00:00,h_vmem=2G,tmem=2G determines wall time (24h), virtual and physical memory limits. Please note that here the memory is requested per cpu, i.e. if you request 2G of memory and 8 cpus, it means that 16G of memory will be allocated. Full list of all directives is available here
- please note that here we didn't specify amount of cores a process will be run on, so it will be 1 as by default.
clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=32G,tmem=32G -pe smp 8' This line defines resources available to a process requiring extra large amounts of memory and time. Please note, that here -pe smp 8 is added which gives to a process 8 cores to run on. Since the process will be computed on several cores, we also need to specify penv = 'smp' directive. Directive cpus = 8 here does not set up number of cpus available to the process. I put it there so I could refer to the number of requested cores in the process command via task.cpu and set up correct number of threads. Please see index_reference_genome_picard process in main.nf for an example.

Configuration to use Singularity with Nextflow

Code described below is in nextflow.config, please examine the file carefully.

In order to allow nextflow use singularity containers a following scope needs to be added:

singularity {
    enabled = true
    autoMounts = true
    runOptions = "--bind ${PWD}"
}

Out of the options above, the most important is runOptions = "--bind ${PWD}". It allows a singularity container to access the files outside the container by mounting paths inside the container. This is because unlike Docker, Nextflow does not mount automatically host paths in the container when using Singularity. It expects they are configure and mounted system wide by the Singularity runtime.

The code described below is in conf/ucl.conf.

Usually, nextflow is capable of automatically pulling containers from multiple sources, inclusing docker and singularity library (https://cloud.sylabs.io/library) and no pre-run download of containers is needed. However, UCL server has a firewall which prevents automatic pulling and therefore containers have to be downloaded before the pipeline run. Be careful if you would like to use Docker container as they don't always work under Singularity. It is better to create a corresponding Singularity container, ask Maria for help if needed.

For this tutorial we will need just one container which we're going to pull from singularity library. This container contains bwa, samtools and picard.

mkdir _singularity_images
cd _singularity_images
singularity pull library://marialitovchenko/default/bwa:v0.7.17

Now we can let our processes know that we have downloaded the container and they can use it:

   params {
        singularityDir="${PWD}/_singularity_images/"
   }

   /* Containers */
   withName:index_reference_genome_samtools {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }

   withName:index_reference_genome_picard {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }
        
   withName:index_reference_genome_bwa {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }

Step 4: running Nextflow on SGE with possibility to submit jobs

Usually to submit a job one would create a script with scheduler directives in the header under hashtags (see example here) followed by call to software one wants to use. It will not work this way with nextflow. Upon run of a classical job script the job script is compiled on your login node, and then it sends jobs from login node to compute nodes. Nextflow has to be run on login node to be able to submit jobs to compute nodes. In other words, do not put a call to nextflow in your job submission script.

To run nextflow type in terminal:

nextflow run main.nf -profile ucl -entry prepare_reference_genome -bg 1>nf.out 2>nf.err

-profile ucl tells nextflow to run it with UCL SGE profile configures in conf/ucl.conf
-entry prepare_reference_genome specifies workflow to execute
-bg puts nextflow run in background this options allows you to run long jobs without constantly keeping terminal open.
1>nf.out 2>nf.err redirects messages and errors into nf.out and nf.err respectively.

After you press enter after the command, nothing should happen, no message should appear. However, if you do

qstat

you should see something like

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2377141 0.00000 nf-prepare username     qw    09/08/2020 13:09:34                                2

which means first process, namely downloading the genome, is now in the queue and it will use 2 processes. It may take time to appear.

Yay! Nextflow is running!

Now it's a good time to log off by typing

exit

in a terminal and log back in again. You have to type exit and not just close the terminal, because otherwise you background process (nextflow included) will be terminated.

When you log in back again, download job could already be running:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2377141 0.00000 nf-prepare username     r    09/08/2020 13:09:34 [email protected]      2

After some time, three indexing jobs will be running in parallel:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2377145 0.00000 nf-prepare username     r    09/08/2020 13:12:34 [email protected]      4        
2377146 0.00000 nf-prepare username     r    09/08/2020 13:12:34 [email protected]      8        
2377147 0.00000 nf-prepare username     r    09/08/2020 13:12:34 [email protected]       6

Note: sometimes, while workflow is still running, you may see output of qstat empty. It's ok, it means scheduler haven't yet started next processes of the pipeline.

If you would like to check on the status of the pipeline other than on qstat, you can display content of nf.out and nf.err:

tail nf.*

Upon successful workflow completion following message will be written in nf.out:

N E X T F L O W  ~  version 20.07.1
Launching `main.nf` [modest_tuckerman] - revision: 31df34739a
WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE
[c4/4303ae] Submitted process > prepare_reference_genome:download_reference_genome (download_reference)
[f8/c0b748] Submitted process > prepare_reference_genome:index_reference_genome_bwa (index_reference_bwa)
[a1/70f0a9] Submitted process > prepare_reference_genome:index_reference_genome_samtools (index_reference_samtools)
[59/6655b5] Submitted process > prepare_reference_genome:index_reference_genome_picard (index_reference)
Pipeline completed at: 2020-09-08T16:04:49.321214+01:00
Execution status: OK

Success! Now you can remove work directory containing temporary files.

Minimal nextflow pipeline example

• main.nf

#!/usr/bin/env nextflow
nextflow.preview.dsl=2

/*
        Organization: UCL Cancer Institute
        Laboratory: Cancer Genome Evolution
        Authors: Maria Litovchenko
        Purpose: A simple script to show nextflow run in UCL cluster (SGE)
        Notes:
*/

params.refGenDir= file(params.refGen).getParent()

/* ----------------------------------------------------------------------------
* Help message
*----------------------------------------------------------------------------*/

// Help message
def helpMessage() {
        log.info """

        Nextflow test pipeline to demonstrate how to run it on UCL cluster.
        UCL cluster is managed by SGE. This pipeline downloads and indexes
        reference genome (hg19) with use of singularity container. The
        pipeline should be finished under 5 minutes. No input files required.

        To run the pipeline, type in terminal:
        nextflow run main.nf -profile ucl -entry prepare_reference_genome -bg 1>nf.out 2>nf.err 
        """
}

// Show help message
params.help = ''
if (params.help) {
    helpMessage()
    exit 0
}

/* ----------------------------------------------------------------------------
* Workflows
*----------------------------------------------------------------------------*/
workflow prepare_reference_genome {
        /*
                A workflow to download and index reference genome
        */

        download_reference_genome(params.refGen_link)
        index_reference_genome_samtools(download_reference_genome.out)
        index_reference_genome_bwa(download_reference_genome.out)
        index_reference_genome_picard(download_reference_genome.out)
}

/* ----------------------------------------------------------------------------
* Processes
*----------------------------------------------------------------------------*/
process download_reference_genome {
        /*
                Downloads reference genome
        */

        publishDir "${params.refGenDir}", pattern: '*sm.fa', mode: "copy",
                   overwrite: true
        tag { "download_reference" }
        label "S"

        input:
                val link_to_genome

        output:
                file "*sm.fa"

        shell:
        """
        wget ${link_to_genome}
        gunzip *

        refGene_fileName=`basename !{params.refGen}`
        echo \$refGene_fileName
        cat *.fa > \$refGene_fileName
        """
}

process index_reference_genome_samtools {
        /*
                Index reference genome with samtools
        */
        publishDir "${params.refGenDir}", pattern: "*.fai",
                   mode: "copy", overwrite: true
        tag { "index_reference_samtools" }
        label "M"
        input:
                path(path_to_genome)
        output:
                file "*.fai"

        script:
        """
        samtools faidx ${path_to_genome}
        """
}

process index_reference_genome_bwa {
        /*
                Index reference genome for use with BWA
        */
        publishDir "${params.refGenDir}", pattern: '*.{bwt,pac,ann,amb,sa}',
                   mode: "copy", overwrite: true
        tag { "index_reference_bwa" }
        label "XL"

        input:
                path(path_to_genome)
        output:
                tuple file("*.ann"), file("*.bwt"), file("*.pac"), file("*.sa"),
                      file("*.amb")

        script:
        """
        bwa index -a bwtsw ${path_to_genome}
        """
}

process index_reference_genome_picard {
        /*
                Index reference genome with picard
        */
        publishDir "${params.refGenDir}", pattern: "*.dict", mode: "copy"
        tag { "index_reference" }
        label "L"

        input:
                path(path_to_genome)
        output:
                file "*.dict"

        shell:
        '''
        refGene_fileName=$(basename !{path_to_genome})
        refGene_fileName=${refGene_fileName%.*}
        java -jar -XX:ParallelGCThreads=!{task.cpus} /bin/picard.jar \
                  CreateSequenceDictionary R=!{path_to_genome} \
                                           O=$refGene_fileName.dict
        '''
}

// inform about completition
workflow.onComplete {
    println "Pipeline completed at: $workflow.complete"
    println "Execution status: ${ workflow.success ? 'OK' : 'failed' }"
}

• nextflow.config

import java.time.*
Date now = new Date()

manifest {
  name = 'Nextflow on UCL cluster (SGE)'
  homePage = 'https://github.com/McGranahanLab/Wiki'
  description = 'Example of nextflow pipeline to run on UCL cluster (under SGE) with singularity containers'
  mainScript = 'main.nf'
  nextflowVersion = '>=19.09.0-edge'
  version = '1.0.0'
}

singularity {
    enabled = true
    autoMounts = true
    runOptions = "--bind ${PWD}"
}

params {
    /* Defaults */
    assets = "${PWD}/_assets"

    timestamp = now.format("yyyyMMdd-HH-mm-ss")
    today = now.format("yyyyMMdd")
    tracedir = "pipeline_info"
}

process {
    cache = 'lenient'
    errorStrategy = 'finish'
}

profiles {
    ucl { includeConfig 'conf/ucl.conf' }
}

timeline {
    enabled = true
    file = "${params.tracedir}/${params.timestamp}_timeline.html"
}
report {
    enabled = true
    file = "${params.tracedir}/${params.timestamp}_report.html"
}
trace {
    enabled = true
    file = "${params.tracedir}/${params.timestamp}_trace.txt"
}

dag {
    enabled = true
    file = "${params.tracedir}/${params.timestamp}_dag.svg"
}

• conf/ucl.conf

/*
        Organization: UCL Cancer Institute
        Laboratory: Cancer Genome Evolution
        Authors: Maria Litovchenko
        Purpose: A configuration file to run nextflow on UCL cluster (SGE)
        Notes:
*/


executor {
    name = 'sge'
    queueSize = 75
    pollInterval = '30sec'
}

notification {
    enabled = true
    to = '[email protected]'
}

params {
    debug = false

    /* Reference genome links and paths */
    refGen_link = "ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna_sm.chromosome.6.fa.gz"
    refGen="${PWD}/_assets/reference_genome/test.Homo_sapiens.GRCh37.chr6.dna_sm.fa"

    singularityDir="${PWD}/_singularity_images/"
}
process {
    executor = 'sge'
    errorStrategy = 'finish'
    cache = 'lenient'

    clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=1G,tmem=2G'

    maxRetries = 3

    /* Labels */
    withLabel: 'XS' {
        clusterOptions = '-S /bin/bash -cwd -l h_rt=1:00:00,h_vmem=2G,tmem=2G'
    }

    withLabel: 'S' {
        cpus = 2
        penv = 'smp'
        clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=4G,tmem=4G -pe smp 2'
    }

    withLabel: 'M' {
        cpus = 4
        penv = 'smp'
        clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=8G,tmem=8G -pe smp 4'
    }

    withLabel: 'L' {
        cpus = 6
        penv = 'smp'
        clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=16G,tmem=16G -pe smp 6'
    }

    withLabel: 'XL' {
        cpus = 8
        penv = 'smp'
        clusterOptions = '-S /bin/bash -cwd -l h_rt=24:00:00,h_vmem=32G,tmem=32G -pe smp 8'
   }

   /* Containers */
   withName:index_reference_genome_samtools {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }

   withName:index_reference_genome_picard {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }

   withName:index_reference_genome_bwa {
        container = "${params.singularityDir}bwa_v0.7.17.sif"
   }
}