NGS III: Computer clusters - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Analysis-of-next-generation-sequencing-data (SC00024)

The purpose of this exercise for you to become familiar with how to use a computer cluster for heavy bioinformatics processes.

A note on computer clusters

Bioinformatics analyses can take long time (and memory) due to the big amount of data we handle. To deal with this it is quite common to use a computer cluster, i.e. a set of connected computers (nodes) that work together as if they are a single (much more powerful) machine.

Until now, we have been running our jobs (all the commands for the QC and Mapping practicals) in the login node, as they do not take so many resources. However, this is never advised when you work in a cluster. Remember that a cluster is used by several users at the same time and we have experienced how slow the server can get! So as a best practice always use the computer nodes.

Copy (or make a soft link) of /home/courses/Cluster/MySample.bam

Nodes and cores

List all available nodes using

$ qstat -u '*' -f

Q. How many nodes are there and how many cores does each node have?

Scheduling system

Our cluster uses SGE as job schedule system, and we just need to create a script with some instructions (or commands) and submit it to the queue.

Create a script called runIndexStats.sh with the following information:

#!/bin/bash -l

#$ -N indexStats 
#$ -j y
#$ -cwd
#$ -pe mpi 1 
#$ -q all.q

# load the required modules
module load samtools/1.17 

# print the command
echo "-------------
cmd: samtools index $1 
-------------
"
# Run the command, $1 means the first user input 
time samtools index $1 


# printing and running the command
echo "-------------
cmd: samtools flagstat $1 > $1.flagstat 
-------------
"
time samtools flagstat $1 > $1.flagstat 

# print end message
echo "--------------- Fin ---------------"

Q. What does -N, -j,-cwd,-pe and -q mean? What is the script doing?

Queue your script using:

$ qsub Path/to/runIndexStats.sh Path/to/bamfile

Check your job with qstat -u '*' -f , you should see something like:

queuename                      qtype resv/used/tot. np_load  arch          states
---------------------------------------------------------------------------------
[email protected]      BIP   0/6/16         0.00     lx-amd64
       132 0.55500 indexStats teacher3     r     10/24/2023 11:38:41     1
---------------------------------------------------------------------------------
[email protected]       BIP   0/6/16         0.00     lx-amd64
       273 0.55500 mapping    student24    r     11/02/2023 17:03:46     1
---------------------------------------------------------------------------------
[email protected]         BIP   0/2/16         0.00     lx-amd64

Once the script is done, check the standard error/output stream of the job: scriptName.ojobid. If everything went well, you should see:

-------------
cmd: samtools index chr21_dna.bowtie.sorted.bam
-------------


real    0m1.041s
user    0m0.966s
sys     0m0.025s
-------------
cmd: samtools flagstat chr21_dna.bowtie.sorted.bam > chr21_dna.bowtie.sorted.bam.flagstat
-------------


real    0m1.072s
user    0m0.957s
sys     0m0.026s
--------------- Fin ---------------

Things can always go wrong, in case you need to interrupt the job, you can use qdel jobid.

Add the following line to your script, after loading the module:

sleep 10000;

Submit your job
Display the list of jobs

After a while the script will still be running (it will sleep for around 3 hours!).

Use qdel to terminate the job
Confirm your job is not running anymore

You now have a bash script that you can reuse anytime you want to generate an index for your alignment file and calculating some statistics, which is great. It will save you a lot of time.

Loops

This script will take one alignment file at the time, so in case you have a lot of files, remember you can use a for-loop. The main idea is to submit one job for each one of the alignment.

If you have several bam files, you can try the following (I had 11 bam files lying around):

for i in *bam; do qsub Path/to/runSamtoolsIndex.sh $i; done

When checking the queue, you should see something like:

queuename                      qtype resv/used/tot. np_load  arch          states
---------------------------------------------------------------------------------
[email protected]      BIP   0/11/16         0.00     lx-amd64
       274 0.55500 indexStats teacher3     r     11/02/2023 17:03:46     1
       275 0.55500 indexStats teacher3     r     11/02/2023 17:03:47     1
       278 0.55500 indexStats teacher3     r     11/02/2023 17:03:47     1
       281 0.55500 indexStats teacher3     r     11/02/2023 17:03:48     1
       284 0.55500 indexStats teacher3     r     11/02/2023 17:03:49     1
       287 0.55500 mapping    student24    r     11/02/2023 17:03:49     6
---------------------------------------------------------------------------------
[email protected]       BIP   0/16/16         0.00     lx-amd64
       273 0.55500 indexStats teacher3     r     11/02/2023 17:03:46     1
       276 0.55500 indexStats teacher3     r     11/02/2023 17:03:47     1
       279 0.55500 indexStats teacher3     r     11/02/2023 17:03:48     1
       282 0.55500 indexStats teacher3     r     11/02/2023 17:03:48     1
       285 0.55500 mapping    student24    r     11/02/2023 17:03:49     6
       288 0.55500 mapping    student24    r     11/02/2023 17:03:49     6
---------------------------------------------------------------------------------
[email protected]         BIP   0/2/16         0.00     lx-amd64
       283 0.55500 indexStats teacher3     r     11/02/2023 17:03:48     1
       286 0.55500 indexStats teacher3     r     11/02/2023 17:03:49     1

In this way you can make use of all the resources rather than overloading the login node.

Home: Analysis of next generation sequencing data (SC00024)

Developed by Sanna Abrahamsson and Marcela Dávila, 2023. Updated by Marcela Dávila, 2024