NGS III: Computer clusters - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Analysis-of-next-generation-sequencing-data (SC00024)
The purpose of this exercise for you to become familiar with how to use a computer cluster for heavy bioinformatics processes.
A note on computer clusters
Bioinformatics analyses can take long time (and memory) due to the big amount of data we handle. To deal with this it is quite common to use a computer cluster
, i.e. a set of connected computers (nodes) that work together as if they are a single (much more powerful) machine.
Until now, we have been running our jobs (all the commands for the QC and Mapping practicals) in the login node
, as they do not take so many resources. However, this is never advised when you work in a cluster. Remember that a cluster is used by several users at the same time and we have experienced how slow the server can get! So as a best practice always use the computer nodes.
- Copy (or make a soft link) of
/home/courses/Cluster/MySample.bam
Nodes and cores
- List all available nodes using
$ qstat -u '*' -f
Q. How many nodes are there and how many cores does each node have?
Scheduling system
Our cluster uses SGE as job schedule system, and we just need to create a script with some instructions (or commands) and submit it to the queue.
- Create a script called
runIndexStats.sh
with the following information:
#!/bin/bash -l
#$ -N indexStats
#$ -j y
#$ -cwd
#$ -pe mpi 1
#$ -q all.q
# load the required modules
module load samtools/1.17
# print the command
echo "-------------
cmd: samtools index $1
-------------
"
# Run the command, $1 means the first user input
time samtools index $1
# printing and running the command
echo "-------------
cmd: samtools flagstat $1 > $1.flagstat
-------------
"
time samtools flagstat $1 > $1.flagstat
# print end message
echo "--------------- Fin ---------------"
Q. What does
-N
,-j
,-cwd
,-pe
and-q
mean? What is the script doing?
- Queue your script using:
$ qsub Path/to/runIndexStats.sh Path/to/bamfile
- Check your job with
qstat -u '*' -f
, you should see something like:
queuename qtype resv/used/tot. np_load arch states
---------------------------------------------------------------------------------
[email protected] BIP 0/6/16 0.00 lx-amd64
132 0.55500 indexStats teacher3 r 10/24/2023 11:38:41 1
---------------------------------------------------------------------------------
[email protected] BIP 0/6/16 0.00 lx-amd64
273 0.55500 mapping student24 r 11/02/2023 17:03:46 1
---------------------------------------------------------------------------------
[email protected] BIP 0/2/16 0.00 lx-amd64
Once the script is done, check the standard error/output stream of the job: scriptName.ojobid
. If everything went well, you should see:
-------------
cmd: samtools index chr21_dna.bowtie.sorted.bam
-------------
real 0m1.041s
user 0m0.966s
sys 0m0.025s
-------------
cmd: samtools flagstat chr21_dna.bowtie.sorted.bam > chr21_dna.bowtie.sorted.bam.flagstat
-------------
real 0m1.072s
user 0m0.957s
sys 0m0.026s
--------------- Fin ---------------
Things can always go wrong, in case you need to interrupt the job, you can use qdel jobid
.
- Add the following line to your script, after loading the module:
sleep 10000;
- Submit your job
- Display the list of jobs
After a while the script will still be running (it will sleep for around 3 hours!).
- Use
qdel
to terminate the job - Confirm your job is not running anymore
You now have a bash script that you can reuse anytime you want to generate an index for your alignment file and calculating some statistics, which is great. It will save you a lot of time.
Loops
This script will take one alignment file at the time, so in case you have a lot of files, remember you can use a for-loop
. The main idea is to submit one job for each one of the alignment.
If you have several bam
files, you can try the following (I had 11 bam
files lying around):
for i in *bam; do qsub Path/to/runSamtoolsIndex.sh $i; done
When checking the queue, you should see something like:
queuename qtype resv/used/tot. np_load arch states
---------------------------------------------------------------------------------
[email protected] BIP 0/11/16 0.00 lx-amd64
274 0.55500 indexStats teacher3 r 11/02/2023 17:03:46 1
275 0.55500 indexStats teacher3 r 11/02/2023 17:03:47 1
278 0.55500 indexStats teacher3 r 11/02/2023 17:03:47 1
281 0.55500 indexStats teacher3 r 11/02/2023 17:03:48 1
284 0.55500 indexStats teacher3 r 11/02/2023 17:03:49 1
287 0.55500 mapping student24 r 11/02/2023 17:03:49 6
---------------------------------------------------------------------------------
[email protected] BIP 0/16/16 0.00 lx-amd64
273 0.55500 indexStats teacher3 r 11/02/2023 17:03:46 1
276 0.55500 indexStats teacher3 r 11/02/2023 17:03:47 1
279 0.55500 indexStats teacher3 r 11/02/2023 17:03:48 1
282 0.55500 indexStats teacher3 r 11/02/2023 17:03:48 1
285 0.55500 mapping student24 r 11/02/2023 17:03:49 6
288 0.55500 mapping student24 r 11/02/2023 17:03:49 6
---------------------------------------------------------------------------------
[email protected] BIP 0/2/16 0.00 lx-amd64
283 0.55500 indexStats teacher3 r 11/02/2023 17:03:48 1
286 0.55500 indexStats teacher3 r 11/02/2023 17:03:49 1
In this way you can make use of all the resources rather than overloading the login node.
Analysis of next generation sequencing data (SC00024)
Home:Developed by Sanna Abrahamsson and Marcela Dávila, 2023. Updated by Marcela Dávila, 2024