Using the ei‐gpu partition on the Earlham Institute computing cluster - TGAC/knowledge_base GitHub Wiki

Using the ei-gpu partition on the Earlham Institute computing cluster

This note is meant for someone who is already somewhat familiar with the HPC e.g. according to the notes here.

A subset of mathematical problems are parallelizable; their solution using software can be made many fold faster compared to a normal CPU using graphic processing units (GPUs). The EI high-performance compute cluster has nodes with GPUs on them, but because we have few such nodes, and because the number of GPU-based workflows is expected to increase dramatically over time as programs such as AlphaFold and Evo2 are adopted more widely, we must make sure that we utilize them well.

This means:

we don't run CPU-only programs or programs that do not have GPU-specific optimization on the GPU. i.e. not all software is designed to be run faster on a GPU (see here).
in GPU runs, we make sure that resources are being utilized effectively:
- we don't request more resources than necessary e.g. wrong GPU/CPU/memory combinations can lock up GPUs in an unused state (see here).
- we use commands like nvidia-smi that report GPU utilization so that we don't realize, say, after 1 week of run time that the GPU was not being used all along (see here).
- we make sure to have turned on any GPU flags in the software we are using as software programs may be designed to run only in the slower CPU-only mode by default (see here)
- we check we are getting expected partial outputs early on in the run e.g. it is unpleasant to do a 1 week run and see all NaNs in the output at the end of the run.
where we can avoid GPUs, we do avoid them
- e.g. if a program is 5x faster on 1 GPU compared to 1 CPU, it is likely that it will only be 1x faster on a GPU if 5 CPUs are used, so it may be prudent to use 5 CPUs instead.
- some software like ONT's nanopore basecaller are already run using GPUs on the nanopore device during sequencing, so it is unnecessary to re-basecall nanopore data using GPUs on the EI HPC except in very specific scenarios.

A few additional considerations are:

whether you have access to EI's GPUs (see here).
what to do if your program needs access to the internet while running (see here).
making sure you have the environment/flags needed to use GPUs (see here, here, and here)

You can also check out two pages from Research Computing about running jobs on GPUs

Testing GPU software using singularity (link)
Example GPU jobs (link)

How do I write a program from scratch that can run on EI's GPUs?

We do not cover how to write a GPU-enabled program from scratch using a programming language like C or Python here. What we do cover is the case where you have access to a software program that can run on a GPU, it has already been installed on the HPC, and you want to run it using a SLURM script. If the software has not been installed on the HPC, then please read this note anyway as it has links on software installation in the HPC and also talk to your data champion.

How do I write a SLURM script that can run on EI's GPUs?

Example of a GPU job

An integral part of running GPU jobs is ensuring that you are a 'good citizen'. Please see the subsection after this one to know if you are one.

Here's an example of a GPU job that basically sleeps for 2 minutes i.e. does nothing. I signal this is a GPU job by asking for the ei-gpu partition and by specifying that we want 1 GPU using --gres=gpu:1. I also specify that I need 14 cores and 120G of memory, but are there guidelines for how much memory and/or CPU to request? Please see the next subsection.

#!/bin/bash
#SBATCH --mem=120G
#SBATCH -c 14
#SBATCH -p ei-gpu
#SBATCH --gres=gpu:1
#SBATCH -J test_120
#SBATCH --mail-type=END,FAIL
#SBATCH --time=00:03:00

sleep 120;

Have I checked what resources are available before running the job? And have I checked if I am a 'good citizen'?

There are job configurations that can collectively 'lock' up a GPU and render it unusable by anybody. Let's say a GPU node contains n GPUs, m CPUs, and q GB memory in total. Let's say three people launch three jobs on that node that request 1 GPU each but m/3 CPUs and q/3 memory each. So SLURM allocates a total of 3 GPUs, m CPUs, and q memory to satisfy these three jobs. The remaining resource on that node is thus 1 GPU, 0 CPUs, and 0 memory. As one cannot start a GPU job without at least some CPUs and some memory, what has transpired here is that one GPU has been locked up! People actually do this on the cluster - see the stats on the node t1024n3 in the real life example below.

So, how do we avoid this problem? Firstly, we have to make sure that we do not over-request resources i.e. only request an adequate number of GPUs/CPUs/memory.

Let's quickly look at what nodes are available. Running the command sinfo, you can see four nodes t1024n1,...,t1024n4 in the ei-gpu partition. Examining a node in detail e.g. using scontrol show node t1024n1, you can see that there is a total of 4 GPUs, 56 CPUs, and 1024 GB of memory in this node. You can also see what resources are being used by others right now, so that you can plan your job parameters.

Let's see these ideas in action.

A real-life example

At the time of this writing, there are four jobs running on the ei-gpu partition (job and usernames have been scrambled for privacy).

[thiyagar@sub04 ~]$ squeue | grep ei-gpu
          11111193    ei-gpu nn_zzzz bbb23qqq  R 1-11:18:29      1 t1024n4
          11215125    ei-gpu nnn_yyyy rrrrssss  R   15:36:13      1 t1024n3
          11215126    ei-gpu bbb_zzzz ttttuuuu  R   5:46:33      1 t1024n4
          11313341    ei-gpu rrr_ssss aaaavvvv  R 6-11:33:08      1 t1024n3

So two nodes are free and two nodes are being utilized. Let's take a closer look.

[thiyagar@sub04 ~]$ scontrol show node t1024n4
NodeName=t1024n4 Arch=x86_64 CoresPerSocket=28
   CPUAlloc=48 CPUEfctv=56 CPUTot=56 CPULoad=60.88
   AvailableFeatures=xeon,nht,sse4,avx2,avx512
   ActiveFeatures=xeon,nht,sse4,avx2,avx512
   Gres=gpu:a100:4,ssd:3576
   NodeAddr=t1024n4 NodeHostName=t1024n4 Version=23.02.7
   OS=Linux 5.14.0-503.34.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 27 06:00:50 EDT 2025
   RealMemory=1025000 AllocMem=745760 FreeMem=573511 Sockets=2 Boards=1
   State=MIXED+CLOUD+REBOOT_REQUESTED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ei-gpu
   BootTime=2025-03-31T01:50:28 SlurmdStartTime=2025-04-02T10:23:45
   LastBusyTime=2025-04-02T10:17:02 ResumeAfterTime=None
   CfgTRES=cpu=56,mem=1025000M,billing=56,gres/gpu=4
   AllocTRES=cpu=48,mem=745760M,gres/gpu=1
   CapWatts=n/a
   CurrentWatts=844 AveWatts=670
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[thiyagar@sub04 ~]$ scontrol show node t1024n3
NodeName=t1024n3 Arch=x86_64 CoresPerSocket=28
   CPUAlloc=56 CPUEfctv=56 CPUTot=56 CPULoad=56.01
   AvailableFeatures=xeon,nht,sse4,avx2,avx512
   ActiveFeatures=xeon,nht,sse4,avx2,avx512
   Gres=gpu:a100:4,ssd:3576
   NodeAddr=t1024n3 NodeHostName=t1024n3 Version=23.02.7
   OS=Linux 5.14.0-503.34.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Mar 27 06:00:50 EDT 2025
   RealMemory=1025000 AllocMem=1000000 FreeMem=727000 Sockets=2 Boards=1
   State=ALLOCATED+CLOUD+REBOOT_REQUESTED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ei-gpu
   BootTime=2025-03-29T05:27:27 SlurmdStartTime=2025-04-02T10:23:42
   LastBusyTime=2025-04-02T10:17:02 ResumeAfterTime=None
   CfgTRES=cpu=56,mem=1025000M,billing=56,gres/gpu=4
   AllocTRES=cpu=56,mem=1000000M
   CapWatts=n/a
   CurrentWatts=846 AveWatts=593
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Upon examining the AllocTRES lines, we see that:

on t1024n4, 1 GPU, 48 CPUs and 745760M of memory is being used.
on t1024n3, 0 GPU, 56 CPUs and 1025000M of memory is being used.

The four GPUs on t1024n3 are basically locked up and cannot be used by anyone as that node has a maximum of 56 CPUs and all 56 are in use right now (remember that one cannot run a GPU job with 0 CPUs).

Is this really what the job authors want? It is hard to say. Luckily there were no other jobs that request GPUs on the queue, but some of these are multi-day jobs and might prevent jobs that arrive in the future from running.

So what do I do in the real-life example above if I want to run a job right now?

You can use the two remaining nodes, but let's say they are also locked up. How do I choose job parameters to ensure my job gets started? On t1024n3, there's just no chance. There are GPUs available but they are locked up as no CPUs are available. On t1024n4, the scontrol invocation above tells me that there's 1025000-745760=279240 memory available, and we know 3 GPUs and 8 CPUs are available. So the following script test.sh should run:

#!/bin/bash
#SBATCH --mem=100G
#SBATCH -c 2
#SBATCH -p ei-gpu
#SBATCH --gres=gpu:1
#SBATCH -J test_120
#SBATCH --nodelist=t1024n4
#SBATCH --mail-type=END,FAIL
#SBATCH --time=00:03:00

sleep 120;

But I will not be able to start three such jobs simultaneously as the total memory I would need (300G) would exceed the available ~279G. Just like I said:

[thiyagar@sub04 ~]$ sbatch test_1.sh
Submitted batch job 12217123
[thiyagar@sub04 ~]$ sbatch test_1.sh
Submitted batch job 12217124
[thiyagar@sub04 ~]$ sbatch test_1.sh
Submitted batch job 12217125
[thiyagar@sub04 ~]$ squeue | grep thiyagar
          12217125    ei-gpu test_120 thiyagar PD       0:00      1 (Resources)
          12217124    ei-gpu test_120 thiyagar  R       0:06      1 t1024n4
          12217123    ei-gpu test_120 thiyagar  R       0:10      1 t1024n4
          12209350 ei-intera interact thiyagar  R    7:11:08      1 e512n49

In summation, please request resources mindfully. And tailor your job to whatever resources are available if you want your jobs to start right away.

Do I have access to the ei-gpu partition?

To use GPUs on the HPC, you need to be part of a group of people that can run jobs on the ei-gpu partition. If you are not in this group, please talk to your data champion or line manager. A combination of you two/three will contact Research Computing and ask for you to be added to this group. If you do not know if you are a part of this group or not, please talk to your data champion.

Is the program I am interested in running optimized for GPUs?

Not all programs can utilize the potential speedup offered by a GPU. So you should check if the program you are interested in has a GPU-capable mode by reading its documentation and/or exploring its command-line options.

Have I set the flags/input options that enable the GPU mode in this program?

Software that can use GPU acceleration may not have the GPU mode turned on by default and may not automatically sense a GPU and turn this on. You may have to do this manually either on the command line or otherwise. Please read the documentation of your specific program and make sure you have turned these options on.

Conversely, some software may always have the GPU mode on by default and may get confused if you try to run them in a CPU-only mode (if you choose to do this). In this case, you will have to set some options to turn the GPU mode off.

Have I checked that I am using `singularity exec --nv`?

Normally we run software programs on the HPC through a software container program called singularity. This program is run behind the scenes every time you use a software package installed by Research Computing that you load using source package.

Whether it is you installing the package or Research Computing doing it, please make sure that the invocation singularity exec --nv is set in the software definition file or used when you run the software (nv stands for nvidia support). Please note that this flag will not cause problems if you run the software you are interested in in a CPU-only mode. If Research computing installed the package, it is likely that they would have set the --nv flag (if they were informed it is a GPU-enabled package).

If you do not know what I am talking about here, please consult our notes on software installation from here and/or talk to your data champion.

Do I have the environment (e.g. CUDA/hardware) needed for the GPU program?

GPU-optimized programs are usually written to use the NVIDIA CUDA software to interact with the GPU. So an additional complexity is that a particular software program may require a specific CUDA version, and this may not be the version of CUDA that is running on our GPU nodes. As far as I know, the CUDA version on the EI HPC's GPUs cannot be changed except by Research Computing. If you feel you may run into this problem, or you don't really know what I am talking about here, please discuss with your data champion.

Another complexity is GPU hardware. Some software programs may require a specific kind of GPU to run. An example is the software program Evo2; as of a few weeks before the time of writing, this program required a GPU called H100 which EI does not have. So, this program cannot be run on our cluster.

If this program needs access to the internet, how can I run it on the EI HPC which cannot access the internet?

If your program needs to download some data from the internet during its execution, and offers some options so that you can pre-download this data, then you are good. You just download the data by some means (talk to your data champion if you don't know how to do this), save it in your project space, and point the program to the location during execution.

If your program cannot do this, and absolutely requires access to the internet during execution, then please talk to your data champion who may be able to help you.

Do I have a way to check if my program is actually using the GPU/how efficiently is it using it?

I start an 'interactive' GPU job below, run a trial of the program dorado in the background that I know utilizes GPUs, and check if it is utilizing GPUs using the nvidia-smi command. You can use a similar trial workflow with your program and make sure that the GPU is being utilized. You can do this before running a full workflow using sbatch.

[thiyagar@sub04 ~]$ srun -p ei-gpu --gres=gpu:1 --time=01:00:00 --mem=100G bash

# load dorado - this is ONT's basecaller that can use GPUs
source package 44726663-0cc1-4aab-8f95-18a59543f5c9;

# set inputs and outputs
dorado_dna_model_folder_simplex_hac=/some/folder/[email protected]
pod5_folder=/some/folder/of/pod5/files
output_folder=/some/folder/trial_xxxxxx

# run dorado
dorado  \
      basecaller \
      "$dorado_dna_model_folder_simplex_hac" \
      "$pod5_folder" \
      --recursive \
     > "$output_folder"/brdu_calls.bam &

# this command monitors GPU usage
nvidia-smi

# output of the command below
Tue Apr  8 13:13:56 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:65:00.0 Off |                    0 |
| N/A   54C    P0            296W /  300W |   22899MiB /  81920MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2370683      C   ...e/dorado-0.3.4-linux-x64/bin/dorado      22890MiB |
+-----------------------------------------------------------------------------------------+

As we can see from above, a command is running on the GPU. After you've done this trial workflow, hit Ctrl+C (or whatever's the Mac equivalent) to close this 'interactive' job on the GPU.