4. Working in the HPC environment - TGAC/knowledge_base GitHub Wiki

Questions:

  • ""

Objectives:

  • ""

Keypoints:

  • ""

When you log into the High-Performance Computing (HPC) system, a shell is opened for you on a submission node. These nodes serve as the hub for scheduling jobs for all users on the HPC. It's crucial to avoid running resource-intensive or long-duration tasks directly on the submission node, as it can severely disrupt other users' work, or even their ability to access the HPC environment. Hence, it's recommended to refrain from running anything longer than a few seconds on the submission node to maintain efficient usage for all users.

Ei’s preferred approach to avoid overloading the submission node is to start an interactive session. This opens a shell directly on one of the main compute nodes. By default, an interactive session requests modest resources, 2GB of RAM and a single CPU core. However, users can adjust resource allocations based on their specific requirements, such as using the '--mem' option. It's important to note that interactive sessions are transient, meaning that any processes running within them will be terminated upon logging out of the session. Therefore, it's advisable to complete tasks promptly within interactive sessions.

For more efficient (and persistent) use of the HPC resources, users can interact with Slurm directly. To do this, users create a script outlining their job requirements and instructions, then submit it to Slurm using the 'sbatch' command followed by the name of the script (e.g., 'sbatch slurm_batch.sh'). This method allows users to define the exact resources their job needs and ensures optimal scheduling and execution within the HPC environment.

You can submit this script to slurm for resource provisioning by typing:

sbatch slurm_batch.sh

When submitting jobs via Slurm, users can specify various options to tailor the execution of their jobs according to their requirements. These options can be specified in the executed script or on the command line when sbatch is invoked. These include:

-N, --nodes: The number of nodes required

-n, --ntasks: The number of tasks

--mem: the amount of memory required

-c, --cpus-per-task: Multithreaded code

-t, --time: Limit the total runtime of a job

--array=m-M: Submits a job array of m to M indices

-o, --error: specify a file for standard output

-e, --error: specify a file for error output

-p : specify the partition that the task should be executed on

Partitions, also referred to as queues, are subsets of the total nodes available on the HPC cluster. These partitions are configured with specific resource allocations and usage policies. Users can examine the rules for partitions using the 'sinfo' command; to obtain detailed information about each partition, including its current status, available resources, and any associated restrictions. Understanding the characteristics of each partition enables users to make informed decisions when submitting jobs to ensure they are allocated the appropriate resources for their tasks.

(image of sinfo screen)

In addition to these basic options, Slurm provides advanced functionalities for specific use cases. For example, users can define job dependencies to ensure that one job completes successfully before another begins, create job arrays to execute a series of similar tasks efficiently, and parallelize jobs using Message Passing Interface (MPI) for enhanced performance. Implementing these goes beyond the scope of this episode, but official SLURM (or RC documentation) should be helpful.

The 'srun' command is used to execute an application within a job allocation obtained from Slurm. It can be particularly useful when users need to run multiple tasks in parallel within a single job script submitted via 'sbatch'. By incorporating 'srun' commands into the job script, users can control the execution of individual tasks within the allocated resources effectively.

#!/bin/bash

  #SBATCH -p tgac-short 

  #SBATCH -n 3 # Allocate 3 tasks 

 srun -n 1 task1 & 

  srun -n 1 task2 & 

  srun -n 1 task3 

  # The wait is important to ensure all the tasks finish 

  wait 

Slurm also offers the '--wrap' option, allowing users to submit simple commands or one-liner scripts directly without the need for a separate script file. This feature streamlines the job submission process for quick, ad-hoc tasks that do not require extensive setup or configuration

sbatch -p ei-short --wrap=”sleep 60”

Most nodes within the HPC cluster are accessible from multiple partitions. This flexibility allows users to choose the most appropriate partition based on their specific requirements and workload characteristics. However, it's worth noting that certain specialized nodes, such as those with large-memory configurations, may be exclusive to a single partition due to their unique hardware specifications.