Quick Start Guide - tum-t38/firefly GitHub Wiki

Loading applications

The T38 Computing Cluster uses the module system to manage multiple program versions and dependencies.

module avail            # Show available modules. 
module list             # List loaded modules.
module [un]load <name>  # Load (unload) a module.
module spider           # Show all modules (including hidden modules).
module spider <name>    # Search for a module and show dependencies.

SLURM Queueing System

The T38 Computing Cluster uses the SLURM queuing system to schedule jobs. A detailed explanation of all SLURM commands can be found in the man pages.

This page summarizes the most commonly used commands and provides some examples.


sbatch	`sbatch <script>`	Submit script to the queue.
sinfo	`sinfo`	Display status of compute nodes.
sq	`sq`	List status of user jobs (Firefly specific).
squeue	`squeue`	List status of all jobs.
scancel	`scancel <jobid>`	Cancel a job.
sinteract	`sinteract`	Start an interactive job.

Run t38-slurm-exampleto generate an example submission script.

Jobs are submitted to SLURM with the sbatch command along with a submission script that will be used to start your program. It is common (although not required) that the submission script is a BASH script. The script generally contains SLURM options that control the behavior of SLURM and which resources are allocated to your job. Options provided on the command line to sbatch override options within the script.

A typical script may start with the following lines:

#!/bin/bash                                 # Run bash
#SBATCH --job-name=JOB_NAME                 # Job name will be JOB_NAME
#SBATCH --time=1-23:59:59                   # Allocate resources for 1 day 23 hours 59 minutes and 59 seconds
#SBATCH --nodes=1                           # Allocate 1 node
#SBATCH --ntasks=4                          # Allocate resources for 4 MPI tasks
#SBATCH --gres=gpu:1                        # Allocate 1 gpu per node
#SBATCH --mail-user=<YOUR_TUM_EMAIL>        # Send SLURM job status updates to the given address.
#SBATCH --mail-type=ALL                     # Send all kinds of updates.

NOTES:

The default time limit is 3 minutes, but there is no upper limit. Please consider a realistic time that your job requires. The efficient scheduling of other jobs depends on each job running for approximately the requested time. Abuse of this flexibility will be noted.

By default, each job is allocate 4GB of memory per node. To change this, specify the --mem option. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T].

Job Tolerance to Cancellation or Failure

GPUs or other hardware may fail or suffer problems during a run, forcing a job to be prematurely ended. Please prepare your workflow to be able to recover from such interruptions. Create restart files and write scripts that can easily start from where a previous job stopped.

Maintenance

Occasionally nodes or the entire cluster will have to be shutdown for maintenance or in response to an incident. As far as possible, warning will be given via e-mail in advance, but sometimes this is not possible.