SLURM - chunhualiao/public-docs GitHub Wiki
Lesson learned: do not directly run slurm job script!!
- must use sbatch thescripty.sh!!
- or srun: error: Job request does not match any supported policy.
Let me break down these Slurm (Simple Linux Utility for Resource Management) concepts in simple terms, using a HPC cluster analogy:
- What it is: A job scheduler/resource manager for supercomputers and HPC clusters (like Perlmutter at NERSC).
-
Analogy: Think of Slurm as the "air traffic controller" for a cluster:
- Manages who gets compute resources (GPUs, CPUs)
- Decides when/where jobs run
- Enforces fair resource sharing
- Purpose: Submit a script to run later (non-interactive).
-
Example:
sbatch my_job.sh # Submit a job script to the queue
-
Key Features:
- Used for production runs (e.g., training LLMs for hours/days)
- Slurm schedules it when resources become available
- Purpose: Launch tasks interactively (for testing/debugging).
-
Example:
srun --nodes=2 --gpus-per-node=4 python train.py # Run immediately if resources are free
-
Key Features:
- Blocks until resources are allocated
- Great for quick experiments (but don’t abuse it!)
-
Purpose: Check available partitions (queues) and node status.
sinfo # Shows partitions (e.g., "gpu", "debug") and node availability
-
Purpose: List pending/running jobs.
squeue -u $USER # See your jobs
Step | Command | Analogy |
---|---|---|
1. Prepare | Write my_job.sh
|
Write a recipe |
2. Submit | sbatch my_job.sh |
Give recipe to a chef (Slurm) |
3. Wait | Slurm queues job | Chef waits for kitchen space |
4. Execute | Slurm runs srun tasks |
Chef cooks your dish |
5. Monitor | squeue |
Check kitchen progress |
#!/bin/bash
#SBATCH --nodes=2 # Request 2 compute nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node
#SBATCH --ntasks-per-node=4 # 4 tasks (MPI processes) per node
#SBATCH --cpus-per-task=32 # 32 CPU cores per task
srun python train.py # Launches 8 tasks (2 nodes × 4 tasks/node)
Parameter | Purpose | Maps to... |
---|---|---|
--nodes |
Number of machines | Physical servers in the cluster |
--gpus-per-node |
GPUs per machine | NVIDIA A100/H100 GPUs |
--ntasks-per-node |
Processes per node | MPI ranks or parallel workers |
--cpus-per-task |
CPU cores per process | OpenMP threads or data loaders |
Component | Slurm Parameter | veRL/Code Usage |
---|---|---|
MPI Processes | --ntasks |
Distributed training (DDP in PyTorch) |
OpenMP Threads |
--cpus-per-task + export OMP_NUM_THREADS
|
CPU parallelism (data loading) |
CUDA | --gpus-per-node |
GPU acceleration (PyTorch/CUDA kernels) |
Scenario | Command | Why |
---|---|---|
Long training job | sbatch |
Scheduled execution, logs saved |
Debugging code | srun --interactive |
Immediate feedback |
Testing multi-node | srun --nodes=2 |
Quick resource check |
-
Write a job script (
my_job.sh
):#!/bin/bash #SBATCH --nodes=2 #SBATCH --gpus-per-node=4 #SBATCH --time=01:00:00 srun python3 -m verl.trainer.main_ppo # Your veRL training command
-
Submit it:
sbatch my_job.sh # Job enters the queue
-
Monitor:
squeue -u $USER # Check job status
-
Cancel if needed:
scancel <JOBID> # Terminate a running job
-
sbatch
: For production jobs (submit and forget). -
srun
: For interactive testing (requires immediate resources). -
Parameters:
-
--nodes
,--gpus-per-node
define hardware -
--ntasks
,--cpus-per-task
define software parallelism
-
For hands-on practice, start with small jobs (e.g., srun --gpus=1 --time=10:00 python test.py
) before scaling up.
can sbatch calls srun inside?