SLURM - chunhualiao/public-docs GitHub Wiki
Lesson learned: do not directly run slurm job script!!
- must use sbatch thescripty.sh!!
- or srun: error: Job request does not match any supported policy.
Let me break down these Slurm (Simple Linux Utility for Resource Management) concepts in simple terms, using a HPC cluster analogy:
- What it is: A job scheduler/resource manager for supercomputers and HPC clusters (like Perlmutter at NERSC).
-
Analogy: Think of Slurm as the "air traffic controller" for a cluster:
- Manages who gets compute resources (GPUs, CPUs)
- Decides when/where jobs run
- Enforces fair resource sharing
- Purpose: Submit a script to run later (non-interactive).
-
Example:
sbatch my_job.sh # Submit a job script to the queue -
Key Features:
- Used for production runs (e.g., training LLMs for hours/days)
- Slurm schedules it when resources become available
- Purpose: Launch tasks interactively (for testing/debugging).
-
Example:
srun --nodes=2 --gpus-per-node=4 python train.py # Run immediately if resources are free -
Key Features:
- Blocks until resources are allocated
- Great for quick experiments (but don’t abuse it!)
-
Purpose: Check available partitions (queues) and node status.
sinfo # Shows partitions (e.g., "gpu", "debug") and node availability
-
Purpose: List pending/running jobs.
squeue -u $USER # See your jobs
| Step | Command | Analogy |
|---|---|---|
| 1. Prepare | Write my_job.sh
|
Write a recipe |
| 2. Submit | sbatch my_job.sh |
Give recipe to a chef (Slurm) |
| 3. Wait | Slurm queues job | Chef waits for kitchen space |
| 4. Execute | Slurm runs srun tasks |
Chef cooks your dish |
| 5. Monitor | squeue |
Check kitchen progress |
#!/bin/bash
#SBATCH --nodes=2 # Request 2 compute nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node
#SBATCH --ntasks-per-node=4 # 4 tasks (MPI processes) per node
#SBATCH --cpus-per-task=32 # 32 CPU cores per task
srun python train.py # Launches 8 tasks (2 nodes × 4 tasks/node)| Parameter | Purpose | Maps to... |
|---|---|---|
--nodes |
Number of machines | Physical servers in the cluster |
--gpus-per-node |
GPUs per machine | NVIDIA A100/H100 GPUs |
--ntasks-per-node |
Processes per node | MPI ranks or parallel workers |
--cpus-per-task |
CPU cores per process | OpenMP threads or data loaders |
| Component | Slurm Parameter | veRL/Code Usage |
|---|---|---|
| MPI Processes | --ntasks |
Distributed training (DDP in PyTorch) |
| OpenMP Threads |
--cpus-per-task + export OMP_NUM_THREADS
|
CPU parallelism (data loading) |
| CUDA | --gpus-per-node |
GPU acceleration (PyTorch/CUDA kernels) |
| Scenario | Command | Why |
|---|---|---|
| Long training job | sbatch |
Scheduled execution, logs saved |
| Debugging code | srun --interactive |
Immediate feedback |
| Testing multi-node | srun --nodes=2 |
Quick resource check |
-
Write a job script (
my_job.sh):#!/bin/bash #SBATCH --nodes=2 #SBATCH --gpus-per-node=4 #SBATCH --time=01:00:00 srun python3 -m verl.trainer.main_ppo # Your veRL training command
-
Submit it:
sbatch my_job.sh # Job enters the queue -
Monitor:
squeue -u $USER # Check job status
-
Cancel if needed:
scancel <JOBID> # Terminate a running job
-
sbatch: For production jobs (submit and forget). -
srun: For interactive testing (requires immediate resources). -
Parameters:
-
--nodes,--gpus-per-nodedefine hardware -
--ntasks,--cpus-per-taskdefine software parallelism
-
For hands-on practice, start with small jobs (e.g., srun --gpus=1 --time=10:00 python test.py) before scaling up.
can sbatch calls srun inside?