Slurm job script and srun - chunhualiao/public-docs GitHub Wiki
Yes, your Slurm job script (my_job.sh
) can and often should contain srun
! Here’s why and how they work together:
sbatch
vs. srun
: Their Roles
1. -
sbatch
:- Submits your script to the Slurm queue.
- Allocates resources (nodes, GPUs, CPUs) but does not run your code directly.
- Example:
sbatch my_job.sh
→ Slurm schedules the job and runs the script when resources are free.
-
srun
:- Launches parallel tasks within the resources allocated by
sbatch
. - Distributes work across nodes/GPUs (like MPI or PyTorch DDP).
- Example: Inside
my_job.sh
,srun python train.py
→ Runstrain.py
on all allocated GPUs.
- Launches parallel tasks within the resources allocated by
srun
Inside my_job.sh
?
2. Why Use (a) Proper Resource Utilization
Slurm allocates resources (nodes, GPUs) to your job via sbatch
, but srun
ensures your program actually uses them. Without srun
, your code might run on just 1 GPU/node!
(b) Parallel Execution
my_job.sh
:
#!/bin/bash
#SBATCH --nodes=2 # Allocate 2 nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node
#SBATCH --ntasks-per-node=4 # 4 tasks (processes) per node
# Without srun: Runs 1 process on the first node!
# With srun: Launches 8 processes (2 nodes × 4 tasks/node)
srun python train.py
(c) Signal Propagation
scancel
(job termination) and checkpointing work better when using srun
, as Slurm can manage processes properly.
3. Example: Multi-Node veRL Training
Here’s a job script for veRL that uses srun
:
#!/bin/bash
#SBATCH --job-name=verl_ppo
#SBATCH --nodes=2 # 2 nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node (total 8 GPUs)
#SBATCH --ntasks-per-node=4 # 4 tasks per node (1 per GPU)
#SBATCH --cpus-per-task=32 # 32 CPUs per task (for data loading)
#SBATCH --time=04:00:00
# Set up environment
module load cudatoolkit
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=12345
# Launch veRL training across all GPUs
srun python3 -m verl.trainer.main_ppo \
trainer.nnodes=$SLURM_NNODES \
trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
data.train_batch_size=1024
4. Key Concepts
srun
Inherits Resources from sbatch
(a) - The
#SBATCH
directives define resources (nodes, GPUs). srun
uses those resources to launch tasks.
(b) Process Mapping
--ntasks-per-node=4
+srun
→ Launches 4 processes per node (1 per GPU).- Each process binds to a GPU (handled by PyTorch/PyTorch Lightning).
(c) Environment Variables
MASTER_ADDR
/MASTER_PORT
: Required for distributed training (PyTorch DDP).SLURM_NNODES
,SLURM_GPUS_PER_NODE
: Automatically set by Slurm.
srun
5. When to Avoid - Single-Process Jobs: If your code uses all GPUs on a node internally (e.g.,
torch.nn.DataParallel
), you might skipsrun
. - Non-Parallel Code: Simple scripts that don’t need multi-node/GPU parallelism.
6. Common Mistakes
Mistake | Fix |
---|---|
Forgetting srun |
Code runs on 1 GPU instead of all allocated GPUs. |
Mismatched --ntasks |
Ensure --ntasks-per-node matches the number of GPUs. |
Incorrect MASTER_ADDR |
Use scontrol show hostnames to set it automatically. |
Summary
sbatch
: Requests resources.srun
: Uses those resources to run parallel tasks.- Always use
srun
insidemy_job.sh
for multi-node/multi-GPU training.
For more details, see:
Slurm Documentation
PyTorch Distributed with Slurm