Slurm job script and srun - chunhualiao/public-docs GitHub Wiki
Yes, your Slurm job script (my_job.sh) can and often should contain srun! Here’s why and how they work together:
1. sbatch vs. srun: Their Roles
-
sbatch:- Submits your script to the Slurm queue.
- Allocates resources (nodes, GPUs, CPUs) but does not run your code directly.
- Example:
sbatch my_job.sh→ Slurm schedules the job and runs the script when resources are free.
-
srun:- Launches parallel tasks within the resources allocated by
sbatch. - Distributes work across nodes/GPUs (like MPI or PyTorch DDP).
- Example: Inside
my_job.sh,srun python train.py→ Runstrain.pyon all allocated GPUs.
- Launches parallel tasks within the resources allocated by
2. Why Use srun Inside my_job.sh?
(a) Proper Resource Utilization
Slurm allocates resources (nodes, GPUs) to your job via sbatch, but srun ensures your program actually uses them. Without srun, your code might run on just 1 GPU/node!
(b) Parallel Execution
my_job.sh:
#!/bin/bash
#SBATCH --nodes=2 # Allocate 2 nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node
#SBATCH --ntasks-per-node=4 # 4 tasks (processes) per node
# Without srun: Runs 1 process on the first node!
# With srun: Launches 8 processes (2 nodes × 4 tasks/node)
srun python train.py
(c) Signal Propagation
scancel (job termination) and checkpointing work better when using srun, as Slurm can manage processes properly.
3. Example: Multi-Node veRL Training
Here’s a job script for veRL that uses srun:
#!/bin/bash
#SBATCH --job-name=verl_ppo
#SBATCH --nodes=2 # 2 nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node (total 8 GPUs)
#SBATCH --ntasks-per-node=4 # 4 tasks per node (1 per GPU)
#SBATCH --cpus-per-task=32 # 32 CPUs per task (for data loading)
#SBATCH --time=04:00:00
# Set up environment
module load cudatoolkit
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=12345
# Launch veRL training across all GPUs
srun python3 -m verl.trainer.main_ppo \
trainer.nnodes=$SLURM_NNODES \
trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
data.train_batch_size=1024
4. Key Concepts
(a) srun Inherits Resources from sbatch
- The
#SBATCHdirectives define resources (nodes, GPUs). srunuses those resources to launch tasks.
(b) Process Mapping
--ntasks-per-node=4+srun→ Launches 4 processes per node (1 per GPU).- Each process binds to a GPU (handled by PyTorch/PyTorch Lightning).
(c) Environment Variables
MASTER_ADDR/MASTER_PORT: Required for distributed training (PyTorch DDP).SLURM_NNODES,SLURM_GPUS_PER_NODE: Automatically set by Slurm.
5. When to Avoid srun
- Single-Process Jobs: If your code uses all GPUs on a node internally (e.g.,
torch.nn.DataParallel), you might skipsrun. - Non-Parallel Code: Simple scripts that don’t need multi-node/GPU parallelism.
6. Common Mistakes
| Mistake | Fix |
|---|---|
Forgetting srun |
Code runs on 1 GPU instead of all allocated GPUs. |
Mismatched --ntasks |
Ensure --ntasks-per-node matches the number of GPUs. |
Incorrect MASTER_ADDR |
Use scontrol show hostnames to set it automatically. |
Summary
sbatch: Requests resources.srun: Uses those resources to run parallel tasks.- Always use
sruninsidemy_job.shfor multi-node/multi-GPU training.
For more details, see:
Slurm Documentation
PyTorch Distributed with Slurm