Slurm job script and srun - chunhualiao/public-docs GitHub Wiki

Slurm

Yes, your Slurm job script (my_job.sh) can and often should contain srun! Here’s why and how they work together:

perlmutter

1. `sbatch` vs. `srun`: Their Roles

sbatch:
- Submits your script to the Slurm queue.
- Allocates resources (nodes, GPUs, CPUs) but does not run your code directly.
- Example: sbatch my_job.sh → Slurm schedules the job and runs the script when resources are free.
srun:
- Launches parallel tasks within the resources allocated by sbatch.
- Distributes work across nodes/GPUs (like MPI or PyTorch DDP).
- Example: Inside my_job.sh, srun python train.py → Runs train.py on all allocated GPUs.

2. Why Use `srun` Inside `my_job.sh`?

(a) Proper Resource Utilization

Slurm allocates resources (nodes, GPUs) to your job via sbatch, but srun ensures your program actually uses them. Without srun, your code might run on just 1 GPU/node!

(b) Parallel Execution

my_job.sh:

#!/bin/bash
#SBATCH --nodes=2              # Allocate 2 nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node
#SBATCH --ntasks-per-node=4    # 4 tasks (processes) per node

# Without srun: Runs 1 process on the first node!
# With srun: Launches 8 processes (2 nodes × 4 tasks/node)
srun python train.py

(c) Signal Propagation

scancel (job termination) and checkpointing work better when using srun, as Slurm can manage processes properly.

3. Example: Multi-Node veRL Training

Here’s a job script for veRL that uses srun:

#!/bin/bash
#SBATCH --job-name=verl_ppo
#SBATCH --nodes=2              # 2 nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node (total 8 GPUs)
#SBATCH --ntasks-per-node=4    # 4 tasks per node (1 per GPU)
#SBATCH --cpus-per-task=32     # 32 CPUs per task (for data loading)
#SBATCH --time=04:00:00

# Set up environment
module load cudatoolkit
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=12345

# Launch veRL training across all GPUs
srun python3 -m verl.trainer.main_ppo \
  trainer.nnodes=$SLURM_NNODES \
  trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
  data.train_batch_size=1024

4. Key Concepts

(a) `srun` Inherits Resources from `sbatch`

The #SBATCH directives define resources (nodes, GPUs).
srun uses those resources to launch tasks.

(b) Process Mapping

--ntasks-per-node=4 + srun → Launches 4 processes per node (1 per GPU).
Each process binds to a GPU (handled by PyTorch/PyTorch Lightning).

(c) Environment Variables

MASTER_ADDR/MASTER_PORT: Required for distributed training (PyTorch DDP).
SLURM_NNODES, SLURM_GPUS_PER_NODE: Automatically set by Slurm.

5. When to Avoid `srun`

Single-Process Jobs: If your code uses all GPUs on a node internally (e.g., torch.nn.DataParallel), you might skip srun.
Non-Parallel Code: Simple scripts that don’t need multi-node/GPU parallelism.

6. Common Mistakes

Mistake	Fix
Forgetting `srun`	Code runs on 1 GPU instead of all allocated GPUs.
Mismatched `--ntasks`	Ensure `--ntasks-per-node` matches the number of GPUs.
Incorrect `MASTER_ADDR`	Use `scontrol show hostnames` to set it automatically.

Summary

sbatch: Requests resources.
srun: Uses those resources to run parallel tasks.
Always use srun inside my_job.sh for multi-node/multi-GPU training.

For more details, see:
Slurm Documentation
PyTorch Distributed with Slurm