Slurm job script and srun - chunhualiao/public-docs GitHub Wiki

Slurm

Yes, your Slurm job script (my_job.sh) can and often should contain srun! Here’s why and how they work together:


1. sbatch vs. srun: Their Roles

  • sbatch:

    • Submits your script to the Slurm queue.
    • Allocates resources (nodes, GPUs, CPUs) but does not run your code directly.
    • Example: sbatch my_job.sh → Slurm schedules the job and runs the script when resources are free.
  • srun:

    • Launches parallel tasks within the resources allocated by sbatch.
    • Distributes work across nodes/GPUs (like MPI or PyTorch DDP).
    • Example: Inside my_job.sh, srun python train.py → Runs train.py on all allocated GPUs.

2. Why Use srun Inside my_job.sh?

(a) Proper Resource Utilization

Slurm allocates resources (nodes, GPUs) to your job via sbatch, but srun ensures your program actually uses them. Without srun, your code might run on just 1 GPU/node!

(b) Parallel Execution

my_job.sh:

#!/bin/bash
#SBATCH --nodes=2              # Allocate 2 nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node
#SBATCH --ntasks-per-node=4    # 4 tasks (processes) per node

# Without srun: Runs 1 process on the first node!
# With srun: Launches 8 processes (2 nodes × 4 tasks/node)
srun python train.py

(c) Signal Propagation

scancel (job termination) and checkpointing work better when using srun, as Slurm can manage processes properly.


3. Example: Multi-Node veRL Training

Here’s a job script for veRL that uses srun:

#!/bin/bash
#SBATCH --job-name=verl_ppo
#SBATCH --nodes=2              # 2 nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node (total 8 GPUs)
#SBATCH --ntasks-per-node=4    # 4 tasks per node (1 per GPU)
#SBATCH --cpus-per-task=32     # 32 CPUs per task (for data loading)
#SBATCH --time=04:00:00

# Set up environment
module load cudatoolkit
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)
export MASTER_PORT=12345

# Launch veRL training across all GPUs
srun python3 -m verl.trainer.main_ppo \
  trainer.nnodes=$SLURM_NNODES \
  trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
  data.train_batch_size=1024

4. Key Concepts

(a) srun Inherits Resources from sbatch

  • The #SBATCH directives define resources (nodes, GPUs).
  • srun uses those resources to launch tasks.

(b) Process Mapping

  • --ntasks-per-node=4 + srun → Launches 4 processes per node (1 per GPU).
  • Each process binds to a GPU (handled by PyTorch/PyTorch Lightning).

(c) Environment Variables

  • MASTER_ADDR/MASTER_PORT: Required for distributed training (PyTorch DDP).
  • SLURM_NNODES, SLURM_GPUS_PER_NODE: Automatically set by Slurm.

5. When to Avoid srun

  • Single-Process Jobs: If your code uses all GPUs on a node internally (e.g., torch.nn.DataParallel), you might skip srun.
  • Non-Parallel Code: Simple scripts that don’t need multi-node/GPU parallelism.

6. Common Mistakes

Mistake Fix
Forgetting srun Code runs on 1 GPU instead of all allocated GPUs.
Mismatched --ntasks Ensure --ntasks-per-node matches the number of GPUs.
Incorrect MASTER_ADDR Use scontrol show hostnames to set it automatically.

Summary

  • sbatch: Requests resources.
  • srun: Uses those resources to run parallel tasks.
  • Always use srun inside my_job.sh for multi-node/multi-GPU training.

For more details, see:
Slurm Documentation
PyTorch Distributed with Slurm