SLURM - chunhualiao/public-docs GitHub Wiki

Lesson learned: do not directly run slurm job script!!

  • must use sbatch thescripty.sh!!
  • or srun: error: Job request does not match any supported policy.

Let me break down these Slurm (Simple Linux Utility for Resource Management) concepts in simple terms, using a HPC cluster analogy:


1. Slurm (The System)

  • What it is: A job scheduler/resource manager for supercomputers and HPC clusters (like Perlmutter at NERSC).
  • Analogy: Think of Slurm as the "air traffic controller" for a cluster:
    • Manages who gets compute resources (GPUs, CPUs)
    • Decides when/where jobs run
    • Enforces fair resource sharing

2. Key Slurm Commands

sbatch (Submit Batch Job)

  • Purpose: Submit a script to run later (non-interactive).
  • Example:
    sbatch my_job.sh  # Submit a job script to the queue
  • Key Features:
    • Used for production runs (e.g., training LLMs for hours/days)
    • Slurm schedules it when resources become available

srun (Run Interactive/Parallel Job)

  • Purpose: Launch tasks interactively (for testing/debugging).
  • Example:
    srun --nodes=2 --gpus-per-node=4 python train.py  # Run immediately if resources are free
  • Key Features:
    • Blocks until resources are allocated
    • Great for quick experiments (but don’t abuse it!)

sinfo (Cluster Status)

  • Purpose: Check available partitions (queues) and node status.
    sinfo  # Shows partitions (e.g., "gpu", "debug") and node availability

squeue (Job Queue)

  • Purpose: List pending/running jobs.
    squeue -u $USER  # See your jobs

3. How They Relate

Workflow Analogy:

Step Command Analogy
1. Prepare Write my_job.sh Write a recipe
2. Submit sbatch my_job.sh Give recipe to a chef (Slurm)
3. Wait Slurm queues job Chef waits for kitchen space
4. Execute Slurm runs srun tasks Chef cooks your dish
5. Monitor squeue Check kitchen progress

4. Mapping to HPC Resources

Job Script Structure (my_job.sh):

#!/bin/bash
#SBATCH --nodes=2              # Request 2 compute nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node
#SBATCH --ntasks-per-node=4    # 4 tasks (MPI processes) per node
#SBATCH --cpus-per-task=32     # 32 CPU cores per task

srun python train.py  # Launches 8 tasks (2 nodes × 4 tasks/node)

Key Parameters:

Parameter Purpose Maps to...
--nodes Number of machines Physical servers in the cluster
--gpus-per-node GPUs per machine NVIDIA A100/H100 GPUs
--ntasks-per-node Processes per node MPI ranks or parallel workers
--cpus-per-task CPU cores per process OpenMP threads or data loaders

5. MPI/OpenMP/CUDA Mapping

Component Slurm Parameter veRL/Code Usage
MPI Processes --ntasks Distributed training (DDP in PyTorch)
OpenMP Threads --cpus-per-task + export OMP_NUM_THREADS CPU parallelism (data loading)
CUDA --gpus-per-node GPU acceleration (PyTorch/CUDA kernels)

6. When to Use Which

Scenario Command Why
Long training job sbatch Scheduled execution, logs saved
Debugging code srun --interactive Immediate feedback
Testing multi-node srun --nodes=2 Quick resource check

7. Example Workflow

  1. Write a job script (my_job.sh):

    #!/bin/bash
    #SBATCH --nodes=2
    #SBATCH --gpus-per-node=4
    #SBATCH --time=01:00:00
    
    srun python3 -m verl.trainer.main_ppo  # Your veRL training command
  2. Submit it:

    sbatch my_job.sh  # Job enters the queue
  3. Monitor:

    squeue -u $USER   # Check job status
  4. Cancel if needed:

    scancel <JOBID>   # Terminate a running job

Key Takeaways

  • sbatch: For production jobs (submit and forget).
  • srun: For interactive testing (requires immediate resources).
  • Parameters:
    • --nodes, --gpus-per-node define hardware
    • --ntasks, --cpus-per-task define software parallelism

For hands-on practice, start with small jobs (e.g., srun --gpus=1 --time=10:00 python test.py) before scaling up.

can sbatch calls srun inside?

⚠️ **GitHub.com Fallback** ⚠️