SLURM - chunhualiao/public-docs GitHub Wiki

Lesson learned: do not directly run slurm job script!!

must use sbatch thescripty.sh!!
or srun: error: Job request does not match any supported policy.

Let me break down these Slurm (Simple Linux Utility for Resource Management) concepts in simple terms, using a HPC cluster analogy:

1. Slurm (The System)

What it is: A job scheduler/resource manager for supercomputers and HPC clusters (like Perlmutter at NERSC).
Analogy: Think of Slurm as the "air traffic controller" for a cluster:
- Manages who gets compute resources (GPUs, CPUs)
- Decides when/where jobs run
- Enforces fair resource sharing

2. Key Slurm Commands

`sbatch` (Submit Batch Job)

Purpose: Submit a script to run later (non-interactive).

Example:

sbatch my_job.sh  # Submit a job script to the queue

Key Features:
- Used for production runs (e.g., training LLMs for hours/days)
- Slurm schedules it when resources become available

`srun` (Run Interactive/Parallel Job)

Purpose: Launch tasks interactively (for testing/debugging).

Example:

srun --nodes=2 --gpus-per-node=4 python train.py  # Run immediately if resources are free

Key Features:
- Blocks until resources are allocated
- Great for quick experiments (but don’t abuse it!)

`sinfo` (Cluster Status)

Purpose: Check available partitions (queues) and node status.

sinfo  # Shows partitions (e.g., "gpu", "debug") and node availability

`squeue` (Job Queue)

Purpose: List pending/running jobs.
```
squeue -u $USER  # See your jobs
```

3. How They Relate

Workflow Analogy:

Step	Command	Analogy
1. Prepare	Write `my_job.sh`	Write a recipe
2. Submit	`sbatch my_job.sh`	Give recipe to a chef (Slurm)
3. Wait	Slurm queues job	Chef waits for kitchen space
4. Execute	Slurm runs `srun` tasks	Chef cooks your dish
5. Monitor	`squeue`	Check kitchen progress

4. Mapping to HPC Resources

Job Script Structure (`my_job.sh`):

#!/bin/bash
#SBATCH --nodes=2              # Request 2 compute nodes
#SBATCH --gpus-per-node=4      # 4 GPUs per node
#SBATCH --ntasks-per-node=4    # 4 tasks (MPI processes) per node
#SBATCH --cpus-per-task=32     # 32 CPU cores per task

srun python train.py  # Launches 8 tasks (2 nodes × 4 tasks/node)

Key Parameters:

Parameter	Purpose	Maps to...
`--nodes`	Number of machines	Physical servers in the cluster
`--gpus-per-node`	GPUs per machine	NVIDIA A100/H100 GPUs
`--ntasks-per-node`	Processes per node	MPI ranks or parallel workers
`--cpus-per-task`	CPU cores per process	OpenMP threads or data loaders

5. MPI/OpenMP/CUDA Mapping

Component	Slurm Parameter	veRL/Code Usage
MPI Processes	`--ntasks`	Distributed training (DDP in PyTorch)
OpenMP Threads	`--cpus-per-task` + `export OMP_NUM_THREADS`	CPU parallelism (data loading)
CUDA	`--gpus-per-node`	GPU acceleration (PyTorch/CUDA kernels)

6. When to Use Which

Scenario	Command	Why
Long training job	`sbatch`	Scheduled execution, logs saved
Debugging code	`srun --interactive`	Immediate feedback
Testing multi-node	`srun --nodes=2`	Quick resource check

7. Example Workflow

Write a job script (my_job.sh):

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=4
#SBATCH --time=01:00:00

srun python3 -m verl.trainer.main_ppo  # Your veRL training command

Submit it:

sbatch my_job.sh  # Job enters the queue

Monitor:
```
squeue -u $USER   # Check job status
```

Cancel if needed:

scancel <JOBID>   # Terminate a running job

Key Takeaways

sbatch: For production jobs (submit and forget).
srun: For interactive testing (requires immediate resources).
Parameters:
- --nodes, --gpus-per-node define hardware
- --ntasks, --cpus-per-task define software parallelism

For hands-on practice, start with small jobs (e.g., srun --gpus=1 --time=10:00 python test.py) before scaling up.

can sbatch calls srun inside?

Slurm job script and srun