veRL:HPC cluster - chunhualiao/public-docs GitHub Wiki
Here's a step-by-step guide to configure a SLURM job for veRL (Volcano Engine Reinforcement Learning) training with multi-node/multi-GPU setups, along with the mapping of HPC concepts to veRL components:
#!/bin/bash
#SBATCH --job-name=verl_multi_gpu
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=2 # Use 2 nodes
#SBATCH --gpus-per-node=4 # 4 GPUs per node (Perlmutter: 4x A100 per node)
#SBATCH --ntasks-per-node=4 # 1 task per GPU
#SBATCH --cpus-per-task=32 # CPU cores per task (Perlmutter: 128 cores/node)
#SBATCH --time=04:00:00
#SBATCH --signal=B:USR1@60 # Signal handling for checkpointing
# Load modules
module purge
module load PrgEnv-nvidia cudatoolkit
# Set environment variables for DDP (Distributed Data Parallel)
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c4))
export WORLD_SIZE=$((SLURM_NNODES * SLURM_GPUS_PER_NODE))
export LOCAL_RANK=$SLURM_LOCALID
export GLOBAL_RANK=$SLURM_PROCID
# Launch veRL training
srun python3 -m verl.trainer.main_ppo \
data.train_files=$DATA_DIR/train.parquet \
data.val_files=$DATA_DIR/test.parquet \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
trainer.nnodes=$SLURM_NNODES \
trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
trainer.logger=['wandb'] \
+trainer.resume=True \
+trainer.checkpoint_dir=auto \
trainer.total_epochs=20
Parameter | Purpose | veRL Mapping |
---|---|---|
--nodes=2 |
Allocates 2 compute nodes | trainer.nnodes=2 |
--gpus-per-node=4 |
4 GPUs per node | trainer.n_gpus_per_node=4 |
--ntasks-per-node=4 |
1 task per GPU (aligns with GPUs/node) | Implicit in DDP setup |
--cpus-per-task=32 |
CPU cores per GPU task (for data loading) | Affects data.train_batch_size efficiency |
MASTER_ADDR |
IP of the main node for DDP coordination | Automatically set via SLURM |
tensor_model_parallel_size |
Split model across GPUs (e.g., 2-way TP) | actor_rollout_ref.rollout.tensor_model_parallel_size=2 |
- Role: Coordinate distributed training across nodes (e.g., gradient synchronization).
-
veRL Usage: Managed implicitly by PyTorch’s
DistributedDataParallel
(DDP). Each GPU runs a separate process . -
SLURM Mapping:
--ntasks-per-node=4
creates 4 MPI-like processes per node.
- Role: Parallelize CPU-bound tasks (e.g., data preprocessing).
-
veRL Usage: Controlled via
OMP_NUM_THREADS
(set--cpus-per-task=32
to match CPU cores). -
Example:
export OMP_NUM_THREADS=32
in the SLURM script .
- Role: Accelerate model inference/training on GPUs.
-
veRL Usage: Leveraged via PyTorch/CUDA integration. Configure GPU memory with
actor_rollout_ref.rollout.gpu_memory_utilization=0.4
[citation:previous context].
-
Checkpointing:
- Use
+trainer.resume=True
and+trainer.checkpoint_dir=auto
to resume from the latest checkpoint [citation:previous context]. - Signal handling (
--signal=B:USR1@60
) allows graceful preemption handling.
- Use
-
Hybrid Parallelism:
-
Tensor Parallelism: Split model layers across GPUs (
tensor_model_parallel_size
). - Data Parallelism: Split batches across GPUs (handled by DDP) .
-
Tensor Parallelism: Split model layers across GPUs (
-
Performance Tuning:
- Set
actor_rollout_ref.rollout.log_prob_micro_batch_size
to balance GPU memory usage. - Use
gpu_memory_utilization
to avoid OOM errors [citation:previous context].
- Set
-
Submit Job:
sbatch verl_multi_node.slurm
-
Monitor:
- Track logs with
tail -f logs/verl_multi_gpu-<JOBID>.out
. - Use
wandb
for real-time metrics [citation:previous context].
- Track logs with
-
Debugging:
- Check GPU utilization with
nvidia-smi
. - Validate distributed setup with
torch.distributed.is_initialized()
in code .
- Check GPU utilization with
- Network: Ensure NCCL backend is configured for fast inter-node communication (default on most HPC clusters) .
-
Scaling: Test with small
WORLD_SIZE
first (e.g., 2 nodes, 8 GPUs) before full-scale runs. -
Mixed Precision: Add
+trainer.fp16=True
if supported by veRL for faster training .
For further details, refer to the veRL documentation and NERSC Perlmutter guidelines.