veRL:HPC cluster - chunhualiao/public-docs GitHub Wiki

veRL

Here's a step-by-step guide to configure a SLURM job for veRL (Volcano Engine Reinforcement Learning) training with multi-node/multi-GPU setups, along with the mapping of HPC concepts to veRL components:

1. SLURM Job Configuration

Basic SLURM Script Structure

#!/bin/bash
#SBATCH --job-name=verl_multi_gpu
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=2               # Use 2 nodes
#SBATCH --gpus-per-node=4       # 4 GPUs per node (Perlmutter: 4x A100 per node)
#SBATCH --ntasks-per-node=4     # 1 task per GPU
#SBATCH --cpus-per-task=32      # CPU cores per task (Perlmutter: 128 cores/node)
#SBATCH --time=04:00:00
#SBATCH --signal=B:USR1@60      # Signal handling for checkpointing

# Load modules
module purge
module load PrgEnv-nvidia cudatoolkit

# Set environment variables for DDP (Distributed Data Parallel)
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c4))
export WORLD_SIZE=$((SLURM_NNODES * SLURM_GPUS_PER_NODE))
export LOCAL_RANK=$SLURM_LOCALID
export GLOBAL_RANK=$SLURM_PROCID

# Launch veRL training
srun python3 -m verl.trainer.main_ppo \
  data.train_files=$DATA_DIR/train.parquet \
  data.val_files=$DATA_DIR/test.parquet \
  actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
  trainer.nnodes=$SLURM_NNODES \
  trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
  trainer.logger=['wandb'] \
  +trainer.resume=True \
  +trainer.checkpoint_dir=auto \
  trainer.total_epochs=20

2. Key SLURM Parameters for veRL

Parameter	Purpose	veRL Mapping
`--nodes=2`	Allocates 2 compute nodes	`trainer.nnodes=2`
`--gpus-per-node=4`	4 GPUs per node	`trainer.n_gpus_per_node=4`
`--ntasks-per-node=4`	1 task per GPU (aligns with GPUs/node)	Implicit in DDP setup
`--cpus-per-task=32`	CPU cores per GPU task (for data loading)	Affects `data.train_batch_size` efficiency
`MASTER_ADDR`	IP of the main node for DDP coordination	Automatically set via SLURM
`tensor_model_parallel_size`	Split model across GPUs (e.g., 2-way TP)	`actor_rollout_ref.rollout.tensor_model_parallel_size=2`

3. Mapping HPC Concepts to veRL

MPI Processes

Role: Coordinate distributed training across nodes (e.g., gradient synchronization).
veRL Usage: Managed implicitly by PyTorch’s DistributedDataParallel (DDP). Each GPU runs a separate process .
SLURM Mapping: --ntasks-per-node=4 creates 4 MPI-like processes per node.

OpenMP Threads

Role: Parallelize CPU-bound tasks (e.g., data preprocessing).
veRL Usage: Controlled via OMP_NUM_THREADS (set --cpus-per-task=32 to match CPU cores).
Example: export OMP_NUM_THREADS=32 in the SLURM script .

CUDA Code

Role: Accelerate model inference/training on GPUs.
veRL Usage: Leveraged via PyTorch/CUDA integration. Configure GPU memory with actor_rollout_ref.rollout.gpu_memory_utilization=0.4 [citation:previous context].

4. Advanced Configuration

Checkpointing:
- Use +trainer.resume=True and +trainer.checkpoint_dir=auto to resume from the latest checkpoint [citation:previous context].
- Signal handling (--signal=B:USR1@60) allows graceful preemption handling.
Hybrid Parallelism:
- Tensor Parallelism: Split model layers across GPUs (tensor_model_parallel_size).
- Data Parallelism: Split batches across GPUs (handled by DDP) .
Performance Tuning:
- Set actor_rollout_ref.rollout.log_prob_micro_batch_size to balance GPU memory usage.
- Use gpu_memory_utilization to avoid OOM errors [citation:previous context].

5. Example Workflow

Submit Job:
```
sbatch verl_multi_node.slurm
```
Monitor:
- Track logs with tail -f logs/verl_multi_gpu-<JOBID>.out.
- Use wandb for real-time metrics [citation:previous context].
Debugging:
- Check GPU utilization with nvidia-smi.
- Validate distributed setup with torch.distributed.is_initialized() in code .

6. Key Considerations

Network: Ensure NCCL backend is configured for fast inter-node communication (default on most HPC clusters) .
Scaling: Test with small WORLD_SIZE first (e.g., 2 nodes, 8 GPUs) before full-scale runs.
Mixed Precision: Add +trainer.fp16=True if supported by veRL for faster training .

For further details, refer to the veRL documentation and NERSC Perlmutter guidelines.