veRL:HPC cluster - chunhualiao/public-docs GitHub Wiki

veRL

Here's a step-by-step guide to configure a SLURM job for veRL (Volcano Engine Reinforcement Learning) training with multi-node/multi-GPU setups, along with the mapping of HPC concepts to veRL components:


1. SLURM Job Configuration

Basic SLURM Script Structure

#!/bin/bash
#SBATCH --job-name=verl_multi_gpu
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err
#SBATCH --partition=gpu
#SBATCH --nodes=2               # Use 2 nodes
#SBATCH --gpus-per-node=4       # 4 GPUs per node (Perlmutter: 4x A100 per node)
#SBATCH --ntasks-per-node=4     # 1 task per GPU
#SBATCH --cpus-per-task=32      # CPU cores per task (Perlmutter: 128 cores/node)
#SBATCH --time=04:00:00
#SBATCH --signal=B:USR1@60      # Signal handling for checkpointing

# Load modules
module purge
module load PrgEnv-nvidia cudatoolkit

# Set environment variables for DDP (Distributed Data Parallel)
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c4))
export WORLD_SIZE=$((SLURM_NNODES * SLURM_GPUS_PER_NODE))
export LOCAL_RANK=$SLURM_LOCALID
export GLOBAL_RANK=$SLURM_PROCID

# Launch veRL training
srun python3 -m verl.trainer.main_ppo \
  data.train_files=$DATA_DIR/train.parquet \
  data.val_files=$DATA_DIR/test.parquet \
  actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
  trainer.nnodes=$SLURM_NNODES \
  trainer.n_gpus_per_node=$SLURM_GPUS_PER_NODE \
  trainer.logger=['wandb'] \
  +trainer.resume=True \
  +trainer.checkpoint_dir=auto \
  trainer.total_epochs=20

2. Key SLURM Parameters for veRL

Parameter Purpose veRL Mapping
--nodes=2 Allocates 2 compute nodes trainer.nnodes=2
--gpus-per-node=4 4 GPUs per node trainer.n_gpus_per_node=4
--ntasks-per-node=4 1 task per GPU (aligns with GPUs/node) Implicit in DDP setup
--cpus-per-task=32 CPU cores per GPU task (for data loading) Affects data.train_batch_size efficiency
MASTER_ADDR IP of the main node for DDP coordination Automatically set via SLURM
tensor_model_parallel_size Split model across GPUs (e.g., 2-way TP) actor_rollout_ref.rollout.tensor_model_parallel_size=2

3. Mapping HPC Concepts to veRL

MPI Processes

  • Role: Coordinate distributed training across nodes (e.g., gradient synchronization).
  • veRL Usage: Managed implicitly by PyTorch’s DistributedDataParallel (DDP). Each GPU runs a separate process .
  • SLURM Mapping: --ntasks-per-node=4 creates 4 MPI-like processes per node.

OpenMP Threads

  • Role: Parallelize CPU-bound tasks (e.g., data preprocessing).
  • veRL Usage: Controlled via OMP_NUM_THREADS (set --cpus-per-task=32 to match CPU cores).
  • Example: export OMP_NUM_THREADS=32 in the SLURM script .

CUDA Code

  • Role: Accelerate model inference/training on GPUs.
  • veRL Usage: Leveraged via PyTorch/CUDA integration. Configure GPU memory with actor_rollout_ref.rollout.gpu_memory_utilization=0.4 [citation:previous context].

4. Advanced Configuration

  1. Checkpointing:

    • Use +trainer.resume=True and +trainer.checkpoint_dir=auto to resume from the latest checkpoint [citation:previous context].
    • Signal handling (--signal=B:USR1@60) allows graceful preemption handling.
  2. Hybrid Parallelism:

    • Tensor Parallelism: Split model layers across GPUs (tensor_model_parallel_size).
    • Data Parallelism: Split batches across GPUs (handled by DDP) .
  3. Performance Tuning:

    • Set actor_rollout_ref.rollout.log_prob_micro_batch_size to balance GPU memory usage.
    • Use gpu_memory_utilization to avoid OOM errors [citation:previous context].

5. Example Workflow

  1. Submit Job:

    sbatch verl_multi_node.slurm
  2. Monitor:

    • Track logs with tail -f logs/verl_multi_gpu-<JOBID>.out.
    • Use wandb for real-time metrics [citation:previous context].
  3. Debugging:

    • Check GPU utilization with nvidia-smi.
    • Validate distributed setup with torch.distributed.is_initialized() in code .

6. Key Considerations

  • Network: Ensure NCCL backend is configured for fast inter-node communication (default on most HPC clusters) .
  • Scaling: Test with small WORLD_SIZE first (e.g., 2 nodes, 8 GPUs) before full-scale runs.
  • Mixed Precision: Add +trainer.fp16=True if supported by veRL for faster training .

For further details, refer to the veRL documentation and NERSC Perlmutter guidelines.

⚠️ **GitHub.com Fallback** ⚠️