veRL - chunhualiao/public-docs GitHub Wiki

See also veRL:HPC cluster

reinforcement learning>

veRL: Volcano Engine Reinforcement Learning for LLM

Here’s a concise tutorial on veRL (Volcano Engine Reinforcement Learning) and how to use it for LLM training, synthesized from its documentation and research papers:

Code Analysis

veRL:trainer/main_generation.py

What is veRL?

veRL is an open-source reinforcement learning (RL) framework designed for post-training large language models (LLMs) like RLHF (Reinforcement Learning from Human Feedback). Developed by ByteDance’s Doubao team and collaborators, it focuses on:

  • Flexibility: Supports diverse RL algorithms (PPO, ReMax, Safe-RLHF) and integrates with LLM frameworks like PyTorch FSDP, Megatron-LM, and vLLM.
  • Efficiency: Uses 3D-HybridEngine to reduce memory redundancy and communication overhead during training-inference transitions, achieving up to 20x higher throughput compared to DeepSpeed-Chat and OpenRLHF.
  • Scalability: Runs on clusters with hundreds of GPUs, handling models up to 70B parameters.

How to Use veRL: Quickstart Guide

1. Install veRL

git clone https://github.com/volcengine/verl
cd verl
pip install -e .  # Install with dependencies (PyTorch, DeepSpeed, etc.)

2. Prepare Data

Example: Use the GSM8K math dataset:

cd examples/data_preprocess
python3 gsm8k.py --local_dir ~/data/gsm8k  # Generates train.parquet/test.parquet

3. Download a Base Model

Use a Hugging Face model (e.g., DeepSeek-Math-7B):

huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct

4. Run Supervised Fine-Tuning (SFT)

torchrun -m verl.trainer.fsdp_sft_trainer \
  data.train_files=~/data/gsm8k/train.parquet \
  data.val_files=~/data/gsm8k/test.parquet \
  model.partial_pretrain=deepseek-ai/deepseek-math-7b-instruct \
  trainer.project_name=gsm8k-sft \
  trainer.total_epochs=4

5. Perform PPO Training

python3 -m verl.trainer.main_ppo \
  data.train_files=~/data/gsm8k/train.parquet \
  data.val_files=~/data/gsm8k/test.parquet \
  data.max_prompt_length=512 \
  actor_rollout_ref.model.path=~/models/deepseek-math-7b-instruct \
  critic.model.path=~/models/deepseek-math-7b-instruct \
  actor_rollout_ref.rollout.tensor_model_parallel_size=2 \  # Tensor parallelism
  trainer.n_gpus_per_node=8 \  # Use 8 GPUs per node
  trainer.logger=['wandb'] \  # Log to Weights & Biases
  trainer.total_epochs=15

Key Features & Tips

  1. Hybrid Programming Model

    • Combines single-controller flexibility with multi-controller efficiency for RL workflows.
    • Example: Use ParallelWorker classes for distributed actor/critic training.
  2. 3D-HybridEngine

    • Optimizes GPU memory usage during transitions between training and inference phases.
    • Adjust tensor_model_parallel_size and gpu_memory_utilization for your hardware.
  3. Rule-Based Rewards

    • Define custom reward functions (e.g., regex-based answer matching in GSM8K).
  4. Monitoring

    • Track metrics via wandb or mlflow with trainer.logger.
  5. Scaling

    • For large clusters, use nnodes and trainer.n_gpus_per_node to distribute workloads.

Documentation & Resources

veRL simplifies RLHF for LLMs by balancing flexibility and speed—ideal for both research and production use cases.