veRL - chunhualiao/public-docs GitHub Wiki

Code Analysis

What is veRL?

veRL is an open-source reinforcement learning (RL) framework designed for post-training large language models (LLMs) like RLHF (Reinforcement Learning from Human Feedback). Developed by ByteDance’s Doubao team and collaborators, it focuses on:

Flexibility: Supports diverse RL algorithms (PPO, ReMax, Safe-RLHF) and integrates with LLM frameworks like PyTorch FSDP, Megatron-LM, and vLLM.
Efficiency: Uses 3D-HybridEngine to reduce memory redundancy and communication overhead during training-inference transitions, achieving up to 20x higher throughput compared to DeepSpeed-Chat and OpenRLHF.
Scalability: Runs on clusters with hundreds of GPUs, handling models up to 70B parameters.

How to Use veRL: Quickstart Guide

1. Install veRL

git clone https://github.com/volcengine/verl
cd verl
pip install -e .  # Install with dependencies (PyTorch, DeepSpeed, etc.)

2. Prepare Data

Example: Use the GSM8K math dataset:

cd examples/data_preprocess
python3 gsm8k.py --local_dir ~/data/gsm8k  # Generates train.parquet/test.parquet

3. Download a Base Model

Use a Hugging Face model (e.g., DeepSeek-Math-7B):

huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct

4. Run Supervised Fine-Tuning (SFT)

torchrun -m verl.trainer.fsdp_sft_trainer \
  data.train_files=~/data/gsm8k/train.parquet \
  data.val_files=~/data/gsm8k/test.parquet \
  model.partial_pretrain=deepseek-ai/deepseek-math-7b-instruct \
  trainer.project_name=gsm8k-sft \
  trainer.total_epochs=4

5. Perform PPO Training

python3 -m verl.trainer.main_ppo \
  data.train_files=~/data/gsm8k/train.parquet \
  data.val_files=~/data/gsm8k/test.parquet \
  data.max_prompt_length=512 \
  actor_rollout_ref.model.path=~/models/deepseek-math-7b-instruct \
  critic.model.path=~/models/deepseek-math-7b-instruct \
  actor_rollout_ref.rollout.tensor_model_parallel_size=2 \  # Tensor parallelism
  trainer.n_gpus_per_node=8 \  # Use 8 GPUs per node
  trainer.logger=['wandb'] \  # Log to Weights & Biases
  trainer.total_epochs=15

Key Features & Tips

Hybrid Programming Model
- Combines single-controller flexibility with multi-controller efficiency for RL workflows.
- Example: Use ParallelWorker classes for distributed actor/critic training.
3D-HybridEngine
- Optimizes GPU memory usage during transitions between training and inference phases.
- Adjust tensor_model_parallel_size and gpu_memory_utilization for your hardware.
Rule-Based Rewards
- Define custom reward functions (e.g., regex-based answer matching in GSM8K).
Monitoring
- Track metrics via wandb or mlflow with trainer.logger.
Scaling
- For large clusters, use nnodes and trainer.n_gpus_per_node to distribute workloads.

Documentation & Resources

GitHub Repo: volcengine/verl
Paper: HybridFlow: A Flexible and Efficient RLHF Framework (arXiv:2409.19256)
Tutorials: veRL Quickstart

veRL simplifies RLHF for LLMs by balancing flexibility and speed—ideal for both research and production use cases.