veRL - chunhualiao/public-docs GitHub Wiki
See also veRL:HPC cluster
veRL: Volcano Engine Reinforcement Learning for LLM
Here’s a concise tutorial on veRL (Volcano Engine Reinforcement Learning) and how to use it for LLM training, synthesized from its documentation and research papers:
Code Analysis
veRL:trainer/main_generation.py
What is veRL?
veRL is an open-source reinforcement learning (RL) framework designed for post-training large language models (LLMs) like RLHF (Reinforcement Learning from Human Feedback). Developed by ByteDance’s Doubao team and collaborators, it focuses on:
- Flexibility: Supports diverse RL algorithms (PPO, ReMax, Safe-RLHF) and integrates with LLM frameworks like PyTorch FSDP, Megatron-LM, and vLLM.
- Efficiency: Uses 3D-HybridEngine to reduce memory redundancy and communication overhead during training-inference transitions, achieving up to 20x higher throughput compared to DeepSpeed-Chat and OpenRLHF.
- Scalability: Runs on clusters with hundreds of GPUs, handling models up to 70B parameters.
How to Use veRL: Quickstart Guide
1. Install veRL
git clone https://github.com/volcengine/verl
cd verl
pip install -e . # Install with dependencies (PyTorch, DeepSpeed, etc.)
2. Prepare Data
Example: Use the GSM8K math dataset:
cd examples/data_preprocess
python3 gsm8k.py --local_dir ~/data/gsm8k # Generates train.parquet/test.parquet
3. Download a Base Model
Use a Hugging Face model (e.g., DeepSeek-Math-7B):
huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct
4. Run Supervised Fine-Tuning (SFT)
torchrun -m verl.trainer.fsdp_sft_trainer \
data.train_files=~/data/gsm8k/train.parquet \
data.val_files=~/data/gsm8k/test.parquet \
model.partial_pretrain=deepseek-ai/deepseek-math-7b-instruct \
trainer.project_name=gsm8k-sft \
trainer.total_epochs=4
5. Perform PPO Training
python3 -m verl.trainer.main_ppo \
data.train_files=~/data/gsm8k/train.parquet \
data.val_files=~/data/gsm8k/test.parquet \
data.max_prompt_length=512 \
actor_rollout_ref.model.path=~/models/deepseek-math-7b-instruct \
critic.model.path=~/models/deepseek-math-7b-instruct \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ # Tensor parallelism
trainer.n_gpus_per_node=8 \ # Use 8 GPUs per node
trainer.logger=['wandb'] \ # Log to Weights & Biases
trainer.total_epochs=15
Key Features & Tips
-
Hybrid Programming Model
- Combines single-controller flexibility with multi-controller efficiency for RL workflows.
- Example: Use
ParallelWorker
classes for distributed actor/critic training.
-
3D-HybridEngine
- Optimizes GPU memory usage during transitions between training and inference phases.
- Adjust
tensor_model_parallel_size
andgpu_memory_utilization
for your hardware.
-
Rule-Based Rewards
- Define custom reward functions (e.g., regex-based answer matching in GSM8K).
-
Monitoring
- Track metrics via
wandb
ormlflow
withtrainer.logger
.
- Track metrics via
-
Scaling
- For large clusters, use
nnodes
andtrainer.n_gpus_per_node
to distribute workloads.
- For large clusters, use
Documentation & Resources
- GitHub Repo: volcengine/verl
- Paper: HybridFlow: A Flexible and Efficient RLHF Framework (arXiv:2409.19256)
- Tutorials: veRL Quickstart
veRL simplifies RLHF for LLMs by balancing flexibility and speed—ideal for both research and production use cases.