GRPO - chunhualiao/public-docs GitHub Wiki
Improvements in Reinforcement Learning for DeepSeek-R1-Zero
Here’s a table comparing Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) based on key characteristics:
Feature | PPO (Proximal Policy Optimization) | GRPO (Group Relative Policy Optimization) |
---|---|---|
Critic Model | Requires a critic model to estimate value functions. | Does not require a critic model, reducing computational cost. |
Optimization Approach | Uses a clipped objective function to constrain policy updates. | Uses a group-based comparison approach to estimate policy updates. |
Baseline Estimation | Uses a learned value function (critic) to estimate advantages. | Estimates the baseline from the group scores instead of a critic. |
Computational Cost | High due to critic training and bootstrapping. | Lower since it avoids the critic and uses simpler relative comparisons. |
Stability of Training | More stable due to critic-guided learning. | May be more sensitive to sample quality but benefits from comparative scoring. |
Sample Efficiency | Requires extensive exploration due to critic reliance. | More sample-efficient due to direct advantage estimation from group comparisons. |
Use in DeepSeek-R1 | Not used due to high computational overhead. | Chosen for DeepSeek-R1-Zero due to efficiency and ability to improve reasoning performance. |
Suitability | Well-suited for continuous control tasks (e.g., robotics, game playing). | Better suited for reasoning tasks in LLMs, where direct group comparisons help refine policy updates. |
In summary:
- PPO is widely used for general RL tasks but requires a critic model, making it more computationally expensive.
- GRPO is more efficient for LLMs as it removes the critic and focuses on direct group-based optimization, making it a better fit for reasoning-oriented RL in DeepSeek-R1.
The authors of the DeepSeek-R1 paper introduced improvements in reinforcement learning (RL) for DeepSeek-R1-Zero primarily through Group Relative Policy Optimization (GRPO). This method optimizes RL efficiency by:
-
Avoiding a Critic Model: Traditional RL methods, such as Proximal Policy Optimization (PPO), rely on a critic model to estimate value functions, requiring extra computation. Instead, GRPO estimates the baseline using group scores, making training more computationally efficient.
-
Group-based Optimization: The model samples multiple responses per question and compares them within a group, optimizing the policy based on relative advantages rather than absolute values.
-
Rule-based Reward Modeling:
- Accuracy Rewards: Evaluates correctness for tasks such as math and coding.
-
Format Rewards: Encourages structured reasoning outputs, using predefined tags like
<think> </think>
to separate the reasoning process from the final answer.
-
Self-Evolution Process:
- The model autonomously develops reasoning strategies without requiring supervised fine-tuning.
- Over RL iterations, DeepSeek-R1-Zero naturally extends its thinking process, improving problem-solving efficiency.
While DeepSeek-R1-Zero demonstrated strong reasoning capabilities through RL alone, it faced notable drawbacks:
- Poor Readability: Outputs were difficult to understand due to language mixing.
- Lack of Alignment: Responses were not optimized for human preferences, making them less user-friendly.
To address these issues, the authors introduced DeepSeek-R1, which builds upon DeepSeek-R1-Zero with additional refinements:
-
Cold-Start Data:
- Before RL, the model is fine-tuned on a small dataset of high-quality reasoning examples.
- This prevents instability during early RL training and improves readability.
-
Reasoning-Oriented RL:
- Similar RL training is applied, but with additional constraints to enhance readability and coherence.
- A language consistency reward is introduced to prevent language mixing.
-
Rejection Sampling and Supervised Fine-Tuning (SFT):
- After RL convergence, high-quality responses are selected and used to fine-tune the model with additional general-domain data.
- This stage integrates reasoning with non-reasoning tasks such as writing and role-playing.
-
Final RL Alignment:
- Another RL phase refines helpfulness and harmlessness.
- Combines rule-based rewards for reasoning with human preference-based rewards for general tasks.
- DeepSeek-R1-Zero relies purely on RL, which enables self-evolving reasoning but results in poor readability and user alignment.
- DeepSeek-R1 incorporates cold-start data and iterative fine-tuning, improving both reasoning quality and usability.
- DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217, outperforming previous versions on reasoning tasks like MATH-500 and AIME.
These enhancements demonstrate that combining pure RL with structured pretraining and fine-tuning significantly improves reasoning capabilities in language models.