GRPO - chunhualiao/public-docs GitHub Wiki

Improvements in Reinforcement Learning for DeepSeek-R1-Zero

Here’s a table comparing Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) based on key characteristics:

Feature	PPO (Proximal Policy Optimization)	GRPO (Group Relative Policy Optimization)
Critic Model	Requires a critic model to estimate value functions.	Does not require a critic model, reducing computational cost.
Optimization Approach	Uses a clipped objective function to constrain policy updates.	Uses a group-based comparison approach to estimate policy updates.
Baseline Estimation	Uses a learned value function (critic) to estimate advantages.	Estimates the baseline from the group scores instead of a critic.
Computational Cost	High due to critic training and bootstrapping.	Lower since it avoids the critic and uses simpler relative comparisons.
Stability of Training	More stable due to critic-guided learning.	May be more sensitive to sample quality but benefits from comparative scoring.
Sample Efficiency	Requires extensive exploration due to critic reliance.	More sample-efficient due to direct advantage estimation from group comparisons.
Use in DeepSeek-R1	Not used due to high computational overhead.	Chosen for DeepSeek-R1-Zero due to efficiency and ability to improve reasoning performance.
Suitability	Well-suited for continuous control tasks (e.g., robotics, game playing).	Better suited for reasoning tasks in LLMs, where direct group comparisons help refine policy updates.

In summary:

PPO is widely used for general RL tasks but requires a critic model, making it more computationally expensive.
GRPO is more efficient for LLMs as it removes the critic and focuses on direct group-based optimization, making it a better fit for reasoning-oriented RL in DeepSeek-R1.

The authors of the DeepSeek-R1 paper introduced improvements in reinforcement learning (RL) for DeepSeek-R1-Zero primarily through Group Relative Policy Optimization (GRPO). This method optimizes RL efficiency by:

Avoiding a Critic Model: Traditional RL methods, such as Proximal Policy Optimization (PPO), rely on a critic model to estimate value functions, requiring extra computation. Instead, GRPO estimates the baseline using group scores, making training more computationally efficient.
Group-based Optimization: The model samples multiple responses per question and compares them within a group, optimizing the policy based on relative advantages rather than absolute values.
Rule-based Reward Modeling:
- Accuracy Rewards: Evaluates correctness for tasks such as math and coding.
- Format Rewards: Encourages structured reasoning outputs, using predefined tags like <think> </think> to separate the reasoning process from the final answer.
Self-Evolution Process:
- The model autonomously develops reasoning strategies without requiring supervised fine-tuning.
- Over RL iterations, DeepSeek-R1-Zero naturally extends its thinking process, improving problem-solving efficiency.

Comparison with Previous Methods

Limitations of DeepSeek-R1-Zero

While DeepSeek-R1-Zero demonstrated strong reasoning capabilities through RL alone, it faced notable drawbacks:

Poor Readability: Outputs were difficult to understand due to language mixing.
Lack of Alignment: Responses were not optimized for human preferences, making them less user-friendly.

Enhancements in DeepSeek-R1

To address these issues, the authors introduced DeepSeek-R1, which builds upon DeepSeek-R1-Zero with additional refinements:

Cold-Start Data:
- Before RL, the model is fine-tuned on a small dataset of high-quality reasoning examples.
- This prevents instability during early RL training and improves readability.
Reasoning-Oriented RL:
- Similar RL training is applied, but with additional constraints to enhance readability and coherence.
- A language consistency reward is introduced to prevent language mixing.
Rejection Sampling and Supervised Fine-Tuning (SFT):
- After RL convergence, high-quality responses are selected and used to fine-tune the model with additional general-domain data.
- This stage integrates reasoning with non-reasoning tasks such as writing and role-playing.
Final RL Alignment:
- Another RL phase refines helpfulness and harmlessness.
- Combines rule-based rewards for reasoning with human preference-based rewards for general tasks.

Key Takeaways

DeepSeek-R1-Zero relies purely on RL, which enables self-evolving reasoning but results in poor readability and user alignment.
DeepSeek-R1 incorporates cold-start data and iterative fine-tuning, improving both reasoning quality and usability.
DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217, outperforming previous versions on reasoning tasks like MATH-500 and AIME.

These enhancements demonstrate that combining pure RL with structured pretraining and fine-tuning significantly improves reasoning capabilities in language models.