GRPO - chunhualiao/public-docs GitHub Wiki

GRPO:Objective Function

Improvements in Reinforcement Learning for DeepSeek-R1-Zero

Here’s a table comparing Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) based on key characteristics:

Feature PPO (Proximal Policy Optimization) GRPO (Group Relative Policy Optimization)
Critic Model Requires a critic model to estimate value functions. Does not require a critic model, reducing computational cost.
Optimization Approach Uses a clipped objective function to constrain policy updates. Uses a group-based comparison approach to estimate policy updates.
Baseline Estimation Uses a learned value function (critic) to estimate advantages. Estimates the baseline from the group scores instead of a critic.
Computational Cost High due to critic training and bootstrapping. Lower since it avoids the critic and uses simpler relative comparisons.
Stability of Training More stable due to critic-guided learning. May be more sensitive to sample quality but benefits from comparative scoring.
Sample Efficiency Requires extensive exploration due to critic reliance. More sample-efficient due to direct advantage estimation from group comparisons.
Use in DeepSeek-R1 Not used due to high computational overhead. Chosen for DeepSeek-R1-Zero due to efficiency and ability to improve reasoning performance.
Suitability Well-suited for continuous control tasks (e.g., robotics, game playing). Better suited for reasoning tasks in LLMs, where direct group comparisons help refine policy updates.

In summary:

  • PPO is widely used for general RL tasks but requires a critic model, making it more computationally expensive.
  • GRPO is more efficient for LLMs as it removes the critic and focuses on direct group-based optimization, making it a better fit for reasoning-oriented RL in DeepSeek-R1.

The authors of the DeepSeek-R1 paper introduced improvements in reinforcement learning (RL) for DeepSeek-R1-Zero primarily through Group Relative Policy Optimization (GRPO). This method optimizes RL efficiency by:

  1. Avoiding a Critic Model: Traditional RL methods, such as Proximal Policy Optimization (PPO), rely on a critic model to estimate value functions, requiring extra computation. Instead, GRPO estimates the baseline using group scores, making training more computationally efficient.

  2. Group-based Optimization: The model samples multiple responses per question and compares them within a group, optimizing the policy based on relative advantages rather than absolute values.

  3. Rule-based Reward Modeling:

    • Accuracy Rewards: Evaluates correctness for tasks such as math and coding.
    • Format Rewards: Encourages structured reasoning outputs, using predefined tags like <think> </think> to separate the reasoning process from the final answer.
  4. Self-Evolution Process:

    • The model autonomously develops reasoning strategies without requiring supervised fine-tuning.
    • Over RL iterations, DeepSeek-R1-Zero naturally extends its thinking process, improving problem-solving efficiency.

Comparison with Previous Methods

Limitations of DeepSeek-R1-Zero

While DeepSeek-R1-Zero demonstrated strong reasoning capabilities through RL alone, it faced notable drawbacks:

  • Poor Readability: Outputs were difficult to understand due to language mixing.
  • Lack of Alignment: Responses were not optimized for human preferences, making them less user-friendly.

Enhancements in DeepSeek-R1

To address these issues, the authors introduced DeepSeek-R1, which builds upon DeepSeek-R1-Zero with additional refinements:

  1. Cold-Start Data:

    • Before RL, the model is fine-tuned on a small dataset of high-quality reasoning examples.
    • This prevents instability during early RL training and improves readability.
  2. Reasoning-Oriented RL:

    • Similar RL training is applied, but with additional constraints to enhance readability and coherence.
    • A language consistency reward is introduced to prevent language mixing.
  3. Rejection Sampling and Supervised Fine-Tuning (SFT):

    • After RL convergence, high-quality responses are selected and used to fine-tune the model with additional general-domain data.
    • This stage integrates reasoning with non-reasoning tasks such as writing and role-playing.
  4. Final RL Alignment:

    • Another RL phase refines helpfulness and harmlessness.
    • Combines rule-based rewards for reasoning with human preference-based rewards for general tasks.

Key Takeaways

  • DeepSeek-R1-Zero relies purely on RL, which enables self-evolving reasoning but results in poor readability and user alignment.
  • DeepSeek-R1 incorporates cold-start data and iterative fine-tuning, improving both reasoning quality and usability.
  • DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217, outperforming previous versions on reasoning tasks like MATH-500 and AIME.

These enhancements demonstrate that combining pure RL with structured pretraining and fine-tuning significantly improves reasoning capabilities in language models.

⚠️ **GitHub.com Fallback** ⚠️