Group Relative Policy Optimization - chunhualiao/public-docs GitHub Wiki

Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is an optimization technique used in reinforcement learning (RL) to efficiently update a policy model. GRPO focuses on maximizing the rewards of outputs generated by the model while balancing computational efficiency and stability during training.

GRPO is a variant of policy optimization that replaces the traditional critic model (used in approaches like Proximal Policy Optimization, PPO) with group-based scoring. Instead of requiring a separate critic network, GRPO calculates relative advantages for outputs by comparing them within a sampled group.

How GRPO Works

Group Sampling:
- For each query ( q ), the model generates a group of outputs ( {o_1, o_2, ..., o_G} ) using the current policy model ( \pi_{\theta} ) (where ( \theta ) represents the model parameters).
Reward Assignment:
- Each output ( o_i ) is evaluated by the reward model, which assigns a reward ( r_i ) based on task-specific criteria (e.g., accuracy, readability, language consistency).
Advantage Calculation:
- The advantage ( A_i ) for each output is computed relative to the other outputs in the group: [ A_i = \frac{r_i - \text{mean}({r_1, r_2, ..., r_G})}{\text{std}({r_1, r_2, ..., r_G})} ]
- This ensures that rewards are normalized within the group, making training stable and robust.
Policy Update:
- The policy model is updated to increase the likelihood of generating outputs with higher advantages. The GRPO objective function is: [ J_{GRPO}(\theta) = \mathbb{E}{q \sim P(Q), {o_i}i \sim \pi{\theta\text{old}}} \left[ \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)} A_i \right] ]
- The policy ratio is clipped to ensure stability, similar to PPO: [ \text{clip}\left(\frac{\pi_{\theta}(o_i|q)}{\pi_{\theta_\text{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon\right) ]
KL Regularization:
- An additional term ensures that the updated policy does not deviate excessively from the reference policy, maintaining training stability.

Why GRPO is Used in This Paper

The DeepSeek-R1 paper adopts GRPO for the following reasons:

1. Computational Efficiency

Traditional RL methods like PPO require a separate critic model, which is typically as large as the policy model. Training such a critic alongside the policy model significantly increases computational cost.
GRPO eliminates the need for a critic by using group-based relative scoring, reducing the overall training overhead.

2. Stability in Training

GRPO’s group-based advantage estimation normalizes rewards within each batch, mitigating issues like reward scale variability. This normalization ensures smooth and stable updates to the policy model.

3. Scalability

GRPO is well-suited for large-scale RL training, where generating multiple outputs for a query (group sampling) is computationally feasible. The group-based approach scales effectively with large models like DeepSeek-R1.

4. Improved Generalization

By comparing outputs within groups, GRPO inherently prioritizes the relative quality of outputs rather than absolute reward values. This improves the model's ability to generalize across diverse tasks.

5. Alignment with Reward Model

The rule-based reward model in this paper (focused on accuracy, readability, and language consistency) works well with GRPO, as the method ensures that outputs with higher task-specific quality are favored without requiring a complex reward prediction network.

Concrete Example of GRPO in Action

Task: Solve ( x^2 - 4x + 3 = 0 ).

Group Sampling: The policy model generates a group of outputs:
- ( o_1 ): "Factor: (x-3)(x-1)=0, so x=3, 1."
- ( o_2 ): "Solve by completing the square: x = 2 ± 1."
- ( o_3 ): "Factor: Incorrect factorization."
- ( o_4 ): "Direct answer: x = 5."
Reward Assignment: The reward model evaluates the outputs:
- ( r_1 = 10 ) (correct and clear reasoning).
- ( r_2 = 8 ) (correct but less clear).
- ( r_3 = 3 ) (incorrect reasoning).
- ( r_4 = 1 ) (incorrect answer, no reasoning).
Advantage Calculation: Normalize rewards within the group: [ A_1 = \frac{10 - 5.5}{3.9}, ; A_2 = \frac{8 - 5.5}{3.9}, ; A_3 = \frac{3 - 5.5}{3.9}, ; A_4 = \frac{1 - 5.5}{3.9}. ]
Policy Update: Update the policy model to increase the likelihood of generating outputs like ( o_1 ) and ( o_2 ) while reducing the probability of ( o_3 ) and ( o_4 ).

Key Advantages of GRPO

Cost-Effective: Reduces computational overhead by eliminating the critic model.
Stable Updates: Normalizes rewards within groups, ensuring robust training.
Practical for Large Models: Efficiently trains large-scale language models like DeepSeek-R1, which generate multiple outputs per query.

In summary, GRPO is an optimization technique tailored for efficient and scalable reinforcement learning, making it an ideal choice for the large-scale training requirements of DeepSeek-R1.