GRPO:Objective Function - chunhualiao/public-docs GitHub Wiki

Mathematical Formulation of GRPO (Group Relative Policy Optimization)

The Group Relative Policy Optimization (GRPO) algorithm optimizes the policy model by maximizing the following objective function:

GRPO Objective Function

J_{GRPO}(\theta) = \mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O|q)} \left[ 
\frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)} A_i, 
\text{clip} \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) 
- \beta D_{KL} (\pi_{\theta} || \pi_{\text{ref}}) \right]

Breaking Down Each Term

1. Policy Optimization Term

\frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)}

Pronounced as: "pi theta of o sub i given q over pi theta old of o sub i given q"
Meaning: This is the ratio of the new policy's probability of generating output ( o_i ) given question ( q ) to the old policy's probability of the same output.
Why is it here? It measures how much the policy has changed from the old version. If this ratio is too large, the policy update might be too aggressive.

2. Clipping Term (Trust Region)

\text{clip} \left( \frac{\pi_{\theta}(o_i | q)}{\pi_{\theta_{\text{old}}}(o_i | q)}, 1 - \epsilon, 1 + \epsilon \right)

Pronounced as: "clip the policy ratio between one minus epsilon and one plus epsilon"
Meaning: Ensures the policy does not change too drastically by limiting the update magnitude.
Why is it here? Prevents overly large updates, which can destabilize training. This is borrowed from PPO, which also uses a clipping mechanism to maintain stable policy updates.

3. Advantage Estimation

A_i = \frac{r_i - \text{mean}(\{r_1, r_2, ..., r_G\})}{\text{std}(\{r_1, r_2, ..., r_G\})}

Pronounced as: "advantage sub i equals reward sub i minus mean reward over standard deviation of rewards"
Meaning: This is a group-relative advantage. Instead of using a learned critic, GRPO normalizes the rewards within a group.
Why is it here? It compares how good an output is relative to others in the group, rather than relying on an absolute value function. This avoids training a separate critic network.

4. KL Divergence Regularization

D_{KL} (\pi_{\theta} || \pi_{\text{ref}})

Pronounced as: "Kullback-Leibler divergence between pi theta and pi reference"
Meaning: Measures how different the updated policy is from a reference policy.
Why is it here? This acts as a regularization term, ensuring that the policy does not deviate too far from an existing reference (which might be a supervised fine-tuned model or a pre-trained policy).

5. Expectation Over Training Data

\mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O|q)}

Pronounced as: "expectation over q sampled from P of Q and outputs o sub i sampled from old policy pi theta old given q"
Meaning: The objective is averaged over all training samples ( q ) (i.e., prompts/questions) and the corresponding group of sampled outputs.
Why is it here? Ensures that the optimization is based on a diverse set of inputs, preventing overfitting to specific cases.

Key Insights on GRPO's Design

No Critic Needed 🏆
- Unlike PPO, which requires a separate critic model to estimate advantages, GRPO normalizes rewards within a group to compute relative advantages.
- This reduces computation and avoids potential bias in critic learning.
Group-Based Learning 👥
- Instead of relying on a single sampled trajectory, GRPO compares multiple outputs per question.
- This stabilizes learning by ensuring updates are based on relative performance, not absolute rewards.
Prevents Large Policy Shifts ⚠️
- The clipping mechanism (similar to PPO) ensures the model does not make drastic changes that could destabilize training.
Encourages Diversity While Regularizing 🔄
- The KL-divergence term ensures that the updated policy remains close to a reference model, preventing over-exploration.

Comparison to PPO

Feature	PPO (Proximal Policy Optimization)	GRPO (Group Relative Policy Optimization)
Advantage Estimation	Uses a critic model (value function).	Uses group-relative advantage without a critic.
Policy Stability	Uses clipping to limit large updates.	Also uses clipping but optimizes within groups.
Computation	High (requires training a critic).	Lower (no critic, only group-wise ranking).
Policy Update	Constrains updates based on old policy.	Constrains updates relative to a group of outputs.
Best Suited For	General RL tasks (e.g., robotics, games).	LLM training and reasoning-focused RL.

Final Thoughts

GRPO is a more efficient and scalable alternative to PPO for language model reinforcement learning, particularly in reasoning tasks.
By avoiding a critic model and using group-relative comparisons, GRPO improves training efficiency while maintaining strong performance.
The KL-divergence term helps ensure smooth, human-aligned policy updates while maintaining high reasoning accuracy.

🚀 GRPO is well-suited for training LLMs like DeepSeek-R1-Zero, where reasoning quality is crucial!