DeepSeek R1 Zero - chunhualiao/public-docs GitHub Wiki
Reinforcement Learning (RL) plays a critical role in improving the reasoning capabilities of DeepSeek-R1. The process is designed to incentivize reasoning behaviors such as generating long chains of thought (CoT), self-verification, and reflection. Here's a step-by-step breakdown with examples:
- Base Model: Training starts with DeepSeek-V3-Base, a pre-trained language model that serves as the foundation.
- Initial Training: For DeepSeek-R1-Zero, RL is applied directly to this base model. For DeepSeek-R1, a small amount of curated cold-start data is used first to stabilize training and improve output readability.
Models Involved in the RL Process
Policy Model:
- This is the primary model undergoing training during RL.
- It starts as the DeepSeek-V3-Base model (for DeepSeek-R1-Zero) or a fine-tuned checkpoint (for DeepSeek-R1 after cold-start data).
- The policy model generates candidate outputs for each input query during training.
- Weights of the policy model are adjusted during the RL process.
Reward Model:
- This model evaluates the outputs generated by the policy model and assigns rewards based on specific criteria (e.g., accuracy, language consistency, format adherence).
- The reward model is often rule-based in the case of DeepSeek-R1-Zero to avoid complexities like reward hacking.
- The reward model’s weights are fixed during the RL process.
Reference Model (optional):
- Used for calculating regularization terms, such as KL divergence, to ensure that the policy model does not deviate excessively from a stable baseline.
- The reference model typically starts as a copy of the policy model before RL begins.
- The reference model’s weights are fixed during the RL process.
Two types of reward signals are designed to guide the RL process:
-
Accuracy Rewards:
- The model is rewarded for producing correct answers.
- Example: For a math problem like (2x + 3 = 7), the model is rewarded if it outputs:
<think> To solve for x: 2x = 7 - 3 = 4, x = 4/2 = 2. </think> Answer: 2
- This ensures reasoning accuracy.
-
Format Rewards:
- The model is rewarded for presenting reasoning in a clear format.
- Example: Responses are rewarded if they adhere to a structure like:
<think> Reasoning steps here... </think> Answer: Final result here.
- This prevents chaotic outputs and enforces readability.
-
Language Consistency Rewards:
- Encourages the model to avoid mixing languages (e.g., switching between English and Chinese).
- Algorithm: Group Relative Policy Optimization (GRPO) is used for RL. It avoids the computational cost of a critic model and uses group-based advantage estimation.
-
Optimization Objective:
- For each question (q), multiple outputs are generated ((o_1, o_2, \dots, o_G)).
- Rewards ((r_i)) for each output are calculated based on accuracy, format, and consistency.
- The policy is updated to maximize the likelihood of outputs with higher rewards.
- The model generates multiple outputs for a query and evaluates them using rule-based rewards.
- Example: For the question:
The model might generate:
Solve: x^2 - 4x + 3 = 0
- Output 1:
(High reward: Correct and clear reasoning)
<think> Solve by factoring: (x - 3)(x - 1) = 0, so x = 3 or x = 1. </think> Answer: 3, 1.
- Output 2:
(Low reward: Incorrect reasoning)
<think> Factoring x^2 - 4x + 3: Incorrect method applied... </think> Answer: Error.
- Output 1:
- Outputs with higher rewards influence the model's policy to generate better reasoning steps in the future.
- Over time, the model learns to produce correct and well-structured reasoning.
- As training progresses, the model autonomously develops advanced reasoning behaviors like:
-
Reflection: Revisiting and correcting its own steps.
- Example:
<think> Let’s reevaluate. Initially, I factored incorrectly. Correct factorization is (x - 3)(x - 1)... </think> Answer: 3, 1.
- Example:
- Verification: Double-checking intermediate results before providing a final answer.
-
Reflection: Revisiting and correcting its own steps.
- Once the RL process nears convergence, the model generates multiple outputs for each query.
- Outputs are filtered through rejection sampling to collect high-quality reasoning data.
- These refined samples are used for supervised fine-tuning (SFT) in later stages.
To enhance generality, DeepSeek-R1 is trained on diverse prompts, including:
- Math: Solving equations, proofs.
- Logic: Reasoning through puzzles.
- Code: Writing and debugging programs.
Example: For the prompt:
Write a Python function to check if a number is prime.
The model outputs:
<think> To check if a number is prime, we need to test divisibility from 2 to sqrt(n). </think>
def is_prime(n):
if n < 2:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
Answer: Function completed.
Rewards are assigned based on:
- Correctness of the function.
- Inclusion of reasoning steps (e.g., "test divisibility from 2 to sqrt(n)").
- The model is tested on reasoning benchmarks (e.g., AIME, MATH-500) during RL training.
- Adjustments are made to the reward function to address issues like:
- Over-reliance on certain reasoning patterns.
- Language mixing in multi-lingual scenarios.
- The RL process significantly boosts the model's reasoning capabilities, achieving performance comparable to OpenAI-o1-1217.
- Emergent behaviors like self-reflection and long CoT are direct results of incentivizing reasoning through RL.
This step-by-step RL framework demonstrates how DeepSeek-R1's reasoning capabilities evolve through iterative reinforcement learning.