DeepSeek R1 Zero - chunhualiao/public-docs GitHub Wiki

DeepSeek R1 Zero:reproducers

Reinforcement Learning (RL) plays a critical role in improving the reasoning capabilities of DeepSeek-R1. The process is designed to incentivize reasoning behaviors such as generating long chains of thought (CoT), self-verification, and reflection. Here's a step-by-step breakdown with examples:

1. Base Model Setup

Base Model: Training starts with DeepSeek-V3-Base, a pre-trained language model that serves as the foundation.
Initial Training: For DeepSeek-R1-Zero, RL is applied directly to this base model. For DeepSeek-R1, a small amount of curated cold-start data is used first to stabilize training and improve output readability.

Models Involved in the RL Process

Policy Model:

This is the primary model undergoing training during RL.
It starts as the DeepSeek-V3-Base model (for DeepSeek-R1-Zero) or a fine-tuned checkpoint (for DeepSeek-R1 after cold-start data).
The policy model generates candidate outputs for each input query during training.
Weights of the policy model are adjusted during the RL process.

Reward Model:

This model evaluates the outputs generated by the policy model and assigns rewards based on specific criteria (e.g., accuracy, language consistency, format adherence).
The reward model is often rule-based in the case of DeepSeek-R1-Zero to avoid complexities like reward hacking.
The reward model’s weights are fixed during the RL process.

Reference Model (optional):

Used for calculating regularization terms, such as KL divergence, to ensure that the policy model does not deviate excessively from a stable baseline.
The reference model typically starts as a copy of the policy model before RL begins.
The reference model’s weights are fixed during the RL process.

2. Reinforcement Learning Setup

(a) Reward Signal Design

Two types of reward signals are designed to guide the RL process:

Accuracy Rewards:
- The model is rewarded for producing correct answers.
- Example: For a math problem like (2x + 3 = 7), the model is rewarded if it outputs:
```
<think> To solve for x: 2x = 7 - 3 = 4, x = 4/2 = 2. </think>
Answer: 2
```
- This ensures reasoning accuracy.
Format Rewards:
- The model is rewarded for presenting reasoning in a clear format.
- Example: Responses are rewarded if they adhere to a structure like:
```
<think> Reasoning steps here... </think>
Answer: Final result here.
```
- This prevents chaotic outputs and enforces readability.
Language Consistency Rewards:
- Encourages the model to avoid mixing languages (e.g., switching between English and Chinese).

(b) Policy Optimization with GRPO

Algorithm: Group Relative Policy Optimization (GRPO) is used for RL. It avoids the computational cost of a critic model and uses group-based advantage estimation.
Optimization Objective:
- For each question (q), multiple outputs are generated ((o_1, o_2, \dots, o_G)).
- Rewards ((r_i)) for each output are calculated based on accuracy, format, and consistency.
- The policy is updated to maximize the likelihood of outputs with higher rewards.

3. Training Steps

Step 1: Initial Exploration

The model generates multiple outputs for a query and evaluates them using rule-based rewards.

Example: For the question:

Solve: x^2 - 4x + 3 = 0

The model might generate:

Output 1:

<think> Solve by factoring: (x - 3)(x - 1) = 0, so x = 3 or x = 1. </think> Answer: 3, 1.

(High reward: Correct and clear reasoning)

Output 2:

<think> Factoring x^2 - 4x + 3: Incorrect method applied... </think> Answer: Error.

(Low reward: Incorrect reasoning)

Step 2: Policy Update

Outputs with higher rewards influence the model's policy to generate better reasoning steps in the future.
Over time, the model learns to produce correct and well-structured reasoning.

Step 3: Emergence of Complex Behaviors

As training progresses, the model autonomously develops advanced reasoning behaviors like:
- Reflection: Revisiting and correcting its own steps.
  - Example:
```
<think> Let’s reevaluate. Initially, I factored incorrectly. Correct factorization is (x - 3)(x - 1)... </think>
Answer: 3, 1.
```
- Verification: Double-checking intermediate results before providing a final answer.

4. Rejection Sampling

Once the RL process nears convergence, the model generates multiple outputs for each query.
Outputs are filtered through rejection sampling to collect high-quality reasoning data.
These refined samples are used for supervised fine-tuning (SFT) in later stages.

5. Multi-Scenario Reinforcement Learning

To enhance generality, DeepSeek-R1 is trained on diverse prompts, including:

Math: Solving equations, proofs.
Logic: Reasoning through puzzles.
Code: Writing and debugging programs.

Example: For the prompt:

Write a Python function to check if a number is prime.

The model outputs:

<think> To check if a number is prime, we need to test divisibility from 2 to sqrt(n). </think>
def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True
Answer: Function completed.

Rewards are assigned based on:

Correctness of the function.
Inclusion of reasoning steps (e.g., "test divisibility from 2 to sqrt(n)").

6. Evaluation and Refinement

The model is tested on reasoning benchmarks (e.g., AIME, MATH-500) during RL training.
Adjustments are made to the reward function to address issues like:
- Over-reliance on certain reasoning patterns.
- Language mixing in multi-lingual scenarios.

7. Key Results

The RL process significantly boosts the model's reasoning capabilities, achieving performance comparable to OpenAI-o1-1217.
Emergent behaviors like self-reflection and long CoT are direct results of incentivizing reasoning through RL.

This step-by-step RL framework demonstrates how DeepSeek-R1's reasoning capabilities evolve through iterative reinforcement learning.