DeepSeek R1 Zero - chunhualiao/public-docs GitHub Wiki

DeepSeek-R1

Reinforcement Learning (RL) plays a critical role in improving the reasoning capabilities of DeepSeek-R1. The process is designed to incentivize reasoning behaviors such as generating long chains of thought (CoT), self-verification, and reflection. Here's a step-by-step breakdown with examples:


1. Base Model Setup

  • Base Model: Training starts with DeepSeek-V3-Base, a pre-trained language model that serves as the foundation.
  • Initial Training: For DeepSeek-R1-Zero, RL is applied directly to this base model. For DeepSeek-R1, a small amount of curated cold-start data is used first to stabilize training and improve output readability.

Models Involved in the RL Process

Policy Model:

  • This is the primary model undergoing training during RL.
  • It starts as the DeepSeek-V3-Base model (for DeepSeek-R1-Zero) or a fine-tuned checkpoint (for DeepSeek-R1 after cold-start data).
  • The policy model generates candidate outputs for each input query during training.
  • Weights of the policy model are adjusted during the RL process.

Reward Model:

  • This model evaluates the outputs generated by the policy model and assigns rewards based on specific criteria (e.g., accuracy, language consistency, format adherence).
  • The reward model is often rule-based in the case of DeepSeek-R1-Zero to avoid complexities like reward hacking.
  • The reward model’s weights are fixed during the RL process.

Reference Model (optional):

  • Used for calculating regularization terms, such as KL divergence, to ensure that the policy model does not deviate excessively from a stable baseline.
  • The reference model typically starts as a copy of the policy model before RL begins.
  • The reference model’s weights are fixed during the RL process.

2. Reinforcement Learning Setup

(a) Reward Signal Design

Two types of reward signals are designed to guide the RL process:

  1. Accuracy Rewards:

    • The model is rewarded for producing correct answers.
    • Example: For a math problem like (2x + 3 = 7), the model is rewarded if it outputs:
      <think> To solve for x: 2x = 7 - 3 = 4, x = 4/2 = 2. </think>
      Answer: 2
      
    • This ensures reasoning accuracy.
  2. Format Rewards:

    • The model is rewarded for presenting reasoning in a clear format.
    • Example: Responses are rewarded if they adhere to a structure like:
      <think> Reasoning steps here... </think>
      Answer: Final result here.
      
    • This prevents chaotic outputs and enforces readability.
  3. Language Consistency Rewards:

    • Encourages the model to avoid mixing languages (e.g., switching between English and Chinese).

(b) Policy Optimization with GRPO

  • Algorithm: Group Relative Policy Optimization (GRPO) is used for RL. It avoids the computational cost of a critic model and uses group-based advantage estimation.
  • Optimization Objective:
    • For each question (q), multiple outputs are generated ((o_1, o_2, \dots, o_G)).
    • Rewards ((r_i)) for each output are calculated based on accuracy, format, and consistency.
    • The policy is updated to maximize the likelihood of outputs with higher rewards.

3. Training Steps

Step 1: Initial Exploration

  • The model generates multiple outputs for a query and evaluates them using rule-based rewards.
  • Example: For the question:
    Solve: x^2 - 4x + 3 = 0
    
    The model might generate:
    • Output 1:
      <think> Solve by factoring: (x - 3)(x - 1) = 0, so x = 3 or x = 1. </think> Answer: 3, 1.
      
      (High reward: Correct and clear reasoning)
    • Output 2:
      <think> Factoring x^2 - 4x + 3: Incorrect method applied... </think> Answer: Error.
      
      (Low reward: Incorrect reasoning)

Step 2: Policy Update

  • Outputs with higher rewards influence the model's policy to generate better reasoning steps in the future.
  • Over time, the model learns to produce correct and well-structured reasoning.

Step 3: Emergence of Complex Behaviors

  • As training progresses, the model autonomously develops advanced reasoning behaviors like:
    • Reflection: Revisiting and correcting its own steps.
      • Example:
        <think> Let’s reevaluate. Initially, I factored incorrectly. Correct factorization is (x - 3)(x - 1)... </think>
        Answer: 3, 1.
        
    • Verification: Double-checking intermediate results before providing a final answer.

4. Rejection Sampling

  • Once the RL process nears convergence, the model generates multiple outputs for each query.
  • Outputs are filtered through rejection sampling to collect high-quality reasoning data.
  • These refined samples are used for supervised fine-tuning (SFT) in later stages.

5. Multi-Scenario Reinforcement Learning

To enhance generality, DeepSeek-R1 is trained on diverse prompts, including:

  • Math: Solving equations, proofs.
  • Logic: Reasoning through puzzles.
  • Code: Writing and debugging programs.

Example: For the prompt:

Write a Python function to check if a number is prime.

The model outputs:

<think> To check if a number is prime, we need to test divisibility from 2 to sqrt(n). </think>
def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True
Answer: Function completed.

Rewards are assigned based on:

  • Correctness of the function.
  • Inclusion of reasoning steps (e.g., "test divisibility from 2 to sqrt(n)").

6. Evaluation and Refinement

  • The model is tested on reasoning benchmarks (e.g., AIME, MATH-500) during RL training.
  • Adjustments are made to the reward function to address issues like:
    • Over-reliance on certain reasoning patterns.
    • Language mixing in multi-lingual scenarios.

7. Key Results

  • The RL process significantly boosts the model's reasoning capabilities, achieving performance comparable to OpenAI-o1-1217.
  • Emergent behaviors like self-reflection and long CoT are direct results of incentivizing reasoning through RL.

This step-by-step RL framework demonstrates how DeepSeek-R1's reasoning capabilities evolve through iterative reinforcement learning.

⚠️ **GitHub.com Fallback** ⚠️