DeepSeek R1:reward model - chunhualiao/public-docs GitHub Wiki

DeepSeek‐R1

The reward model in the DeepSeek-R1 paper is a critical component of the reinforcement learning (RL) process. Its primary role is to evaluate the outputs generated by the policy model and assign a numerical reward, which guides the policy model's learning. Below is a detailed explanation of how the reward model works, including concrete examples.

Components of the Reward Model

Rule-Based Reward System:
- The reward model in DeepSeek-R1-Zero and DeepSeek-R1 is primarily rule-based, ensuring reliability and interpretability.
- Rewards are calculated based on specific criteria, such as:
  - Accuracy Rewards: Evaluating whether the output is correct.
  - Format Rewards: Assessing the structure and readability of the output.
  - Language Consistency Rewards: Checking if the output adheres to a consistent language.
Reward Formula:
- The reward ( R ) is a weighted sum of individual reward components: $R = R_{\text{accuracy}} + R_{\text{format}} + R_{\text{language-consistency}}$

How the Reward Model Works (Step-by-Step)

Input and Output:
- Input: A query or task (e.g., a math problem, a coding challenge, or a logic puzzle).
- Output: Candidate responses generated by the policy model.
Evaluation Criteria:
- The reward model evaluates each candidate response against the task requirements and assigns a reward.
Reward Assignment:
- For each response, the reward is calculated based on accuracy, adherence to formatting rules, and language consistency.
Feedback to Policy Model:
- The policy model uses the rewards to update its parameters via reinforcement learning, optimizing for higher rewards.

Concrete Examples

Example 1: Math Problem

Input:
```
Solve for x: 2x + 3 = 7
```
Generated Outputs:
1. Output A:
```
<think> Subtract 3 from both sides: 2x = 4. Divide by 2: x = 2. </think>
Answer: 2
```
  - Accuracy Reward: High (correct solution).
  - Format Reward: High (clear structure with <think> tags).
  - Language Consistency Reward: High (entirely in English).
  - Total Reward: 10 (hypothetical).
2. Output B:
```
<think> x = (7 - 3)/2 = 2 </think> x is two.
```
  - Accuracy Reward: High (correct solution).
  - Format Reward: Medium (no explicit reasoning steps).
  - Language Consistency Reward: High (entirely in English).
  - Total Reward: 7.
3. Output C:
```
<think> 2x + 3 = 7, so x = 5. </think>
Answer: 5
```
  - Accuracy Reward: Low (incorrect solution).
  - Format Reward: High (clear structure).
  - Language Consistency Reward: High.
  - Total Reward: 3.

Example 2: Coding Problem

Input:

Write a Python function to check if a number is prime.

Generated Outputs:

Output A:

<think> To check if a number is prime, test divisibility from 2 to sqrt(n). </think>
def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True
Answer: Function completed.

Accuracy Reward: High (function works correctly).
Format Reward: High (clear reasoning with <think> and properly formatted code).
Language Consistency Reward: High (output aligns with prompt).
Total Reward: 10.

Output B:

def prime_check(n):
    if n == 2:
        return True
    for i in range(2, n):
        if n % i == 0:
            return False
    return True

Accuracy Reward: Medium (inefficient implementation, edge cases missing).
Format Reward: Low (no reasoning provided).
Language Consistency Reward: High.
Total Reward: 5.

Example 3: Language Mixing Issue

Input:
```
Solve: x^2 - 4 = 0
```
Generated Outputs:
1. Output A:
```
<think> x^2 = 4, so x = ±2. </think>
Answer: x = 2, -2.
```
  - Language Consistency Reward: High (consistent language usage).
  - Total Reward: High.
2. Output B:
```
<think> x^2 = 4, 所以 x = ±2. </think>
答案: x = 2, -2.
```
  - Language Consistency Reward: Low (mix of English and Chinese).
  - Total Reward: Low.

Key Features of the Reward Model

Rule-Based Simplicity:
- Ensures interpretability and avoids complex neural reward models prone to reward hacking.
Focus on Readability:
- Encourages clear and user-friendly outputs.
Flexibility:
- Adapts to domain-specific requirements like math correctness, logical flow, or proper coding syntax.

By systematically assigning rewards based on the criteria above, the reward model helps train the policy model to produce accurate, structured, and user-friendly reasoning outputs across diverse tasks.