DeepSeek R1:nature paper - chunhualiao/public-docs GitHub Wiki
https://www.nature.com/articles/s41586-025-09422-z
Here’s the “what’s going on” in this Nature paper—fast but complete.
DeepSeek-R1 shows you can teach an LLM to reason using only reinforcement learning (RL) from final-answer correctness, no human chain-of-thought labels. The model first learns long, reflective reasoning via pure RL (“R1-Zero”), then a multistage pipeline adds alignment and broader skills to make “R1.” On math, coding, and STEM benchmarks it beats supervised-only approaches.
- Pure RL on a base model (DeepSeek-V3 Base): They train with Group Relative Policy Optimization (GRPO)—sample a group of answers per question, score each with rule-based checkers (e.g., numeric answer match, code unit tests), compute within-group advantages, and update the policy while staying close to a reference model (KL penalty). No value network, no human CoT.
- Only verifiable rewards at first: Two rule-based rewards drive R1-Zero—accuracy (is the final answer correct) and format (keeping …, …). This yields emergent behaviors like self-verification (“wait…”) and trying alternatives.
- Then make it usable (R1): They add (1) a small cold-start SFT with human-like long CoT for readability, (2) RL again with rule-based rewards on reasoning plus model-based rewards for helpfulness/harmlessness, and (3) another SFT mixing reasoning + non-reasoning data. See the pipeline diagram on page 3 (Fig. 2).
- Reasoning emerges & gets longer: On page 2 (Fig. 1a), AIME-2024 pass@1 climbs from 15.6% → 77.9% (and to 86.7% with self-consistency). Fig. 1b shows average response length steadily grows into the thousands of tokens, reflecting more deliberate reasoning. A notable jump occurs after ~8.2k steps, when the rollout max length was increased, per Methods.
- Benchmark suite: Final R1 improves broad instruction-following (e.g., AlpacaEval 2.0 LC-winrate 87.6, Arena-Hard 92.3) while keeping strong reasoning (e.g., AIME-2024 79.8 pass@1, MATH-500 97.3; Codeforces rating ~2029). See Table 2 on page 4.
- Linguistic “aha moment”: The model starts using reflective terms like “wait” much more around step ~8k, quantified in Extended Data Fig. 1 (page 9) and illustrated in Table 1 (page 3).
Instead of learning a value function, GRPO samples G outputs per prompt, scores them, normalizes each answer’s reward by the group’s mean/std to get advantages, and does a clipped policy update with a KL penalty to a reference policy—simpler, cheaper, and well-suited to verifiable tasks. Methods (pages 7–8) give the objective and training hyperparameters.
- Rule-based (reasoning): exact-match graders for math; compilers/test suites for code; format reward to keep the / structure. This avoids reward-model gaming during large-scale RL.
- Model-based (general): pairwise helpfulness and pointwise safety reward models (trained on ~66k and ~106k examples, respectively) only in later R1 stages, used sparingly to reduce reward hacking. A language-consistency bonus discourages English/Chinese mixing.
- Structured output & tool use are still weak (no native tool-calling), though RL environments for tools should be tractable next. Token efficiency shows some overthinking on easy prompts. Language mixing persists outside EN/ZH. Few-shot prompts can hurt R1—prefer zero-shot with explicit output format. Software-engineering gains lag due to slow evaluations; they suggest asynchronous evals/rejection sampling for future work. See “Conclusion, limitation and future work” (pages 4–5).
- If your task has a verifier (e.g., unit tests for code, formal checks on diagrams/specs), you can skip human CoT and push reasoning via pure RL on correctness, then optionally align with light SFT + preference RL.
- Use scaffolding + format rewards (tags, required boxes) to make the verifier reliable; add a language/consistency reward if outputs must conform to a schema (think: structured SysML).
- For long-horizon problems, expect emergent self-checking if your reward only cares about final verifiable success; budget for longer outputs and consider self-consistency decoding on hard items.
If you want, I can walk through (a) GRPO math with a tiny toy example, (b) how to design a verifier for code/diagram tasks, or (c) how to adapt their multistage pipeline to your domain.
Here’s what the Supplementary Information adds beyond the main paper—i.e., the “how” and a lot of careful diagnostics:
-
#1 RL algorithm choice & infrastructure. It directly compares GRPO vs PPO (Supplementary Fig. 2): PPO can match GRPO only after sensitive λ tuning and with extra compute for a value model; GRPO is the more practical choice at scale. It also diagrams the RL framework (Supplementary Fig. 3): rollout with vLLM workers, expert parallelism for the MoE, duplicated “hot” experts, self-speculative decoding via MTP, and a modular rule-based reward executor.
-
#2 Data recipe in detail (incl. “cold start”). It spells out the supervised dataset used after rejection sampling: ~600k reasoning samples and ~200k non-reasoning samples; some labels are produced by a generative judge (DeepSeek-V3), and the “thinking” style guidelines (concise, conversational, no markdown) are listed. A table gives domain counts and token stats for the ~800k SFT set. Note that most data are single-turn.
-
#3 Exact hyper-params & training cost. It lists LR schedules, batch sizes, context lengths, and distillation LRs per base model. Crucially, it gives a compute breakdown: ~147k H800 GPU-hours total (~$294k at $2/GPU-h), with R1-Zero ~101k hours, SFT data creation ~5k, and R1 ~41k; plus cluster specs (64×8 H800s) and wall-times (R1-Zero ~198 h; R1 ~80 h). (Supplementary Table 3/4 & §2.4.4.)
-
#4 Reward-model pitfalls (“reward hacking”). A plot shows the helpful-reward score rising while Codeforces pass@1 falls, warning that model-based preference rewards can be gamed without improving real problem-solving. (Supplementary Fig. 4.)
-
#5 Language Consistency reward ablation. Keeping the chain-of-thought in one target language remains stable only with the LC reward; math holds steady but code drops slightly—quantifying the readability vs. performance trade-off. (Supplementary Fig. 5; §2.6.)
-
#6 Self-evolution signals in R1-Zero. It tracks reasoning behaviors during RL: the frequency of reflective tokens (e.g., “wait”, “check”, “verify”) grows 5–7×, with “wait” spiking after step ~8k—evidence of emergent reflection timing. (Supplementary Fig. 7; §3.2.)
-
#7 Evaluation protocol transparency. It documents pass@k settings (e.g., k=64 for AIME/GPQA), decoding temps (0.6), dataset windows (e.g., LiveCodeBench 2024-08→2025-01), and Codeforces setup—so results are reproducible. It also details aggressive decontamination (10-gram filtering; ~6 million math texts removed) and cautions about paraphrase leakage.
-
#8 Test-time scaling beyond “just sample more”. Majority voting barely helps GPT-4o on AIME-2024 (9.3%→13.4% pass@64), yet R1 starts much higher (79.8% pass@1) and still benefits (→86.7% with voting; 90.0% pass@64). It explains why independent samples help less for non-reasoning models. (Supplementary §5.4/5.5.)
-
#9 Safety: pipeline & benchmarks. It introduces a risk-control system (keyword filter → model-based “risk review” prompt) and reports standardized safety results (HELM and repro’d DNA/HarmBench). It also shows performance with/without showing CoT (“hide cot”). Notably, R1 is comparable overall but shows a gap on HarmBench, and the pipeline measurably hardens the service. (Supplementary §4.3; Table 6; Listing 8.)
-
#10 Distillation: what was distilled, how, and how well. They distill R1 into open bases (Qwen/Llama) using ~800k R1-generated samples—SFT-only (no RL) to highlight pure transfer—and show consistent gains vs. human-only SFT baselines (see Supplementary Table 12).
-
#11 Where R1 is strong/weak & how it scales thinking. A category breakdown over 2024 contests shows strengths in number theory/algebra and weaker geometry/combinatorics; on 2025 AIME/2024 AMC it reaches USAMO-qualifying caliber when combined. It also quantifies adaptive CoT length—hard problems average ~8,793 thinking tokens. (Supplementary Fig. 15–16; §5.2–5.4.)
-
#12 Stage-wise gains by difficulty. A table for LiveCodeBench shows most improvements accruing on medium/hard problems across Dev1→Dev3→R1, clarifying where the pipeline helps. (Supplementary Table 11.)
If you want, I can turn any of these into a one-page “cheat sheet” (e.g., costs, data sizes, ablations, safety pipeline) for quick reference.