Absolute Zero 2025 - chunhualiao/public-docs GitHub Wiki
https://github.com/LeapLabTHU/Absolute-Zero-Reasoner
absolute zero 2025:building blocks
Okay, let's walk through a simplified, conceptual example of how the Absolute Zero Reasoner (AZR) approach might work from scratch, focusing on the core loop and innovations.
Imagine our Language Model (LM) is a moderately capable pre-trained coder, but not yet a specialized reasoner.
Phase 0: Setup and Initialization (The "Absolute Zero" Start)
- No External Data: We don't give the LM any human-curated (program, input, output) examples or reasoning tasks.
- Environment: We have a Python code executor ready. This is crucial for verifiable rewards.
-
Task Buffers: We initialize three empty buffers:
-
D_deduction
: For deduction tasks(program, input) -> output
. -
D_abduction
: For abduction tasks(program, output) -> input
. -
D_induction
: For induction tasks(set of input/output pairs, message) -> program
.
-
-
Initial Seed (as per Figure 5, p. 6): To kickstart the process, the paper mentions providing a single, extremely simple "triplet" to the system. This isn't strictly "zero" external data, but it's a minimal seed to get the very first proposal going if the model can't generate from absolute nothing. Let's say this is:
- Program (
p_seed
):def f(x): return x
- Input (
i_seed
):1
- Output (
o_seed
):1
This seed(p_seed, i_seed, o_seed)
is added toD_deduction
andD_abduction
(or rather, parts of it are used to form tasks). ForD_induction
, we might derive an initial I/O pair and a message. For simplicity, let's imagineD_deduction
now contains(p_seed, i_seed, o_seed)
.
- Program (
Iteration 1: The First Self-Play Step
The LM now plays two roles: Proposer (π_propose
) and Solver (π_solve
). These are the same underlying LM, but prompted differently. (Innovation: Unified model for both roles).
A. Propose Phase (Generating a new task)
- Task Type Selection: Let's say AZR decides to propose a Deduction task.
-
Referencing (Sec 3.1, p. 4 & Sec 3.3.2, p.7): The Proposer is prompted by sampling K (e.g., 1) past task triplets from
D_deduction
.- Prompt to Proposer: "Here's an example task: Program:
def f(x): return x
, Input:1
, Output:1
. Please propose a NEW, DIFFERENT Python program and a valid input for it. The program should be deterministic and safe." (This includes implicit instructions for good task generation).
- Prompt to Proposer: "Here's an example task: Program:
-
Proposer LM Generates:
- Proposed Program (
p_1
):def my_func(a): return a + 5
- Proposed Input (
i_1
):10
- Proposed Program (
-
Task Validation (Sec 3.3.3, p. 7 - using the Code Executor):
-
Program Integrity & Safety: The executor runs
p_1
withi_1
.- Is it valid Python? Yes.
- Does it use forbidden packages (e.g.,
os.system
)? No. - Is it deterministic (gives the same output if run multiple times)? Yes.
-
Obtain Gold Output: The executor computes
o_1* = p_1(i_1) = 10 + 5 = 15
. -
Store Validated Task: The validated triplet (
p_1
,i_1
,o_1*
=15) is added to theD_deduction
buffer.
-
Program Integrity & Safety: The executor runs
-
Calculate Learnability Reward for the Proposer (
r_propose
) (Eq. 4, p. 5): This is a key innovation. The goal is to propose tasks that are "learnable" – not too easy, not too hard for the current Solver.- The Solver LM (
π_solve
) attempts to solve the newly proposed task (p_1
,i_1
) for its output N times (e.g., N=8 rollouts). - Let's say the Solver gets it right 5 out of 8 times. So, the average success rate
r_solve_bar = 5/8 = 0.625
. - The learnability reward for the proposer is
r_propose = 1 - r_solve_bar = 1 - 0.625 = 0.375
. (Ifr_solve_bar
was 0 or 1,r_propose
would be 0, penalizing trivial or impossible tasks).
- The Solver LM (
B. Solve Phase (Solving a task)
-
Task Sampling: AZR picks a task from one of the buffers. Let's say it picks the new task from
D_deduction
: (x = (p_1, i_1)
,y* = o_1*
).- Prompt to Solver: "Given Program:
def my_func(a): return a + 5
and Input:10
, what is the output?"
- Prompt to Solver: "Given Program:
-
Solver LM Generates:
- Solver's Answer (
y_1
): The LM runs its reasoning process and outputs15
.
- Solver's Answer (
-
Answer Verification (Sec 3.3.4, p. 8 - using Code Executor):
- Is
y_1 == y_1*
? Yes,15 == 15
.
- Is
-
Calculate Accuracy Reward for the Solver (
r_solve
) (Eq. 5, p. 5):-
r_solve = 1
(binary reward: 1 for correct, 0 for incorrect).
-
-
(Format Penalty - Eq. 6, p. 5): Both Proposer and Solver also get penalties if their output isn't in the correct format (e.g., missing
<think>
tags if expected, or not producing a valid program structure). Let's assume format is correct here.
C. RL Update Phase (Learning from the experience)
-
Rewards:
- Proposer got
r_propose = 0.375
for proposing (p_1
,i_1
). - Solver got
r_solve = 1
for solving (p_1
,i_1
) ->o_1*
.
- Proposer got
-
Algorithm: Task-Relative REINFORCE++ (TRR++) (Sec 3.3.5, p. 8):
- This is another key technique. Instead of a single global baseline for rewards, TRR++ computes separate baselines for each of the six task-role configurations (Propose-Deduction, Solve-Deduction, Propose-Abduction, etc.). This helps stabilize training and reduce variance.
- The advantage
A_norm
is calculated for the Proposer's action (proposingp_1
,i_1
) and the Solver's action (outputtingy_1
). - The LM's parameters (
θ
) are updated using these advantages to make good actions more likely and bad actions less likely. Both the "proposer head" and "solver head" (conceptually, as it's one model) learn.
(Parallel Task Types) Simultaneously, in a real run, the Proposer would also be generating Abduction and Induction tasks, and the Solver would be attempting to solve them.
-
For Abduction: Propose (
p
,i'
), validate to get (p
,i'
,o'
). Task for solver: given (p
,o'
), predicti'
. -
For Induction: Propose (
p
,{i_n}
,m
) (program, multiple inputs, and a message/description). Validate to get (p
,{i_n, o_n*}
,m
). Task for solver: given a subset of{(i_n, o_n*)}
andm
, predictp
.
Iteration 2, 3, ..., T: Continuous Self-Improvement
-
Growing Buffers:
D_deduction
,D_abduction
,D_induction
now contain more complex and diverse tasks generated by the increasingly capable Proposer. -
Proposer Gets Smarter:
- It refers to K examples from the now richer buffers.
- It learns to propose tasks that are at the frontier of the Solver's current ability (due to the learnability reward). If it proposes things that are too easy,
r_propose
is low. If too hard,r_propose
is also low. - This creates an autocurriculum (Innovation: dynamic, self-generated curriculum).
-
Solver Gets Smarter:
- It's trained on a progressively harder and more diverse set of tasks generated by the Proposer.
- Its
r_solve_bar
on newly proposed tasks might initially dip if the Proposer makes a difficulty leap, then improves as the Solver learns.
-
The Loop Continues: Propose -> Validate (Executor) -> Calculate
r_propose
(using Solver) -> Solve -> Validate (Executor) -> Calculater_solve
-> RL Update.
Key Innovations & Techniques Illustrated:
- Zero External Data (beyond minimal seed): The system learns by generating its own problems and solutions.
- Reinforced Self-Play: The agent improves by playing against (and with) itself in different roles.
- Unified Proposer-Solver Model: A single LM learns both to create challenging tasks and to solve them.
- Code Executor as Verifiable Environment: Provides grounded, objective rewards, avoiding issues like reward hacking seen with learned reward models.
-
Learnability Reward (
r_propose
): Drives the Proposer to generate tasks of appropriate difficulty, fostering a natural curriculum. - Three Complementary Reasoning Tasks (Deduction, Abduction, Induction): Ensures the model develops a well-rounded reasoning ability.
- Task-Relative REINFORCE++ (TRR++): Custom RL algorithm for stable and efficient learning in this multitask, multi-role setup.
- Dynamic Self-Improving Curriculum: The tasks naturally become more complex as the model's capabilities grow.
Okay, let's apply first-principle thinking to the Absolute Zero Reasoner (AZR) system design.
Overall Goal of AZR: To create a reasoning agent (LLM) that improves its capabilities (in coding and math-like tasks) autonomously through self-play in a verifiable environment, without relying on external, human-curated datasets of problems and solutions.
First Principles Identified:
-
Learning Requires Feedback: For an agent to learn, it must receive information about the quality of its actions.
- AZR Implementation: Verifiable rewards from a code executor (correct/incorrect, valid/invalid task).
-
Learning Requires a Curriculum: An agent needs a set of problems or tasks to learn from.
- AZR Implementation: Tasks are self-generated by the agent (proposer role).
-
Optimal Learning Occurs at the Edge of Competence (Zone of Proximal Development): Tasks should be challenging but not impossible.
- AZR Implementation: The "learnability reward" for the proposer, designed to encourage tasks the current solver can sometimes solve but not always.
-
Practice Improves Skill: Repeatedly attempting and learning from tasks should enhance the agent's ability to perform those tasks.
- AZR Implementation: The solver role practices solving tasks; the proposer role practices proposing tasks. Both are updated via RL.
-
Generalization Requires Diverse Experience: To develop broad reasoning skills, the agent needs exposure to a variety of problem types and reasoning modes.
- AZR Implementation: Uses three distinct task types (deduction, abduction, induction) targeting different reasoning facets.
-
Complex Systems Can Emerge from Simple Rules and Interactions: Self-improvement and complex behaviors can arise from a well-designed feedback loop.
- AZR Implementation: The self-play loop between proposer, solver, and environment.
Core Assumptions in AZR's Design:
- Sufficiency of Pre-trained LMs for Bootstrapping: The base LLM has enough inherent capability (from pre-training on general text/code) to generate some initial, simple, verifiable tasks and attempt to solve them, even if poorly at first.
- Coding as a Valid Domain for Reasoning Development: Skills developed through solving and proposing coding tasks (which require logic, planning, pattern recognition) are transferable or foundational to broader reasoning capabilities, including mathematical reasoning.
-
Code Executor as a Reliable and Sufficient Oracle: The code executor is assumed to provide:
- Ground Truth: For solution correctness (binary pass/fail).
- Validity Check: For proposed tasks (syntactic correctness, safety, determinism).
- That this form of feedback is rich enough to guide complex reasoning development without needing human-annotated rationales or intermediate steps (beyond what the model might generate itself).
-
Effectiveness of the Reward Structure:
- The "learnability reward" (
$1 - \text{solve_rate}$ ) effectively guides the proposer to create an auto-curriculum of appropriate difficulty. - Binary accuracy rewards for the solver are sufficient.
- Format penalties correctly steer the model's output structure.
- The "learnability reward" (
- Completeness of the Three Task Types: Deduction, abduction, and induction are assumed to cover a sufficiently diverse and fundamental set of reasoning modes necessary for significant improvement.
- Viability of a Unified Model: A single LLM can effectively learn and perform both the proposer and solver roles without catastrophic interference or needing separate specialized architectures.
- Emergent Complexity of Self-Generated Curriculum: The interaction between the proposer (driven by learnability) and the improving solver will naturally lead to tasks of increasing complexity and diversity, forming a good curriculum without explicit human design.
- Scalability of Self-Play: The self-play loop can continue to generate meaningful tasks and drive improvement over many iterations and at larger model scales.
Checking for Potential Improvements (Challenging Assumptions & Design Choices):
-
Challenging Assumption 3 (Code Executor as Sufficient Oracle):
- Limitation: The code executor primarily checks functional correctness, not necessarily the quality, efficiency, or interpretability of the code proposed or solved. It also doesn't verify the "difficulty" of the reasoning required, only the difficulty for the current solver.
-
Improvement Possibility:
-
Richer Feedback from Environment: Could the environment provide more nuanced feedback? For example, for the proposer:
- Reward for generating tasks that require novel combinations of concepts.
- Reward for tasks leading to high "reasoning density" (e.g., more logical steps per token).
- Static analysis metrics (complexity, style) as part of task validation or reward.
- For the solver: Reward for solution efficiency (e.g., fewer execution steps, though this is hard to normalize).
-
Richer Feedback from Environment: Could the environment provide more nuanced feedback? For example, for the proposer:
-
Challenging Assumption 4 (Effectiveness of Reward Structure - especially "Learnability"):
-
Limitation: The current learnability reward (
$1 - \text{solve_rate}$ ) might lead to the proposer focusing on tasks that are "tricky" in a way that's easy to make the solver fluctuate on, rather than tasks that robustly build deeper reasoning. It could also get stuck if the solver masters a type of trick quickly. -
Improvement Possibility:
- Integrate Task Novelty/Diversity: Could the proposer be explicitly rewarded for generating tasks that are structurally different from recent tasks (e.g., based on AST similarity, or diversity in concepts used)? The paper mentions trying some diversity rewards with limited success, so the formulation might need to be different.
- Information Gain Metric: Reward the proposer for tasks where the solver shows the most significant improvement or uncertainty reduction after training on them.
- Competitive Element: Introduce a separate "critic" model that evaluates the "interestingness" or "non-triviality" of a proposed task, beyond just its learnability by the current solver.
-
Limitation: The current learnability reward (
-
Challenging Assumption 5 (Completeness of Three Task Types):
- Limitation: While deduction, abduction, and induction are fundamental, are there other reasoning paradigms critical for AGI-level reasoning that are not well covered (e.g., analogical reasoning, causal inference, complex planning with long-term dependencies not easily captured by code structure alone)?
-
Improvement Possibility:
- Meta-Learning for Task Generation: Could the system learn to generate new types of verifiable tasks beyond the initial three, if given a more abstract definition of what constitutes a "good reasoning task"?
- Hybrid Environments: Introduce tasks that require interfacing the code executor with other simple, verifiable environments (e.g., a symbolic math engine, a small knowledge base for logical queries).
-
Challenging Assumption 7 (Emergent Complexity & Curriculum Quality):
- Limitation: Risk of "curriculum collapse" where the system focuses too much on a narrow subset of tasks it's good at generating/solving, or hits a plateau in task complexity.
-
Improvement Possibility:
- Explicit Exploration Mechanisms: Encourage the proposer to generate tasks in "less explored" parts of the potential problem space.
- Periodic "Challenge Sets": Introduce (even if small and fixed, or very slowly evolving) benchmark tasks that the system periodically tests itself against, with a meta-reward for improving on these challenges, guiding the proposer.
- Human-in-the-loop for Curriculum Seeding (Minimal): If the system stagnates, could a human provide a few "catalyst" problems to push the proposer into new territory, without providing solutions?
-
Safety and Alignment (Implicit Assumption: benign emergence):
- Limitation: The "uh-oh" moments show that unconstrained self-improvement can lead to undesirable behaviors. The current approach relies on oversight.
-
Improvement Possibility:
- Safety-Aware Proposer: Train the proposer with negative rewards for generating tasks/code that trigger safety filters or exhibit undesirable patterns.
- Verifiable Safety Constraints: If possible, integrate safety checks into the "environment" itself, making safety a verifiable property.
-
Efficiency of the Loop:
- Limitation: The proposer relies on N rollouts from the solver to estimate learnability. This can be computationally expensive.
-
Improvement Possibility:
- Learned Learnability Predictor: Train a small, auxiliary model to predict the solver's success rate on a proposed task, potentially faster than N full rollouts.
- More sample-efficient RL updates for both proposer and solver.
By systematically questioning these foundational assumptions and considering what each component truly needs to achieve its purpose (based on first principles), we can identify concrete pathways for further enhancing the robustness, generality, safety, and efficiency of the Absolute Zero paradigm.