absolute zero 2025:building blocks - chunhualiao/public-docs GitHub Wiki

absolute zero 2025

Key Techniques and Building Blocks Dependency Graph for Absolute Zero Reasoner

Based on my analysis of the repository, here are the key techniques and building blocks with their dependency relationships:

graph TD
    A["Base Language Model"] --> B["veRL Framework"]
    B --> C["PPO Training Infrastructure"]
    
    D["Problem Type Definitions"] --> E["Data Construction System"]
    E --> F["Self-Generated Prompts"]
    
    F --> G["PROPOSE Phase"]
    F --> H["SOLVE Phase"]
    
    G --> I["Task Generation"]
    H --> J["Solution Generation"]
    
    I --> K["Python Executor"]
    J --> K
    
    K --> L["Multi-Modal Reward System"]
    L --> M["TRR++ Algorithm"]
    M --> N["Advantage Estimation"]
    N --> C
    
    C --> O["Model Updates"]
    O --> P["Self-Evolving Loop"]
    P --> G
    P --> H
    
    Q["Intrinsic Rewards"] --> L
    R["Code Validation"] --> L
    S["Math Verification"] --> L

Core Building Blocks for Beginners:

1. Foundation Layer

  • Base Language Model: The starting point (Qwen, Llama, etc.) 1
  • veRL Framework: Distributed RL training infrastructure 2

2. Problem Definition System

  • Three Problem Types: code_i (input prediction), code_o (output prediction), code_f (function implementation) 3
  • Data Construction: Self-generates tasks using structured prompts 4

3. Two-Phase Algorithm

  • PROPOSE Phase: Model generates reasoning tasks from abduction, deduction, and induction 5
  • SOLVE Phase: Model attempts to solve self-generated tasks 6

4. Validation and Reward System

  • Python Executor: Validates solutions through code execution 7
  • Multi-Modal Rewards: Combines accuracy, diversity, complexity, and format rewards 8

5. Learning Algorithm

  • TRR++ (REINFORCE++): Advanced advantage estimation for RL 9
  • PPO Training: Proximal Policy Optimization with custom trainer 10

6. Self-Evolution Loop

  • Continuous Improvement: Model improves through both generation and solving phases 11

Recommended Learning Path for Beginners:

  1. Start with the README: Understand the overall algorithm flow and results 12

  2. Study Problem Types: Examine how different reasoning tasks are defined 13

  3. Understand Data Construction: Learn how the system generates its own training data 14

  4. Explore Reward System: Study how solutions are evaluated and rewarded 15

  5. Examine Training Loop: Understand the self-play training process 16

  6. Run Simple Example: Start with the provided scripts for hands-on experience 17

Notes

The Absolute Zero Reasoner represents a novel approach to improving language model reasoning capabilities through self-play without external training data. The key innovation lies in the iterative PROPOSE-SOLVE cycle where the model both generates its own reasoning tasks and learns to solve them, creating a self-evolving system. The system leverages advanced RL techniques like TRR++ for effective learning and employs a sophisticated reward system that evaluates multiple aspects of generated solutions including correctness, diversity, and complexity.