Biology of a Large Language Model - chunhualiao/public-docs GitHub Wiki

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

The linked paper introduces a novel method for reverse-engineering language models using attribution graphs, demonstrated through case studies on Claude 3.5 Haiku. Here’s a structured breakdown of its key insights:


Core Methodology

  1. Local Replacement Models:
    The researchers build simplified, interpretable proxy models that:

    • Use human-interpretable features (e.g., concepts like “Texas” or “rhyme scheme”)
    • Inherit attention patterns from the original model
    • Include error nodes to quantify gaps in interpretability[1]
  2. Attribution Graphs:
    These directed graphs map how features interact to produce outputs:

    • Nodes represent conceptual features
    • Edges show causal relationships between features
    • Pruned and grouped into supernodes for readability[1]
  3. Validation via Interventions:
    Hypothesized mechanisms are tested by:

    • Inhibiting/activating specific feature clusters
    • Measuring downstream effects on model behavior[1]

The article clarifies that attribution graphs and simplified proxy models are distinct but interconnected components of their interpretability framework:

  1. Simplified Proxy Models
    These are purpose-built replacements for the original model ([1][4]):

    • Use human-interpretable features (e.g., "Texas" or "rhyme scheme")
    • Inherit attention patterns from Claude 3.5 Haiku
    • Include error nodes to quantify interpretability gaps
  2. Attribution Graphs
    These are computational maps derived from the proxy models ([1][4]):

    • Show causal relationships between features in the proxy model
    • Pruned to 10% of original nodes while retaining 80% explanatory power
    • Group related features into supernodes for readability

Key Differences

Aspect Proxy Models Attribution Graphs
Purpose Mimic original model's behavior Visualize causal pathways in proxy models
Structure Functional neural networks Directed graphs with nodes/edges
Validation Match original model's outputs Tested via feature inhibition/activation

The proxy models act as interpretable substitutes for Claude 3.5 Haiku, while attribution graphs provide a mechanistic lens to understand how these proxies process information. Their combined use allows both behavioral fidelity and computational transparency.

Key Case Studies

1. Multi-Step Reasoning
For the prompt “Fact: the capital of the state containing Dallas is”:

  • The attribution graph revealed two distinct steps:
    Dallas → TexasTexas + capital → Austin[1]
  • Intervention proof: Suppressing “Texas” features caused outputs like “Sacramento” when California-related features were activated[1]

2. Poetry Planning
When generating rhyming couplets:

  • Pre-writing planning: Features for candidate end-words (e.g., “rabbit”) activated before composing the line[1]
  • Backward chaining: The model structures intermediate words (e.g., “like a”) to reach the planned rhyme[1]
  • Steering experiments showed 70% success rate in forcing specific end-words via feature injections[1]

3. Multilingual Processing
For prompts meaning “big” in English, French, and Chinese:

  • Shared language-agnostic features handled the antonym operation (“small”)
  • Language-specific features determined output syntax[1]
  • Cross-lingual interventions: Swapping antonym/synonym operations worked across languages despite being derived from English data[1]

Limitations and Implications

  • Partial interpretability: Attribution graphs explain ~25% of prompts analyzed[1]
  • Scale dependence: Multilingual features become more prevalent in larger models[1]
  • Practical impact: These methods help audit model reliability but remain labor-intensive

The work demonstrates that even advanced models like Claude 3.5 Haiku use human-like intermediate steps (e.g., planning, multi-hop reasoning), validated through rigorous experimentation. However, full reverse-engineering remains challenging due to the models’ inherent complexity[1].