Biology of a Large Language Model - chunhualiao/public-docs GitHub Wiki
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
The linked paper introduces a novel method for reverse-engineering language models using attribution graphs, demonstrated through case studies on Claude 3.5 Haiku. Here’s a structured breakdown of its key insights:
Core Methodology
-
Local Replacement Models:
The researchers build simplified, interpretable proxy models that:- Use human-interpretable features (e.g., concepts like “Texas” or “rhyme scheme”)
- Inherit attention patterns from the original model
- Include error nodes to quantify gaps in interpretability[1]
-
Attribution Graphs:
These directed graphs map how features interact to produce outputs:- Nodes represent conceptual features
- Edges show causal relationships between features
- Pruned and grouped into supernodes for readability[1]
-
Validation via Interventions:
Hypothesized mechanisms are tested by:- Inhibiting/activating specific feature clusters
- Measuring downstream effects on model behavior[1]
The article clarifies that attribution graphs and simplified proxy models are distinct but interconnected components of their interpretability framework:
-
Simplified Proxy Models
These are purpose-built replacements for the original model ([1][4]):- Use human-interpretable features (e.g., "Texas" or "rhyme scheme")
- Inherit attention patterns from Claude 3.5 Haiku
- Include error nodes to quantify interpretability gaps
-
Attribution Graphs
These are computational maps derived from the proxy models ([1][4]):- Show causal relationships between features in the proxy model
- Pruned to 10% of original nodes while retaining 80% explanatory power
- Group related features into supernodes for readability
Key Differences
Aspect | Proxy Models | Attribution Graphs |
---|---|---|
Purpose | Mimic original model's behavior | Visualize causal pathways in proxy models |
Structure | Functional neural networks | Directed graphs with nodes/edges |
Validation | Match original model's outputs | Tested via feature inhibition/activation |
The proxy models act as interpretable substitutes for Claude 3.5 Haiku, while attribution graphs provide a mechanistic lens to understand how these proxies process information. Their combined use allows both behavioral fidelity and computational transparency.
Key Case Studies
1. Multi-Step Reasoning
For the prompt “Fact: the capital of the state containing Dallas is”:
- The attribution graph revealed two distinct steps:
Dallas → Texas
→Texas + capital → Austin
[1] - Intervention proof: Suppressing “Texas” features caused outputs like “Sacramento” when California-related features were activated[1]
2. Poetry Planning
When generating rhyming couplets:
- Pre-writing planning: Features for candidate end-words (e.g., “rabbit”) activated before composing the line[1]
- Backward chaining: The model structures intermediate words (e.g., “like a”) to reach the planned rhyme[1]
- Steering experiments showed 70% success rate in forcing specific end-words via feature injections[1]
3. Multilingual Processing
For prompts meaning “big” in English, French, and Chinese:
- Shared language-agnostic features handled the antonym operation (“small”)
- Language-specific features determined output syntax[1]
- Cross-lingual interventions: Swapping antonym/synonym operations worked across languages despite being derived from English data[1]
Limitations and Implications
- Partial interpretability: Attribution graphs explain ~25% of prompts analyzed[1]
- Scale dependence: Multilingual features become more prevalent in larger models[1]
- Practical impact: These methods help audit model reliability but remain labor-intensive
The work demonstrates that even advanced models like Claude 3.5 Haiku use human-like intermediate steps (e.g., planning, multi-hop reasoning), validated through rigorous experimentation. However, full reverse-engineering remains challenging due to the models’ inherent complexity[1].