interpretable proxy model - chunhualiao/public-docs GitHub Wiki

Large language models (LLMs) can be simplified into interpretable proxy models through strategic approximations that retain key behaviors while exposing mechanistic patterns. Here's how researchers achieve this:

Core Approaches

1. Feature-Based Simplification

Human-interpretable features replace raw neuron activations (e.g., "rhyme scheme" instead of abstract vectors)[1][2]
Local replacement models mimic the original LLM’s behavior using these features while preserving attention patterns[2][6]
Example:
A poetry generator proxy might explicitly track "syllable count" and "end rhyme" features instead of opaque neural computations

2. Symbolic Program Integration

LLM-based Symbolic Programs (LSPs) combine neural networks with rule-based systems[6]:
1. LLMs generate natural language concepts (e.g., "identify rhyme candidates")
2. Symbolic rules assemble these concepts into decision trees
Result:
A medical diagnosis proxy could use rules like
IF (fever > 38°C) AND (cough present) THEN suggest influenza[4]

3. Decoding-Time Guidance

Proxy-tuning adjusts outputs without modifying weights[3][5]:
- A small tuned model ("expert") and its untuned version ("anti-expert") guide a larger base model
- Formula:
  $$ \text{Adjusted logits} = \text{Base logits} + (\text{Expert logits} - \text{Anti-expert logits}) $$

Validation & Trade-offs

Method	Strengths	Limitations
Feature simplification	Directly interpretable steps	Explains ~25% of cases[2]
Symbolic programs	Human-transferable logic[6]	Limited to predefined DSLs
Proxy-tuning	No weight access needed[5]	Indirect behavioral alignment

Why It Works

Modularity: LLMs internally develop pseudo-symbolic representations (e.g., discrete "country capital" features)[2][6]
Attention Conservation: Proxy models inherit the original model’s attention focus areas[1][2]
Error Quantification: Built-in error nodes highlight where simplifications fail[2]

While these proxies don’t fully replicate LLM capabilities, they provide actionable insights into model decision-making—a crucial step toward trustworthy AI systems.

Citations:

Answer from Perplexity: https://www.perplexity.ai/search/help-me-understand-https-trans-3bfapnyoQ7m2qX6vrxqkiA?utm_source=copy_output