interpretable proxy model - chunhualiao/public-docs GitHub Wiki

Large language models (LLMs) can be simplified into interpretable proxy models through strategic approximations that retain key behaviors while exposing mechanistic patterns. Here's how researchers achieve this:


Core Approaches

1. Feature-Based Simplification

  • Human-interpretable features replace raw neuron activations (e.g., "rhyme scheme" instead of abstract vectors)[1][2]
  • Local replacement models mimic the original LLM’s behavior using these features while preserving attention patterns[2][6]
  • Example:
    A poetry generator proxy might explicitly track "syllable count" and "end rhyme" features instead of opaque neural computations

2. Symbolic Program Integration

  • LLM-based Symbolic Programs (LSPs) combine neural networks with rule-based systems[6]:
    1. LLMs generate natural language concepts (e.g., "identify rhyme candidates")
    2. Symbolic rules assemble these concepts into decision trees
  • Result:
    A medical diagnosis proxy could use rules like
    IF (fever > 38°C) AND (cough present) THEN suggest influenza[4]

3. Decoding-Time Guidance

  • Proxy-tuning adjusts outputs without modifying weights[3][5]:
    • A small tuned model ("expert") and its untuned version ("anti-expert") guide a larger base model
    • Formula:
      $$ \text{Adjusted logits} = \text{Base logits} + (\text{Expert logits} - \text{Anti-expert logits}) $$

Validation & Trade-offs

Method Strengths Limitations
Feature simplification Directly interpretable steps Explains ~25% of cases[2]
Symbolic programs Human-transferable logic[6] Limited to predefined DSLs
Proxy-tuning No weight access needed[5] Indirect behavioral alignment

Why It Works

  1. Modularity: LLMs internally develop pseudo-symbolic representations (e.g., discrete "country capital" features)[2][6]
  2. Attention Conservation: Proxy models inherit the original model’s attention focus areas[1][2]
  3. Error Quantification: Built-in error nodes highlight where simplifications fail[2]

While these proxies don’t fully replicate LLM capabilities, they provide actionable insights into model decision-making—a crucial step toward trustworthy AI systems.

Citations:


Answer from Perplexity: https://www.perplexity.ai/search/help-me-understand-https-trans-3bfapnyoQ7m2qX6vrxqkiA?utm_source=copy_output