interpretable proxy model - chunhualiao/public-docs GitHub Wiki
Large language models (LLMs) can be simplified into interpretable proxy models through strategic approximations that retain key behaviors while exposing mechanistic patterns. Here's how researchers achieve this:
Core Approaches
1. Feature-Based Simplification
- Human-interpretable features replace raw neuron activations (e.g., "rhyme scheme" instead of abstract vectors)[1][2]
- Local replacement models mimic the original LLM’s behavior using these features while preserving attention patterns[2][6]
- Example:
A poetry generator proxy might explicitly track "syllable count" and "end rhyme" features instead of opaque neural computations
2. Symbolic Program Integration
- LLM-based Symbolic Programs (LSPs) combine neural networks with rule-based systems[6]:
- LLMs generate natural language concepts (e.g., "identify rhyme candidates")
- Symbolic rules assemble these concepts into decision trees
- Result:
A medical diagnosis proxy could use rules like
IF (fever > 38°C) AND (cough present) THEN suggest influenza[4]
3. Decoding-Time Guidance
- Proxy-tuning adjusts outputs without modifying weights[3][5]:
- A small tuned model ("expert") and its untuned version ("anti-expert") guide a larger base model
- Formula:
$$ \text{Adjusted logits} = \text{Base logits} + (\text{Expert logits} - \text{Anti-expert logits}) $$
Validation & Trade-offs
Method | Strengths | Limitations |
---|---|---|
Feature simplification | Directly interpretable steps | Explains ~25% of cases[2] |
Symbolic programs | Human-transferable logic[6] | Limited to predefined DSLs |
Proxy-tuning | No weight access needed[5] | Indirect behavioral alignment |
Why It Works
- Modularity: LLMs internally develop pseudo-symbolic representations (e.g., discrete "country capital" features)[2][6]
- Attention Conservation: Proxy models inherit the original model’s attention focus areas[1][2]
- Error Quantification: Built-in error nodes highlight where simplifications fail[2]
While these proxies don’t fully replicate LLM capabilities, they provide actionable insights into model decision-making—a crucial step toward trustworthy AI systems.
Citations:
- [1] https://arxiv.org/html/2312.03656v2
- [2] https://transformer-circuits.pub/2025/attribution-graphs/methods.html
- [3] https://huggingface.co/papers/2401.08565
- [4] https://www.nature.com/articles/s41467-023-43713-1
- [5] https://www.linkedin.com/pulse/proxy-tuning-efficient-customizable-adaptation-models-ayoub-tn0me
- [6] https://arxiv.org/html/2406.17224v1
- [7] https://openreview.net/forum?id=dribhnhm1i
- [8] https://arxiv.org/html/2402.01761v1
- [9] https://openreview.net/pdf/5177c627eec4962417ca349e4766628da370fae9.pdf
- [10] https://towardsdatascience.com/interpretable-features-in-large-language-models-377fb25c72eb/
- [11] https://openreview.net/forum?id=yibaLrx5Bm
- [12] https://arxiv.org/html/2401.08565v2
- [13] https://royalsocietypublishing.org/doi/10.1098/rsos.241692
- [14] https://arxiv.org/abs/2401.08565
- [15] https://lilywenglab.github.io/WengLab_2024_CBLLM.pdf
- [16] https://www.reddit.com/r/SaaS/comments/1fiuvxp/i_built_an_ai_proxy_platform_to_simplify_my_job/
- [17] https://dl.acm.org/doi/10.1145/3639372
- [18] https://news.mit.edu/2023/language-models-scalable-self-learners-0608
- [19] https://aclanthology.org/2024.findings-naacl.138.pdf
- [20] https://www.youtube.com/watch?v=4d_eyBqG75I
Answer from Perplexity: https://www.perplexity.ai/search/help-me-understand-https-trans-3bfapnyoQ7m2qX6vrxqkiA?utm_source=copy_output