Introductory Resources for the Curious - ARBORproject/arborproject.github.io GitHub Wiki
This is the right place for getting started with tinkering with language models. You will find a collection of introductory resources, such as interactive colab notebooks and tutorials.
Quick experiments on hacking model behavior
- Thought Token Forcing Modify model behavior by prefilling a part of it's answer.
- Activation steering Influence model behavior by dialing up the internal knob for a specific context. Disclamer: This is an active area of research and there's no guarantee whether this knob exists for every concept or is localized. Experiments simply found that intervening on directions in activation space works surprisingly well at influencing model behavior!
Full courses on mechanistic interpretability
- ARENA A comprehensive curriculum on mechanistic interpretability. Find background on the program here.
- NNsight Tutorials Introduction to experiments with model internals