Interpretability - HanjieChen/Reading-List GitHub Wiki
Circuit Complexity & Transformers
- Relations among Complexity Measures (1979)
- Theoretical Limitations of Self-Attention in Neural Sequence Models (2019)
- On the Power of Saturated Transformers: A View from Circuit Complexity (2021)
- Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity (2022)
- What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment (2022)
- Transformers Learn Shortcuts to Automata (2022)
- The Parallelism Tradeoff: Limitations of Log-Precision Transformers (2023)
- Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective (2023)
- Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2024)
- A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers (2025)
- Exact Expressive Power of Transformers with Padding (2025)
- Transformers Learn to Implement Multi-Step Gradient Descent with Chain of Thought (2025)
Steering
- Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
- SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
- Improving LLM Reasoning through Interpretable Role-Playing Steering
Agentic Interpretability
Algorithmic learning theory
- Learning Universal Predictors
- Language Modeling Is Compression
- Neural Networks and the Chomsky Hierarchy
Rate-Distortion Theory
- Segmentation of Multivariate Mixed Data via Lossy Data Coding and Compression
- Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction
- White-Box Transformers via Sparse Rate Reduction
- Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction
- Scaling White-Box Transformers for Vision
Information-Theoretic Analysis
- How much do language models memorize?
- ADAPTIVE LANGUAGE MODELING USING THE MAXIMUM ENTROPY PRINCIPLE
- Adaptive Statistical Language Modeling: A Maximum Entropy Approach
- Layer by Layer: Uncovering Hidden Representations in Language Models
- Measuring the Mixing of Contextual Information in the Transformer
- Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
- MULTI-VIEW INFORMATION BOTTLENECK WITHOUT VARIATIONAL APPROXIMATION
- Information Flow Routes: Automatically Interpreting Language Models at Scale
- Unveiling the Dynamics of Information Interplay in Supervised Learning
- L2M: Mutual Information Scaling Law for Long-Context Language Modeling
- Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability
- Approaches to Information-Theoretic Analysis of Neural Activity
- An Information-Theoretic Framework for Deep Learning
- An Information Theoretic Interpretation to Deep Neural Networks
- Information-Theoretic Generalization Bounds for Deep Neural Networks
Mechanistic Interpretability
- Weight-sparse transformers have interpretable circuits
- Superposition Yields Robust Neural Scaling
- Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
- Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models
- Towards Interpreting Visual Information Processing in Vision-Language Models
- Open Problems in Mechanistic Interpretability
- What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
- BEYOND SINGLE CONCEPT VECTOR: MODELING CONCEPT SUBSPACE IN LLMS WITH GAUSSIAN DISTRIBUTION
- A PRIMER ON THE INNER WORKINGS OF TRANSFORMER-BASED LANGUAGE MODELS
- The Geometry of Concepts: Sparse Autoencoder Feature Structure
- BILINEAR MLPS ENABLE WEIGHT-BASED MECHANISTIC INTERPRETABILITY
- THE PERSIAN RUG: SOLVING TOY MODELS OF SUPER- POSITION USING LARGE-SCALE SYMMETRIES
- MDECOMPOSING THE DARK MATTER OF SPARSE AUTOENCODERS
- Mechanistic Interpretability for AI Safety A Review
- Automatically Interpreting Millions of Features in Large Language Models
- Sparse Crosscoders for Cross-Layer Features and Model Diffing
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
- EFFICIENT DICTIONARY LEARNING WITH SWITCH SPARSE AUTOENCODERS
- Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
- DILA: Dictionary Label Attention for Mechanistic Interpretability in High-dimensional Multi-label Medical Coding Prediction
- Polysemanticity and Capacity in Neural Networks
- An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs
- CODEBOOK FEATURES: SPARSE AND DISCRETE INTERPRETABILITY FOR NEURAL NETWORKS
- InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
- RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
- A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
- Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
- Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
- From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- FINDING NEURONS IN A HAYSTACK: CASE STUDIES WITH SPARSE PROBING
- Tracr: Compiled Transformers as a Laboratory for Interpretability
- Locating and Editing Factual Associations in GPT
- Eliciting Latent Predictions from Transformers with the Tuned Lens
- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
- DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
- Toy Models of Superposition
- Transcoders enable fine-grained interpretable circuit analysis for language models
- Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
- SelfIE: Self-Interpretation of Large Language Model Embeddings
- A Primer on the Inner Workings of Transformer-based Language Models
- Retrieval Head Mechanistically Explains Long-Context Factuality
- WHAT DOES THE KNOWLEDGE NEURON THESIS HAVE TO DO WITH KNOWLEDGE?
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
- Universal Neurons in GPT2 Language Models
- Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
- A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations
- Grokking modular arithmetic
- Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability
- Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- MECHANISTICALLY ANALYZING THE EFFECTS OF FINETUNING ON PROCEDURALLY DEFINED TASKS
- Towards Automated Circuit Discovery for Mechanistic Interpretability
- The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
- Efficient sparse coding algorithms
- Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
- SPARSE AUTOENCODERS FIND HIGHLY INTERPRETABLE FEATURES IN LANGUAGE MODELS
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- Automatically Identifying Local and Global Circuits with Linear Computation Graphs
- Scaling and evaluating sparse autoencoders from OpenAI
Prompting-based
- STRUX: An LLM for Decision-Making with Structured Explanations
- Towards LLM-guided Causal Explainability for Black-box Text Classifiers
- Are self-explanations from Large Language Models faithful?
- Make Your Decision Convincing! A Unified Two-Stage Framework: Self-Attribution and Decision-Making
- Post Hoc Explanations of Language Models Can Improve Language Models
- Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges
Human Perspective
- A Rationale-Centric Framework for Human-in-the-loop Machine Learning
- The Impact of Imperfect XAI on Human-AI Decision-Making
- Explanations Can Reduce Overreliance on AI Systems During Decision-Making
- Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning
- Designerly Understanding: Information Needs for Model Transparency to Support Design Ideation for AI-Powered User Experience
- Evaluating Saliency Methods for Neural Language Models
- Explanation-Based Human Debugging of NLP Models: A Survey
- ASK YOUR HUMANS: USING HUMAN INSTRUCTIONS TO IMPROVE GENERALIZATION IN REINFORCEMENT LEARNING
Evaluation
- Aligned Probing: Relating Toxic Behavior and Model Internals
- Measuring Faithfulness of Chains of Thought by Unlearning Reasoning Steps
- Testing methods of neural systems understanding
- FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Concept-based
- Measuring the Mixing of Contextual Information in the Transformer
- Explaining How Transformers Use Context to Build Predictions
- ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability
- Explaining Language Model Predictions with High-Impact Concepts
- Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees
- Post-hoc Concept Bottleneck Models
Interpretability, Explainability, Robustness
- TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
- From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
- Causal Abstraction for Faithful Model Interpretation
- Interpretability and Explainability: A Machine Learning Zoo Mini-tour
- ON FAST ADVERSARIAL ROBUSTNESS ADAPTATION IN MODEL-AGNOSTIC META-LEARNING
- Robust Encodings: A Framework for Combating Adversarial Typos
- On the Lack of Robust Interpretability of Neural Text Classifiers
- Concealed Data Poisoning Attacks on NLP Models
- Adversarial Examples Are Not Bugs, They Are Features
- Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
- Contextualized Perturbation for Textual Adversarial Attack
- Universal Adversarial Triggers for Attacking and Analyzing NLP
- Robust Encodings: A Framework for Combating Adversarial Typos
- A Closer Look at Accuracy vs. Robustness
- Getting a CLUE: A Method for Explaining Uncertainty Estimates
- Understanding and Mitigating the Tradeoff Between Robustness and Accuracy
- How Does Mixup Help Robustness and Generalization?
- Adversarial Training for Large Neural Language Models
- Pathologies of Neural Models Make Interpretations Difficult
- Robust Attribution Regularization
- Fooling Network Interpretation in Image Classification
- Interpretable Deep Learning under Fire
- Interpreting Adversarial Examples by Activation Promotion and Suppression
- Statistical stability indices for LIME: obtaining reliable explanations for Machine Learning models
- On the (In)fidelity and Sensitivity of Explanations
- Visualizing and Understanding the Effectiveness of BERT
- Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations
- Interpreting Adversarial Examples by Activation Promotion and Suppression
- STRUCTURED ADVERSARIAL ATTACK: TOWARDS GENERAL IMPLEMENTATION AND BETTER INTERPRETABILITY
- A simple defense against adversarial attacks on heatmap explanations
- Smoothed Geometry for Robust Attribution
- Fairwashing Explanations with Off-Manifold Detergent
- Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
- Gradient-based Analysis of NLP Models is Manipulable
- Interpretation of Neural Networks Is Fragile
- Towards Robust Explanations for Deep Neural Networks
- Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients
- Interpretable Adversarial Perturbation in Input Embedding Space for Text
- ADVERSARIAL TRAINING METHODS FOR SEMI-SUPERVISED TEXT CLASSIFICATION
- Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training
- Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications
- Towards Robust Interpretability with Self-Explaining Neural Networks
- Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
- Reevaluating Adversarial Examples in Natural Language
- Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples
- Explanations can be manipulated and geometry is to blame
- Fooling Neural Network Interpretations via Adversarial Model Manipulation
- Concise Explanations of Neural Networks using Adversarial Training Make explanations sparse via adversarial training.
- Proper Network Interpretability Helps Adversarial Robustness in Classification Robust interpretation to adversarial attacks.
- Informative Dropout for Robust Representation Learning: A Shape-bias Perspective Interesting work by utilizing self-information.
- Robust and Stable Black Box Explanations
Information Bottleneck
- An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction
- Supervised feature selection by clustering using conditional mutual information-based distances
- The information bottleneck method
- Multivariate Information Bottleneck
- Agglomerative Information Bottleneck
- Agglomerative Multivariate Information Bottleneck
- Nonlinear Information Bottleneck Interesting point: upper bound with empirical distribution in the training data.
- ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING
- The Information Bottleneck Problem and Its Applications in Machine Learning
- Information Bottleneck Co-clustering
Feature Interactions
- Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport
- Interpreting Hierarchical Linguistic Interactions in DNNs
- Explaining Explanations: Axiomatic Feature Interactions for Deep Networks
- Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT
- The Shapley Taylor Interaction Index
- FEATURE INTERACTION INTERPRETABILITY: A CASE FOR EXPLAINING AD-RECOMMENDATION SYSTEMS VIA NEURAL INTERACTION DETECTION Gradient-based method + LIME.
- DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS Analyze feedforward neural networks.
- How does this interaction affect me? Interpretable attribution for feature interactions
- Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks
- Self-Attention Attribution: Interpreting Information Interactions Inside Transformer
Influence Functions
Find supporting training examples as explanations
- Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
- On Second-Order Group Influence Functions for Black-Box Predictions
Interpretation for analyzing tasks
- Designing and Interpreting Probes with Control Tasks Explore whether the ELMo representations really encode the linguistic structure for down-stream tasks (e.g POS tagging)
- Information-Theoretic Probing for Linguistic Structure Discuss the structure of the encoded information in BERT
- Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge
Explainability Evaluation
- The Struggles and Subjectivity of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets
- Explainability Fact Sheets: A Framework for Systematic Assessment of Explainable Approaches An overview of explainable AI
- Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Explore whether explanations can help user judgements. Proposed a way to generate counterfactual examples.
- Aligning Faithful Interpretations with their Social Attribution Faithfulness of interpretations. Criticize existing explanation methods in terms of faithfulness evaluation.
- When Explanations Lie: Why Many Modified BP Attributions Fail The saliency map is only determined by first layers.
- When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
Interpretation Methods
- Interpretable Neural Predictions with Differentiable Binary Variables Very similar to our previous idea of using Beta distribution with reparameterization. Test on both SST and NLI, but word-pair level (attention) did not work well.
- Explaining Groups of Points in Low-Dimensional Representations Leverage the model that learned the low-dimensional representation to help identify the key differences between the groups. Define Global Counterfactual Explanations.
- Concept Bottleneck Models Utilize human-annotated concepts to guide model training.
- Cost-Effective Interactive Attention Learning with Neural Attention Processes Interactive learning framework with human annotators.
- FASTSHAP: REAL-TIME SHAPLEY VALUE ESTIMATION
- Improving Deep Learning Interpretability by Saliency Guided Training
- Machine Learning Explainability for External Stakeholders
- SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers (Concept-level)
- Contrastive Explanations for Model Interpretability
- Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detectionl Interpretability
- Information-Theoretic Measures of Dataset Difficulty
Meta-learning for Few-Shot Text Classification
- FEW-SHOT TEXT CLASSIFICATION WITH DISTRIBUTIONAL SIGNATURES Another application of integrating an additional layer to learn word importance during training.
- Diverse Few-Shot Text Classification with Multiple Metrics Detect similar tasks, and share encoder.
Shapley Values for Interpretation
- Interpreting Hierarchical Linguistic Interactions in DNNs
- The Many Shapley Values for Model Explanation
- Problems with Shapley-value-based explanations as feature importance measures
- Efficient nonparametric statistical inference on population feature importance using Shapley values
- The Shapley Taylor Interaction Index
Improving Transparency
- Transparency Promotion with Model-Agnostic Linear Competitors Balance the trade-off between transparency and interpretability. Hybrid framework with NN and linear model.
Rationales
- Invariant Rationalization Removing spuriously correlated rationales.
Variational Information Bottleneck based Methods
- Restricting the Flow: Information Bottlenecks for Attribution
- A Game Theoretic Approach to Class-wise Selective Rationalization
- Explaining A Black-box By Using A Deep Variational Information Bottleneck Approach
- Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
- GREEDY ATTACK AND GUMBEL ATTACK: GENERATING ADVERSARIAL EXAMPLES FOR DISCRETE DATA
- Variational Information Planning for Sequential Decision Making
- Towards a Deep and Unified Understanding of Deep Neural Models in NLP
- Specializing Word Embeddings (for Parsing) by Information Bottleneck
- Deep Variational Information Bottleneck
Learn masks for interpretation
- Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
- Towards Explanation of DNN-based Prediction with Guided Feature Inversion
- Self-Supervised Discovering of Causal Features: Towards Interpretable Reinforcement Learning
Learning with rationale
- Learning Credible Models
- Learning Credible Deep Neural Networks with Rationale Regularization
- Incorporating Priors with Feature Attribution on Text Classification
- INTERPRETATIONS ARE USEFUL: PENALIZING EXPLA- NATIONS TO ALIGN NEURAL NETWORKS WITH PRIOR KNOWLEDGE
- Deriving Machine Attention from Human Rationales
- Rationale-Augmented Convolutional Neural Networks for Text Classification