Interpretability - HanjieChen/Reading-List GitHub Wiki

Circuit Complexity & Transformers

Steering

Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models
Improving LLM Reasoning through Interpretable Role-Playing Steering

Agentic Interpretability

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Algorithmic learning theory

Rate-Distortion Theory

Information-Theoretic Analysis

Mechanistic Interpretability

Prompting-based

Human Perspective

Evaluation

Concept-based

Interpretability, Explainability, Robustness

Information Bottleneck

An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction
Supervised feature selection by clustering using conditional mutual information-based distances
The information bottleneck method
Multivariate Information Bottleneck
Agglomerative Information Bottleneck
Agglomerative Multivariate Information Bottleneck
Nonlinear Information Bottleneck Interesting point: upper bound with empirical distribution in the training data.
ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING
The Information Bottleneck Problem and Its Applications in Machine Learning
Information Bottleneck Co-clustering

Feature Interactions

Influence Functions

Find supporting training examples as explanations

Interpretation for analyzing tasks

Designing and Interpreting Probes with Control Tasks Explore whether the ELMo representations really encode the linguistic structure for down-stream tasks (e.g POS tagging)
Information-Theoretic Probing for Linguistic Structure Discuss the structure of the encoded information in BERT
Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge

Explainability Evaluation

The Struggles and Subjectivity of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets
Explainability Fact Sheets: A Framework for Systematic Assessment of Explainable Approaches An overview of explainable AI
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Explore whether explanations can help user judgements. Proposed a way to generate counterfactual examples.
Aligning Faithful Interpretations with their Social Attribution Faithfulness of interpretations. Criticize existing explanation methods in terms of faithfulness evaluation.
When Explanations Lie: Why Many Modified BP Attributions Fail The saliency map is only determined by first layers.
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data

Interpretation Methods

Interpretable Neural Predictions with Differentiable Binary Variables Very similar to our previous idea of using Beta distribution with reparameterization. Test on both SST and NLI, but word-pair level (attention) did not work well.
Explaining Groups of Points in Low-Dimensional Representations Leverage the model that learned the low-dimensional representation to help identify the key differences between the groups. Define Global Counterfactual Explanations.
Concept Bottleneck Models Utilize human-annotated concepts to guide model training.
Cost-Effective Interactive Attention Learning with Neural Attention Processes Interactive learning framework with human annotators.
FASTSHAP: REAL-TIME SHAPLEY VALUE ESTIMATION
Improving Deep Learning Interpretability by Saliency Guided Training
Machine Learning Explainability for External Stakeholders
SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers (Concept-level)
Contrastive Explanations for Model Interpretability
Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detectionl Interpretability
Information-Theoretic Measures of Dataset Difficulty

Meta-learning for Few-Shot Text Classification

FEW-SHOT TEXT CLASSIFICATION WITH DISTRIBUTIONAL SIGNATURES Another application of integrating an additional layer to learn word importance during training.
Diverse Few-Shot Text Classification with Multiple Metrics Detect similar tasks, and share encoder.

Shapley Values for Interpretation

Improving Transparency

Transparency Promotion with Model-Agnostic Linear Competitors Balance the trade-off between transparency and interpretability. Hybrid framework with NN and linear model.

Rationales

Invariant Rationalization Removing spuriously correlated rationales.

Variational Information Bottleneck based Methods

Learn masks for interpretation

Learning with rationale

Improve interpretability and accuracy

NEURAL GENERATORS OF SPARSE LOCAL LINEAR MODELS FOR ACHIEVING BOTH ACCURACY AND INTERPRETABILITY

Interpretability - HanjieChen/Reading-List GitHub Wiki

Circuit Complexity & Transformers

Steering

Agentic Interpretability

Algorithmic learning theory

Rate-Distortion Theory

Information-Theoretic Analysis

Mechanistic Interpretability

Prompting-based

Human Perspective

Evaluation

Concept-based

Interpretability, Explainability, Robustness

Information Bottleneck

Feature Interactions

Influence Functions

Interpretation for analyzing tasks

Explainability Evaluation

Interpretation Methods

Meta-learning for Few-Shot Text Classification

Shapley Values for Interpretation

Improving Transparency

Rationales

Variational Information Bottleneck based Methods

Learn masks for interpretation

Learning with rationale

Improve interpretability and accuracy

e-SNLI