Interpretability and Explainability - HanjieChen/Reading-List GitHub Wiki
Mechanistic Interpretability
- SelfIE: Self-Interpretation of Large Language Model Embeddings
- A Primer on the Inner Workings of Transformer-based Language Models
- Retrieval Head Mechanistically Explains Long-Context Factuality
- WHAT DOES THE KNOWLEDGE NEURON THESIS HAVE TO DO WITH KNOWLEDGE?
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
- Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
- Universal Neurons in GPT2 Language Models
- Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
- A Toy Model of Universality: Reverse Engineering how Networks Learn Group Operations
- Grokking modular arithmetic
- Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability
- Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
- MECHANISTICALLY ANALYZING THE EFFECTS OF FINETUNING ON PROCEDURALLY DEFINED TASKS
- Towards Automated Circuit Discovery for Mechanistic Interpretability
- The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
Interpretable Model
Human Perspective
- A Rationale-Centric Framework for Human-in-the-loop Machine Learning
- The Impact of Imperfect XAI on Human-AI Decision-Making
- Explanations Can Reduce Overreliance on AI Systems During Decision-Making
- Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning
- Designerly Understanding: Information Needs for Model Transparency to Support Design Ideation for AI-Powered User Experience
- Evaluating Saliency Methods for Neural Language Models
- Explanation-Based Human Debugging of NLP Models: A Survey
- ASK YOUR HUMANS: USING HUMAN INSTRUCTIONS TO IMPROVE GENERALIZATION IN REINFORCEMENT LEARNING
Evaluation
- Testing methods of neural systems understanding
- FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Concept-based
- Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees
- Post-hoc Concept Bottleneck Models
Interpretability, Explainability, Robustness
- Interpretability and Explainability: A Machine Learning Zoo Mini-tour
- ON FAST ADVERSARIAL ROBUSTNESS ADAPTATION IN MODEL-AGNOSTIC META-LEARNING
- Robust Encodings: A Framework for Combating Adversarial Typos
- On the Lack of Robust Interpretability of Neural Text Classifiers
- Concealed Data Poisoning Attacks on NLP Models
- Adversarial Examples Are Not Bugs, They Are Features
- Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
- Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
- Contextualized Perturbation for Textual Adversarial Attack
- Universal Adversarial Triggers for Attacking and Analyzing NLP
- Robust Encodings: A Framework for Combating Adversarial Typos
- A Closer Look at Accuracy vs. Robustness
- Getting a CLUE: A Method for Explaining Uncertainty Estimates
- Understanding and Mitigating the Tradeoff Between Robustness and Accuracy
- How Does Mixup Help Robustness and Generalization?
- Adversarial Training for Large Neural Language Models
- Pathologies of Neural Models Make Interpretations Difficult
- Robust Attribution Regularization
- Fooling Network Interpretation in Image Classification
- Interpretable Deep Learning under Fire
- Interpreting Adversarial Examples by Activation Promotion and Suppression
- Statistical stability indices for LIME: obtaining reliable explanations for Machine Learning models
- On the (In)fidelity and Sensitivity of Explanations
- Visualizing and Understanding the Effectiveness of BERT
- Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations
- Interpreting Adversarial Examples by Activation Promotion and Suppression
- STRUCTURED ADVERSARIAL ATTACK: TOWARDS GENERAL IMPLEMENTATION AND BETTER INTERPRETABILITY
- A simple defense against adversarial attacks on heatmap explanations
- Smoothed Geometry for Robust Attribution
- Fairwashing Explanations with Off-Manifold Detergent
- Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
- Gradient-based Analysis of NLP Models is Manipulable
- Interpretation of Neural Networks Is Fragile
- Towards Robust Explanations for Deep Neural Networks
- Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients
- Interpretable Adversarial Perturbation in Input Embedding Space for Text
- ADVERSARIAL TRAINING METHODS FOR SEMI-SUPERVISED TEXT CLASSIFICATION
- Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training
- Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications
- Towards Robust Interpretability with Self-Explaining Neural Networks
- Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples
- TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
- Reevaluating Adversarial Examples in Natural Language
- Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples
- Explanations can be manipulated and geometry is to blame
- Fooling Neural Network Interpretations via Adversarial Model Manipulation
- Concise Explanations of Neural Networks using Adversarial Training Make explanations sparse via adversarial training.
- Proper Network Interpretability Helps Adversarial Robustness in Classification Robust interpretation to adversarial attacks.
- Informative Dropout for Robust Representation Learning: A Shape-bias Perspective Interesting work by utilizing self-information.
- Robust and Stable Black Box Explanations
Information Bottleneck
- An Information Bottleneck Approach for Controlling Conciseness in Rationale Extraction
- Supervised feature selection by clustering using conditional mutual information-based distances
- The information bottleneck method
- Multivariate Information Bottleneck
- Agglomerative Information Bottleneck
- Agglomerative Multivariate Information Bottleneck
- Nonlinear Information Bottleneck Interesting point: upper bound with empirical distribution in the training data.
- ON THE INFORMATION BOTTLENECK THEORY OF DEEP LEARNING
- The Information Bottleneck Problem and Its Applications in Machine Learning
- Information Bottleneck Co-clustering
Feature Interactions
- Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport
- Interpreting Hierarchical Linguistic Interactions in DNNs
- Explaining Explanations: Axiomatic Feature Interactions for Deep Networks
- Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT
- The Shapley Taylor Interaction Index
- FEATURE INTERACTION INTERPRETABILITY: A CASE FOR EXPLAINING AD-RECOMMENDATION SYSTEMS VIA NEURAL INTERACTION DETECTION Gradient-based method + LIME.
- DETECTING STATISTICAL INTERACTIONS FROM NEURAL NETWORK WEIGHTS Analyze feedforward neural networks.
- How does this interaction affect me? Interpretable attribution for feature interactions
- Attentional Factorization Machines: Learning the Weight of Feature Interactions via Attention Networks
- Self-Attention Attribution: Interpreting Information Interactions Inside Transformer
Influence Functions
Find supporting training examples as explanations
- Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
- On Second-Order Group Influence Functions for Black-Box Predictions
Interpretation for analyzing tasks
- Designing and Interpreting Probes with Control Tasks Explore whether the ELMo representations really encode the linguistic structure for down-stream tasks (e.g POS tagging)
- Information-Theoretic Probing for Linguistic Structure Discuss the structure of the encoded information in BERT
- Interpretations are Useful: Penalizing Explanations to Align Neural Networks with Prior Knowledge
Explainability Evaluation
- The Struggles and Subjectivity of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets
- Explainability Fact Sheets: A Framework for Systematic Assessment of Explainable Approaches An overview of explainable AI
- Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? Explore whether explanations can help user judgements. Proposed a way to generate counterfactual examples.
- Aligning Faithful Interpretations with their Social Attribution Faithfulness of interpretations. Criticize existing explanation methods in terms of faithfulness evaluation.
- When Explanations Lie: Why Many Modified BP Attributions Fail The saliency map is only determined by first layers.
- When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data
Interpretation Methods
- Interpretable Neural Predictions with Differentiable Binary Variables Very similar to our previous idea of using Beta distribution with reparameterization. Test on both SST and NLI, but word-pair level (attention) did not work well.
- Explaining Groups of Points in Low-Dimensional Representations Leverage the model that learned the low-dimensional representation to help identify the key differences between the groups. Define Global Counterfactual Explanations.
- Concept Bottleneck Models Utilize human-annotated concepts to guide model training.
- Cost-Effective Interactive Attention Learning with Neural Attention Processes Interactive learning framework with human annotators.
- FASTSHAP: REAL-TIME SHAPLEY VALUE ESTIMATION
- Improving Deep Learning Interpretability by Saliency Guided Training
- Machine Learning Explainability for External Stakeholders
- SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers (Concept-level)
- Contrastive Explanations for Model Interpretability
- Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detectionl Interpretability
- Information-Theoretic Measures of Dataset Difficulty
Meta-learning for Few-Shot Text Classification
- FEW-SHOT TEXT CLASSIFICATION WITH DISTRIBUTIONAL SIGNATURES Another application of integrating an additional layer to learn word importance during training.
- Diverse Few-Shot Text Classification with Multiple Metrics Detect similar tasks, and share encoder.
Shapley Values for Interpretation
- Interpreting Hierarchical Linguistic Interactions in DNNs
- The Many Shapley Values for Model Explanation
- Problems with Shapley-value-based explanations as feature importance measures
- Efficient nonparametric statistical inference on population feature importance using Shapley values
- The Shapley Taylor Interaction Index
Improving Transparency
- Transparency Promotion with Model-Agnostic Linear Competitors Balance the trade-off between transparency and interpretability. Hybrid framework with NN and linear model.
Rationales
- Invariant Rationalization Removing spuriously correlated rationales.
Variational Information Bottleneck based Methods
- Restricting the Flow: Information Bottlenecks for Attribution
- A Game Theoretic Approach to Class-wise Selective Rationalization
- Explaining A Black-box By Using A Deep Variational Information Bottleneck Approach
- Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
- GREEDY ATTACK AND GUMBEL ATTACK: GENERATING ADVERSARIAL EXAMPLES FOR DISCRETE DATA
- Variational Information Planning for Sequential Decision Making
- Towards a Deep and Unified Understanding of Deep Neural Models in NLP
- Specializing Word Embeddings (for Parsing) by Information Bottleneck
- Deep Variational Information Bottleneck
Learn masks for interpretation
- Learning to Explain: An Information-Theoretic Perspective on Model Interpretation
- Towards Explanation of DNN-based Prediction with Guided Feature Inversion
- Self-Supervised Discovering of Causal Features: Towards Interpretable Reinforcement Learning
Learning with rationale
- Learning Credible Models
- Learning Credible Deep Neural Networks with Rationale Regularization
- Incorporating Priors with Feature Attribution on Text Classification
- INTERPRETATIONS ARE USEFUL: PENALIZING EXPLA- NATIONS TO ALIGN NEURAL NETWORKS WITH PRIOR KNOWLEDGE
- Deriving Machine Attention from Human Rationales
- Rationale-Augmented Convolutional Neural Networks for Text Classification