Evaluation - HanjieChen/Reading-List GitHub Wiki

Explanation/Rationale/Reasoning

GLoRE: Evaluating Logical Reasoning of Large Language Models
SOCREVAL: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI
Evaluating the Impact of Human Explanation Strategies on Human-AI Visual Decision-Making
Four Principles of Explainable Artificial Intelligence
Towards Faithful Model Explanation in NLP: A Survey
Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations
Using Natural Language Explanations to Rescale Human Judgments
Challenges in Explanation Quality Evaluation
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BY-STEP REASONING
In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making
Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning
RECEVAL: Evaluating Reasoning Chains via Correctness and Informativeness
Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-Text Rationales
REV: Information-Theoretic Evaluation of Free-Text Rationales

Bias/Fairness/Factuality

NLG Evaluation