Evaluation - HanjieChen/Reading-List GitHub Wiki
Explanation/Rationale/Reasoning
- GLoRE: Evaluating Logical Reasoning of Large Language Models
- SOCREVAL: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation
- Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI
- Evaluating the Impact of Human Explanation Strategies on Human-AI Visual Decision-Making
- Four Principles of Explainable Artificial Intelligence
- Towards Faithful Model Explanation in NLP: A Survey
- Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
- Are Human Explanations Always Helpful? Towards Objective Evaluation of Human Natural Language Explanations
- Using Natural Language Explanations to Rescale Human Judgments
- Challenges in Explanation Quality Evaluation
- INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
- ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BY-STEP REASONING
- In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making
- Interpreting Interpretability: Understanding Data Scientists' Use of Interpretability Tools for Machine Learning
- RECEVAL: Evaluating Reasoning Chains via Correctness and Informativeness
- Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-Text Rationales
- REV: Information-Theoretic Evaluation of Free-Text Rationales
Bias/Fairness/Factuality
- Multi2Claim: Generating Scientific Claims from Multi-Choice Questions for Scientific Fact-Checking
- FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
- ExpertQA: Expert-Curated Questions and Attributed Answers
- Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models
NLG Evaluation
- CHATEVAL: TOWARDS BETTER LLM-BASED EVALUATORS THROUGH MULTI-AGENT DEBATE
- A Critical Evaluation of Evaluations for Long-form Question Answering
- LENS: A Learnable Evaluation Metric for Text Simplification
- SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
- On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
- Towards a Unified Multi-Dimensional Evaluator for Text Generation
- Operationalizing Specifications, In Addition to Test Sets for Evaluating Constrained Generative Models
- SALTED: A Framework for SAlient Long-tail Translation Error Detection
- Evaluation of Text Generation: A Survey
- Don’t Take It Literally: An Edit-Invariant Sequence Loss for Text Generation
- BLEURT: Learning Robust Metrics for Text Generation
- Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation
- The Authenticity Gap in Human Evaluation
- G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
- On the Possibilities of AI-Generated Text Detection
- MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers