QA-Evaluator

Overview


Score range	Float [0-1] for F1 score evaluator: the higher, the more similar is the response with ground truth. Integer [1-5] for AI-assisted quality evaluators for question-and-answering (QA) scenarios: where 1 is bad and 5 is good
What is this metric?	Measures comprehensively the groundedness, coherence, and fluency of a response in QA scenarios, as well as the textual similarity between the response and its ground truth.
How does it work?	The QA evaluator leverages prompt-based AI-assisted evaluators using a language model as a judge on the response to a user query, including `GroundednessEvaluator` (needs input `context`), `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`, and `SimilarityEvaluator` (needs input `ground_truth`). It also includes a Natural Language Process (NLP) metric `F1ScoreEvaluator` using F1 score on shared tokens between the response and its ground truth. See the definitions and scoring rubrics for these AI-assisted evaluators and F1 score evaluator.
When to use it?	Use it when assessing the readability and user-friendliness of your model's generated responses in real-world applications.
What does it need as input?	Query, Response, Context, Ground Truth

Version: 2