models qna rag metrics eval - Azure/azureml-assets GitHub Wiki

qna-rag-metrics-eval

Overview

The Q&A RAG (Retrieval Augmented Generation) evaluation flow will evaluate the Q&A RAG systems by leveraging the state-of-the-art Large Language Models (LLM) to measure the quality and safety of your responses . Utilizing GPT model to assist with measurements aims to achieve a high agreement with human evaluations compared to traditional mathematical measurements.

Inference samples

Inference type CLI VS Code Extension
Real time deploy-promptflow-model-cli-example deploy-promptflow-model-vscode-extension-example
Batch N/A N/A

Sample inputs and outputs (for real-time inference)

Sample input

{
    "inputs": {
        "question": "What is the purpose of the LLM Grounding Score, and what does a higher score mean in this context?",
        "answer": "The LLM Grounding Score gauges an LLM's grasp of provided context in in-context learning. A higher score implies better understanding and more accurate responses.",
        "metrics": "gpt_groundedness,gpt_retrieval_score,gpt_relevance",
        "documents": "{'documents': [{'[doc1]': {'title': 'In-Context Learning with Large-Scale Pretrained Language Models',\r'content': 'In-Context Learning uses large pretrained models to acquire new skills. GPT-3 introduced this, achieving accuracy similar to fine-tuned models. Prompt order and similar training examples affect performance. Retrievers locate exemplary few-shot examples, with semantic similarity fine-tuning. Advanced retriever use includes code generation, but 'fantastic' examples assumption has task-specific limitations.'}}]}"
    }
}

Sample output

{
    "outputs": {
        "gpt_groundedness":5,
        "gpt_relevance":5,
        "gpt_retrieval_score":1
    }
}

Version: 7

View in Studio: https://ml.azure.com/registries/azureml/models/qna-rag-metrics-eval/version/7

Properties

is-promptflow: True

azureml.promptflow.section: gallery

azureml.promptflow.type: evaluate

azureml.promptflow.name: QnA RAG Evaluation

azureml.promptflow.description: Compute the quality of the answer for the given question based on the retrieved documents

inference-min-sku-spec: 2|0|14|28

inference-recommended-sku: Standard_DS3_v2

⚠️ **GitHub.com Fallback** ⚠️