models qna non rag metrics eval - Azure/azureml-assets GitHub Wiki
The Q&A evaluation flow will evaluate the Q&A systems by leveraging the state-of-the-art Large Language Models (LLM) to measure the quality and safety of your responses. Utilizing GPT and GPT embedding model to assist with measurements aims to achieve a high agreement with human evaluations compared to traditional mathematical measurements.
Inference type | CLI | VS Code Extension |
---|---|---|
Real time | deploy-promptflow-model-cli-example | deploy-promptflow-model-vscode-extension-example |
Batch | N/A | N/A |
{
"inputs": {
"question": "Which camping table holds the most weight?",
"answer": "The Alpine Explorer Tent is the most waterproof.",
"context": "From the our product list, the alpine explorer tent is the most waterproof. The Adventure Dining Tabbe has higher weight.",
"ground_truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m",
"metrics": "gpt_groundedness,f1_score,ada_similarity,gpt_fluency,gpt_coherence,gpt_similarity,gpt_relevance"
}
}
{
"outputs": {
"f1_score":0.5,
"gpt_coherence":1,
"gpt_similarity":1,
"gpt_fluency":1,
"gpt_relevance":1,
"gpt_groundedness":5,
"ada_similarity":0.9317354400079281
}
}
Version: 5
View in Studio: https://ml.azure.com/registries/azureml/models/qna-non-rag-metrics-eval/version/5
is-promptflow: True
azureml.promptflow.section: gallery
azureml.promptflow.type: evaluate
azureml.promptflow.name: QnA Evaluation
azureml.promptflow.description: Compute the quality of the answer for the given question based on the ground_truth and the context
inference-min-sku-spec: 2|0|14|28
inference-recommended-sku: Standard_DS3_v2