Evaluation of retrieval augmented generation (RAG)

Overview

Evaluating RAG systems is inherently complex due to the fuzzy nature of inputs, outputs, and transformations within the pipeline. Key factors such as runtime, API costs, and GPU usage add to this complexity, alongside the many components involved in a typical RAG system. Proper evaluation is critical for strategy selection, hyperparameter tuning, and continuous monitoring to ensure optimal performance.

What to evaluate?

End-to-end evaluation - overall sanity check

Pipeline performance: Does the system provide the correct response given the knowledge base and queries?

Component-wise evaluation - further optimizations

Retriever component: Assess the accuracy of the retrieved contexts.
Generator component: Evaluate how well the LLM generates the final answer without a retriever, with a retriever, and with a 'perfect' retriever.

Evaluation metrics for retrieval

Deterministic metrics

Precision: The fraction of relevant items among retrieved items.
Recall: The fraction of correctly retrieved items among all relevant items.
F1 score: The harmonic mean of precision and recall.
Rank-aware metrics: Includes metrics like hit rate, mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain (NDCG).

These metrics rely on the availability of ground truth chunk-to-retrieve, which may not always be accessible.

Proxy metrics

LLM-as-a-judge: Uses LLMs to evaluate retrieval performance. Example metrics include:
- Context precision: Whether all ground truth contexts are ranked higher among retrieved contexts.
- Context relevance: Measures the relevance of retrieved contexts to the query.
- Context recall (or context coverage): Alignment between retrieved contexts and ground truth answers.
- Context entity recall: The fraction of entities correctly mentioned within retrieved contexts.
- ...

These metrics offer the flexibility to evaluate the relevance of retrieved contexts in relation to the user query without requiring a specific ground truth chunk-to-retrieve.

Is proxy metrics reliable?

Proxy metrics depend on the strength of the LLM, prompting techniques, and the specific metric or case. While these metrics can offer insights into system performance, they might not always provide a reliable assessment for further development. LLM-based generation with human verification can be an alternative.

Evaluation metrics for generation

Key questions

Which model to use (e.g., GPT-4 vs. smaller models)?
Should you fine-tune the LLM?
Which prompts to use?

Aspects to evaluate (example from Ragas)

Faithfulness: The factual consistency between the answer and retrieved context.
Answer relevancy: The relevance of the answer to the query.
Answer correctness: The accuracy of the answer compared to the ground truth answer.
Answer semantic similarity: The similarity between the answer and the ground truth answer.
...

These aspects can be measured using deterministic metrics, semantic metrics, and LLM-based metrics.

An insightful blog on various metrics for evaluating different aspects of generated responses: A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation) | by Yi Zhang | Relari Blog

LLM-based metrics

LLM-based metrics can sometimes outperform traditional metrics by aligning more closely with human judgment. However, they may be costly and time-consuming. A potential solution is to use ensemble metrics, combining deterministic metrics with LLMs in uncertain cases. Uncertain cases can be determined using conformal prediction. For more information, please refer to A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation) | by Yi Zhang | Relari Blog.

Evaluation frameworks

Some of the popular frameworks for evaluating RAG include:

Ragas
TruLens
DeepEval
UpTrain
FaaF
ARES
Continuous-eval
SuperKnowa
EXAM
Promptfoo

A lanscape overview of different metrics and frameworks for RAG evaluation is given at: post Twitter - RAG evaluation frameworks

Framework limitations

Some frameworks depend on OpenAI keys, limiting the ability to customize models.
Multimodal question generation and evaluation might not be fully supported.
Metrics may vary in definition and calculation based on different prompts.
Certain frameworks do not allow for the custom declaration of new metrics.
Automation of unit tests is not available in some frameworks.
Some frameworks lack a web UI for tracing and debugging.

RAG evaluation dataset

Components

Question: Text-based queries.
Ground-truth answer: The correct answer for the given question.
Ground-truth contexts: (Optional) The list of contexts required to answer the question.
Generated answer: The answer generated by RAG.
List of contexts: The contexts used by RAG for generation.

Dataset creation

Human annotations: Offer better quality and assessment but are expensive and hard to scale.
- Requires more effort and maintenance but offers high reliability
LLM generation (synthetic data): Simple to implement but may require prompting and double-checking.
- Easy to start with but vary in quality and need human verification.

LLM for evaluation dataset generation

Steps

Chunk data
Choosing the chunk to ask question: single chunk, multiple chunks, multi-hop chunks
Question generation: Use LLMs to generate questions based on the chunked data.
Answer generation: Generate ground-truth answers using the context and LLMs.
Quality check: Use critique agents to verify the quality of generated questions and answers.

Depending on whether the chunk is text, an image, or a mix of both, the required LLM can be unimodal or multimodal.

Types of generated questions

Simple or complex questions
Basic Q&A, math problems, summarization tasks, yes/no questions
Single or multi-hop/multi-document queries
...

Extensions to multimodal RAG evaluation

To evaluate different metrics that involved multimodal data, one can use multimodal LLM with multimodal prompting for evaluation. An example of measuring multimodal Faithfullness and Relevancy is given at LlamaIndex - Evaluating multimodal RAG.

It is important to note that multimodal RAG evaluation is still underdeveloped, which may pose challenges for future research and applications.

← Previous: S10_Generation

Next: Advanced RAG →

S11_Evaluation - trankhoidang/RAG-wiki GitHub Wiki

Evaluation of retrieval augmented generation (RAG)

Overview

What to evaluate?

End-to-end evaluation - overall sanity check

Component-wise evaluation - further optimizations

Evaluation metrics for retrieval

Deterministic metrics

Proxy metrics

Is proxy metrics reliable?

Evaluation metrics for generation

Key questions

Aspects to evaluate (example from Ragas)

LLM-based metrics

Evaluation frameworks

Framework limitations

RAG evaluation dataset

Components

Dataset creation

LLM for evaluation dataset generation

Steps

Types of generated questions

Extensions to multimodal RAG evaluation

⚠️ GitHub.com Fallback ⚠️

S11_Evaluation - trankhoidang/RAG-wiki GitHub Wiki

Evaluation of retrieval augmented generation (RAG)

Overview

What to evaluate?

End-to-end evaluation - overall sanity check

Component-wise evaluation - further optimizations

Evaluation metrics for retrieval

Deterministic metrics

Proxy metrics

Is proxy metrics reliable?

Evaluation metrics for generation

Key questions

Aspects to evaluate (example from Ragas)

LLM-based metrics

Evaluation frameworks

Framework limitations

RAG evaluation dataset

Components

Dataset creation

LLM for evaluation dataset generation

Steps

Types of generated questions

Extensions to multimodal RAG evaluation

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️