S11_Evaluation - trankhoidang/RAG-wiki GitHub Wiki
Evaluating RAG systems is inherently complex due to the fuzzy nature of inputs, outputs, and transformations within the pipeline. Key factors such as runtime, API costs, and GPU usage add to this complexity, alongside the many components involved in a typical RAG system. Proper evaluation is critical for strategy selection, hyperparameter tuning, and continuous monitoring to ensure optimal performance.
- Pipeline performance: Does the system provide the correct response given the knowledge base and queries?
- Retriever component: Assess the accuracy of the retrieved contexts.
- Generator component: Evaluate how well the LLM generates the final answer without a retriever, with a retriever, and with a 'perfect' retriever.
- Precision: The fraction of relevant items among retrieved items.
- Recall: The fraction of correctly retrieved items among all relevant items.
- F1 score: The harmonic mean of precision and recall.
- Rank-aware metrics: Includes metrics like hit rate, mean reciprocal rank (MRR), mean average precision (MAP), and normalized discounted cumulative gain (NDCG).
These metrics rely on the availability of ground truth chunk-to-retrieve, which may not always be accessible.
-
LLM-as-a-judge: Uses LLMs to evaluate retrieval performance. Example metrics include:
- Context precision: Whether all ground truth contexts are ranked higher among retrieved contexts.
- Context relevance: Measures the relevance of retrieved contexts to the query.
- Context recall (or context coverage): Alignment between retrieved contexts and ground truth answers.
- Context entity recall: The fraction of entities correctly mentioned within retrieved contexts.
- ...
These metrics offer the flexibility to evaluate the relevance of retrieved contexts in relation to the user query without requiring a specific ground truth chunk-to-retrieve.
Proxy metrics depend on the strength of the LLM, prompting techniques, and the specific metric or case. While these metrics can offer insights into system performance, they might not always provide a reliable assessment for further development. LLM-based generation with human verification can be an alternative.
- Which model to use (e.g., GPT-4 vs. smaller models)?
- Should you fine-tune the LLM?
- Which prompts to use?
- Faithfulness: The factual consistency between the answer and retrieved context.
- Answer relevancy: The relevance of the answer to the query.
- Answer correctness: The accuracy of the answer compared to the ground truth answer.
- Answer semantic similarity: The similarity between the answer and the ground truth answer.
- ...
These aspects can be measured using deterministic metrics, semantic metrics, and LLM-based metrics.
An insightful blog on various metrics for evaluating different aspects of generated responses: A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation) | by Yi Zhang | Relari Blog
LLM-based metrics can sometimes outperform traditional metrics by aligning more closely with human judgment. However, they may be costly and time-consuming. A potential solution is to use ensemble metrics, combining deterministic metrics with LLMs in uncertain cases. Uncertain cases can be determined using conformal prediction. For more information, please refer to A Practical Guide to RAG Pipeline Evaluation (Part 2: Generation) | by Yi Zhang | Relari Blog.
Some of the popular frameworks for evaluating RAG include:
- Ragas
- TruLens
- DeepEval
- UpTrain
- FaaF
- ARES
- Continuous-eval
- SuperKnowa
- EXAM
- Promptfoo
A lanscape overview of different metrics and frameworks for RAG evaluation is given at: post Twitter - RAG evaluation frameworks
- Some frameworks depend on OpenAI keys, limiting the ability to customize models.
- Multimodal question generation and evaluation might not be fully supported.
- Metrics may vary in definition and calculation based on different prompts.
- Certain frameworks do not allow for the custom declaration of new metrics.
- Automation of unit tests is not available in some frameworks.
- Some frameworks lack a web UI for tracing and debugging.
- Question: Text-based queries.
- Ground-truth answer: The correct answer for the given question.
- Ground-truth contexts: (Optional) The list of contexts required to answer the question.
- Generated answer: The answer generated by RAG.
- List of contexts: The contexts used by RAG for generation.
-
Human annotations: Offer better quality and assessment but are expensive and hard to scale.
- Requires more effort and maintenance but offers high reliability
-
LLM generation (synthetic data): Simple to implement but may require prompting and double-checking.
- Easy to start with but vary in quality and need human verification.
- Chunk data
- Choosing the chunk to ask question: single chunk, multiple chunks, multi-hop chunks
- Question generation: Use LLMs to generate questions based on the chunked data.
- Answer generation: Generate ground-truth answers using the context and LLMs.
- Quality check: Use critique agents to verify the quality of generated questions and answers.
Depending on whether the chunk is text, an image, or a mix of both, the required LLM can be unimodal or multimodal.
- Simple or complex questions
- Basic Q&A, math problems, summarization tasks, yes/no questions
- Single or multi-hop/multi-document queries
- ...
To evaluate different metrics that involved multimodal data, one can use multimodal LLM with multimodal prompting for evaluation. An example of measuring multimodal Faithfullness and Relevancy is given at LlamaIndex - Evaluating multimodal RAG.
It is important to note that multimodal RAG evaluation is still underdeveloped, which may pose challenges for future research and applications.