RAG Solution and Performance Improvement using Cohere Rerank - dtoinagn/flyingbird.github.io GitHub Wiki

Overview

The Retrieval-Augmented Generation (RAG) pattern is an industry-standard approach to building applications that use language models to process specific or proprietary data that the model doesn't already know. The design, experiment and evaluation of a RAG solution involves many complex considerations:

The architecture of a RAG solution
How to determine which test documents and queries to use during evaluation
How to choose a chunking strategy
How to determine which chunks you should enrich and how to enrich them
How to choose the right embedding model
How to configure the search index
How to determine which searches, such as vector, full text, hybrid, and manual multiple searches, you should perform
How to evaluate each step

RAG Application Flow

The following workflow describes a high-level flow for a RAG application.

The user issues a query in an intelligent application user interface
The intelligent app makes an API call to an orchestrator. You can implement the orchestrator with tools or platforms like Semantic Kernel, Azure Machine Learning prompt flow, or LangChain
The orchestrator determines which search to perform on Azure AI Search and issues the query
The orchestrator packages the top N results from the query. It packages the top results and the query as context within a prompt and sends the prompt to the language model. The orchestrator returns the response to the intelligent application for the user to read.

The reliability and accuracy of the responses hinges on finding the right source materials. Therefore, honing the search process in RAG is crucial to boosting the trustworthiness of the generated responses.

RAG systems are important tools for building search and retrieval systems, but they often fall short of expectations due to suboptimal retrieval steps. This can be enhanced using a rerank step to improve search quality.

Document Retrieval in RAG orchestration

Dense Retrieval is one technique for retrieving documents in a RAG orchestration aiming to understand the semantic meaning and intent behind user queries. It finds the closest documents to a user query in the embedding, as shown in the following screenshot. The goal of dense retrieval is to map both the user queries and documents (or passages) into a dense vector space. In this space, the similarity between the query and document vectors can be computed using standard distance metrics like cosine similarity or euclidean distance. The documents that match closest to the semantic meaning of the user query based on the calculated distance metrics are then presented back to the user.

The quality of the final responses to search queries is significantly influenced by the relevance of the retrieved documents. While dense retrieval models are very efficient and can scale to large datasets, they struggle with more complex data and questions due to the simplicity of the method. Document vectors contain the meaning of text in a compressed representation -- typically 786-1536 dimension vectors. This often results in loss of information because information is compressed into a single vector. When documents are retrieved during a vector search the most relevant information is not always presented at the top of the retrieval.

Reference

https://aws.amazon.com/blogs/machine-learning/improve-rag-performance-using-cohere-rerank/