Retrieval‐Augmented Generation (RAG) - joehubert/ai-agent-design-patterns GitHub Wiki

Home::Overview of Patterns

Classification

Core Processing Pattern

Intent

To enhance Large Language Model (LLM) outputs by integrating external knowledge sources during generation, thus grounding responses in factual information and reducing hallucinations.

Also Known As

Knowledge-Augmented Generation, External Knowledge Integration, Document-Grounded Generation

Motivation

LLMs are trained on large but finite datasets with knowledge cutoffs, making them prone to several limitations:

  • They may generate factually incorrect information (hallucinations)
  • Their knowledge becomes outdated as time passes after training
  • They lack access to private, domain-specific, or specialized information
  • They cannot cite specific sources for verification

Traditional approaches like fine-tuning on domain-specific data are resource-intensive and don't scale well for frequently changing information. The RAG pattern addresses these challenges by:

  1. Retrieving relevant information from external knowledge sources
  2. Augmenting prompts with this retrieved information
  3. Generating responses grounded in the retrieved facts

This approach combines the creative generation capabilities of LLMs with the factual accuracy of knowledge bases, resulting in more reliable, up-to-date, and verifiable outputs.

Applicability

Use the RAG pattern when:

  • Factual accuracy is critical (customer support, legal applications, medical information)
  • Working with domain-specific knowledge not widely available in LLM training data
  • Dealing with time-sensitive or frequently changing information
  • Needing to provide traceable sources or references for generated content
  • Building applications that require access to private or proprietary data
  • Creating systems that need to respond based on user-specific information
  • Implementing solutions where hallucinations would pose significant risks

Structure

flowchart LR
    User((User))
    KB[(Knowledge Base)]
    Retriever[Retriever]
    LLM[LLM]
    
    User -- "(1) Query" --> Retriever
    Retriever -- "(2) Search" --> KB
    KB -- "(3) Results" --> Retriever
    Retriever -- "(4) Augmented Prompt" --> LLM
    LLM -- "(5) Response" --> User
    
    classDef user fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef kb fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef retriever fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
    classDef llm fill:#fff8e1,stroke:#ffa000,stroke-width:2px
    
    class User user
    class KB kb
    class Retriever retriever
    class LLM llm

Components

  • Knowledge Source: External repositories containing factual information (documents, databases, APIs, etc.)
  • Vector Database: Storage system for embeddings that enables similarity search
  • Chunker: Component that breaks documents into manageable sections
  • Embedding Model: Converts text chunks into vector representations
  • Retriever: System that identifies and extracts relevant information from knowledge sources based on the query
  • Context Builder: Assembles retrieved information into a format suitable for augmenting the prompt
  • Generator: The LLM that produces the final response based on the augmented prompt
  • Query Analyzer: Optional component that reformulates or expands the original query to improve retrieval
  • Citation Manager: Optional component that tracks sources of information for attribution

Interactions

  1. When a user query is received, the system may first analyze and reformulate it to optimize for retrieval.
  2. The retriever converts the query into a vector representation using the embedding model.
  3. The retriever performs similarity search against the vector database to find relevant chunks of information.
  4. The context builder assembles the retrieved chunks and integrates them with the original query to create an augmented prompt.
  5. The generator (LLM) processes the augmented prompt to produce a response grounded in the retrieved information.
  6. The citation manager may track which sources contributed to the response for attribution purposes.
  7. The final response is returned to the user, potentially including citations or references.

For dynamic knowledge bases, additional background processes include:

  1. Ingesting new documents through the chunker, which breaks them into appropriate segments
  2. Converting these chunks into vector embeddings
  3. Storing the embeddings and their associated text in the vector database

Consequences

Benefits

  • Significantly reduces hallucinations by grounding responses in factual information
  • Enables access to up-to-date information beyond the LLM's training cutoff
  • Allows integration of private, domain-specific, or proprietary information
  • Supports source attribution and verification
  • Decouples knowledge from reasoning capabilities, allowing each to be updated independently
  • Can be more cost-effective than continuous fine-tuning for rapidly changing information

Limitations

  • Introduces additional system complexity and dependencies
  • May increase latency due to retrieval operations
  • Quality heavily depends on the retrieval component's effectiveness
  • Limited by the coverage and quality of the knowledge sources
  • May struggle with nuanced information needs requiring synthesis across many sources
  • Can encounter challenges with contradictory information in the knowledge base

Performance Implications

  • Retrieval operations add latency to response generation
  • Vector database query performance impacts overall system responsiveness
  • Document chunking strategies affect both storage requirements and retrieval precision
  • Embedding model choice influences both speed and quality of retrieval

Implementation

  1. Define knowledge requirements:

    • Identify what external knowledge the system needs access to
    • Determine update frequency and freshness requirements
  2. Design the knowledge base architecture:

    • Select appropriate document storage systems
    • Choose vector database technology (Pinecone, Weaviate, Qdrant, etc.)
    • Determine embedding models for vectorization (OpenAI, Cohere, BERT variants, etc.)
  3. Implement chunking strategy:

    • Develop document parsing pipelines
    • Define chunk size and overlap parameters
    • Create metadata extraction processes
  4. Build retrieval mechanisms:

    • Implement similarity search functionality
    • Develop query expansion or reformulation techniques
    • Create relevance scoring and filtering systems
  5. Design prompt augmentation:

    • Create templates for integrating retrieved information
    • Implement context window management strategies
    • Develop methods for handling multiple sources
  6. Implement citation and sourcing:

    • Design source tracking mechanisms
    • Create citation formatting standards
    • Implement verification capabilities
  7. Optimize for performance:

    • Implement caching strategies
    • Consider hybrid retrieval approaches
    • Create monitoring systems for retrieval quality

Code Examples

To do...

Variations

  • Hybrid RAG: Combines dense vector retrieval with traditional keyword search for improved recall
  • Multi-Stage RAG: Implements a sequence of retrieval operations, using initial generation to guide subsequent retrievals
  • Recursive RAG: Uses the LLM itself to determine what additional information to retrieve in an iterative process
  • Fusion RAG: Combines information from multiple knowledge sources with different characteristics
  • Semantic Router RAG: Uses a classifier to route queries to different retrieval systems based on query type
  • Self-RAG: Incorporates a self-evaluation step where the LLM assesses its need for additional information
  • RAG with Reranking: Adds a post-retrieval ranking phase to improve precision of selected documents

Real-World Examples

  • Customer Support Systems: Companies like Intercom and Zendesk implement RAG to augment chatbots with product documentation and knowledge bases
  • Legal Research Assistants: Legal tech companies like Casetext use RAG to ground responses in case law and statutes
  • Enterprise Search: Organizations implement RAG-based systems to answer questions about internal documentation and policies
  • Medical Information Systems: Healthcare platforms use RAG to provide accurate information grounded in medical literature and guidelines
  • Financial Analysis Tools: Investment platforms use RAG to combine historical market data with current news for investment insights

Related Patterns

  • Chain-of-Thought Prompting: Often combined with RAG to improve reasoning with retrieved information
  • Semantic Caching: Frequently used to optimize RAG systems by storing previous retrievals
  • Multi-Agent Systems: May use RAG to provide specialized agents with domain-specific knowledge
  • Reflection: Can be integrated with RAG to evaluate information needs and retrieval quality
  • Fallback Chains: Useful for implementing graceful degradation when retrieval fails
  • Output Filtering: Commonly paired with RAG to verify that generated content accurately represents retrieved information