Retrieval‐Augmented Generation (RAG) - joehubert/ai-agent-design-patterns GitHub Wiki

Classification

Intent

To enhance Large Language Model (LLM) outputs by integrating external knowledge sources during generation, thus grounding responses in factual information and reducing hallucinations.

Also Known As

Knowledge-Augmented Generation, External Knowledge Integration, Document-Grounded Generation

Motivation

LLMs are trained on large but finite datasets with knowledge cutoffs, making them prone to several limitations:

They may generate factually incorrect information (hallucinations)
Their knowledge becomes outdated as time passes after training
They lack access to private, domain-specific, or specialized information
They cannot cite specific sources for verification

Traditional approaches like fine-tuning on domain-specific data are resource-intensive and don't scale well for frequently changing information. The RAG pattern addresses these challenges by:

Retrieving relevant information from external knowledge sources
Augmenting prompts with this retrieved information
Generating responses grounded in the retrieved facts

This approach combines the creative generation capabilities of LLMs with the factual accuracy of knowledge bases, resulting in more reliable, up-to-date, and verifiable outputs.

Applicability

Use the RAG pattern when:

Factual accuracy is critical (customer support, legal applications, medical information)
Working with domain-specific knowledge not widely available in LLM training data
Dealing with time-sensitive or frequently changing information
Needing to provide traceable sources or references for generated content
Building applications that require access to private or proprietary data
Creating systems that need to respond based on user-specific information
Implementing solutions where hallucinations would pose significant risks

Structure

flowchart LR
    User((User))
    KB[(Knowledge Base)]
    Retriever[Retriever]
    LLM[LLM]
    
    User -- "(1) Query" --> Retriever
    Retriever -- "(2) Search" --> KB
    KB -- "(3) Results" --> Retriever
    Retriever -- "(4) Augmented Prompt" --> LLM
    LLM -- "(5) Response" --> User
    
    classDef user fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef kb fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef retriever fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
    classDef llm fill:#fff8e1,stroke:#ffa000,stroke-width:2px
    
    class User user
    class KB kb
    class Retriever retriever
    class LLM llm

Components

Knowledge Source: External repositories containing factual information (documents, databases, APIs, etc.)
Vector Database: Storage system for embeddings that enables similarity search
Chunker: Component that breaks documents into manageable sections
Embedding Model: Converts text chunks into vector representations
Retriever: System that identifies and extracts relevant information from knowledge sources based on the query
Context Builder: Assembles retrieved information into a format suitable for augmenting the prompt
Generator: The LLM that produces the final response based on the augmented prompt
Query Analyzer: Optional component that reformulates or expands the original query to improve retrieval
Citation Manager: Optional component that tracks sources of information for attribution

Interactions

When a user query is received, the system may first analyze and reformulate it to optimize for retrieval.
The retriever converts the query into a vector representation using the embedding model.
The retriever performs similarity search against the vector database to find relevant chunks of information.
The context builder assembles the retrieved chunks and integrates them with the original query to create an augmented prompt.
The generator (LLM) processes the augmented prompt to produce a response grounded in the retrieved information.
The citation manager may track which sources contributed to the response for attribution purposes.
The final response is returned to the user, potentially including citations or references.

For dynamic knowledge bases, additional background processes include:

Ingesting new documents through the chunker, which breaks them into appropriate segments
Converting these chunks into vector embeddings
Storing the embeddings and their associated text in the vector database

Consequences

Benefits

Significantly reduces hallucinations by grounding responses in factual information
Enables access to up-to-date information beyond the LLM's training cutoff
Allows integration of private, domain-specific, or proprietary information
Supports source attribution and verification
Decouples knowledge from reasoning capabilities, allowing each to be updated independently
Can be more cost-effective than continuous fine-tuning for rapidly changing information

Limitations

Introduces additional system complexity and dependencies
May increase latency due to retrieval operations
Quality heavily depends on the retrieval component's effectiveness
Limited by the coverage and quality of the knowledge sources
May struggle with nuanced information needs requiring synthesis across many sources
Can encounter challenges with contradictory information in the knowledge base

Performance Implications

Retrieval operations add latency to response generation
Vector database query performance impacts overall system responsiveness
Document chunking strategies affect both storage requirements and retrieval precision
Embedding model choice influences both speed and quality of retrieval

Implementation

Define knowledge requirements:
- Identify what external knowledge the system needs access to
- Determine update frequency and freshness requirements
Design the knowledge base architecture:
- Select appropriate document storage systems
- Choose vector database technology (Pinecone, Weaviate, Qdrant, etc.)
- Determine embedding models for vectorization (OpenAI, Cohere, BERT variants, etc.)
Implement chunking strategy:
- Develop document parsing pipelines
- Define chunk size and overlap parameters
- Create metadata extraction processes
Build retrieval mechanisms:
- Implement similarity search functionality
- Develop query expansion or reformulation techniques
- Create relevance scoring and filtering systems
Design prompt augmentation:
- Create templates for integrating retrieved information
- Implement context window management strategies
- Develop methods for handling multiple sources
Implement citation and sourcing:
- Design source tracking mechanisms
- Create citation formatting standards
- Implement verification capabilities
Optimize for performance:
- Implement caching strategies
- Consider hybrid retrieval approaches
- Create monitoring systems for retrieval quality

Code Examples

Go to repo...

Variations

Hybrid RAG: Combines dense vector retrieval with traditional keyword search for improved recall
Multi-Stage RAG: Implements a sequence of retrieval operations, using initial generation to guide subsequent retrievals
Recursive RAG: Uses the LLM itself to determine what additional information to retrieve in an iterative process
Fusion RAG: Combines information from multiple knowledge sources with different characteristics
Semantic Router RAG: Uses a classifier to route queries to different retrieval systems based on query type
Self-RAG: Incorporates a self-evaluation step where the LLM assesses its need for additional information
RAG with Reranking: Adds a post-retrieval ranking phase to improve precision of selected documents

Real-World Examples

Customer Support Systems: Companies like Intercom and Zendesk implement RAG to augment chatbots with product documentation and knowledge bases
Legal Research Assistants: Legal tech companies like Casetext use RAG to ground responses in case law and statutes
Enterprise Search: Organizations implement RAG-based systems to answer questions about internal documentation and policies
Medical Information Systems: Healthcare platforms use RAG to provide accurate information grounded in medical literature and guidelines
Financial Analysis Tools: Investment platforms use RAG to combine historical market data with current news for investment insights

Related Patterns

Chain-of-Thought Prompting: Often combined with RAG to improve reasoning with retrieved information
Semantic Caching: Frequently used to optimize RAG systems by storing previous retrievals
Multi-Agent Systems: May use RAG to provide specialized agents with domain-specific knowledge
Reflection: Can be integrated with RAG to evaluate information needs and retrieval quality
Fallback Chains: Useful for implementing graceful degradation when retrieval fails
Output Filtering: Commonly paired with RAG to verify that generated content accurately represents retrieved information