Semantic Caching - joehubert/ai-agent-design-patterns GitHub Wiki

Classification

Intent

To optimize response time and resource utilization by storing and reusing previous LLM responses based on semantic similarity, rather than exact matching, reducing redundant processing of semantically equivalent queries.

Also Known As

Similarity-Based Caching
Embedding-Based Caching
Neural Caching
Contextual Response Caching

Motivation

Large Language Model operations are computationally expensive and time-consuming. Traditional caching systems rely on exact key matching, which fails to capture the semantic equivalence of different queries that are asking for the same information in different ways.

For example, the queries "What is the capital of France?" and "Tell me the capital city of France" have different text but identical meaning. A traditional cache would miss the opportunity to reuse the response, while semantic caching recognizes their similarity.

This pattern addresses:

The high computational cost of LLM inference
The latency challenge in real-time applications
The need to reduce redundant processing
The opportunity to leverage semantic understanding for efficiency

Applicability

Use Semantic Caching when:

Your application handles many semantically similar queries
Response time is critical to user experience
You need to reduce API costs for external LLM services
You want to optimize resource utilization for self-hosted models
The domain of queries is relatively bounded or predictable
The answers to queries are not highly time-sensitive or rapidly changing

Prerequisites:

Access to embedding models for converting queries to vector representations
Vector database or search capability for similarity matching
Defined threshold policies for "close enough" matching
Cache invalidation strategies for time-sensitive information

Structure

To do...

Components

Query Embedder: Transforms incoming text queries into vector embeddings that capture semantic meaning
Vector Store: Database optimized for storing and retrieving vector representations with similarity search capabilities
Similarity Matcher: Determines if an incoming query is semantically similar enough to cached queries using distance metrics
Cache Manager: Handles storage, retrieval, invalidation, and maintenance of the cache
Threshold Controller: Configures and adjusts the similarity thresholds that determine cache hits
Expiration Handler: Manages time-based invalidation of cached entries
Cache Analyzer: Monitors cache performance metrics like hit rates and suggests optimizations

Interactions

When a query arrives, the Query Embedder converts it to a vector representation
The Similarity Matcher searches the Vector Store for semantically similar previous queries
If a match exceeding the similarity threshold is found, the Cache Manager retrieves the cached response
If no suitable match exists, the query is processed by the LLM and the result is stored in the cache
Periodically, the Expiration Handler removes outdated entries based on configured policies
The Cache Analyzer continuously evaluates performance and may trigger threshold adjustments or cache pruning

The Threshold Controller may dynamically adjust similarity thresholds based on:

The criticality of the query
The domain context
Current system load
Historical hit/miss patterns

Consequences

Benefits:

Significantly reduced response latency for semantically equivalent queries
Lower computational resource requirements
Decreased external API costs for cloud-based LLMs
More consistent response quality for similar queries
Reduced environmental impact through lower energy consumption

Limitations:

Implementation complexity compared to traditional caching
Computing embeddings adds some overhead to each query
Determining appropriate similarity thresholds requires tuning
Risk of returning slightly inappropriate responses if thresholds are too loose
Cache invalidation becomes more complex with semantic matching
Storage requirements increase with vector representations

Performance implications:

Trade-off between embedding computation time and potential cache hit benefits
Vector search performance depends on database size and optimization
Memory footprint grows with both cached responses and their vector representations
System may require batch maintenance operations that impact performance

Implementation

To implement Semantic Caching effectively:

Select an embedding model: Choose a model appropriate for your domain, balancing quality and performance. Sentence transformers or similar models designed for semantic similarity are good choices.
Configure vector similarity metrics: Determine whether to use cosine similarity, Euclidean distance, or other metrics based on your embedding space characteristics.
Establish threshold policies: Set initial similarity thresholds conservatively and refine based on performance data. Different query categories may require different thresholds.
Implement a layered approach:
- First check for exact matches (traditional cache)
- Then check for high-confidence semantic matches
- Finally fallback to LLM processing
Design for cache maintenance:
- Implement time-based expiration for domains with changing information
- Create pruning strategies for least-recently or least-frequently used entries
- Consider periodic re-embedding if embedding models are updated
Monitor and measure:
- Track semantic hit rates separately from exact matches
- Measure false positive rates (inappropriate matches)
- Evaluate latency improvements and cost savings
Handle edge cases:
- Develop strategies for queries containing time-sensitive elements
- Create bypass mechanisms for queries requiring fresh information
- Implement user feedback loops to identify problematic cache hits

Code Examples

To do...

Variations

Hybrid Exact-Semantic Caching: Combines traditional exact-match caching with semantic caching to get the best of both worlds - the speed of exact matching when available and the flexibility of semantic matching when needed.

Hierarchical Similarity Thresholds: Implements multiple tiers of similarity thresholds with different confidence levels, potentially requiring human verification for lower-confidence matches.

Query Normalization: Pre-processes queries to normalize formatting, spelling, and structure before embedding to improve match likelihood.

Context-Aware Caching: Incorporates conversation context or user profile information as part of the cache key to provide personalized cached responses.

Generative Cache Augmentation: Instead of directly returning cached responses, uses them as context for a lightweight model to generate tailored responses, blending caching with generation.

Real-World Examples

Customer Support Systems: Support platforms use semantic caching to quickly retrieve answers to common questions phrased in various ways, reducing response time and agent workload.
Enterprise Search Applications: Knowledge management systems implement semantic caching to speed up frequently searched concepts across different syntactic formulations.
AI Assistants: Virtual assistants employ semantic caching to maintain responsiveness when handling repetitive queries about weather, factual information, or recommendations.
Educational Platforms: Learning systems cache explanations and examples for similar educational queries, ensuring consistent quality while reducing processing requirements.

Related Patterns

Retrieval-Augmented Generation (RAG): Often combined with Semantic Caching, where cache misses trigger retrieval operations and the results are then cached for future similar queries.
Complexity-Based Routing: Works alongside Semantic Caching, where simple queries might be served from cache while complex ones go to more powerful models.
Fallback Chains: Semantic Caching can be implemented as the first step in a fallback chain, with LLM processing as a fallback when cache misses occur.
Episodic Memory: Complements Semantic Caching by maintaining conversation history that can inform cache matching decisions with additional context.
Dynamic Prompt Engineering: Can be applied to cache misses, optimizing the prompt before sending it to the underlying model.