Semantic Caching - joehubert/ai-agent-design-patterns GitHub Wiki
Classification
Intent
To optimize response time and resource utilization by storing and reusing previous LLM responses based on semantic similarity, rather than exact matching, reducing redundant processing of semantically equivalent queries.
Also Known As
- Similarity-Based Caching
- Embedding-Based Caching
- Neural Caching
- Contextual Response Caching
Motivation
Large Language Model operations are computationally expensive and time-consuming. Traditional caching systems rely on exact key matching, which fails to capture the semantic equivalence of different queries that are asking for the same information in different ways.
For example, the queries "What is the capital of France?" and "Tell me the capital city of France" have different text but identical meaning. A traditional cache would miss the opportunity to reuse the response, while semantic caching recognizes their similarity.
This pattern addresses:
- The high computational cost of LLM inference
- The latency challenge in real-time applications
- The need to reduce redundant processing
- The opportunity to leverage semantic understanding for efficiency
Applicability
Use Semantic Caching when:
- Your application handles many semantically similar queries
- Response time is critical to user experience
- You need to reduce API costs for external LLM services
- You want to optimize resource utilization for self-hosted models
- The domain of queries is relatively bounded or predictable
- The answers to queries are not highly time-sensitive or rapidly changing
Prerequisites:
- Access to embedding models for converting queries to vector representations
- Vector database or search capability for similarity matching
- Defined threshold policies for "close enough" matching
- Cache invalidation strategies for time-sensitive information
Structure
To do...
Components
- Query Embedder: Transforms incoming text queries into vector embeddings that capture semantic meaning
- Vector Store: Database optimized for storing and retrieving vector representations with similarity search capabilities
- Similarity Matcher: Determines if an incoming query is semantically similar enough to cached queries using distance metrics
- Cache Manager: Handles storage, retrieval, invalidation, and maintenance of the cache
- Threshold Controller: Configures and adjusts the similarity thresholds that determine cache hits
- Expiration Handler: Manages time-based invalidation of cached entries
- Cache Analyzer: Monitors cache performance metrics like hit rates and suggests optimizations
Interactions
- When a query arrives, the Query Embedder converts it to a vector representation
- The Similarity Matcher searches the Vector Store for semantically similar previous queries
- If a match exceeding the similarity threshold is found, the Cache Manager retrieves the cached response
- If no suitable match exists, the query is processed by the LLM and the result is stored in the cache
- Periodically, the Expiration Handler removes outdated entries based on configured policies
- The Cache Analyzer continuously evaluates performance and may trigger threshold adjustments or cache pruning
The Threshold Controller may dynamically adjust similarity thresholds based on:
- The criticality of the query
- The domain context
- Current system load
- Historical hit/miss patterns
Consequences
Benefits:
- Significantly reduced response latency for semantically equivalent queries
- Lower computational resource requirements
- Decreased external API costs for cloud-based LLMs
- More consistent response quality for similar queries
- Reduced environmental impact through lower energy consumption
Limitations:
- Implementation complexity compared to traditional caching
- Computing embeddings adds some overhead to each query
- Determining appropriate similarity thresholds requires tuning
- Risk of returning slightly inappropriate responses if thresholds are too loose
- Cache invalidation becomes more complex with semantic matching
- Storage requirements increase with vector representations
Performance implications:
- Trade-off between embedding computation time and potential cache hit benefits
- Vector search performance depends on database size and optimization
- Memory footprint grows with both cached responses and their vector representations
- System may require batch maintenance operations that impact performance
Implementation
To implement Semantic Caching effectively:
-
Select an embedding model: Choose a model appropriate for your domain, balancing quality and performance. Sentence transformers or similar models designed for semantic similarity are good choices.
-
Configure vector similarity metrics: Determine whether to use cosine similarity, Euclidean distance, or other metrics based on your embedding space characteristics.
-
Establish threshold policies: Set initial similarity thresholds conservatively and refine based on performance data. Different query categories may require different thresholds.
-
Implement a layered approach:
- First check for exact matches (traditional cache)
- Then check for high-confidence semantic matches
- Finally fallback to LLM processing
-
Design for cache maintenance:
- Implement time-based expiration for domains with changing information
- Create pruning strategies for least-recently or least-frequently used entries
- Consider periodic re-embedding if embedding models are updated
-
Monitor and measure:
- Track semantic hit rates separately from exact matches
- Measure false positive rates (inappropriate matches)
- Evaluate latency improvements and cost savings
-
Handle edge cases:
- Develop strategies for queries containing time-sensitive elements
- Create bypass mechanisms for queries requiring fresh information
- Implement user feedback loops to identify problematic cache hits
Code Examples
To do...
Variations
Hybrid Exact-Semantic Caching: Combines traditional exact-match caching with semantic caching to get the best of both worlds - the speed of exact matching when available and the flexibility of semantic matching when needed.
Hierarchical Similarity Thresholds: Implements multiple tiers of similarity thresholds with different confidence levels, potentially requiring human verification for lower-confidence matches.
Query Normalization: Pre-processes queries to normalize formatting, spelling, and structure before embedding to improve match likelihood.
Context-Aware Caching: Incorporates conversation context or user profile information as part of the cache key to provide personalized cached responses.
Generative Cache Augmentation: Instead of directly returning cached responses, uses them as context for a lightweight model to generate tailored responses, blending caching with generation.
Real-World Examples
-
Customer Support Systems: Support platforms use semantic caching to quickly retrieve answers to common questions phrased in various ways, reducing response time and agent workload.
-
Enterprise Search Applications: Knowledge management systems implement semantic caching to speed up frequently searched concepts across different syntactic formulations.
-
AI Assistants: Virtual assistants employ semantic caching to maintain responsiveness when handling repetitive queries about weather, factual information, or recommendations.
-
Educational Platforms: Learning systems cache explanations and examples for similar educational queries, ensuring consistent quality while reducing processing requirements.
Related Patterns
-
Retrieval-Augmented Generation (RAG): Often combined with Semantic Caching, where cache misses trigger retrieval operations and the results are then cached for future similar queries.
-
Complexity-Based Routing: Works alongside Semantic Caching, where simple queries might be served from cache while complex ones go to more powerful models.
-
Fallback Chains: Semantic Caching can be implemented as the first step in a fallback chain, with LLM processing as a fallback when cache misses occur.
-
Episodic Memory: Complements Semantic Caching by maintaining conversation history that can inform cache matching decisions with additional context.
-
Dynamic Prompt Engineering: Can be applied to cache misses, optimizing the prompt before sending it to the underlying model.