Guides Rag Improvements - kennetholsenatm-gif/q_mini_wasm_v2 GitHub Wiki
This document describes the improvements made to the Retrieval-Augmented Generation (RAG) system in the q_mini_wasm_v2 project.
Problem: The original implementation used a placeholder embedding that returned identical values for all dimensions, making semantic search ineffective.
Solution: Created a pluggable embedding service interface with multiple implementations:
- PlaceholderEmbeddingService: Generates deterministic embeddings based on text hash (for testing)
- TFIDFEmbeddingService: TF-IDF based embeddings for keyword-focused search
- CompositeEmbeddingService: Combines multiple embedding services with weighted averaging
Benefits:
- Modular design allows easy integration of external embedding models (OpenAI, Cohere, local ONNX models)
- Deterministic placeholder ensures reproducible results during development
- Caching at the embedding level reduces redundant computation
Problem: Only semantic vector search was used, missing exact keyword matches.
Solution: Implemented hybrid search combining:
- Semantic Search: Vector similarity using Qdrant
- Keyword Search: BM25-like scoring for term frequency matching
- Configurable Weights: Adjust balance between semantic (default 70%) and keyword (default 30%) results
- Reranking: Post-retrieval reranking based on query-document overlap
Benefits:
- Better recall for exact term matches (function names, error codes)
- Improved relevance through combined scoring
- Fallback to semantic search if hybrid search fails
Problem: No caching led to repeated expensive operations for identical queries and embeddings.
Solution: Implemented two-level caching:
- EmbeddingCache: Caches generated embeddings
- QueryCache: Caches complete query results
- LRU/LFU/FIFO Eviction: Configurable eviction policies
- TTL Support: Time-based cache expiration
Benefits:
- Reduced latency for repeated queries
- Lower computational costs
- Configurable cache size and policies
Changes:
- Integrated all new components
- Added comprehensive metrics tracking (cache hits/misses)
- Improved error handling with fallback mechanisms
- Better configuration management
type ServiceConfig struct {
Qdrant QdrantConfig // Vector database configuration
Chunker ChunkConfig // Document chunking settings
Scaler TokenScalerConfig // Token budget scaling
HybridSearch HybridSearchConfig // Hybrid search weights
Cache CacheConfig // Caching configuration
ProjectRoot string // Root directory for indexing
AutoIndex bool // Auto-index on startup
EmbeddingType string // "placeholder", "tfidf", "composite"
}type HybridSearchConfig struct {
SemanticWeight float64 // Weight for semantic search (0-1)
KeywordWeight float64 // Weight for keyword search (0-1)
RerankTopK int // Number of results to rerank
MinScore float32 // Minimum score threshold
}type CacheConfig struct {
MaxSize int // Maximum number of items
TTL time.Duration // Time to live
EvictionPolicy string // "lru", "lfu", "fifo"
}config := rag.ServiceConfig{
Qdrant: rag.QdrantConfig{
Host: "localhost",
Port: 6334,
CollectionName: "qminiwasm_docs",
VectorSize: 384,
},
HybridSearch: rag.HybridSearchConfig{
SemanticWeight: 0.7,
KeywordWeight: 0.3,
},
// ... (truncated)
// See source for complete coderesult, err := service.RetrieveContext(ctx, "How to implement quantum gates?", 2048, "code")
if err != nil {
log.Fatal(err)
}
for _, chunk := range result.Chunks {
fmt.Printf("Score: %.2f, File: %s, Lines: %d-%d\n",
chunk.Score, chunk.SourceFile, chunk.StartLine, chunk.EndLine)
}The service now tracks enhanced metrics:
type Metrics struct {
TotalDocuments int64 // Total indexed documents
TotalChunks int64 // Total indexed chunks
TotalQueries int64 // Total queries served
CacheHits int64 // Cache hit count
CacheMisses int64 // Cache miss count
AvgLatencyMs float64 // Average query latency
AvgTokensSaved float64 // Average tokens saved
CacheHitRate float64 // Cache hit rate percentage
}- External Embedding Integration: Add support for OpenAI, Cohere, or local ONNX models
- BM25 Index: Full BM25 implementation with document frequency tracking
- Query Expansion: Use LLM to expand queries for better retrieval
- Semantic Chunking: Use embedding similarity for smarter document splitting
- Distributed Caching: Redis-based caching for multi-instance deployments
-
POST /api/v1/rag/retrieve- Retrieve context for a query -
GET /api/v1/rag/metrics- Get RAG service metrics -
POST /api/v1/rag/index- Index a single document
-
rag_retrieve_context- Retrieve relevant context -
rag_index_document- Index a document -
rag_get_metrics- Get service metrics -
rag_reindex_all- Reindex all documents -
kanban_rag_status- Get RAG service status