Concepts - udit-asopa/similarity_search_chromadb GitHub Wiki
Core Concepts in Vector Similarity Search
Table of Contents
- Vector Embeddings
- ChromaDB Architecture
- SentenceTransformers
- Similarity Metrics
- Metadata Filtering
- HNSW Algorithm
Vector Embeddings
What are Vector Embeddings?
Vector embeddings are numerical representations of text, images, or other data types in a high-dimensional space. They capture semantic meaning and relationships between different pieces of content.
Key Properties
- Semantic Similarity: Similar concepts have similar vector representations
- Dimensional Space: Typically 384, 512, 768, or 1024 dimensions
- Mathematical Operations: Enable similarity calculations using distance metrics
Example
# Text: "Software Engineer with Python skills"
# Embedding: [0.234, -0.123, 0.456, ..., 0.789] (384 dimensions)
# Text: "Python Developer"
# Embedding: [0.245, -0.134, 0.467, ..., 0.798] (384 dimensions)
# Similar vectors because of semantic relationship
Benefits
- Context Understanding: Captures meaning beyond keywords
- Language Agnostic: Works across different languages
- Transferable: Pre-trained models work on various domains
ChromaDB Architecture
Overview
ChromaDB is an open-source vector database designed for AI applications, providing efficient storage and retrieval of embeddings.
Core Components
1. Client
client = chromadb.Client() # In-memory client
# OR
client = chromadb.PersistentClient(path="./chroma_db") # Persistent storage
2. Collections
collection = client.create_collection(
name="employees",
metadata={"description": "Employee data"},
embedding_function=embedding_function
)
3. Documents and Metadata
collection.add(
documents=["Software Engineer with Python skills"],
metadatas=[{"department": "Engineering", "experience": 5}],
ids=["emp_1"]
)
Storage Options
- In-Memory: Fast, temporary storage for development
- Persistent: File-based storage for production
- Client-Server: Distributed deployment for scale
Key Features
- Automatic Indexing: Efficient similarity search
- Metadata Filtering: Combine vector search with structured queries
- Multiple Distance Metrics: Cosine, Euclidean, Manhattan
- Batch Operations: Efficient bulk operations
SentenceTransformers
Overview
SentenceTransformers is a Python library that provides an easy method to compute dense vector representations for sentences, paragraphs, and images.
Model Selection
all-MiniLM-L6-v2
- Dimensions: 384
- Performance: Good balance of quality and speed
- Use Case: General-purpose sentence embeddings
- Size: ~90MB
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello World", "Goodbye World"])
Other Popular Models
- all-mpnet-base-v2: Higher quality, slower (768 dimensions)
- all-distilroberta-v1: RoBERTa-based (768 dimensions)
- paraphrase-MiniLM-L6-v2: Optimized for paraphrasing
Integration with ChromaDB
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
Advantages
- Pre-trained: No training required
- Multilingual: Many models support multiple languages
- Domain Adaptation: Can be fine-tuned for specific domains
- Efficient: Optimized for batch processing
Similarity Metrics
Cosine Similarity
Most common metric for text embeddings.
Formula: similarity = cos(θ) = (A · B) / (||A|| ||B||)
Range: -1 to 1 (ChromaDB converts to distance: 0 to 2)
- 0 = Identical vectors
- 1 = Orthogonal vectors
- 2 = Opposite vectors
Advantages:
- Magnitude independent
- Works well with normalized embeddings
- Intuitive interpretation
Euclidean Distance
Measures straight-line distance in vector space.
Formula: distance = √Σ(ai - bi)²
Use Cases:
- When magnitude matters
- Lower-dimensional spaces
- Geometric applications
Manhattan Distance (L1)
Sum of absolute differences.
Formula: distance = Σ|ai - bi|
Use Cases:
- High-dimensional sparse data
- When outliers are problematic
Metadata Filtering
Overview
Metadata filtering allows combining vector similarity with structured queries, enabling precise and efficient search.
Query Types
Exact Match
collection.get(where={"department": "Engineering"})
Range Queries
# Greater than or equal
collection.get(where={"experience": {"$gte": 10}})
# Less than
collection.get(where={"experience": {"$lt": 5}})
# Between values
collection.get(where={
"experience": {"$gte": 5, "$lte": 15}
})
Array Operations
# In array
collection.get(where={
"location": {"$in": ["New York", "San Francisco"]}
})
# Not in array
collection.get(where={
"department": {"$nin": ["HR", "Finance"]}
})
Logical Operations
# AND operation
collection.get(where={
"$and": [
{"department": "Engineering"},
{"experience": {"$gte": 5}}
]
})
# OR operation
collection.get(where={
"$or": [
{"location": "New York"},
{"experience": {"$gte": 15}}
]
})
Combined Search
# Vector similarity + metadata filtering
results = collection.query(
query_texts=["Python developer"],
n_results=5,
where={
"$and": [
{"department": "Engineering"},
{"experience": {"$gte": 3}}
]
}
)
Performance Benefits
- Pre-filtering: Reduces vector search space
- Index Optimization: Database can optimize queries
- Precision: Combines semantic and structured search
HNSW Algorithm
Overview
Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for approximate nearest neighbor search.
Key Concepts
1. Graph Structure
- Nodes: Vector embeddings
- Edges: Connections between similar vectors
- Layers: Hierarchical organization for efficient search
2. Search Process
- Entry Point: Start at top layer
- Greedy Search: Move to most similar neighbor
- Layer Descent: Move down layers for precision
- Result: Approximate nearest neighbors
Configuration in ChromaDB
collection = client.create_collection(
name="my_collection",
configuration={
"hnsw": {
"space": "cosine", # Distance metric
"M": 16, # Max connections per node
"ef_construction": 200, # Search depth during construction
"ef": 10, # Search depth during query
"max_elements": 10000 # Maximum elements
}
}
)
Parameters Explained
M (Max Connections)
- Default: 16
- Higher M: Better recall, more memory
- Lower M: Faster queries, less memory
- Typical Range: 8-64
ef_construction
- Default: 200
- Purpose: Controls index quality during building
- Higher Values: Better quality, slower indexing
- Typical Range: 100-800
ef (Query Time)
- Default: 10
- Purpose: Controls search thoroughness
- Higher Values: Better recall, slower queries
- Typical Range: 10-500
Trade-offs
- Speed vs Accuracy: Approximate but very fast
- Memory vs Performance: More connections = better performance
- Build Time vs Query Time: Better index = faster queries
Advantages
- Scalability: Logarithmic search complexity
- Flexibility: Configurable parameters
- Performance: Fast approximate search
- Memory Efficiency: Reasonable memory usage
Best Practices
1. Embedding Strategy
- Choose appropriate model for your domain
- Consistent preprocessing of text
- Consider multilingual needs
- Balance quality vs speed requirements
2. Collection Design
- Use descriptive collection names
- Include relevant metadata fields
- Plan for scalability from the start
- Consider data privacy requirements
3. Query Optimization
- Use metadata filters to reduce search space
- Batch queries when possible
- Cache frequent query results
- Monitor performance metrics
4. Error Handling
- Always check for empty results
- Handle model loading errors
- Validate input data
- Implement graceful degradation
5. Performance Tuning
- Adjust HNSW parameters based on use case
- Monitor memory usage
- Consider persistent storage for production
- Profile query performance regularly
Common Use Cases
1. Document Search
- Legal document retrieval
- Research paper discovery
- Knowledge base search
2. Recommendation Systems
- Product recommendations
- Content suggestions
- Similar item finding
3. Similarity Detection
- Duplicate detection
- Plagiarism checking
- Content clustering
4. Question Answering
- FAQ matching
- Customer support
- Educational systems
5. Content Classification
- Automatic tagging
- Category assignment
- Quality assessment
This guide provides the theoretical foundation for understanding how vector similarity search works in practice. The concepts here are implemented in the main application script.
Common Issues and Best Practices
ChromaDB Metadata Limitations
Supported Data Types
ChromaDB metadata only supports scalar values:
str(string)int(integer)float(floating-point number)bool(boolean)None(null value)
Handling Complex Data
# ❌ This will fail
metadata = {
"tags": ["fiction", "adventure"], # Lists not supported
"author_info": {"name": "Author"} # Objects not supported
}
# ✅ Convert to supported types
metadata = {
"tags": "fiction, adventure", # Convert list to string
"author_info": '{"name": "Author"}', # Convert object to JSON string
"tag_count": 2, # Extract numeric properties
"has_tags": True # Extract boolean properties
}
Performance Optimization
Collection Configuration
collection = client.create_collection(
name="optimized_collection",
configuration={
"hnsw": {
"space": "cosine", # Choose appropriate distance metric
"M": 16, # Balance between speed and accuracy
"ef_construction": 200, # Higher = better accuracy, slower build
"ef": 50, # Higher = better accuracy, slower search
"max_elements": 100000 # Set realistic capacity
}
}
)
Embedding Model Selection
# For speed (recommended for development)
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dim, 90MB
# For accuracy (recommended for production)
model = SentenceTransformer('all-mpnet-base-v2') # 768 dim, 420MB
# For multilingual
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
Search Strategy Guidelines
Query Design
# ✅ Good queries are specific and descriptive
query = "experienced Python developer with machine learning skills"
# ❌ Avoid overly broad or short queries
query = "developer" # Too vague
query = "python ml ai data science software engineering experience" # Too long
Combining Search Types
# Use semantic search for content discovery
semantic_results = collection.query(
query_texts=["AI researcher"],
n_results=20
)
# Use metadata filtering for precise requirements
filtered_results = collection.query(
query_texts=["AI researcher"],
where={"experience_years": {"$gte": 5}},
n_results=10
)
Error Handling Patterns
Graceful Degradation
def robust_search(collection, query, filters=None, n_results=5):
"""Search with fallback strategies"""
try:
# Try combined search first
results = collection.query(
query_texts=[query],
where=filters,
n_results=n_results
)
if len(results['ids'][0]) == 0 and filters:
# Fallback: try without filters
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results
except Exception as e:
print(f"Search error: {e}")
return {"ids": [](/udit-asopa/similarity_search_chromadb/wiki/), "documents": [](/udit-asopa/similarity_search_chromadb/wiki/), "distances": [](/udit-asopa/similarity_search_chromadb/wiki/)}
Collection Health Checks
def validate_collection(collection):
"""Verify collection is properly configured"""
try:
count = collection.count()
if count == 0:
raise ValueError("Collection is empty")
# Test query
test_results = collection.query(
query_texts=["test"],
n_results=1
)
return True
except Exception as e:
print(f"Collection validation failed: {e}")
return False
These concepts and best practices ensure reliable, performant vector search implementations.