Uses Examples and Tutorials - udit-asopa/similarity_search_chromadb GitHub Wiki
Usage Examples and Tutorials
Table of Contents
- Quick Start Guide
- Basic Usage Examples
- Advanced Search Patterns
- Customization Examples
- Production Deployment
- Troubleshooting
Quick Start Guide
1. Installation and Setup
# Clone the repository
git clone <repository-url>
cd sim_search_chromadb
# Install dependencies
pixi install
# Run the demo
pixi run python script.py
2. Expected Output
Collection created: employee_collection
Collection contents:
Number of documents: 15
=== Similarity Search Examples ===
1. Searching for Python developers:
Query: 'Python developer with web development experience'
1. John Doe (employee_1) - Distance: 0.3245
Role: Software Engineer, Department: Engineering
Document: Software Engineer with 5 years of experience...
Basic Usage Examples
Example 1: Simple Text Search
import chromadb
from chromadb.utils import embedding_functions
# Initialize
client = chromadb.Client()
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
# Create collection
collection = client.create_collection(
name="simple_search",
embedding_function=ef
)
# Add documents
collection.add(
documents=["I love programming in Python", "Java is great for enterprise"],
ids=["doc1", "doc2"]
)
# Search
results = collection.query(
query_texts=["Python development"],
n_results=1
)
print(f"Best match: {results['documents'][0][0]}")
# Output: "I love programming in Python"
Example 2: Search with Metadata
# Add documents with metadata
collection.add(
documents=["Senior Python Developer needed", "Junior Java Developer position"],
metadatas=[
{"language": "Python", "level": "Senior"},
{"language": "Java", "level": "Junior"}
],
ids=["job1", "job2"]
)
# Search with filtering
results = collection.query(
query_texts=["Python programming job"],
where={"level": "Senior"},
n_results=5
)
Example 3: Batch Operations
# Batch document addition
documents = [
"Machine Learning Engineer with TensorFlow experience",
"Data Scientist skilled in Python and R",
"Full Stack Developer using React and Node.js"
]
metadatas = [
{"role": "ML Engineer", "skills": ["TensorFlow", "Python"]},
{"role": "Data Scientist", "skills": ["Python", "R", "Statistics"]},
{"role": "Full Stack Developer", "skills": ["React", "Node.js", "JavaScript"]}
]
ids = ["ml_eng_1", "data_sci_1", "fullstack_1"]
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
Advanced Search Patterns
Pattern 1: Multi-Field Semantic Search
# Create rich documents combining multiple fields
def create_employee_document(employee):
"""Generate comprehensive search document from employee data."""
skills_text = f"Skills include {employee['skills']}"
role_text = f"Works as {employee['role']} in {employee['department']}"
experience_text = f"Has {employee['experience']} years of experience"
location_text = f"Located in {employee['location']}"
return f"{role_text}. {experience_text}. {skills_text}. {location_text}."
# Usage
employee = {
"role": "Senior Data Scientist",
"department": "Analytics",
"experience": 8,
"skills": "Python, Machine Learning, Statistics, SQL",
"location": "San Francisco"
}
document = create_employee_document(employee)
# Result: "Works as Senior Data Scientist in Analytics. Has 8 years of experience.
# Skills include Python, Machine Learning, Statistics, SQL. Located in San Francisco."
Pattern 2: Hierarchical Search
def hierarchical_search(collection, query, filters=None):
"""Perform search with fallback strategies."""
# First: Try with strict filters
if filters:
results = collection.query(
query_texts=[query],
where=filters,
n_results=5
)
if len(results['ids'][0]) > 0:
return results, "strict"
# Second: Try with relaxed filters
if filters and "experience" in filters:
relaxed_filters = {k: v for k, v in filters.items() if k != "experience"}
results = collection.query(
query_texts=[query],
where=relaxed_filters,
n_results=5
)
if len(results['ids'][0]) > 0:
return results, "relaxed"
# Third: Pure semantic search
results = collection.query(
query_texts=[query],
n_results=10
)
return results, "semantic_only"
# Usage
query = "experienced Python developer"
filters = {
"$and": [
{"department": "Engineering"},
{"experience": {"$gte": 5}},
{"location": "San Francisco"}
]
}
results, search_type = hierarchical_search(collection, query, filters)
print(f"Search completed using {search_type} strategy")
Pattern 3: Similarity Threshold Filtering
def search_with_threshold(collection, query, threshold=0.7, max_results=10):
"""Return only results above similarity threshold."""
results = collection.query(
query_texts=[query],
n_results=max_results
)
filtered_results = {
'ids': [](/udit-asopa/similarity_search_chromadb/wiki/),
'documents': [](/udit-asopa/similarity_search_chromadb/wiki/),
'metadatas': [](/udit-asopa/similarity_search_chromadb/wiki/),
'distances': [](/udit-asopa/similarity_search_chromadb/wiki/)
}
for i, distance in enumerate(results['distances'][0]):
# ChromaDB uses distance (lower = more similar)
# Convert to similarity: similarity = 1 - (distance / 2)
similarity = 1 - (distance / 2)
if similarity >= threshold:
filtered_results['ids'][0].append(results['ids'][0][i])
filtered_results['documents'][0].append(results['documents'][0][i])
filtered_results['metadatas'][0].append(results['metadatas'][0][i])
filtered_results['distances'][0].append(distance)
return filtered_results
# Usage
high_quality_results = search_with_threshold(
collection,
"senior software engineer",
threshold=0.8
)
Pattern 4: Multi-Query Ensemble Search
def ensemble_search(collection, queries, weights=None):
"""Combine results from multiple related queries."""
if weights is None:
weights = [1.0] * len(queries)
all_results = {}
# Collect results for each query
for i, query in enumerate(queries):
results = collection.query(
query_texts=[query],
n_results=10
)
weight = weights[i]
for j, doc_id in enumerate(results['ids'][0]):
distance = results['distances'][0][j]
weighted_score = distance * weight
if doc_id in all_results:
all_results[doc_id]['total_score'] += weighted_score
all_results[doc_id]['query_count'] += 1
else:
all_results[doc_id] = {
'total_score': weighted_score,
'query_count': 1,
'document': results['documents'][0][j],
'metadata': results['metadatas'][0][j]
}
# Sort by average score
sorted_results = sorted(
all_results.items(),
key=lambda x: x[1]['total_score'] / x[1]['query_count']
)
return sorted_results
# Usage
queries = [
"Python web developer",
"backend engineer with API experience",
"full stack developer"
]
weights = [1.0, 0.8, 0.6] # Prioritize first query
ensemble_results = ensemble_search(collection, queries, weights)
Customization Examples
Custom Embedding Function
class CustomEmbeddingFunction:
"""Custom embedding function with preprocessing."""
def __init__(self, model_name="all-MiniLM-L6-v2"):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer(model_name)
def __call__(self, texts):
"""Generate embeddings with custom preprocessing."""
# Custom preprocessing
processed_texts = []
for text in texts:
# Normalize text
text = text.lower().strip()
# Remove special characters (optional)
import re
text = re.sub(r'[^\w\s]', '', text)
# Add context markers
text = f"Employee profile: {text}"
processed_texts.append(text)
# Generate embeddings
embeddings = self.model.encode(processed_texts)
return embeddings.tolist()
# Usage
custom_ef = CustomEmbeddingFunction()
collection = client.create_collection(
name="custom_embeddings",
embedding_function=custom_ef
)
Dynamic Collection Management
class EmployeeSearchSystem:
"""Complete search system with collection management."""
def __init__(self, persist_directory="./chroma_db"):
self.client = chromadb.PersistentClient(path=persist_directory)
self.collections = {}
self.embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
def create_department_collection(self, department_name):
"""Create department-specific collection."""
collection_name = f"employees_{department_name.lower()}"
try:
collection = self.client.create_collection(
name=collection_name,
embedding_function=self.embedding_function,
metadata={"department": department_name}
)
except Exception:
# Collection exists, get it
collection = self.client.get_collection(collection_name)
self.collections[department_name] = collection
return collection
def add_employee(self, employee, department=None):
"""Add employee to appropriate collection."""
dept = department or employee.get('department', 'general')
if dept not in self.collections:
self.create_department_collection(dept)
collection = self.collections[dept]
# Generate document
document = self._create_document(employee)
collection.add(
documents=[document],
metadatas=[employee],
ids=[employee['id']]
)
def search_all_departments(self, query, n_results=5):
"""Search across all department collections."""
all_results = []
for dept, collection in self.collections.items():
results = collection.query(
query_texts=[query],
n_results=n_results
)
# Add department info to results
for i, doc_id in enumerate(results['ids'][0]):
result = {
'id': doc_id,
'document': results['documents'][0][i],
'metadata': results['metadatas'][0][i],
'distance': results['distances'][0][i],
'department_collection': dept
}
all_results.append(result)
# Sort by distance
all_results.sort(key=lambda x: x['distance'])
return all_results[:n_results]
def _create_document(self, employee):
"""Create searchable document from employee data."""
return (f"{employee['role']} with {employee['experience']} years "
f"in {employee['department']}. Skills: {employee['skills']}. "
f"Located in {employee['location']}.")
# Usage
search_system = EmployeeSearchSystem()
# Add employees
employees = [
{"id": "eng_1", "name": "Alice", "department": "Engineering",
"role": "Software Engineer", "experience": 5, "skills": "Python, React"},
{"id": "mkt_1", "name": "Bob", "department": "Marketing",
"role": "Marketing Manager", "experience": 8, "skills": "SEO, Analytics"}
]
for emp in employees:
search_system.add_employee(emp)
# Search across departments
results = search_system.search_all_departments("Python developer")
Production Deployment
Configuration for Production
import chromadb
from chromadb.config import Settings
# Production client configuration
client = chromadb.PersistentClient(
path="./production_chroma_db",
settings=Settings(
# Enable authentication
chroma_client_auth_provider="chromadb.auth.basic_authn.BasicAuthClientProvider",
chroma_client_auth_credentials="admin:secure_password",
# Performance settings
chroma_server_grpc_port=8001,
chroma_server_http_port=8000,
# Security settings
chroma_server_ssl_enabled=True,
# Resource limits
chroma_segment_cache_policy="LRU",
chroma_segment_cache_size=1000000
)
)
Monitoring and Logging
import logging
import time
from functools import wraps
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('search_system.log'),
logging.StreamHandler()
]
)
def monitor_search_performance(func):
"""Decorator to monitor search performance."""
@wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
duration = time.time() - start_time
logging.info(f"Search completed in {duration:.3f}s - "
f"Query: {kwargs.get('query_texts', 'Unknown')}")
return result
except Exception as e:
duration = time.time() - start_time
logging.error(f"Search failed after {duration:.3f}s - "
f"Error: {str(e)}")
raise
return wrapper
# Apply monitoring to collection methods
original_query = chromadb.Collection.query
chromadb.Collection.query = monitor_search_performance(original_query)
Batch Processing for Large Datasets
def bulk_add_employees(collection, employees, batch_size=100):
"""Add employees in batches for better performance."""
for i in range(0, len(employees), batch_size):
batch = employees[i:i + batch_size]
# Prepare batch data
documents = []
metadatas = []
ids = []
for emp in batch:
documents.append(create_employee_document(emp))
metadatas.append({k: v for k, v in emp.items() if k != 'id'})
ids.append(emp['id'])
# Add batch to collection
try:
collection.add(
documents=documents,
metadatas=metadatas,
ids=ids
)
logging.info(f"Added batch {i//batch_size + 1}: {len(batch)} employees")
except Exception as e:
logging.error(f"Failed to add batch {i//batch_size + 1}: {str(e)}")
raise
# Usage
large_employee_dataset = [...] # List of 10,000+ employees
bulk_add_employees(collection, large_employee_dataset, batch_size=500)
Troubleshooting
Common Issues and Solutions
1. Import Error: SentenceTransformers
# Error
ModuleNotFoundError: No module named 'sentence_transformers'
# Solution
pixi add sentence-transformers
# or
pip install sentence-transformers
2. Collection Already Exists
# Error
chromadb.errors.DuplicateIDError: Collection 'employee_collection' already exists
# Solution
try:
collection = client.create_collection(name="employee_collection")
except chromadb.errors.DuplicateIDError:
collection = client.get_collection(name="employee_collection")
3. Empty Search Results
def debug_empty_results(collection, query):
"""Debug function for empty search results."""
# Check collection size
all_items = collection.get()
print(f"Collection contains {len(all_items['ids'])} documents")
if len(all_items['ids']) == 0:
print("Collection is empty - add documents first")
return
# Check query embedding
results = collection.query(
query_texts=[query],
n_results=len(all_items['ids']) # Get all results
)
print(f"Query returned {len(results['ids'][0])} results")
if len(results['ids'][0]) > 0:
print(f"Best match distance: {results['distances'][0][0]:.4f}")
print(f"Worst match distance: {results['distances'][0][-1]:.4f}")
# Show sample documents
print("\nSample documents in collection:")
for i in range(min(3, len(all_items['documents']))):
print(f" {i+1}: {all_items['documents'][i][:100]}...")
4. Performance Issues
# Monitor query performance
def profile_search(collection, queries, n_results=5):
"""Profile search performance across multiple queries."""
import time
results = {}
for query in queries:
start_time = time.time()
search_results = collection.query(
query_texts=[query],
n_results=n_results
)
duration = time.time() - start_time
results[query] = {
'duration': duration,
'result_count': len(search_results['ids'][0])
}
# Print performance summary
avg_duration = sum(r['duration'] for r in results.values()) / len(results)
print(f"Average query time: {avg_duration:.3f}s")
for query, stats in results.items():
print(f"Query: '{query[:50]}...' - {stats['duration']:.3f}s - {stats['result_count']} results")
return results
# Usage
test_queries = [
"Python developer",
"team leader with experience",
"marketing manager",
"senior engineer"
]
profile_search(collection, test_queries)
5. Memory Issues with Large Collections
# Memory-efficient search for large collections
def memory_efficient_search(collection, query, batch_size=1000):
"""Search large collections in batches to manage memory."""
# Get total document count
all_items = collection.get(limit=1) # Just get count
# Note: ChromaDB doesn't directly provide count,
# so we estimate based on batch retrieval
results = []
offset = 0
while True:
# Get batch of documents
batch_items = collection.get(
limit=batch_size,
offset=offset
)
if len(batch_items['ids']) == 0:
break
# Search within batch
batch_results = collection.query(
query_texts=[query],
n_results=min(10, len(batch_items['ids'])),
include=['documents', 'metadatas', 'distances']
)
results.extend(zip(
batch_results['ids'][0],
batch_results['documents'][0],
batch_results['metadatas'][0],
batch_results['distances'][0]
))
offset += batch_size
# Sort all results by distance
results.sort(key=lambda x: x[3]) # Sort by distance
return results[:10] # Return top 10
Performance Optimization Tips
- Use Persistent Client: For production workloads
- Batch Operations: Add documents in batches of 100-1000
- Optimize HNSW Parameters: Tune based on your use case
- Pre-filter with Metadata: Reduce vector search space
- Cache Frequent Queries: Store common search results
- Monitor Memory Usage: Use appropriate batch sizes
- Use Appropriate Model: Balance quality vs speed requirements
This comprehensive guide should help you understand and implement various patterns with the ChromaDB similarity search system!
HTML Dashboard Usage Examples
Getting Started with the Web Interface
1. Setup and Launch
# Start the FastAPI server
pixi run dev
# Open the HTML dashboard in your browser
xdg-open frontend/index.html
2. Similarity Search Examples
Tab: 🎯 Similarity Search
Try these natural language queries:
-
"Python developer with web experience"
- Finds: John Doe (Software Engineer), Alex Rodriguez (Lead Software Engineer)
- Shows semantic understanding of programming skills
-
"team leader with management experience"
- Finds: David Lee (Engineering Manager), Rachel Brown (Marketing Director)
- Identifies leadership roles across departments
-
"marketing professional with social media skills"
- Finds: Jane Smith (Marketing Manager), Emily Wilson (Marketing Assistant)
- Matches domain expertise and specific skills
3. Filter Search Examples
Tab: 🔍 Filter Search
Use precise criteria to filter employees:
-
Engineering Department + 5+ Years Experience:
- Department: "Engineering"
- Min Experience: 5
- Results: Senior engineers and architects
-
California Employees:
- Location: "San Francisco" or "Los Angeles"
- Shows geographic filtering
-
Part-time Employees:
- Employment Type: "Part-time"
- Filters by work arrangement
4. Advanced Search Examples
Tab: ⚡ Advanced Search
Combine semantic search with filters:
-
"senior developer" + Engineering + 8+ years:
- Query: "senior developer with architecture experience"
- Department: "Engineering"
- Min Experience: 8
- Finds: Michael Brown, Chris Evans, Alex Rodriguez
-
"marketing manager" + California:
- Query: "marketing manager with leadership skills"
- Location: "Los Angeles"
- Finds: Jane Smith and similar profiles
5. Understanding Results
Each result card shows:
- Employee name and role
- Match score (higher = better match)
- Department, experience, location
- Full description with skills and background
- Hover effects for better interaction
6. Pro Tips
- Use descriptive queries - "Python web developer" works better than just "Python"
- Combine filters wisely - Don't over-constrain your search
- Check similarity scores - Scores below 0.5 indicate very good matches
- Try different phrasings - "team lead" vs "manager" vs "supervisor"