Concepts - udit-asopa/similarity_search_chromadb GitHub Wiki

Core Concepts in Vector Similarity Search

Vector Embeddings
ChromaDB Architecture
SentenceTransformers
Similarity Metrics
Metadata Filtering
HNSW Algorithm

Vector Embeddings

What are Vector Embeddings?

Vector embeddings are numerical representations of text, images, or other data types in a high-dimensional space. They capture semantic meaning and relationships between different pieces of content.

Key Properties

Semantic Similarity: Similar concepts have similar vector representations
Dimensional Space: Typically 384, 512, 768, or 1024 dimensions
Mathematical Operations: Enable similarity calculations using distance metrics

Example

# Text: "Software Engineer with Python skills"
# Embedding: [0.234, -0.123, 0.456, ..., 0.789] (384 dimensions)

# Text: "Python Developer"  
# Embedding: [0.245, -0.134, 0.467, ..., 0.798] (384 dimensions)
# Similar vectors because of semantic relationship

Benefits

Context Understanding: Captures meaning beyond keywords
Language Agnostic: Works across different languages
Transferable: Pre-trained models work on various domains

ChromaDB Architecture

Overview

ChromaDB is an open-source vector database designed for AI applications, providing efficient storage and retrieval of embeddings.

Core Components

1. Client

client = chromadb.Client()  # In-memory client
# OR
client = chromadb.PersistentClient(path="./chroma_db")  # Persistent storage

2. Collections

collection = client.create_collection(
    name="employees",
    metadata={"description": "Employee data"},
    embedding_function=embedding_function
)

3. Documents and Metadata

collection.add(
    documents=["Software Engineer with Python skills"],
    metadatas=[{"department": "Engineering", "experience": 5}],
    ids=["emp_1"]
)

Storage Options

In-Memory: Fast, temporary storage for development
Persistent: File-based storage for production
Client-Server: Distributed deployment for scale

Key Features

Automatic Indexing: Efficient similarity search
Metadata Filtering: Combine vector search with structured queries
Multiple Distance Metrics: Cosine, Euclidean, Manhattan
Batch Operations: Efficient bulk operations

SentenceTransformers

Overview

SentenceTransformers is a Python library that provides an easy method to compute dense vector representations for sentences, paragraphs, and images.

Model Selection

all-MiniLM-L6-v2

Dimensions: 384
Performance: Good balance of quality and speed
Use Case: General-purpose sentence embeddings
Size: ~90MB

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello World", "Goodbye World"])

Other Popular Models

all-mpnet-base-v2: Higher quality, slower (768 dimensions)
all-distilroberta-v1: RoBERTa-based (768 dimensions)
paraphrase-MiniLM-L6-v2: Optimized for paraphrasing

Integration with ChromaDB

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

Advantages

Pre-trained: No training required
Multilingual: Many models support multiple languages
Domain Adaptation: Can be fine-tuned for specific domains
Efficient: Optimized for batch processing

Similarity Metrics

Cosine Similarity

Most common metric for text embeddings.

Formula: similarity = cos(θ) = (A · B) / (||A|| ||B||)

Range: -1 to 1 (ChromaDB converts to distance: 0 to 2)

0 = Identical vectors
1 = Orthogonal vectors
2 = Opposite vectors

Advantages:

Magnitude independent
Works well with normalized embeddings
Intuitive interpretation

Euclidean Distance

Measures straight-line distance in vector space.

Formula: distance = √Σ(ai - bi)²

Use Cases:

When magnitude matters
Lower-dimensional spaces
Geometric applications

Manhattan Distance (L1)

Sum of absolute differences.

Formula: distance = Σ|ai - bi|

Use Cases:

High-dimensional sparse data
When outliers are problematic

Metadata Filtering

Overview

Metadata filtering allows combining vector similarity with structured queries, enabling precise and efficient search.

Query Types

Exact Match

collection.get(where={"department": "Engineering"})

Range Queries

# Greater than or equal
collection.get(where={"experience": {"$gte": 10}})

# Less than
collection.get(where={"experience": {"$lt": 5}})

# Between values
collection.get(where={
    "experience": {"$gte": 5, "$lte": 15}
})

Array Operations

# In array
collection.get(where={
    "location": {"$in": ["New York", "San Francisco"]}
})

# Not in array
collection.get(where={
    "department": {"$nin": ["HR", "Finance"]}
})

Logical Operations

# AND operation
collection.get(where={
    "$and": [
        {"department": "Engineering"},
        {"experience": {"$gte": 5}}
    ]
})

# OR operation
collection.get(where={
    "$or": [
        {"location": "New York"},
        {"experience": {"$gte": 15}}
    ]
})

Combined Search

# Vector similarity + metadata filtering
results = collection.query(
    query_texts=["Python developer"],
    n_results=5,
    where={
        "$and": [
            {"department": "Engineering"},
            {"experience": {"$gte": 3}}
        ]
    }
)

Performance Benefits

Pre-filtering: Reduces vector search space
Index Optimization: Database can optimize queries
Precision: Combines semantic and structured search

HNSW Algorithm

Overview

Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for approximate nearest neighbor search.

Key Concepts

1. Graph Structure

Nodes: Vector embeddings
Edges: Connections between similar vectors
Layers: Hierarchical organization for efficient search

2. Search Process

Entry Point: Start at top layer
Greedy Search: Move to most similar neighbor
Layer Descent: Move down layers for precision
Result: Approximate nearest neighbors

Configuration in ChromaDB

collection = client.create_collection(
    name="my_collection",
    configuration={
        "hnsw": {
            "space": "cosine",     # Distance metric
            "M": 16,               # Max connections per node
            "ef_construction": 200, # Search depth during construction
            "ef": 10,              # Search depth during query
            "max_elements": 10000  # Maximum elements
        }
    }
)

Parameters Explained

M (Max Connections)

Default: 16
Higher M: Better recall, more memory
Lower M: Faster queries, less memory
Typical Range: 8-64

ef_construction

Default: 200
Purpose: Controls index quality during building
Higher Values: Better quality, slower indexing
Typical Range: 100-800

ef (Query Time)

Default: 10
Purpose: Controls search thoroughness
Higher Values: Better recall, slower queries
Typical Range: 10-500

Trade-offs

Speed vs Accuracy: Approximate but very fast
Memory vs Performance: More connections = better performance
Build Time vs Query Time: Better index = faster queries

Advantages

Scalability: Logarithmic search complexity
Flexibility: Configurable parameters
Performance: Fast approximate search
Memory Efficiency: Reasonable memory usage

Best Practices

1. Embedding Strategy

Choose appropriate model for your domain
Consistent preprocessing of text
Consider multilingual needs
Balance quality vs speed requirements

2. Collection Design

Use descriptive collection names
Include relevant metadata fields
Plan for scalability from the start
Consider data privacy requirements

3. Query Optimization

Use metadata filters to reduce search space
Batch queries when possible
Cache frequent query results
Monitor performance metrics

4. Error Handling

Always check for empty results
Handle model loading errors
Validate input data
Implement graceful degradation

5. Performance Tuning

Adjust HNSW parameters based on use case
Monitor memory usage
Consider persistent storage for production
Profile query performance regularly

Common Use Cases

1. Document Search

Legal document retrieval
Research paper discovery
Knowledge base search

2. Recommendation Systems

Product recommendations
Content suggestions
Similar item finding

3. Similarity Detection

Duplicate detection
Plagiarism checking
Content clustering

4. Question Answering

FAQ matching
Customer support
Educational systems

5. Content Classification

Automatic tagging
Category assignment
Quality assessment

This guide provides the theoretical foundation for understanding how vector similarity search works in practice. The concepts here are implemented in the main application script.

Common Issues and Best Practices

ChromaDB Metadata Limitations

Supported Data Types

ChromaDB metadata only supports scalar values:

str (string)
int (integer)
float (floating-point number)
bool (boolean)
None (null value)

Handling Complex Data

# ❌ This will fail
metadata = {
    "tags": ["fiction", "adventure"],      # Lists not supported
    "author_info": {"name": "Author"}      # Objects not supported  
}

# ✅ Convert to supported types
metadata = {
    "tags": "fiction, adventure",          # Convert list to string
    "author_info": '{"name": "Author"}',   # Convert object to JSON string
    "tag_count": 2,                        # Extract numeric properties
    "has_tags": True                       # Extract boolean properties
}

Performance Optimization

Collection Configuration

collection = client.create_collection(
    name="optimized_collection",
    configuration={
        "hnsw": {
            "space": "cosine",           # Choose appropriate distance metric
            "M": 16,                     # Balance between speed and accuracy
            "ef_construction": 200,      # Higher = better accuracy, slower build
            "ef": 50,                    # Higher = better accuracy, slower search
            "max_elements": 100000       # Set realistic capacity
        }
    }
)

Embedding Model Selection

# For speed (recommended for development)
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dim, 90MB

# For accuracy (recommended for production)  
model = SentenceTransformer('all-mpnet-base-v2')  # 768 dim, 420MB

# For multilingual
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

Search Strategy Guidelines

Query Design

# ✅ Good queries are specific and descriptive
query = "experienced Python developer with machine learning skills"

# ❌ Avoid overly broad or short queries
query = "developer"  # Too vague
query = "python ml ai data science software engineering experience"  # Too long

Combining Search Types

# Use semantic search for content discovery
semantic_results = collection.query(
    query_texts=["AI researcher"],
    n_results=20
)

# Use metadata filtering for precise requirements
filtered_results = collection.query(
    query_texts=["AI researcher"],
    where={"experience_years": {"$gte": 5}},
    n_results=10
)

Error Handling Patterns

Graceful Degradation

def robust_search(collection, query, filters=None, n_results=5):
    """Search with fallback strategies"""
    try:
        # Try combined search first
        results = collection.query(
            query_texts=[query],
            where=filters,
            n_results=n_results
        )
        
        if len(results['ids'][0]) == 0 and filters:
            # Fallback: try without filters
            results = collection.query(
                query_texts=[query],
                n_results=n_results
            )
            
        return results
        
    except Exception as e:
        print(f"Search error: {e}")
        return {"ids": [](/udit-asopa/similarity_search_chromadb/wiki/), "documents": [](/udit-asopa/similarity_search_chromadb/wiki/), "distances": [](/udit-asopa/similarity_search_chromadb/wiki/)}

Collection Health Checks

def validate_collection(collection):
    """Verify collection is properly configured"""
    try:
        count = collection.count()
        if count == 0:
            raise ValueError("Collection is empty")
            
        # Test query
        test_results = collection.query(
            query_texts=["test"],
            n_results=1
        )
        
        return True
        
    except Exception as e:
        print(f"Collection validation failed: {e}")
        return False

These concepts and best practices ensure reliable, performant vector search implementations.

Concepts - udit-asopa/similarity_search_chromadb GitHub Wiki

Core Concepts in Vector Similarity Search

Table of Contents

Vector Embeddings

What are Vector Embeddings?

Key Properties

Example

Benefits

ChromaDB Architecture

Overview

Core Components

1. Client

2. Collections

3. Documents and Metadata

Storage Options

Key Features

SentenceTransformers

Overview

Model Selection

all-MiniLM-L6-v2

Other Popular Models

Integration with ChromaDB

Advantages

Similarity Metrics

Cosine Similarity

Euclidean Distance

Manhattan Distance (L1)

Metadata Filtering

Overview

Query Types

Exact Match

Range Queries

Array Operations

Logical Operations

Combined Search

Performance Benefits

HNSW Algorithm

Overview

Key Concepts

1. Graph Structure

2. Search Process

Configuration in ChromaDB

Parameters Explained

M (Max Connections)

ef_construction

ef (Query Time)

Trade-offs

Advantages

Best Practices

1. Embedding Strategy

2. Collection Design

3. Query Optimization

4. Error Handling

5. Performance Tuning

Common Use Cases

1. Document Search

2. Recommendation Systems

3. Similarity Detection

4. Question Answering

5. Content Classification

Common Issues and Best Practices

ChromaDB Metadata Limitations

Supported Data Types

Handling Complex Data

Performance Optimization

Collection Configuration

Embedding Model Selection

Search Strategy Guidelines

Query Design

Combining Search Types

Error Handling Patterns

Graceful Degradation

Collection Health Checks