Concepts - udit-asopa/similarity_search_chromadb GitHub Wiki

Core Concepts in Vector Similarity Search

Table of Contents

  1. Vector Embeddings
  2. ChromaDB Architecture
  3. SentenceTransformers
  4. Similarity Metrics
  5. Metadata Filtering
  6. HNSW Algorithm

Vector Embeddings

What are Vector Embeddings?

Vector embeddings are numerical representations of text, images, or other data types in a high-dimensional space. They capture semantic meaning and relationships between different pieces of content.

Key Properties

  • Semantic Similarity: Similar concepts have similar vector representations
  • Dimensional Space: Typically 384, 512, 768, or 1024 dimensions
  • Mathematical Operations: Enable similarity calculations using distance metrics

Example

# Text: "Software Engineer with Python skills"
# Embedding: [0.234, -0.123, 0.456, ..., 0.789] (384 dimensions)

# Text: "Python Developer"  
# Embedding: [0.245, -0.134, 0.467, ..., 0.798] (384 dimensions)
# Similar vectors because of semantic relationship

Benefits

  • Context Understanding: Captures meaning beyond keywords
  • Language Agnostic: Works across different languages
  • Transferable: Pre-trained models work on various domains

ChromaDB Architecture

Overview

ChromaDB is an open-source vector database designed for AI applications, providing efficient storage and retrieval of embeddings.

Core Components

1. Client

client = chromadb.Client()  # In-memory client
# OR
client = chromadb.PersistentClient(path="./chroma_db")  # Persistent storage

2. Collections

collection = client.create_collection(
    name="employees",
    metadata={"description": "Employee data"},
    embedding_function=embedding_function
)

3. Documents and Metadata

collection.add(
    documents=["Software Engineer with Python skills"],
    metadatas=[{"department": "Engineering", "experience": 5}],
    ids=["emp_1"]
)

Storage Options

  • In-Memory: Fast, temporary storage for development
  • Persistent: File-based storage for production
  • Client-Server: Distributed deployment for scale

Key Features

  • Automatic Indexing: Efficient similarity search
  • Metadata Filtering: Combine vector search with structured queries
  • Multiple Distance Metrics: Cosine, Euclidean, Manhattan
  • Batch Operations: Efficient bulk operations

SentenceTransformers

Overview

SentenceTransformers is a Python library that provides an easy method to compute dense vector representations for sentences, paragraphs, and images.

Model Selection

all-MiniLM-L6-v2

  • Dimensions: 384
  • Performance: Good balance of quality and speed
  • Use Case: General-purpose sentence embeddings
  • Size: ~90MB
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(["Hello World", "Goodbye World"])

Other Popular Models

  • all-mpnet-base-v2: Higher quality, slower (768 dimensions)
  • all-distilroberta-v1: RoBERTa-based (768 dimensions)
  • paraphrase-MiniLM-L6-v2: Optimized for paraphrasing

Integration with ChromaDB

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

Advantages

  • Pre-trained: No training required
  • Multilingual: Many models support multiple languages
  • Domain Adaptation: Can be fine-tuned for specific domains
  • Efficient: Optimized for batch processing

Similarity Metrics

Cosine Similarity

Most common metric for text embeddings.

Formula: similarity = cos(θ) = (A · B) / (||A|| ||B||)

Range: -1 to 1 (ChromaDB converts to distance: 0 to 2)

  • 0 = Identical vectors
  • 1 = Orthogonal vectors
  • 2 = Opposite vectors

Advantages:

  • Magnitude independent
  • Works well with normalized embeddings
  • Intuitive interpretation

Euclidean Distance

Measures straight-line distance in vector space.

Formula: distance = √Σ(ai - bi)²

Use Cases:

  • When magnitude matters
  • Lower-dimensional spaces
  • Geometric applications

Manhattan Distance (L1)

Sum of absolute differences.

Formula: distance = Σ|ai - bi|

Use Cases:

  • High-dimensional sparse data
  • When outliers are problematic

Metadata Filtering

Overview

Metadata filtering allows combining vector similarity with structured queries, enabling precise and efficient search.

Query Types

Exact Match

collection.get(where={"department": "Engineering"})

Range Queries

# Greater than or equal
collection.get(where={"experience": {"$gte": 10}})

# Less than
collection.get(where={"experience": {"$lt": 5}})

# Between values
collection.get(where={
    "experience": {"$gte": 5, "$lte": 15}
})

Array Operations

# In array
collection.get(where={
    "location": {"$in": ["New York", "San Francisco"]}
})

# Not in array
collection.get(where={
    "department": {"$nin": ["HR", "Finance"]}
})

Logical Operations

# AND operation
collection.get(where={
    "$and": [
        {"department": "Engineering"},
        {"experience": {"$gte": 5}}
    ]
})

# OR operation
collection.get(where={
    "$or": [
        {"location": "New York"},
        {"experience": {"$gte": 15}}
    ]
})

Combined Search

# Vector similarity + metadata filtering
results = collection.query(
    query_texts=["Python developer"],
    n_results=5,
    where={
        "$and": [
            {"department": "Engineering"},
            {"experience": {"$gte": 3}}
        ]
    }
)

Performance Benefits

  • Pre-filtering: Reduces vector search space
  • Index Optimization: Database can optimize queries
  • Precision: Combines semantic and structured search

HNSW Algorithm

Overview

Hierarchical Navigable Small World (HNSW) is a graph-based algorithm for approximate nearest neighbor search.

Key Concepts

1. Graph Structure

  • Nodes: Vector embeddings
  • Edges: Connections between similar vectors
  • Layers: Hierarchical organization for efficient search

2. Search Process

  1. Entry Point: Start at top layer
  2. Greedy Search: Move to most similar neighbor
  3. Layer Descent: Move down layers for precision
  4. Result: Approximate nearest neighbors

Configuration in ChromaDB

collection = client.create_collection(
    name="my_collection",
    configuration={
        "hnsw": {
            "space": "cosine",     # Distance metric
            "M": 16,               # Max connections per node
            "ef_construction": 200, # Search depth during construction
            "ef": 10,              # Search depth during query
            "max_elements": 10000  # Maximum elements
        }
    }
)

Parameters Explained

M (Max Connections)

  • Default: 16
  • Higher M: Better recall, more memory
  • Lower M: Faster queries, less memory
  • Typical Range: 8-64

ef_construction

  • Default: 200
  • Purpose: Controls index quality during building
  • Higher Values: Better quality, slower indexing
  • Typical Range: 100-800

ef (Query Time)

  • Default: 10
  • Purpose: Controls search thoroughness
  • Higher Values: Better recall, slower queries
  • Typical Range: 10-500

Trade-offs

  • Speed vs Accuracy: Approximate but very fast
  • Memory vs Performance: More connections = better performance
  • Build Time vs Query Time: Better index = faster queries

Advantages

  • Scalability: Logarithmic search complexity
  • Flexibility: Configurable parameters
  • Performance: Fast approximate search
  • Memory Efficiency: Reasonable memory usage

Best Practices

1. Embedding Strategy

  • Choose appropriate model for your domain
  • Consistent preprocessing of text
  • Consider multilingual needs
  • Balance quality vs speed requirements

2. Collection Design

  • Use descriptive collection names
  • Include relevant metadata fields
  • Plan for scalability from the start
  • Consider data privacy requirements

3. Query Optimization

  • Use metadata filters to reduce search space
  • Batch queries when possible
  • Cache frequent query results
  • Monitor performance metrics

4. Error Handling

  • Always check for empty results
  • Handle model loading errors
  • Validate input data
  • Implement graceful degradation

5. Performance Tuning

  • Adjust HNSW parameters based on use case
  • Monitor memory usage
  • Consider persistent storage for production
  • Profile query performance regularly

Common Use Cases

1. Document Search

  • Legal document retrieval
  • Research paper discovery
  • Knowledge base search

2. Recommendation Systems

  • Product recommendations
  • Content suggestions
  • Similar item finding

3. Similarity Detection

  • Duplicate detection
  • Plagiarism checking
  • Content clustering

4. Question Answering

  • FAQ matching
  • Customer support
  • Educational systems

5. Content Classification

  • Automatic tagging
  • Category assignment
  • Quality assessment

This guide provides the theoretical foundation for understanding how vector similarity search works in practice. The concepts here are implemented in the main application script.

Common Issues and Best Practices

ChromaDB Metadata Limitations

Supported Data Types

ChromaDB metadata only supports scalar values:

  • str (string)
  • int (integer)
  • float (floating-point number)
  • bool (boolean)
  • None (null value)

Handling Complex Data

# ❌ This will fail
metadata = {
    "tags": ["fiction", "adventure"],      # Lists not supported
    "author_info": {"name": "Author"}      # Objects not supported  
}

# ✅ Convert to supported types
metadata = {
    "tags": "fiction, adventure",          # Convert list to string
    "author_info": '{"name": "Author"}',   # Convert object to JSON string
    "tag_count": 2,                        # Extract numeric properties
    "has_tags": True                       # Extract boolean properties
}

Performance Optimization

Collection Configuration

collection = client.create_collection(
    name="optimized_collection",
    configuration={
        "hnsw": {
            "space": "cosine",           # Choose appropriate distance metric
            "M": 16,                     # Balance between speed and accuracy
            "ef_construction": 200,      # Higher = better accuracy, slower build
            "ef": 50,                    # Higher = better accuracy, slower search
            "max_elements": 100000       # Set realistic capacity
        }
    }
)

Embedding Model Selection

# For speed (recommended for development)
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dim, 90MB

# For accuracy (recommended for production)  
model = SentenceTransformer('all-mpnet-base-v2')  # 768 dim, 420MB

# For multilingual
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

Search Strategy Guidelines

Query Design

# ✅ Good queries are specific and descriptive
query = "experienced Python developer with machine learning skills"

# ❌ Avoid overly broad or short queries
query = "developer"  # Too vague
query = "python ml ai data science software engineering experience"  # Too long

Combining Search Types

# Use semantic search for content discovery
semantic_results = collection.query(
    query_texts=["AI researcher"],
    n_results=20
)

# Use metadata filtering for precise requirements
filtered_results = collection.query(
    query_texts=["AI researcher"],
    where={"experience_years": {"$gte": 5}},
    n_results=10
)

Error Handling Patterns

Graceful Degradation

def robust_search(collection, query, filters=None, n_results=5):
    """Search with fallback strategies"""
    try:
        # Try combined search first
        results = collection.query(
            query_texts=[query],
            where=filters,
            n_results=n_results
        )
        
        if len(results['ids'][0]) == 0 and filters:
            # Fallback: try without filters
            results = collection.query(
                query_texts=[query],
                n_results=n_results
            )
            
        return results
        
    except Exception as e:
        print(f"Search error: {e}")
        return {"ids": [](/udit-asopa/similarity_search_chromadb/wiki/), "documents": [](/udit-asopa/similarity_search_chromadb/wiki/), "distances": [](/udit-asopa/similarity_search_chromadb/wiki/)}

Collection Health Checks

def validate_collection(collection):
    """Verify collection is properly configured"""
    try:
        count = collection.count()
        if count == 0:
            raise ValueError("Collection is empty")
            
        # Test query
        test_results = collection.query(
            query_texts=["test"],
            n_results=1
        )
        
        return True
        
    except Exception as e:
        print(f"Collection validation failed: {e}")
        return False

These concepts and best practices ensure reliable, performant vector search implementations.