On Vector Stores - robbiemu/aclarai GitHub Wiki

📦 Vector Store Summary (Approved)

This document summarizes the vector tables we’ve committed to implementing, their purpose, and the sprints/tasks where they are used.

Description: Embeddings of Tier 1 block-level utterance chunks.

Used In:

🟡 Sprint 2 — "Embed utterances and save vectors to Postgres"
🔵 Sprint 3 — Used by summary agent to retrieve relevant utterances during Tier 2 summary generation
🟣 Sprint 4 — RAG context source for concept generation

Purpose:

Description: Noun phrases extracted from claims and summaries, embedded and staged for deduplication.

Used In:

🟣 Sprint 4 — "Create noun phrase extractor", "Use HNSWlib for concept detection"

Purpose:

Description: Canonical (:Concept) terms, embedded to support semantic search and linking.

Used In:

Purpose:

The concepts vector store supports two primary access patterns:

Similarity Search (similarity_search): Used for discovery, such as finding a few concepts semantically similar to a new claim or another concept. This is a "one-to-few" operation.
Bulk Embedding Retrieval (get_embeddings_for_concepts): Used for data-intensive tasks like clustering, where the embeddings for a large, known set of concepts are required. This is a "many-to-many" operation performed in a single, efficient database query. This pattern should be used by the Concept Clustering Job to avoid N+1 query performance issues.

Covered in:

Candidate	Reason Rejected
`claim_vectors`	Redundant; not used in any sprint
`summary_vectors`	Covered by RAG over utterances/claims/summaries
`rag_passages`	Future-use; redundant with Tier 3 RAG design

This summary reflects vector table requirements through Sprint 4 and supports future RAG capabilities without unnecessary expansion.