On Vector Stores - robbiemu/aclarai GitHub Wiki
📦 Vector Store Summary (Approved)
This document summarizes the vector tables we’ve committed to implementing, their purpose, and the sprints/tasks where they are used.
utterances
✅ 1. Description: Embeddings of Tier 1 block-level utterance chunks.
Used In:
- 🟡 Sprint 2 — "Embed utterances and save vectors to Postgres"
- 🔵 Sprint 3 — Used by summary agent to retrieve relevant utterances during Tier 2 summary generation
- 🟣 Sprint 4 — RAG context source for concept generation
Purpose:
- Vault-wide similarity search
- Supports example mining, duplicate detection, RAG summaries
concept_candidates
✅ 2. Description: Noun phrases extracted from claims and summaries, embedded and staged for deduplication.
Used In:
- 🟣 Sprint 4 — "Create noun phrase extractor", "Use HNSWlib for concept detection"
Purpose:
- Support nearest-neighbor grouping
- Canonical concept promotion pipeline
- Staging area before inserting into
:Concept
graph
concepts
✅ 3. Description: Canonical (:Concept)
terms, embedded to support semantic search and linking.
Used In:
- 🟣 Sprint 4 — "Link claims to concepts"
- 🔵 Sprint 5+ — Similarity-based exploration and disambiguation
- 🟡 Sprint 10 — "Implement Concept clustering job"
Purpose:
- Detect semantic duplicates
- Enable concept-to-concept similarity
- Assist in onboarding new claims or summaries
- Support efficient bulk retrieval for clustering and analytics
Access Patterns
The concepts
vector store supports two primary access patterns:
- Similarity Search (
similarity_search
): Used for discovery, such as finding a few concepts semantically similar to a new claim or another concept. This is a "one-to-few" operation. - Bulk Embedding Retrieval (
get_embeddings_for_concepts
): Used for data-intensive tasks like clustering, where the embeddings for a large, known set of concepts are required. This is a "many-to-many" operation performed in a single, efficient database query. This pattern should be used by the Concept Clustering Job to avoid N+1 query performance issues.
✅ 4. 🔁 Used in RAG (but no new vector DBs)
- Claims: retrieved from graph, not vectorized
- Summaries: retrieved from graph, not vectorized
Covered in:
- 📄 Tier3 Rag Workflow design doc
❌ Declined Vector Stores
Candidate | Reason Rejected |
---|---|
claim_vectors |
Redundant; not used in any sprint |
summary_vectors |
Covered by RAG over utterances/claims/summaries |
rag_passages |
Future-use; redundant with Tier 3 RAG design |
This summary reflects vector table requirements through Sprint 4 and supports future RAG capabilities without unnecessary expansion.