On Concepts - robbiemu/aclarai GitHub Wiki
🧠 Concept Creation and Drift Handling (Initial Strategy)
aclarai extracts, organizes, and updates concepts in a lightweight but effective way that suits individual users and small teams. This approach builds on the existing block-level processing and nightly maintenance workflows.
1. 📌 How Concepts Are Created
Concepts are born automatically during sentence/claim extraction batches.
Step | What Happens |
---|---|
1.1. Extract surface candidates | After running Claimify on a batch, scan resulting sentences and summaries using lightweight noun phrase extraction (e.g., spaCy or regex chunking). see concept_candidates Vector Table |
1.2. Normalize candidates | Lowercase, trim whitespace, strip punctuation, singularize nouns. E.g., ["dashboards", "Dashboard", "Dash board"] → "dashboard" . |
1.3. Check for existing concepts | Use hnswlib to run a cosine similarity search on all current concept embeddings. If a match ≥ 0.9 is found, reuse that concept. Otherwise, create a new one. |
1.4. Create or update the graph | \n- New concept → create (:Concept {id, name, embedding, aliases}) \n- Existing concept → update last_seen and expand aliases if necessary |
1.5. Link to claims | For each extracted claim, create (:Claim)-[:SUPPORTS_CONCEPT {strength}]→(:Concept) or CONTRADICTS_CONCEPT . |
(:Concept)
nodes
🧩 Properties of Each canonical concept is represented by a (:Concept)
node in the graph. These nodes contain both identifying information and operational metadata used during syncing, search, and linking.
Property | Type | Description |
---|---|---|
name |
String |
The canonical name of the concept (also used in the Markdown filename) |
embedding_hash |
String |
SHA256 hash of the Markdown text used to produce the current embedding |
last_updated |
Datetime |
Timestamp of last embedding refresh (usually from nightly concept sync) |
version |
Int |
Parsed from the concept file’s aclarai:id block to track vault edits |
status |
String |
Optional: may be "active" , "merged" , or "deprecated" for hygiene ops |
concept_candidates
Vector Table
📦 Extracted noun phrases are embedded and stored in a persistent vector table, called concept_candidates
. This vector space:
- Spans the entire vault, not just the current file
- Accumulates candidates from every processed claim and summary
- Enables fast nearest-neighbor search to detect existing concepts before creating new ones
Each row includes:
text
: the noun phrase (e.g."slice object"
)embedding
: vector representationsource_claim_id
andaclarai_id
for traceabilitystatus
: e.g.,"pending"
,"merged"
,"promoted"
This table supports:
- Concept deduplication across time and documents
- Canonical concept creation via similarity threshold (≥ 0.9)
- Concept linking without introducing noise into the final graph
🔁 Like
utterances
, this is a cumulative, appendable vector index—but used specifically for short phrases, not full utterances.
Yes — perfect approach. Let’s just add a concise new section after the existing description of :Concept
nodes in on-concepts.md
.
The document already refers to :Concept
nodes in the graph, but it does not yet explain that a vector store is used for semantic matching when linking claims or resolving new candidates.
✅ Suggested Addition (Minimal Insertion)
Add after the explanation of concept promotion or :Concept
node creation:
🧠 Concept Vector Index (Canonical Concepts)
Once concepts are promoted to canonical :Concept
nodes, they are also embedded and stored in a separate concepts vector table. This allows:
- Fast nearest-neighbor lookup when linking new claims or summaries
- Semantic similarity grouping across promoted concepts (e.g.,
"GPU crash"
≈"CUDA failure"
) - Deduplication enforcement when additional noun phrases are proposed
This vector index is used primarily during:
- Linking:
(:Claim)-[:MENTIONS_CONCEPT]->(:Concept)
- UI refinement (e.g. concept grouping or merging suggestions)
🔍 Tier 3 Markdown files are generated via graph traversal, not vector similarity — the concept vector DB supports semantic alignment only.
2. 🔄 Nightly Concept Refresh Job
Once per day, the same maintenance job that syncs Markdown and Neo4j also performs basic concept hygiene.
Step | What It Does |
---|---|
2.1. Refresh embeddings | For each concept, embed the first paragraph of its Tier 3 Markdown page (usually the definition or summary). Store new vector. |
2.2. Detect duplicates | Run pairwise similarity via hnswlib . If two concepts are ≥ 0.95 similar, mark one as a duplicate. |
2.3. Auto-merge low-risk duplicates | If:\n- Both concepts have low total claim count (e.g., < 10), and\n- One is clearly newer (created_at ), then:\n - Move claims to older concept\n - Merge aliases\n - Delete the duplicate node\n - Add a redirect note in the Tier 3 file (e.g., “Merged from [[Old Concept]]”) |
📝 No user review queues, moderation UIs, or merge confirmation dialogs are required.
3. ✂️ Optional Concept Split Detection (Later)
Not included in the MVP, but the groundwork is in place:
- Over time, if a concept’s linked claims split into two unrelated semantic clusters, we can detect that via intra-concept claim embedding divergence.
- When needed, this can trigger a split into multiple new concepts with redistributed claims and backlink notes.
4. 🔁 How This Fits The Graph-Vault Sync Loop
Pipeline phase | Concept actions |
---|---|
During sentence batch | Extract new concepts and link claims immediately after Claimify. |
Nightly job | Refresh embeddings, detect & merge concept duplicates. |
Manual edits to concept files | Cause the associated (:Concept) node to be marked dirty and reprocessed the next night. |
✅ Summary
- Uses
hnswlib
for fast, lightweight concept matching - Merges similar concepts automatically based on similarity threshold and size
- Updates embeddings from Tier 3 page definitions every night
- No UI review flow required—MVP stays simple and autonomous
- Fully integrated into the existing nightly sync and block-based claim extraction pipeline
This plan keeps aclarai usable for individuals from day one while still laying the groundwork for smarter concept maintenance over time.