On Concepts - robbiemu/aclarai GitHub Wiki

🧠 Concept Creation and Drift Handling (Initial Strategy)

aclarai extracts, organizes, and updates concepts in a lightweight but effective way that suits individual users and small teams. This approach builds on the existing block-level processing and nightly maintenance workflows.

1. 📌 How Concepts Are Created

Concepts are born automatically during sentence/claim extraction batches.

Step	What Happens
1.1. Extract surface candidates	After running Claimify on a batch, scan resulting sentences and summaries using lightweight noun phrase extraction (e.g., spaCy or regex chunking). see concept_candidates Vector Table
1.2. Normalize candidates	Lowercase, trim whitespace, strip punctuation, singularize nouns. E.g., `["dashboards", "Dashboard", "Dash board"] → "dashboard"`.
1.3. Check for existing concepts	Use `hnswlib` to run a cosine similarity search on all current concept embeddings. If a match ≥ `0.9` is found, reuse that concept. Otherwise, create a new one.
1.4. Create or update the graph	\n- New concept → create `(:Concept {id, name, embedding, aliases})`\n- Existing concept → update `last_seen` and expand `aliases` if necessary
1.5. Link to claims	For each extracted claim, create `(:Claim)-[:SUPPORTS_CONCEPT {strength}]→(:Concept)` or `CONTRADICTS_CONCEPT`.

🧩 Properties of `(:Concept)` nodes

Each canonical concept is represented by a (:Concept) node in the graph. These nodes contain both identifying information and operational metadata used during syncing, search, and linking.

Property	Type	Description
`name`	`String`	The canonical name of the concept (also used in the Markdown filename)
`embedding_hash`	`String`	SHA256 hash of the Markdown text used to produce the current embedding
`last_updated`	`Datetime`	Timestamp of last embedding refresh (usually from nightly concept sync)
`version`	`Int`	Parsed from the concept file’s `aclarai:id` block to track vault edits
`status`	`String`	Optional: may be `"active"`, `"merged"`, or `"deprecated"` for hygiene ops

📦 `concept_candidates` Vector Table

Extracted noun phrases are embedded and stored in a persistent vector table, called concept_candidates. This vector space:

Spans the entire vault, not just the current file
Accumulates candidates from every processed claim and summary
Enables fast nearest-neighbor search to detect existing concepts before creating new ones

Each row includes:

text: the noun phrase (e.g. "slice object")
embedding: vector representation
source_claim_id and aclarai_id for traceability
status: e.g., "pending", "merged", "promoted"

This table supports:

Concept deduplication across time and documents
Canonical concept creation via similarity threshold (≥ 0.9)
Concept linking without introducing noise into the final graph

🔁 Like utterances, this is a cumulative, appendable vector index—but used specifically for short phrases, not full utterances.

Yes — perfect approach. Let’s just add a concise new section after the existing description of :Concept nodes in on-concepts.md.

The document already refers to :Concept nodes in the graph, but it does not yet explain that a vector store is used for semantic matching when linking claims or resolving new candidates.

✅ Suggested Addition (Minimal Insertion)

Add after the explanation of concept promotion or :Concept node creation:

🧠 Concept Vector Index (Canonical Concepts)

Once concepts are promoted to canonical :Concept nodes, they are also embedded and stored in a separate concepts vector table. This allows:

Fast nearest-neighbor lookup when linking new claims or summaries
Semantic similarity grouping across promoted concepts (e.g., "GPU crash" ≈ "CUDA failure")
Deduplication enforcement when additional noun phrases are proposed

This vector index is used primarily during:

Linking: (:Claim)-[:MENTIONS_CONCEPT]->(:Concept)
UI refinement (e.g. concept grouping or merging suggestions)

🔍 Tier 3 Markdown files are generated via graph traversal, not vector similarity — the concept vector DB supports semantic alignment only.

2. 🔄 Nightly Concept Refresh Job

Once per day, the same maintenance job that syncs Markdown and Neo4j also performs basic concept hygiene.

Step	What It Does
2.1. Refresh embeddings	For each concept, embed the first paragraph of its Tier 3 Markdown page (usually the definition or summary). Store new vector.
2.2. Detect duplicates	Run pairwise similarity via `hnswlib`. If two concepts are ≥ `0.95` similar, mark one as a duplicate.
2.3. Auto-merge low-risk duplicates	If:\n- Both concepts have low total claim count (e.g., < 10), and\n- One is clearly newer (`created_at`), then:\n - Move claims to older concept\n - Merge aliases\n - Delete the duplicate node\n - Add a `redirect` note in the Tier 3 file (e.g., “Merged from [[Old Concept]]”)

📝 No user review queues, moderation UIs, or merge confirmation dialogs are required.

3. ✂️ Optional Concept Split Detection (Later)

Not included in the MVP, but the groundwork is in place:

Over time, if a concept’s linked claims split into two unrelated semantic clusters, we can detect that via intra-concept claim embedding divergence.
When needed, this can trigger a split into multiple new concepts with redistributed claims and backlink notes.

4. 🔁 How This Fits The Graph-Vault Sync Loop

Pipeline phase	Concept actions
During sentence batch	Extract new concepts and link claims immediately after Claimify.
Nightly job	Refresh embeddings, detect & merge concept duplicates.
Manual edits to concept files	Cause the associated `(:Concept)` node to be marked dirty and reprocessed the next night.

✅ Summary

Uses hnswlib for fast, lightweight concept matching
Merges similar concepts automatically based on similarity threshold and size
Updates embeddings from Tier 3 page definitions every night
No UI review flow required—MVP stays simple and autonomous
Fully integrated into the existing nightly sync and block-based claim extraction pipeline

This plan keeps aclarai usable for individuals from day one while still laying the groundwork for smarter concept maintenance over time.