2026 04 03_semantic_clustering_follow_on_plan - mark-ik/graphshell GitHub Wiki
Status: Active follow-on plan
Scope: Extracts the semantic-clustering lane from 2026-02-24_physics_engine_extensibility_plan.md into an execution plan that bridges semantic enrichment, out-of-band clustering computation, and graph layout consumption.
Related:
2026-02-24_physics_engine_extensibility_plan.md2026-03-11_graph_enrichment_plan.mdforce_layout_and_barnes_hut_spec.mdlayout_behaviors_and_physics_spec.md2026-04-03_layout_backend_state_ownership_plan.md2026-04-03_layout_variant_follow_on_plan.md2026-04-03_wasm_layout_runtime_plan.md
Semantic clustering already exists in Graphshell, but only as a partial cross-lane capability:
- the enrichment lane already owns semantic provenance, classification, and user-facing explanation requirements
- the force-layout lane already treats semantic clustering as a Graphshell-owned extension force
- the physics extensibility umbrella note sketches future algorithmic follow-ons such as k-means, DBSCAN, and embedding-driven grouping
What is still missing is a dedicated execution lane for the part in the middle:
- what semantic inputs are allowed to drive clustering
- how cluster assignments are computed, invalidated, and diagnosed
- how those assignments affect layout without becoming a hidden source of graph truth
This plan exists so semantic clustering is no longer split between "enrichment someday" and "physics helper already exists" with no authority for the actual bridge.
- treating semantic clustering as a graph-canonical mutation
- deepening hidden runtime-only semantic state without explanation or provenance
- making ML embeddings a prerequisite for all semantic grouping behavior
- replacing domain clustering, frame-affinity behavior, or other existing layout helpers
- turning this lane into a general-purpose model-serving or vector-search plan
The first missing decision is what data semantic clustering is allowed to consume. The umbrella note points at burn embeddings and UDC similarity; the enrichment lane already requires provenance, confidence, and user-facing explanation.
- Define the allowed semantic inputs for clustering in priority order: embeddings when available, classification/tag similarity when not, and explicit fallback rules.
- Require every clustering input source to remain attributable to enrichment metadata rather than hidden renderer or physics state.
- Define whether clustering operates on per-node vectors, pairwise similarity tables, or both.
- Specify how cluster inputs are invalidated when node content or semantic metadata changes.
- Clustering can explain which semantic source produced a grouping decision.
- Missing embeddings degrade to a documented fallback path rather than disabling the whole lane.
- Input invalidation triggers only when the relevant semantic data changes.
The umbrella note explicitly places clustering computation out-of-band rather than inside the per-frame physics step. That boundary is important for both performance and inspectability.
- Define a background or on-demand clustering pipeline that computes cluster assignments outside the interactive layout step.
- Start with a bounded first-slice algorithm choice and leave richer alternatives such as DBSCAN as later admissions rather than day-one complexity.
- Produce a stable cluster-assignment artifact keyed by
GraphViewIdand node identity. - Keep clustering recomputation policy explicit: manual refresh, data-change invalidation, or bounded automatic recompute.
- Cluster assignments are stable for identical inputs.
- Recompute cadence is explicit and diagnosable.
- Background clustering cannot directly mutate graph truth or bypass reducer-owned enrichment.
The force-layout spec already says semantic clustering is a post-step extension force. This plan needs to define how richer cluster assignments feed that force without becoming a second hidden layout engine.
- Define how cluster assignments feed layout behavior: centroid targets, affinity groups, or other explicit extension-force inputs.
- Keep semantic clustering independent from domain clustering and frame-affinity behavior, while allowing them to compose predictably.
- Define profile and diagnostics surfaces for enabling, weighting, or disabling semantic clustering effects.
- Ensure semantic clustering remains a toggleable behavioral consumer rather than an always-on replacement for baseline layout semantics.
- Enabling semantic clustering measurably changes related-node positions.
- Disabling the feature removes its spatial effect without altering graph truth.
- Semantic clustering composes with domain clustering rather than silently overriding it.
The enrichment umbrella already sets the prototype rule: explain before automate. Semantic clustering should not become a hidden grouping engine that users cannot inspect, reject, or reason about.
- Expose requested vs resolved semantic clustering state through diagnostics.
- Provide user-facing explanation hooks for why nodes are being grouped semantically.
- Keep clustering provenance aligned with the enrichment inspector/filter surfaces instead of a physics-only debug panel.
- Define how suggested or low-confidence semantic inputs affect clustering policy.
- A user can inspect why a node is participating in a semantic cluster.
- Diagnostics distinguish semantic clustering from other organizer helpers.
- Low-confidence or missing semantic inputs degrade according to documented policy.
This plan is complete when Graphshell has a documented and testable semantic clustering pipeline that starts from attributable semantic inputs, computes cluster assignments out-of-band, feeds them into explicit layout behavior, and exposes the result through both diagnostics and enrichment-facing explanation surfaces.