Durability Recovery - Z-M-Huang/openhive GitHub Wiki
| State | Persisted In | Purpose |
|---|---|---|
| Org tree (parent-child relationships, team status, bootstrap state) | SQLite org_tree
|
Rebuild hierarchy on restart. status column tracks active/idle/shutdown. bootstrapped column tracks bootstrap completion. Cleaned up on shutdown_team. |
| Task queues (status, priority, correlation IDs) | SQLite task_queue
|
Resume pending work |
| Memory entries | SQLite memories
|
Persistent team memory (identity, lessons, decisions, context, references, historical). See Memory-System. |
| Memory search index | SQLite memory_chunks + memory_chunks_fts + embedding_cache
|
Derived from memories.content for hybrid search. |
| Trigger dedup state (event IDs, TTLs) | SQLite trigger_dedup
|
Prevent duplicate processing |
| Trigger configurations (state, circuit breaker) | SQLite trigger_configs
|
Restore active triggers on restart |
| Escalation correlation IDs | SQLite escalation_correlations
|
Track cross-team workflows |
| Channel interactions (message log) | SQLite channel_interactions
|
Audit trail of channel I/O. Stores direction, channelId, userId/teamId, contentSnippet (2000 char cap), contentLength, topicId, createdAt. 24-hour retention enforced by periodic cleanup via cleanOlderThan(). |
| Topics | SQLite topics
|
Topic names, state (active/idle/done), channel mapping. Used to rehydrate topic sessions after restart. See Conversation-Threading. |
| Sender trust decisions | SQLite sender_trust
|
Persistent sender trust decisions. Global — NOT team-scoped. Survives shutdown_team. |
| Trust audit log | SQLite trust_audit_log
|
Append-only security audit of all trust decisions. No auto-retention. Global. |
| Team vault | SQLite team_vault
|
Secrets (is_secret=1) + operational state (is_secret=0) |
When a team is shut down, all its team-scoped data is permanently deleted:
-
org_tree,scope_keywordsrows -
memories,memory_chunksrows (and associated FTS/embedding entries) -
trigger_configsrows (and in-memory trigger handlers) -
task_queuerows (pending, running, done, failed, and cancelled) -
escalation_correlationsrows (both source and target directions) -
channel_interactionsrows - Delete
team_vaultentries (WHERE team_name = ?) -
.run/teams/{name}/directory (config, rules, skills, subagents)
System-global tables (trigger_dedup, log_entries, sender_trust, trust_audit_log) are not affected by team shutdown. sender_trust and trust_audit_log are system-scoped, not team-scoped — trust decisions and their audit trail persist across all team lifecycles.
With cascade: true, all descendant teams are cleaned depth-first before the parent.
- Session state (AI SDK sessions are ephemeral)
- In-flight conversation context (rebuilt from memory + rules + recent channel interactions at re-spawn)
flowchart TD
A[Container starts] --> B[Bootstrap loads config]
B --> C[Load org tree from SQLite]
C --> D[Reset running tasks to pending]
D --> D1[Reset trigger overlap state]
D1 --> D2[Mark all topic sessions as idle]
D2 --> E[Start main agent]
E --> F[Resume trigger engine]
F --> G[Pending tasks picked up on demand]
G --> H
H --> I[Load triggers from SQLite trigger_configs]
I --> J[System operational]
On restart:
- Load org tree from SQLite (who are the teams, what are the parent-child relationships)
-
startup-recovery.tsresets anyrunningtasks back topendingand identifies teams with pending work.cancelledtasks are terminal and are NOT reset. - Reset trigger overlap state:
UPDATE trigger_configs SET overlap_count = 0, active_task_id = NULL. All sessions are destroyed on restart, so no running instances exist to overlap with. Consistent with ADR-10 (disposable sessions) and ADR-34 (overlap policy). - Mark all topic sessions as
idlein thetopicstable. Topic sessions are disposable — metadata (name, state, channel mapping) survives in SQLite, but in-memory sessions are lost. The next message to a topic rehydrates it with conversation history fromchannel_interactionsfiltered bytopic_id. See Conversation-Threading. - Pending tasks are picked up by normal task processing as new messages arrive or triggers fire (sessions are spawned on demand, not eagerly re-created)
- Memory entries in SQLite provide context continuity across fresh sessions
- Load active triggers from SQLite
trigger_configstable viatriggerEngine.loadFromStore() - Resume the Trigger Engine (schedule handlers start firing)
If a session dies mid-task (container restart, OOM kill, unexpected exception, etc.), recovery happens at the next container startup:
-
Session is lost. The SDK
query()session is disposable -- there is nothing to resume. -
Recovery runs at startup.
startup-recovery.tsscans thetask_queuetable and resets any tasks with statusrunningback topending(startup-recovery.ts:42).cancelledtasks are terminal and are not reset. There is no runtime session-loss detection or health check -- recovery is startup-only. - Pending tasks re-dispatched. The normal restart recovery flow (see above) re-spawns team sessions and feeds them the now-pending tasks.
- Memory provides continuity. The team's memory entries in SQLite give the fresh session context from before the crash.
flowchart TD
A[Task executing in session] --> B[Container crashes<br/>OOM / restart / etc.]
B --> C[Container restarts]
C --> D["startup-recovery.ts:<br/>UPDATE task_queue SET status='pending'<br/>WHERE status='running'"]
D --> E[Normal restart recovery:<br/>re-spawn sessions, feed pending tasks]
E --> F[Memory entries in SQLite provide pre-crash context]
No manual intervention is required. The only observable effect is that the task restarts from the beginning (not from a mid-task checkpoint). Tasks that were running at crash time are retried as pending on the next startup.
Stalled tasks are detected by the task queue engine's built-in stall detector, a periodic infrastructure check running every 10 minutes. This is a detection mechanism, not a recovery mechanism — it alerts operators to tasks that may need manual intervention.
| Condition | Severity | Action |
|---|---|---|
pending > 1 hour |
warning | Alert to originating channel or escalate |
pending or running > 24 hours |
error | Alert + escalate through hierarchy |
Stall detection complements restart recovery (above). Restart recovery handles crash scenarios by resetting running tasks to pending. Stall detection handles non-crash scenarios where tasks are stuck due to blocked sessions, missing tools, or other runtime issues. See Architecture-Decisions#ADR-38.
Nightly SQLite backup. Database file: .run/openhive.db. Backups written to .run/backups/. Backup encryption key stored separately from backup data.
See also Durability-Recovery restart flow above. The recovery sequence applies identically whether the container was stopped gracefully or crashed unexpectedly. The key guarantee: no pending task is lost as long as it was enqueued in SQLite before the crash.