Durability Recovery - Z-M-Huang/openhive GitHub Wiki

Workflow Durability and Recovery

What Is Durable (SQLite at .run/openhive.db)

State Persisted In Purpose
Org tree (parent-child relationships, team status, bootstrap state) SQLite org_tree Rebuild hierarchy on restart. status column tracks active/idle/shutdown. bootstrapped column tracks bootstrap completion. Cleaned up on shutdown_team.
Task queues (status, priority, correlation IDs) SQLite task_queue Resume pending work
Memory entries SQLite memories Persistent team memory (identity, lessons, decisions, context, references, historical). See Memory-System.
Memory search index SQLite memory_chunks + memory_chunks_fts + embedding_cache Derived from memories.content for hybrid search.
Trigger dedup state (event IDs, TTLs) SQLite trigger_dedup Prevent duplicate processing
Trigger configurations (state, circuit breaker) SQLite trigger_configs Restore active triggers on restart
Escalation correlation IDs SQLite escalation_correlations Track cross-team workflows
Channel interactions (message log) SQLite channel_interactions Audit trail of channel I/O. Stores direction, channelId, userId/teamId, contentSnippet (2000 char cap), contentLength, topicId, createdAt. 24-hour retention enforced by periodic cleanup via cleanOlderThan().
Topics SQLite topics Topic names, state (active/idle/done), channel mapping. Used to rehydrate topic sessions after restart. See Conversation-Threading.
Sender trust decisions SQLite sender_trust Persistent sender trust decisions. Global — NOT team-scoped. Survives shutdown_team.
Trust audit log SQLite trust_audit_log Append-only security audit of all trust decisions. No auto-retention. Global.
Team vault SQLite team_vault Secrets (is_secret=1) + operational state (is_secret=0)

What Is Removed on shutdown_team

When a team is shut down, all its team-scoped data is permanently deleted:

  • org_tree, scope_keywords rows
  • memories, memory_chunks rows (and associated FTS/embedding entries)
  • trigger_configs rows (and in-memory trigger handlers)
  • task_queue rows (pending, running, done, failed, and cancelled)
  • escalation_correlations rows (both source and target directions)
  • channel_interactions rows
  • Delete team_vault entries (WHERE team_name = ?)
  • .run/teams/{name}/ directory (config, rules, skills, subagents)

System-global tables (trigger_dedup, log_entries, sender_trust, trust_audit_log) are not affected by team shutdown. sender_trust and trust_audit_log are system-scoped, not team-scoped — trust decisions and their audit trail persist across all team lifecycles.

With cascade: true, all descendant teams are cleaned depth-first before the parent.

What Is Disposable

  • Session state (AI SDK sessions are ephemeral)
  • In-flight conversation context (rebuilt from memory + rules + recent channel interactions at re-spawn)

Restart Recovery

flowchart TD
    A[Container starts] --> B[Bootstrap loads config]
    B --> C[Load org tree from SQLite]
    C --> D[Reset running tasks to pending]
    D --> D1[Reset trigger overlap state]
    D1 --> D2[Mark all topic sessions as idle]
    D2 --> E[Start main agent]
    E --> F[Resume trigger engine]
    F --> G[Pending tasks picked up on demand]
    G --> H
    H --> I[Load triggers from SQLite trigger_configs]
    I --> J[System operational]
Loading

On restart:

  1. Load org tree from SQLite (who are the teams, what are the parent-child relationships)
  2. startup-recovery.ts resets any running tasks back to pending and identifies teams with pending work. cancelled tasks are terminal and are NOT reset.
  3. Reset trigger overlap state: UPDATE trigger_configs SET overlap_count = 0, active_task_id = NULL. All sessions are destroyed on restart, so no running instances exist to overlap with. Consistent with ADR-10 (disposable sessions) and ADR-34 (overlap policy).
  4. Mark all topic sessions as idle in the topics table. Topic sessions are disposable — metadata (name, state, channel mapping) survives in SQLite, but in-memory sessions are lost. The next message to a topic rehydrates it with conversation history from channel_interactions filtered by topic_id. See Conversation-Threading.
  5. Pending tasks are picked up by normal task processing as new messages arrive or triggers fire (sessions are spawned on demand, not eagerly re-created)
  6. Memory entries in SQLite provide context continuity across fresh sessions
  7. Load active triggers from SQLite trigger_configs table via triggerEngine.loadFromStore()
  8. Resume the Trigger Engine (schedule handlers start firing)

Crash During Task Execution

If a session dies mid-task (container restart, OOM kill, unexpected exception, etc.), recovery happens at the next container startup:

  1. Session is lost. The SDK query() session is disposable -- there is nothing to resume.
  2. Recovery runs at startup. startup-recovery.ts scans the task_queue table and resets any tasks with status running back to pending (startup-recovery.ts:42). cancelled tasks are terminal and are not reset. There is no runtime session-loss detection or health check -- recovery is startup-only.
  3. Pending tasks re-dispatched. The normal restart recovery flow (see above) re-spawns team sessions and feeds them the now-pending tasks.
  4. Memory provides continuity. The team's memory entries in SQLite give the fresh session context from before the crash.
flowchart TD
    A[Task executing in session] --> B[Container crashes<br/>OOM / restart / etc.]
    B --> C[Container restarts]
    C --> D["startup-recovery.ts:<br/>UPDATE task_queue SET status='pending'<br/>WHERE status='running'"]
    D --> E[Normal restart recovery:<br/>re-spawn sessions, feed pending tasks]
    E --> F[Memory entries in SQLite provide pre-crash context]
Loading

No manual intervention is required. The only observable effect is that the task restarts from the beginning (not from a mid-task checkpoint). Tasks that were running at crash time are retried as pending on the next startup.

Stall Detection

Stalled tasks are detected by the task queue engine's built-in stall detector, a periodic infrastructure check running every 10 minutes. This is a detection mechanism, not a recovery mechanism — it alerts operators to tasks that may need manual intervention.

Condition Severity Action
pending > 1 hour warning Alert to originating channel or escalate
pending or running > 24 hours error Alert + escalate through hierarchy

Stall detection complements restart recovery (above). Restart recovery handles crash scenarios by resetting running tasks to pending. Stall detection handles non-crash scenarios where tasks are stuck due to blocked sessions, missing tools, or other runtime issues. See Architecture-Decisions#ADR-38.

Automated Backups

Nightly SQLite backup. Database file: .run/openhive.db. Backups written to .run/backups/. Backup encryption key stored separately from backup data.


Scenario 5: Restart Recovery

See also Durability-Recovery restart flow above. The recovery sequence applies identically whether the container was stopped gracefully or crashed unexpectedly. The key guarantee: no pending task is lost as long as it was enqueued in SQLite before the crash.

⚠️ **GitHub.com Fallback** ⚠️