Workflow Durability and Recovery

What Is Durable (SQLite at `.run/openhive.db`)

State	Persisted In	Purpose
Org tree (parent-child relationships, team status, bootstrap state)	SQLite `org_tree`	Rebuild hierarchy on restart. `status` column tracks active/idle/shutdown. `bootstrapped` column tracks bootstrap completion. Cleaned up on `shutdown_team`.
Task queues (status, priority, correlation IDs)	SQLite `task_queue`	Resume pending work
Memory entries	SQLite `memories`	Persistent team memory (identity, lessons, decisions, context, references, historical). See Memory-System.
Memory search index	SQLite `memory_chunks` + `memory_chunks_fts` + `embedding_cache`	Derived from `memories.content` for hybrid search.
Trigger dedup state (event IDs, TTLs)	SQLite `trigger_dedup`	Prevent duplicate processing
Trigger configurations (state, circuit breaker)	SQLite `trigger_configs`	Restore active triggers on restart
Escalation correlation IDs	SQLite `escalation_correlations`	Track cross-team workflows
Channel interactions (message log)	SQLite `channel_interactions`	Audit trail of channel I/O. Stores direction, channelId, userId/teamId, contentSnippet (2000 char cap), contentLength, topicId, createdAt. 24-hour retention enforced by periodic cleanup via `cleanOlderThan()`.
Topics	SQLite `topics`	Topic names, state (active/idle/done), channel mapping. Used to rehydrate topic sessions after restart. See Conversation-Threading.
Sender trust decisions	SQLite `sender_trust`	Persistent sender trust decisions. Global — NOT team-scoped. Survives `shutdown_team`.
Trust audit log	SQLite `trust_audit_log`	Append-only security audit of all trust decisions. No auto-retention. Global.
Team vault	SQLite `team_vault`	Secrets (is_secret=1) + operational state (is_secret=0)

What Is Removed on `shutdown_team`

When a team is shut down, all its team-scoped data is permanently deleted:

org_tree, scope_keywords rows
memories, memory_chunks rows (and associated FTS/embedding entries)
trigger_configs rows (and in-memory trigger handlers)
task_queue rows (pending, running, done, failed, and cancelled)
escalation_correlations rows (both source and target directions)
channel_interactions rows
Delete team_vault entries (WHERE team_name = ?)
.run/teams/{name}/ directory (config, rules, skills, subagents)

System-global tables (trigger_dedup, log_entries, sender_trust, trust_audit_log) are not affected by team shutdown. sender_trust and trust_audit_log are system-scoped, not team-scoped — trust decisions and their audit trail persist across all team lifecycles.

With cascade: true, all descendant teams are cleaned depth-first before the parent.

What Is Disposable

Session state (AI SDK sessions are ephemeral)
In-flight conversation context (rebuilt from memory + rules + recent channel interactions at re-spawn)

Restart Recovery

flowchart TD
    A[Container starts] --> B[Bootstrap loads config]
    B --> C[Load org tree from SQLite]
    C --> D[Reset running tasks to pending]
    D --> D1[Reset trigger overlap state]
    D1 --> D2[Mark all topic sessions as idle]
    D2 --> E[Start main agent]
    E --> F[Resume trigger engine]
    F --> G[Pending tasks picked up on demand]
    G --> H
    H --> I[Load triggers from SQLite trigger_configs]
    I --> J[System operational]

On restart:

Load org tree from SQLite (who are the teams, what are the parent-child relationships)
startup-recovery.ts resets any running tasks back to pending and identifies teams with pending work. cancelled tasks are terminal and are NOT reset.
Reset trigger overlap state: UPDATE trigger_configs SET overlap_count = 0, active_task_id = NULL. All sessions are destroyed on restart, so no running instances exist to overlap with. Consistent with ADR-10 (disposable sessions) and ADR-34 (overlap policy).
Mark all topic sessions as idle in the topics table. Topic sessions are disposable — metadata (name, state, channel mapping) survives in SQLite, but in-memory sessions are lost. The next message to a topic rehydrates it with conversation history from channel_interactions filtered by topic_id. See Conversation-Threading.
Pending tasks are picked up by normal task processing as new messages arrive or triggers fire (sessions are spawned on demand, not eagerly re-created)
Memory entries in SQLite provide context continuity across fresh sessions
Load active triggers from SQLite trigger_configs table via triggerEngine.loadFromStore()
Resume the Trigger Engine (schedule handlers start firing)

Crash During Task Execution

If a session dies mid-task (container restart, OOM kill, unexpected exception, etc.), recovery happens at the next container startup:

Session is lost. The SDK query() session is disposable -- there is nothing to resume.
Recovery runs at startup. startup-recovery.ts scans the task_queue table and resets any tasks with status running back to pending (startup-recovery.ts:42). cancelled tasks are terminal and are not reset. There is no runtime session-loss detection or health check -- recovery is startup-only.
Pending tasks re-dispatched. The normal restart recovery flow (see above) re-spawns team sessions and feeds them the now-pending tasks.
Memory provides continuity. The team's memory entries in SQLite give the fresh session context from before the crash.

flowchart TD
    A[Task executing in session] --> B[Container crashes<br/>OOM / restart / etc.]
    B --> C[Container restarts]
    C --> D["startup-recovery.ts:<br/>UPDATE task_queue SET status='pending'<br/>WHERE status='running'"]
    D --> E[Normal restart recovery:<br/>re-spawn sessions, feed pending tasks]
    E --> F[Memory entries in SQLite provide pre-crash context]

No manual intervention is required. The only observable effect is that the task restarts from the beginning (not from a mid-task checkpoint). Tasks that were running at crash time are retried as pending on the next startup.

Stall Detection

Stalled tasks are detected by the task queue engine's built-in stall detector, a periodic infrastructure check running every 10 minutes. This is a detection mechanism, not a recovery mechanism — it alerts operators to tasks that may need manual intervention.

Condition	Severity	Action
`pending` > 1 hour	warning	Alert to originating channel or escalate
`pending` or `running` > 24 hours	error	Alert + escalate through hierarchy

Stall detection complements restart recovery (above). Restart recovery handles crash scenarios by resetting running tasks to pending. Stall detection handles non-crash scenarios where tasks are stuck due to blocked sessions, missing tools, or other runtime issues. See Architecture-Decisions#ADR-38.

Automated Backups

Nightly SQLite backup. Database file: .run/openhive.db. Backups written to .run/backups/. Backup encryption key stored separately from backup data.

Scenario 5: Restart Recovery

See also Durability-Recovery restart flow above. The recovery sequence applies identically whether the container was stopped gracefully or crashed unexpectedly. The key guarantee: no pending task is lost as long as it was enqueued in SQLite before the crash.

Durability Recovery - Z-M-Huang/openhive GitHub Wiki

Workflow Durability and Recovery

What Is Durable (SQLite at `.run/openhive.db`)

What Is Removed on `shutdown_team`

What Is Disposable

Restart Recovery

Crash During Task Execution

Stall Detection

Automated Backups

Scenario 5: Restart Recovery

⚠️ GitHub.com Fallback ⚠️

Durability Recovery - Z-M-Huang/openhive GitHub Wiki

Workflow Durability and Recovery

What Is Durable (SQLite at .run/openhive.db)

What Is Removed on shutdown_team

What Is Disposable

Restart Recovery

Crash During Task Execution

Stall Detection

Automated Backups

Scenario 5: Restart Recovery

⚠️ **GitHub.com Fallback** ⚠️

What Is Durable (SQLite at `.run/openhive.db`)

What Is Removed on `shutdown_team`

⚠️ GitHub.com Fallback ⚠️