Logging - Z-M-Huang/openhive GitHub Wiki

Logging

Requirement: Verbose structured logging for everything. All operations, at all levels, with full context. Logs are the primary debugging tool for an AI-driven system where behavior is non-deterministic.

Log Levels

Level Purpose Examples
trace Fine-grained execution detail Tool input/output, rule loading, prompt assembly, MCP message payloads
debug Internal system decisions Scope matching results, queue operations, session lifecycle events
info Operational events Team spawns/shutdowns, task delegation, escalation, trigger fires
warn Recoverable problems Task retry, session re-spawn, dedup collision, scope ambiguity
error Failures requiring attention Task permanently failed, session crash, config validation error, hook failure

Default level: info. Configurable globally via log_level in /data/config/config.yaml.

What Is Logged (Verbose by Default)

Every operation produces structured log entries:

  • Team lifecycle: spawn, idle, shutdown, crash, re-spawn (with session duration, task count)
  • Task flow: delegation, queue insertion, dispatch, completion, failure, re-queue (with correlation IDs)
  • Escalation: escalation sent, received, resolved, timeout (with full escalation chain)
  • Tool calls: tool name, input summary, duration, result summary, audit decisions (via withAudit() wrapper)
  • Routing: list_teams queries, delegation target selection
  • Triggers: fire, dedup hit, dedup miss, rate limit, dispatch target
  • Rule loading: which rules loaded, cascade order, any conflicts detected
  • Organization tool operations: tool calls, server connections/disconnections
  • Errors: full stack traces, context, affected team/task/agent
  • Trust evaluation decisions: TrustGate logs allowed/denied decisions with sender_id, channel_type, and reason to the trust_audit_log table

Vault Operation Events

All vault tool calls produce structured log entries via the withAudit() wrapper:

  • vault_set -- key logged, value scrubbed. If the operation is rejected because the key has is_secret=1, the rejection is logged with reason ("Secret entries are system-managed").
  • vault_get -- key logged. Secret values (is_secret=1) are scrubbed from audit output; only the key and access metadata appear in logs.
  • vault_list -- prefix (if any) logged. Values for is_secret=1 entries are omitted from audit output.
  • vault_delete -- key logged. If the operation is rejected because the key has is_secret=1, the rejection is logged with reason.

Learning Cycle Events

Learning cycle operations (triggered by the learning-cycle.md system skill) produce structured log entries:

  • Learning session start -- logged at info level with team name, trigger source, and topics to explore.
  • Finding stored -- logged at info level when a new memory entry is created from a learning finding. Includes topic, source URL, confidence level, and memory type (lesson or reference).
  • Journal updated -- logged at debug level when the learning progression journal is written back to the vault via vault_set("learning:{team}:{subagent}:journal", ...). Includes topics covered count and next focus topic.

Reflection Cycle Events

Reflection cycle operations (triggered by the reflection-cycle system skill) produce structured log entries:

  • Reflection session start -- logged at info level with team name and trigger source.
  • Evidence gathered -- logged at info level when completed tasks are queried for outcome patterns. Includes task count and date range.
  • Diagnosis identified -- logged at info level when the highest-impact issue is identified. Includes issue description and evidence count.
  • Change applied -- logged at info level when a skill/rule change is applied via evolution flow. Includes before/after summary.
  • Change skipped -- logged at info level when no actionable issue is found or cooldown is active.
  • Reflection complete -- logged at info level with duration, diagnosis summary, and outcome. See Self-Evolution#Self-Reflection.

Stall Detection Events

Stall detection operations (emitted by the task queue stall detector) produce structured log entries:

  • Stalled task warning -- logged at warn level when a pending task exceeds 1 hour. Includes task ID, team name, and age.
  • Stalled task error -- logged at error level when a pending or running task exceeds 24 hours. Includes task ID, team name, age, and escalation target.
  • Stall detection scan clean -- logged at debug level when a scan finds no stalled tasks. See Durability-Recovery#Stall Detection.

Transports (Dual Output)

Logs go to two destinations simultaneously:

  1. stdout -- Structured JSON via Pino. Default sink. Always active. Enables docker logs, log aggregation, and piping to external services.
  2. SQLite -- All log entries at info level and above persisted to .run/openhive.db log_entries table. Schema: (id, level, message, context, duration_ms, created_at). Context is a JSON string containing structured metadata (team, taskId, tool name, etc.).

The logging configuration schema (loadLogging() in config/loader.ts) supports additional transports and per-team overrides, but these are not yet wired into bootstrap. Currently only the global log_level from config.yaml is used.

Credential Scrubbing (Hard Requirement)

All log output is scrubbed before reaching ANY transport. Secret values are inline API keys resolved by the provider-resolver at spawn time; they are never read from .env files.

  1. SecretString class -- wraps all secret values (including inline API keys from provider-resolver); toString() returns [REDACTED]. The SecretString[] array of active credentials is registered at spawn time as the scrub list. First line of defense.
  2. Team credential scrubbing -- The withAudit() wrapper (tool-audit.ts) receives team credential values (from team_vault where is_secret=1) as rawSecrets and passes them to scrubSecrets(). This ensures team API keys are redacted from tool input/output. Values shorter than 8 characters are excluded to prevent over-redaction (aligned with credential-scrubber.ts threshold). Credential extraction is centralized via extractStringCredentials() from domain/credential-utils.ts.
  3. Dynamic credential extraction -- The withAudit() wrapper inspects tool_input.credentials at call time and augments the scrub list with any credential values found. This catches credentials passed dynamically (e.g., via spawn_team with new credentials). Extraction is tool-agnostic.
  4. SDK stderr callback -- intercepts SDK process output, scrubs all registered SecretString values before forwarding to Pino.
  5. Inline scrubbing layers -- scrubSecrets() utility applied at multiple points: audit wrappers, handlers, task consumer. Scrubs all registered secret values before any output reaches log sinks.
  6. Scrub-before-truncate -- withAudit() scrubs tool results before truncating to 200 characters, preventing partial secret leakage at the cutoff boundary.
  7. Agent response scrubbing -- handleMessage and TaskConsumer scrub team credential values from SDK responses before returning or storing results.

No secret value ever appears in any log transport.

Interaction Audit Store (Separate from Logging)

In addition to the structured log system above, OpenHive maintains a separate interaction audit store in the channel_interactions SQLite table. This is not a logger transport — it is written directly via interactionStore.log() in channel-handler-factory.ts and task-consumer.ts.

The interaction store records inbound user messages, outbound agent replies, and async task notifications with channel/team attribution. With conversation threading, each record also includes a topic_id column for per-topic history filtering (see Conversation-Threading). It has a 24-hour retention policy enforced by a periodic cleanup interval (cleanOlderThan() in index.ts). Content snippets are capped at 2000 characters — this is not a full transcript store.

These records are queried at session start to inject recent conversation history into the system prompt, giving agents awareness of prior channel activity.