Scenarios - Z-M-Huang/openhive GitHub Wiki

Scenarios

End-to-end operational walkthroughs. Each scenario shows the complete flow from user action to final result, including failure modes and system behavior. For component details, see the linked pages.

Convention: Scenarios show the flow (who calls what, in what order). They do not duplicate component documentation -- they link to the canonical pages for details.


1. User Creates a Team (spawn_team)

spawn_team is a two-step async flow. The handler returns immediately with a queued status; the bootstrap session runs in the background and the originating channel is notified when it completes.

Step A — Immediate return (synchronous)

  1. User messages: "Create a QA team"
  2. Main agent calls spawn_team(name: "qa", description: "...", scope_accepts: ["testing and QA automation", "end-to-end test strategy"], init_context: "...")
  3. spawn_team handler (see Organization-Tools#org-tools.ts):
    • a. Validates: name not already in org tree (duplicate check before scaffolding)
    • b. Scaffolds directory: .run/teams/qa/ with config.yaml, org-rules/, team-rules/ (including team-context.md), skills/, subagents/
    • c. Registers in org tree (SQLite org_tree table)
    • d. Stores scope keywords (SQLite scope_keywords table)
    • e. Enqueues bootstrap task (type: bootstrap, priority: critical)
    • f. Returns { status: 'queued', bootstrap_task_id: "task_abc123", message_for_user: "QA team is being set up — I'll let you know when it's ready." } immediately
  4. Main agent echoes message_for_user to the user — the caller MUST relay this message so the user understands the team is initialising asynchronously.

Step B — Background bootstrap (asynchronous)

  1. TaskConsumer dequeues the bootstrap task and calls handleMessage(init_context) in a fresh session for the QA team.
  2. Bootstrap session creates subagent definitions, skills, plugins, and learning/reflection triggers.
  3. On completion, TaskConsumer posts "[qa] Team bootstrapped and ready." to the originating channel.
  4. Parent sees via get_status: (initializing) while bootstrap runs, then active after.

Failure: Duplicate Team Name

spawn_team rejects at step 3a (before any scaffolding occurs)

  • --> Returns error: Team 'qa' already exists
  • --> No filesystem or SQLite changes made

Failure: Scaffolding Fails (disk full, permission error)

Step 3a passes, 3b fails mid-way

  • --> Rollback: removes partially scaffolded directory, returns error to parent
  • --> No partial state left in SQLite or filesystem (spawn-team.ts rollback logic)

Failure: Bootstrap Task Fails

Steps 3a-3f succeed, but bootstrap session hits max_turns or errors in step 5

  • --> Task marked failed; parent sees "Bootstrap failed" via get_status
  • --> Team exists in org tree but may need manual intervention or shutdown_team

Sequence Diagram

sequenceDiagram
    participant User
    participant Main as Main Agent
    participant Handler as spawn_team handler
    participant FS as Filesystem
    participant DB as SQLite
    participant TC as TaskConsumer
    participant QA as QA Session

    User->>Main: "Create a QA team"
    Main->>Handler: spawn_team(name:"qa", ...)
    Handler->>DB: Check org_tree for "qa"
    alt Duplicate name
        DB-->>Handler: exists
        Handler-->>Main: Error: "Team 'qa' already exists"
    else Name available
        DB-->>Handler: not found
        Handler->>FS: Scaffold .run/teams/qa/
        Handler->>DB: Insert org_tree + scope_keywords
        Handler->>DB: Enqueue bootstrap task (critical, bootstrap_task_id)
        Handler-->>Main: { status: 'queued', bootstrap_task_id, message_for_user }
        Main-->>User: "QA team is being set up — I'll let you know when it's ready."
        Note over TC,QA: Step B runs asynchronously
        TC->>DB: Dequeue bootstrap task
        TC->>QA: handleMessage(init_context)
        QA-->>TC: Bootstrap complete
        TC->>Main: "[qa] Team bootstrapped and ready."
        Main-->>User: "[qa] Team bootstrapped and ready."
    end
Loading

2. User Delegates a Task (delegate_task)

Happy Path

  1. User messages: "Review the PR on the auth module"
  2. Main agent calls list_teams() to see available children (see Organization-Tools#LLM-Driven Routing)
  3. Main agent picks "engineering" based on description/keywords -- calls delegate_task("engineering", "Review PR on auth module", priority: "high")
  4. delegate_task handler validates: "engineering" is a direct child of caller (delegate-task.ts:51)
  5. Task enqueued in SQLite task_queue (type: delegate, priority: high, sourceChannelId from caller)
  6. TaskConsumer dequeues task -- fresh handleMessage() call creates new session for engineering
  7. Engineering orchestrator reads subagent definitions, picks the best-fit subagent (e.g., code-reviewer), and delegates the task to it (ADR-40)
  8. Result stored in task_queue -- notification routed to sourceChannelId

Failure: Target Not a Direct Child

delegate_task validates parent-child, not just ancestry -- caller must be direct parent

  • --> Returns error: Team 'engineering' is not a child of 'main'
  • --> Must traverse through intermediate teams (escalate or re-route)

Failure: Wrong Team Selected (runtime)

Steps 1-7 proceed, but engineering team cannot handle the task

  • --> Engineering calls escalate({message: "This is a frontend issue, not backend"})
  • --> Escalation creates type: escalation task for parent (Main)
  • --> Main receives escalation, re-routes to correct team

Failure: Target Team Does Not Exist

Steps 1-3 proceed, but "engineering" not in org tree

  • --> delegate_task returns error: Team 'engineering' not found
  • --> Main agent may spawn the team first, then delegate

Sequence Diagram

sequenceDiagram
    participant User
    participant Main as Main Agent
    participant Handler as delegate_task handler
    participant DB as SQLite
    participant TC as TaskConsumer
    participant Eng as Engineering Session

    User->>Main: "Review the PR on auth module"
    Main->>Main: list_teams() → picks "engineering"
    Main->>Handler: delegate_task("engineering", task, "high")
    Handler->>DB: Validate direct parent-child
    alt Not direct child
        DB-->>Handler: validation fails
        Handler-->>Main: Error: not a child
    else Valid
        Handler->>DB: Enqueue task (type:delegate, priority:high)
        Handler-->>Main: Task queued
        TC->>DB: Dequeue task
        TC->>Eng: handleMessage(task)
        Eng->>Eng: Process task
        alt Task succeeds
            Eng-->>TC: Result
            TC->>DB: Store result
            TC->>Main: Notification via sourceChannelId
        else Cannot handle
            Eng->>Handler: escalate({message:"frontend issue"})
            Handler->>DB: Enqueue escalation for Main
            TC->>Main: Escalation notification
            Main->>Main: Re-route to correct team
        end
    end
Loading

3. Team Queries a Child (query_team -- synchronous)

Happy Path

  1. Parent team needs quick info -- calls query_team("ops", "What's the current deployment status?")
  2. query_team handler validates caller is direct parent of target (query-team.ts:58 checks parent-child, not ancestry)
  3. Calls queryRunnerRef() -- handleMessage() creates fresh session for ops with the query
  4. Ops session processes query synchronously (runs to completion within the parent's tool call)
  5. Response returned directly to parent as tool result (no task queue, no notification)

Failure: No Providers Configured / queryRunner Not Wired

queryRunnerRef is undefined or providers not configured (query-team.ts:63)

  • --> Returns error: providers not configured (same error message for both cases)
  • --> Parent sees error, can retry after providers are configured and bootstrap completes

Failure: Empty Response

Query completes but LLM produces no text (hits max_turns with only tool calls)

  • --> queryRunner returns empty string or throws
  • --> Parent receives error, can rephrase query or use delegate_task instead

Sequence Diagram

sequenceDiagram
    participant Parent as Parent Agent
    participant Handler as query_team handler
    participant DB as SQLite
    participant Runner as queryRunnerRef
    participant Ops as Ops Session

    Parent->>Handler: query_team("ops", "deployment status?")
    Handler->>DB: Validate direct parent-child
    alt Not direct parent
        Handler-->>Parent: Error: not authorized
    else Valid
        Handler->>Runner: queryRunnerRef()
        alt Runner unavailable or no providers
            Runner-->>Handler: undefined / error
            Handler-->>Parent: Error: "providers not configured"
        else Runner available
            Runner->>Ops: handleMessage(query)
            Ops->>Ops: Process query synchronously
            alt Response text produced
                Ops-->>Runner: Response text
                Runner-->>Handler: Response
                Handler-->>Parent: Direct tool result
            else Empty response (max_turns, no text)
                Ops-->>Runner: Empty string / error
                Runner-->>Handler: Error
                Handler-->>Parent: Error: empty response
                Note over Parent: Can rephrase or use delegate_task
            end
        end
    end
Loading

4. Trigger Fires a Task

Happy Path (keyword/message trigger)

  1. Keyword trigger "deploy-monitor" fires on keyword match in a channel (see Triggers#Execution Flow)
  2. Engine checks: trigger is active, not rate-limited, event not deduplicated, overlap policy allows firing
  3. Trigger engine calls delegateTask(team, task) -- enqueues in task_queue (type: trigger, correlationId: trigger:deploy-monitor:<uuid>)
  4. TRIGGER_NOTIFY_INSTRUCTION appended to task content by task-consumer.ts
  5. TaskConsumer dequeues -- fresh handleMessage() call processes task
  6. LLM includes {"notify": true/false} in response
  7. parseLlmNotifyDecision() extracts decision, stripNotifyBlock() cleans response
  8. If notify: true -- response routed to sourceChannelId (channel where keyword was detected)
  9. If notify: false -- result stored in task_queue only (no channel notification)

Happy Path (schedule trigger)

  1. Schedule trigger "daily-report" fires at 9:00 AM (see Triggers#Execution Flow)
  2. Engine checks: trigger is active, not rate-limited, event not deduplicated, overlap policy allows firing
  3. Trigger engine calls delegateTask(team, task) -- enqueues in task_queue (type: trigger, correlationId: trigger:daily-report:<uuid>)
  4. TRIGGER_NOTIFY_INSTRUCTION appended to task content by task-consumer.ts
  5. TaskConsumer dequeues -- fresh handleMessage() call processes task
  6. Schedule triggers have no sourceChannelId -- results are stored in task_queue but not pushed to any channel
  7. If the LLM determines the result warrants attention, it calls escalate() to notify the parent team

Failure: Dedup Skip

Event ID already processed within TTL window

Failure: Rate-Limit Skip

Trigger source exceeded rate limit

  • --> Trigger does not fire; logged as rate-limited

Failure: Inactive Trigger

Trigger is in pending or disabled state

  • --> Trigger does not fire; handler not registered in engine

Failure: Task Fails Consecutively

Task fails -- trigger engine increments failure counter (see Triggers#Circuit Breaker)

  • --> After N consecutive failures (failure_threshold, default 3) -- trigger auto-disabled
  • --> Logged as warning, can be re-enabled via enable_trigger

Failure: LLM Does Not Include Notify Decision

Response has no {"notify": ...} block

  • --> Fail-safe: default to notify: true (never silently suppress)

Failure: No sourceChannelId (non-schedule triggers)

Keyword or message trigger has no sourceChannelId (unexpected — these should always have one)

  • --> Logged as error: Task notification has no sourceChannelId -- cannot route (index.ts:241)
  • --> Result stored but no notification sent

Note: Schedule triggers are inherently non-notifying — missing sourceChannelId on a schedule trigger is expected, not an error. See Triggers#Notification Routing & Policy.

Sequence Diagram

sequenceDiagram
    participant Cron as Cron / Event Source
    participant Engine as Trigger Engine
    participant DB as SQLite
    participant TC as TaskConsumer
    participant Team as Team Session
    participant Ch as Channel Adapter

    Cron->>Engine: Timer/event fires trigger
    Engine->>Engine: Check: active? dedup? rate-limit? overlap?
    alt Dedup/rate-limit/inactive/overlap-skip
        Engine-->>Engine: Skip (no task created)
    else Passes checks
        Engine->>DB: Enqueue task (type:trigger, correlationId)
        TC->>DB: Dequeue task
        Note over TC: Appends TRIGGER_NOTIFY_INSTRUCTION
        TC->>Team: handleMessage(task + notify instruction)
        Team->>Team: Process task
        Team-->>TC: Response with {"notify": true/false}
        TC->>TC: parseLlmNotifyDecision()
        alt No notify decision in response
            Note over TC: Fail-safe: default to notify: true
        end
        alt keyword/message trigger + notify: true
            TC->>Ch: Route response to sourceChannelId + topicId
        else notify: false OR schedule trigger
            TC->>DB: Store result only
            Note over Team: Schedule: escalate() if significant
        end
    end
Loading

5. Team Escalates to Parent

Happy Path

  1. Child team encounters issue outside its scope
  2. Calls escalate({message: "Need database access for migration", reason: "no DB credentials"}) (schema is {message, reason?} -- there is no severity field; see escalate.ts:14)
  3. escalate handler validates parent exists, generates escalation correlation ID
  4. Task enqueued for parent (type: escalation, priority: high)
  5. Parent session processes escalation, takes action (e.g., delegates to DB team)

Root Team Escalation

escalate called by main (root) team — the root team has no parent, so escalate returns an error. The main agent communicates directly with the user via the channel adapter; it does not need an escalation path. Learning and reflection findings originate from child-team subagents (ADR-40), which escalate through the normal child → parent → main → user chain.

Sequence Diagram

sequenceDiagram
    participant Child as Child Team
    participant Handler as escalate handler
    participant DB as SQLite
    participant TC as TaskConsumer
    participant Parent as Parent Team

    Child->>Handler: escalate({message, reason?})
    Handler->>DB: Look up parent of child
    alt Root team (no parent)
        DB-->>Handler: no parent found
        Handler-->>Child: Error: root team cannot escalate
    else Has parent
        Handler->>DB: Enqueue escalation task (type:escalation, priority:high)
        Handler-->>Child: Escalation sent
        TC->>DB: Dequeue escalation
        TC->>Parent: handleMessage(escalation)
        Parent->>Parent: Decide action
        Parent->>Parent: delegate_task to appropriate team
    end
Loading

6. Team Shuts Down (shutdown_team)

Happy Path

  1. Parent decides child team's work is done
  2. Calls shutdown_team(name: "marketing-q4", cascade: false)
  3. shutdown_team handler (see Organization-Tools#org-tools.ts):
    • a. Deletes all task_queue rows for team (shutdown-team.ts:87 -- rows are DELETED via removeByTeam, not marked failed)
    • b. Deletes all memories and memory_chunks rows for team
    • c. Removes triggers for team (see Triggers#Per-Team Registry)
    • d. Removes team from org tree
    • e. Terminates session
    • f. Deletes team directory (.run/teams/marketing-q4/)
  4. Team data is gone. The team name can be re-used -- a fresh spawn_team with the same name creates a new team.

Failure: Cascade Shutdown with Active Children

cascade: true -- shuts down team AND all descendants

  • --> Each descendant goes through the same shutdown sequence
  • --> Order: leaves first (deepest children), then upward

Sequence Diagram

sequenceDiagram
    participant Parent as Parent Agent
    participant Handler as shutdown_team handler
    participant DB as SQLite
    participant FS as Filesystem
    participant Session as Team Session

    Parent->>Handler: shutdown_team("marketing-q4", cascade:false)
    Handler->>DB: Delete task_queue rows (removeByTeam)
    Handler->>DB: Delete memories + memory_chunks rows
    Handler->>DB: Remove triggers
    Handler->>DB: Remove from org_tree
    Handler->>Session: Terminate session
    Handler->>FS: Delete .run/teams/marketing-q4/
    Handler-->>Parent: Team shut down

    Note over Parent: Name "marketing-q4" is now available for re-use
Loading

7. Browser Automation (browser_navigate)

Happy Path

  1. Team calls browser_navigate(url: "https://example.com/api/docs")
  2. Gate 1: Check team has browser: config (browser-tools.ts:18)
  3. Gate 2: validateBrowserUrl() checks SSRF + domain allowlist (see Browser-Proxy#SSRF Protection)
  4. BrowserRelay.callTool() forwards to @playwright/mcp child process
  5. @playwright/mcp navigates Chromium, returns result
  6. Result returned to team session

Failure: No browser: Config

Team's config.yaml has no browser: section

  • --> Gate 1 fails (browser-tools.ts:18): returns {success: false, error: "browser tools not enabled for this team"}

Failure: BrowserRelay Unavailable

@playwright/mcp not installed or init failed at startup

  • --> Browser tools not registered in tool set at all (conditional registration)
  • --> Model cannot invoke browser tools (they do not appear in activeTools)

Failure: SSRF Blocked

URL points to private IP (e.g., 169.254.169.254 AWS metadata)

  • --> validateBrowserUrl() rejects immediately
  • --> Error: URL blocked: private/reserved address

Failure: Domain Not in Allowlist

Team has browser.allowed_domains, URL hostname does not match

  • --> validateBrowserUrl() rejects
  • --> Error: URL hostname not in allowed domains

Sequence Diagram

sequenceDiagram
    participant Team as Team Session
    participant BT as browser_navigate handler
    participant Val as validateBrowserUrl
    participant Relay as BrowserRelay
    participant PW as @playwright/mcp

    Team->>BT: browser_navigate("https://example.com/...")
    BT->>BT: Gate 1: browser: config exists?
    alt No browser config
        BT-->>Team: Error: "browser tools not enabled"
    else Config present
        BT->>Val: validateBrowserUrl(url)
        alt SSRF blocked
            Val-->>BT: Rejected: private/reserved IP
            BT-->>Team: Error: "URL blocked"
        else Domain not in allowlist
            Val-->>BT: Rejected: hostname not allowed
            BT-->>Team: Error: "URL hostname not in allowed domains"
        else URL valid
            Val-->>BT: OK
            BT->>Relay: callTool("browser_navigate", url)
            alt Relay unavailable
                Relay-->>BT: Error
                BT-->>Team: Error: "BrowserRelay unavailable"
            else Relay OK
                Relay->>PW: Navigate Chromium
                PW-->>Relay: Page result
                Relay-->>BT: Result
                BT-->>Team: Navigation result
            end
        end
    end
Loading

8. Trigger Lifecycle (create, enable, test, update)

Happy Path

  1. Parent agent calls create_trigger(team: "ops", name: "daily-health", type: "schedule", config: {cron: "0 9 * * *"}, task: "Run health check")
  2. Trigger created in pending state (see Triggers#Trigger State Machine)
  3. Parent calls test_trigger(team: "ops", trigger_name: "daily-health") -- enqueues a one-shot task; returns taskId for tracking (does NOT return the task result directly)
  4. Parent checks task result via get_status -- then calls enable_trigger(team: "ops", trigger_name: "daily-health")
  5. Trigger moves to active state; handler registered in engine

Note: create_trigger does NOT validate cron expressions at creation time (create-trigger.ts:31). Invalid cron expressions will fail at runtime when the trigger engine attempts to schedule the handler.

Failure: Trigger Name Not Slugified

Name must match /^[a-z0-9]+(-[a-z0-9]+)*$/ (create-trigger.ts:11)

  • --> Returns validation error with expected format

Failure: Invalid Cron at Runtime

Cron expression is syntactically invalid -- enable_trigger sets state to active in SQLite before attempting handler registration (enable-trigger.ts:40). If node-cron rejects the expression at schedule() time (schedule.ts:18), the trigger is left in active state but without a running handler.

  • --> No automatic rollback to pending -- the trigger appears active but does not fire
  • --> Caller can use disable_trigger then fix the cron expression via update_trigger

Sequence Diagram

sequenceDiagram
    participant Parent as Parent Agent
    participant CT as create_trigger
    participant TT as test_trigger
    participant ET as enable_trigger
    participant DB as SQLite
    participant Engine as Trigger Engine

    Parent->>CT: create_trigger("ops", "daily-health", schedule, ...)
    CT->>CT: Validate slug format
    alt Invalid name
        CT-->>Parent: Error: invalid trigger name
    else Valid name
        CT->>DB: Insert trigger (state: pending)
        CT-->>Parent: Trigger created (pending)
    end

    Parent->>TT: test_trigger("ops", trigger_name: "daily-health")
    TT->>DB: Enqueue one-shot task (type: trigger)
    TT-->>Parent: {taskId} (track via get_status)

    Parent->>ET: enable_trigger("ops", trigger_name: "daily-health")
    ET->>DB: Set state: active
    ET->>Engine: Register handler
    alt Cron expression valid
        Engine-->>ET: Handler registered
        ET-->>Parent: Trigger enabled
    else Invalid cron at runtime
        Engine-->>ET: node-cron rejects expression
        Note over ET: DB still shows active (no rollback)
        ET-->>Parent: Trigger enabled (but will not fire)
        Note over Parent: Fix: disable_trigger + update_trigger cron + re-enable
    end
Loading

9. Web Fetch

Happy Path

  1. Team calls web_fetch(url: "https://api.example.com/status", method: "GET")
  2. SSRF check via validateBrowserUrl() (same protection as browser tools; see Browser-Proxy#SSRF Protection)
  3. Domain allowlist check against team's browser.allowed_domains (if configured)
  4. HTTP request made with configurable timeout
  5. Returns {status, headers, body} (body truncated at limit)

Failure: SSRF Blocked

Same as browser_navigate -- private/reserved IPs blocked

  • --> validateBrowserUrl() rejects immediately
  • --> Error: URL blocked: private/reserved address

Failure: Timeout

HTTP request exceeds timeout

  • --> Returns error with timeout information

Sequence Diagram

sequenceDiagram
    participant Team as Team Session
    participant WF as web_fetch handler
    participant Val as validateBrowserUrl
    participant HTTP as HTTP Client

    Team->>WF: web_fetch("https://api.example.com/status", "GET")
    WF->>Val: validateBrowserUrl(url)
    alt SSRF blocked
        Val-->>WF: Rejected: private/reserved IP
        WF-->>Team: Error: "URL blocked"
    else Domain not in allowlist
        Val-->>WF: Rejected: hostname not allowed
        WF-->>Team: Error: "URL hostname not in allowed domains"
    else URL valid
        Val-->>WF: OK
        WF->>HTTP: GET https://api.example.com/status
        alt Timeout
            HTTP-->>WF: Timeout error
            WF-->>Team: Error: request timed out
        else Success
            HTTP-->>WF: {status, headers, body}
            WF-->>Team: {status, headers, body}
        end
    end
Loading

10. User Creates a Skill via Repository Search (search_skill_repository)

Happy Path

  1. User messages: "Create a skill for frontend code review"
  2. Main agent identifies engineering team as the owner — delegates via delegate_task("engineering", "Create a code review skill for frontend")
  3. Engineering orchestrator routes to its skill-builder subagent
  4. Subagent calls search_skill_repository("frontend code review best practices")
  5. Repository returns two matches:
    • "frontend-design" from anthropics/skills (222K installs, 78% match)
    • "code-review-guidelines" from community (12K installs, 85% match)
  6. Subagent presents both to user (via escalation) with install counts, sources, and match scores
  7. User picks "code-review-guidelines" (higher match score)
  8. Subagent downloads the SKILL.md content from the GitHub source
  9. Extract/create plugins (plugin-first per ADR-39): Subagent identifies executable operations the skill needs. Registers each as a plugin tool via register_plugin_tool({ tool_name, source_code }) — e.g., diff_analyzer.ts for parsing PR diffs
  10. Subagent tailors: reads the Vercel SKILL.md, converts to OpenHive format (Purpose, Steps, Inputs, Outputs, Error Handling), adds ## Required Tools listing the plugins, adds team-specific review checklist, adapts to team's tech stack
  11. Subagent writes the adapted skill to .run/teams/engineering/skills/code-review.md
  12. Wire to subagent: Adds the skill to the target subagent's ## Skills section in subagents/code-reviewer.md (see Skills#4-Step Creation Workflow)
  13. Subagent tests the skill by invoking it on a sample task
  14. Result flows back: subagent → orchestrator → main → user: "Created code-review skill, adapted from community/code-review-guidelines (85% match, 12K installs), wired to code-reviewer subagent"

Failure: No Match Above 60%

search_skill_repository returns no results ≥60%. Agent creates the skill from scratch per Skills#Initial Skill Creation — analyzes the team's purpose and generates a custom skill file.

Failure: skills.sh Unreachable

Network error when querying skills.sh. Agent logs warning and creates the skill from scratch. Never blocks.

Sequence Diagram

sequenceDiagram
    participant User
    participant Main as Main Agent
    participant Eng as Engineering Orchestrator
    participant SA as skill-builder Subagent
    participant Repo as skills.sh

    User->>Main: "Create a skill for frontend code review"
    Main->>Eng: delegate_task("engineering", "Create code review skill")
    Eng->>SA: invoke skill-builder subagent
    SA->>Repo: search_skill_repository("frontend code review")
    Repo-->>SA: [{name: "code-review-guidelines", match: 85%, installs: 12K}]
    SA->>Eng: escalate("Found match, need user confirmation")
    Eng->>Main: escalate to user
    Main->>User: "Found 'code-review-guidelines' (85% match). Use this?"
    User->>Main: "Yes"
    Main->>Eng: "User confirmed"
    Eng->>SA: "Proceed with match"
    SA->>Repo: Download SKILL.md content
    SA->>SA: Generate plugins (ADR-39) + tailor skill
    SA->>SA: Wire to subagents/code-reviewer.md
    SA-->>Eng: "Skill created and wired"
    Eng-->>Main: result
    Main-->>User: "Created code-review skill, wired to code-reviewer subagent"
Loading

11. User Starts a New Topic While Assistant is Busy (Conversation Threading)

Happy Path

  1. User sends: "Add 2FA to the login page"
    • 0 active topics → new topic-1 ("Add 2FA") created automatically, no classification needed
    • Main agent begins working: researching auth libraries, delegating to engineering team
  2. While topic-1 is processing, user sends: "What's the deploy status?"
    • 1 active topic → main agent evaluates during processing and recognizes this is unrelated to 2FA
    • New topic-2 ("Deploy status") created automatically
    • Topic-2 session starts in parallel
  3. Topic-2 responds quickly: "Last deploy was 2 hours ago, all green"
    • Response arrives with topic_id: "t-def456", topic_name: "Deploy status"
  4. Topic-1 continues working in the background
  5. User sends: "Use WebAuthn instead of TOTP"
    • 2 active topics → lightweight LLM classification call
    • Classified to topic-1 ("Add 2FA") based on semantic match
    • Topic-1 session receives the message and adjusts approach

Failure: Misclassification

User's message is routed to the wrong topic. The response doesn't match context. User clarifies with a follow-up message — the classifier re-routes to the correct topic based on content matching.

Sequence Diagram

sequenceDiagram
    participant User
    participant TC as TopicClassifier
    participant T1 as Topic-1: Add 2FA
    participant T2 as Topic-2: Deploy Status

    User->>TC: "Add 2FA to the login page"
    Note over TC: 0 topics → new
    TC->>T1: Create topic, handleMessage()
    T1->>T1: Researching, delegating...

    User->>TC: "What's the deploy status?"
    Note over TC: 1 topic → agent evaluates → unrelated → new
    TC->>T2: Create topic, handleMessage()
    T2-->>User: "Last deploy 2h ago, all green" [topic: Deploy Status]

    User->>TC: "Use WebAuthn instead of TOTP"
    Note over TC: 2 topics → LLM call → matches "Add 2FA"
    TC->>T1: Route to existing topic
    T1-->>User: "Switching to WebAuthn..." [topic: Add 2FA]
Loading

12. Memory Save, Search, and Supersede

Happy Path

  1. Operations team completes an incident response. During the session, it saves a lesson:

    • Calls memory_save(key: "redis-timeout", content: "Redis connections timeout after 30s under load. Increase pool size to 20.", type: "lesson")
    • New row inserted in memories table with team_name: "operations", is_active: 1
    • Content is chunked and indexed in memory_chunks + memory_chunks_fts
  2. Three days later, a new session starts. The lesson is auto-injected into the system prompt's memory section under [LESSON], making it available in the agent's reasoning context.

  3. During this session, the agent investigates a new Redis issue and discovers the original lesson was wrong:

    • Calls memory_search(query: "redis connection pool") → returns the redis-timeout entry with score 0.92
    • Agent realizes the root cause was actually connection leak, not pool size
    • Agent asks user: "I have a memory that says Redis timeouts are fixed by increasing pool size to 20, but I'm seeing a connection leak pattern. Which is correct?"
    • User confirms: "It was a connection leak. The pool size change masked it temporarily."
  4. Agent supersedes the old memory with the correction:

    • Calls memory_save(key: "redis-timeout", content: "Redis timeouts were caused by a connection leak in the retry handler, not pool exhaustion. Fix: close connections in the finally block.", type: "lesson", supersede_reason: "User confirmed original diagnosis was wrong. Pool size increase masked the real issue (connection leak in retry handler).")
    • Old entry: is_active set to 0
    • New entry: is_active: 1, supersedes_id points to old entry, supersede_reason recorded
  5. Later, memory_search(query: "redis timeout") returns both entries:

    • Active entry (score 0.95): the corrected lesson
    • Superseded entry (score 0.72, marked [SUPERSEDED]): the original wrong diagnosis — still searchable for audit trail

Failure: Overwrite Without Reason

Agent tries to save a new entry with a key that already exists but forgets the reason:

  • Calls memory_save(key: "redis-timeout", content: "New content...", type: "lesson")
  • Tool rejects: "Active memory 'redis-timeout' already exists. Provide supersede_reason to replace it, or use a different key."
  • Agent must either provide a reason or choose a different key

Soft Supersede (Typo Fix)

Agent notices a typo in an existing memory:

  • Calls memory_save(key: "redis-timeout", content: "...corrected typo...", type: "lesson", supersede_reason: "minor correction")
  • Same mechanics, no user verification needed

Sequence Diagram

sequenceDiagram
    participant Agent as Operations Agent
    participant MT as memory_save
    participant DB as SQLite (memories)
    participant Search as memory_search
    participant User

    Note over Agent: Session 1: Incident response
    Agent->>MT: memory_save("redis-timeout", "Increase pool to 20", type:"lesson")
    MT->>DB: INSERT (is_active=1)

    Note over Agent: Session 2: New investigation
    Agent->>Search: memory_search("redis connection pool")
    Search->>DB: FTS5 + vector query
    DB-->>Search: redis-timeout (score: 0.92)
    Search-->>Agent: [{key: "redis-timeout", snippet: "...pool size to 20...", score: 0.92}]

    Agent->>User: "Memory says pool size fix, but I see connection leak. Which is correct?"
    User-->>Agent: "Connection leak. Pool size masked it."

    Agent->>MT: memory_save("redis-timeout", "Connection leak in retry handler", type:"lesson", supersede_reason:"Pool size was wrong diagnosis")
    MT->>DB: UPDATE old row (is_active=0)
    MT->>DB: INSERT new row (supersedes_id=old.id, is_active=1)

    Note over Agent: Session 3: Future search
    Agent->>Search: memory_search("redis timeout")
    Search-->>Agent: Active: corrected lesson (0.95)<br/>Superseded: original diagnosis (0.72, marked)
Loading

13. Sender Trust Evaluation (TrustGate)

Happy Path

  1. Discord user (ID in sender_trust DB with trust_level="trusted") sends a message
  2. ChannelRouter forwards to TrustGate with channelType="discord", senderId="112233445566"
  3. TrustGate evaluates: denylist (not found) → DB (found, trusted) → allow
  4. Message proceeds to TopicClassifier → normal processing
  5. Admin later grants trust to a new sender via add_trusted_sender tool
  6. New sender's next message succeeds

Failure: Unknown Sender

Unknown sender (not in allowlist or DB) sends a message on a deny-policy channel.

  • --> TrustGate returns static "Not authorized." response
  • --> Message never reaches TopicClassifier or LLM
  • --> Decision logged to trust_audit_log with reason="default_policy_deny"

Failure: Explicitly Denied Sender

Sender in denylist or with trust_level="denied" sends a message.

  • --> TrustGate issues silent deny — no response at all
  • --> Prevents enumeration (sender cannot determine if the system exists)
  • --> Decision logged to trust_audit_log with reason="sender_denylist"

Failure: Trust Database Unavailable

SQLite becomes inaccessible during operation.

  • --> Deny-policy channels: TrustGate fails closed (all messages denied)
  • --> Startup warning logged if no trust: section in channels.yaml

Sequence Diagram

sequenceDiagram
    participant User
    participant Adapter as Channel Adapter
    participant Router as ChannelRouter
    participant TG as TrustGate
    participant TC as TopicClassifier
    participant Main as Main Agent

    User->>Adapter: Send message
    Adapter->>Router: Forward (channelType, senderId, message)
    Router->>TG: Evaluate trust (channelType, senderId)

    Note over TG: 6-step evaluation order:<br/>1. Sender denylist check<br/>2. DB lookup (sender_trust)<br/>3. Sender allowlist check<br/>4. Channel-specific overrides<br/>5. Channel-level policy<br/>6. Default policy

    alt Denylist match (trust_level="denied")
        TG-->>Router: DENY (silent)
        Note over Router: No response sent to user<br/>Logged: reason="sender_denylist"
    else Unknown sender + deny-policy channel
        TG-->>Router: DENY (static message)
        Router-->>Adapter: "Not authorized."
        Adapter-->>User: "Not authorized."
        Note over Router: Logged: reason="default_policy_deny"
    else Trusted (allowlist or DB)
        TG-->>Router: ALLOW
        Router->>TC: Forward message
        TC->>Main: Route to topic
        Main-->>TC: Response
        TC-->>Router: Response
        Router-->>Adapter: Response
        Adapter-->>User: Response
    end
Loading

14. Admin Dashboard Access

Happy Path

  1. Operator opens browser to container port (e.g., http://localhost:8080)
  2. Dashboard loads — SPA shell with navigation
  3. System Health Overview displays: uptime, SQLite size, team count, queue backlogs
  4. Operator navigates to Live Org Tree — sees team hierarchy with status and queue depth
  5. Operator checks Task Queue Dashboard — filters by team, sees pending/running/done/failed/cancelled
  6. Operator browses Log Viewer — searches by level, team, time range
  7. Operator opens Memory Browser — views memories by team/type, supersede chains
  8. Operator opens Trigger Manager — enables a disabled trigger via toggle (one of two write operations; the other is plugin lifecycle actions)
  9. Operator checks Conversation History — sees message flow and topic states

Failure: Dashboard Unreachable

Dashboard port not accessible (firewall, container not running).

  • --> Operator can use Discord or direct DB access as fallback

Note: No authentication. Operator manages network access. For remote access, use Traefik, Cloudflare Tunnel, or similar reverse proxy with their own auth.

sequenceDiagram
    participant Op as Operator
    participant Browser
    participant SPA as Dashboard SPA
    participant API as REST API
    participant DB as SQLite

    Op->>Browser: Open http://localhost:8080
    alt Dashboard reachable
        Browser->>SPA: Load SPA shell
        SPA->>API: GET /api/v1/overview
        API->>DB: Query uptime, team count, queue stats
        DB-->>API: System metrics
        API-->>SPA: Health overview
        SPA-->>Op: Dashboard loaded

        Op->>SPA: Navigate to Trigger Manager
        SPA->>API: GET /api/v1/triggers?team=ops-team
        API-->>SPA: Trigger list with states

        Op->>SPA: Enable a disabled trigger
        SPA->>API: POST /api/v1/triggers/:id/enable
        API->>DB: Update trigger state
        DB-->>API: Updated
        API-->>SPA: Trigger state changed
        Note over SPA: Two write mutations: trigger toggle + plugin lifecycle
    else Dashboard unreachable
        Browser-->>Op: Connection refused
        Note over Op: Fallback: Discord or direct DB access
    end
Loading

15. Credential Migration

Happy Path

  1. Operator runs bootstrap with --dry-run flag to preview what credential migration would change
  2. Dry-run output shows: which teams have credentials: in config.yaml, which vault entries would be created, which runtime artifacts contain legacy credential accessor references
  3. Operator reviews output and confirms migration is safe to proceed
  4. Bootstrap runs migration: for each team with credentials: in config.yaml, insert each key-value pair into team_vault with is_secret=1, updated_by='migration'. Config.yaml-wins rule: if a key exists in both config.yaml and vault, the config.yaml value overwrites the vault entry
  5. Runtime artifact scan: search .run/teams/{name}/{skills,subagents,team-rules,org-rules}/*.md for legacy credential accessor references, replace each occurrence with vault_get
  6. Remove credentials: section from each team's config.yaml
  7. Team sessions start with vault-based credential access — no behavioral change from the team's perspective

Failure: Migration Fails for One Team

Migration encounters an error for a specific team (e.g., malformed credentials block, write failure).

  • --> Team is quarantined (skipped); migration continues for remaining teams
  • --> Other teams start normally with vault-based credentials
  • --> Error logged with team name and failure reason
  • --> Operator fixes the issue and re-runs migration for the quarantined team

Failure: Stale Vault After Partial Migration

Operator re-runs migration after a partial failure or after updating config.yaml credentials.

  • --> Config.yaml-wins rule ensures re-running migration overwrites stale vault data with current config.yaml values
  • --> Vault entries created by a previous run are overwritten, not duplicated
  • --> Safe to re-run migration any number of times (idempotent)

Sequence Diagram

sequenceDiagram
    participant Op as Operator
    participant Boot as Bootstrap
    participant Cfg as Config Scanner
    participant Vault as team_vault (SQLite)
    participant FS as Filesystem (.run/teams/)
    participant Team as Team Session

    Op->>Boot: --dry-run
    Boot->>Cfg: Scan all config.yaml files
    Cfg-->>Boot: Teams with credentials: [ops, eng, qa]
    Boot->>FS: Scan .run/teams/*/skills,subagents,team-rules,org-rules/*.md
    FS-->>Boot: Files with legacy credential accessor refs: [3 files]
    Boot-->>Op: Dry-run report (teams, vault entries, artifact refs)

    Op->>Boot: Run migration
    Boot->>Cfg: Scan config.yaml for each team

    loop For each team with credentials:
        alt Migration succeeds
            Boot->>Vault: INSERT/UPDATE team_vault (is_secret=1, updated_by='migration')
            Note over Vault: Config.yaml-wins: overwrites existing vault entries
            Boot->>FS: Replace legacy credential accessor → vault_get in artifacts
            Boot->>Cfg: Remove credentials: section from config.yaml
        else Migration fails for team
            Boot-->>Boot: Quarantine team, log error
            Note over Boot: Continue with remaining teams
        end
    end

    Boot-->>Op: Migration complete (N migrated, M quarantined)
    Boot->>Team: Start team sessions
    Team->>Vault: vault_get(key) for credential access
    Vault-->>Team: Credential value
Loading

16. Autonomous Learning Cycle

Bootstrap creates active learning-cycle-{subagent} triggers per subagent with readiness gates checked at runtime (ADR-35, ADR-40). Each trigger targets a specific subagent for deterministic routing. Main agent has no learning triggers (no subagents). For the full 6-phase cycle, vault journal structure, and configuration defaults, see Self-Evolution#Autonomous Learning.

Happy Path

  1. Schedule trigger learning-cycle-learner fires for the learner subagent in ops-team (see Triggers#Execution Flow). Orchestrator routes deterministically — no LLM cost.
  2. Readiness gates pass: bootstrapped=1, scope_keywords present, 6-tool bundle available.
  3. Journal Read: vault_get("learning:ops-team:learner:journal") — loads prior state (or initializes empty on first run).
  4. Topic Analysis: Derives topics from scope_keywords, ranks by task-history gaps, checks existing memories to avoid re-learning.
  5. Web Discovery: Fetches sources via web_fetch + browser tools, skipping cached URLs (TTL-based) and deprioritized sources (90-day expiry).
  6. Validation: Cross-domain corroboration — 3+ independent root domains → high confidence (lesson), 2 → medium (lesson), 1 → low (reference only). Near-duplicate/mirror content counts as one source.
  7. Storage: memory_save with deterministic key learn:{topic_slug}:{claim_hash} for dedup. Capped at max_learnings_per_session (default 5).
  8. Journal Update: Persists progress, prunes expired deprioritized sources, sets next focus.
  9. Duration budget: If max_duration_minutes (default 30) exceeded, in-progress operation completes, then skips to journal update and exits gracefully.

Failure: Web Tools Unavailable

  • Task fails — journal NOT updated (no partial corruption)
  • Circuit breaker increments; after 3 consecutive failures: trigger auto-disabled (see Triggers#Circuit Breaker)

Failure: Journal Corrupted

  • Non-fatal: treated as first run — existing memories prevent duplicate storage via dedup keys
  • Topic coverage lost but no incorrect data injected

Sequence Diagram

sequenceDiagram
    participant Cron as Schedule Trigger
    participant SA as learner Subagent
    participant Skill as learning-cycle skill
    participant Vault as vault_get / vault_set
    participant Web as web_fetch / browser
    participant Mem as memory_save / memory_search

    Cron->>SA: delegateTask (deterministic routing)
    SA->>Skill: follow learning-cycle.md
    Skill->>Skill: Readiness gates + 6-tool check
    Skill->>Vault: vault_get(journal)
    Vault-->>Skill: prior state (or empty)
    Skill->>Skill: Derive topics from scope_keywords
    Skill->>Web: Fetch sources (skip cached + deprioritized)
    Web-->>Skill: Candidate findings
    Skill->>Skill: Cross-domain corroboration
    Skill->>Mem: memory_save (lessons + references)
    Skill->>Vault: vault_set(journal with updated progress)

    alt Duration budget exceeded
        Skill->>Vault: vault_set(journal — partial progress)
        Note over Skill: Graceful exit
    end

    alt Web tools fail
        Web-->>Skill: Error
        Note over Skill: Task fails — journal untouched
    end
Loading

Cross-References


17. Self-Reflection Cycle

Happy Path

  1. Schedule trigger reflection-cycle-learner fires the reflection-cycle skill for the learner subagent in the operations team at 3 AM (see Triggers#Reflection Trigger). Orchestrator routes directly to the subagent (deterministic routing per ADR-40).
  2. READINESS GATES: Skill checks bootstrapped=1, scope_keywords present, and required tool bundle (vault_get, vault_set, memory_save, memory_search, memory_list, list_completed_tasks). All gates pass -- proceed
  3. JOURNAL READ: vault_get("reflection:ops-team:learner:journal") returns previous journal with next_focus: "task completion time"
  4. EVIDENCE GATHER: list_completed_tasks retrieves last 50 completed tasks. Pattern analysis identifies 12 tasks with >10 min completion time for simple queries
  5. DIAGNOSE: Highest-impact issue: slow response template loading adds ~3 min per task. Evidence: 12/50 tasks affected
  6. PROPOSE: Draft change: add early-exit to response template skill when input matches a known pattern. Before/after comparison shows expected 60% reduction in affected task completion time
  7. APPLY: Subagent escalates proposal to orchestrator for confirmation (propose+confirm model per ADR-40). Orchestrator reviews, approves, and applies the change via Edit tool with governance enforcement
  8. JOURNAL UPDATE: vault_set("reflection:ops-team:learner:journal", ...) records diagnosis, proposal, outcome (applied), next focus ("error rate patterns")

Edge Case: No Actionable Issue

  1. Trigger fires, readiness gates pass
  2. Evidence gathered -- all tasks within normal parameters
  3. Diagnosis: no significant inefficiency detected
  4. Change skipped -- journal updated with "no action taken", next focus unchanged

Edge Case: Tool Missing

  1. Trigger fires
  2. Readiness gate check: list_completed_tasks not in team's allowed_tools
  3. Skill logs warning: "Reflection skipped: missing tool list_completed_tasks"
  4. Exit without error. Trigger remains active for next firing
sequenceDiagram
    participant TE as Trigger Engine
    participant Orch as Orchestrator
    participant SA as Subagent (learner)
    participant Skill as reflection-cycle skill
    participant Tasks as list_completed_tasks
    participant Vault as team_vault
    participant Gov as Governance

    TE->>Orch: reflection-cycle-learner fires (3 AM)
    Orch->>SA: Route to learner (deterministic)

    alt Readiness gates pass
        SA->>Skill: Follow reflection-cycle.md steps

        Skill->>Vault: vault_get("reflection:ops-team:learner:journal")
        Vault-->>Skill: Previous journal (next_focus: "task completion time")

        Skill->>Tasks: list_completed_tasks (last 50)
        Tasks-->>Skill: Task outcomes with duration data

        alt Actionable issue found
            Note over Skill: Diagnose: slow template loading (12/50 tasks)
            Skill->>SA: Propose: add early-exit to template skill
            SA->>Orch: escalate() proposal for confirmation
            Orch->>Gov: Governance check (scope, cooldown)
            Gov-->>Orch: Approved
            Orch->>SA: Apply change via Edit tool
            SA->>Skill: Change applied
            Skill->>Vault: vault_set("reflection:ops-team:learner:journal", {applied, next_focus})
        else No actionable issue
            Note over Skill: All tasks within normal parameters
            Skill->>Vault: vault_set("reflection:ops-team:learner:journal", {no_action, next_focus unchanged})
        end
    else Tool missing (e.g., list_completed_tasks)
        SA->>Skill: Readiness gate check
        Skill-->>SA: warn("Reflection skipped: missing tool")
        Note over SA: Exit without error. Trigger remains active.
    end
Loading

18. Stall Detection

Happy Path: Stalled Task Detected

  1. Stall detector periodic scan runs (every 10 minutes, engine-level infrastructure — see Architecture-Decisions#ADR-38)
  2. Scan queries task_queue for pending tasks older than 1 hour and pending/running tasks older than 24 hours
  3. One task found: task_id=42, team ops-team, status pending, age 2 hours 15 minutes
  4. Warning-level alert: logged at warn level. Alert routed to originating channel (if available) or escalated
  5. Next scan (10 min later): same task still pending, age 2 hours 25 minutes -- warning repeated
  6. Team processes the task before the 24-hour threshold -- task transitions to done, no further alerts

Edge Case: 24-Hour Escalation

  1. Task task_id=99, team research-team, status running, age 25 hours
  2. Error-level alert: logged at error level. Alert escalated through hierarchy
  3. Main team delivers escalation to user via channel adapter (root team escalation path)
  4. Operator investigates -- blocked session due to missing API credentials. Credentials added, task resumes

Edge Case: Clean Scan

  1. Stall detector scan runs
  2. No tasks exceed either threshold
  3. Logged at debug level: "Stall detection scan clean"
  4. No alerts generated
sequenceDiagram
    participant SD as Stall Detector
    participant DB as task_queue (SQLite)
    participant Log as Logger
    participant Ch as Channel Adapter
    participant Main as Main Agent
    participant User

    loop Every 10 minutes
        SD->>DB: Query pending >1hr, pending/running >24hr
        alt Stalled task found (<24hr)
            DB-->>SD: task_id=42, ops-team, pending, 2hr 15min
            SD->>Log: warn("Stalled task", {task_id, team, age})
            alt sourceChannelId present
                SD->>Ch: Route warning to originating channel
            else No sourceChannelId (schedule-triggered)
                SD->>Main: escalate() warning to parent team
            end
            Note over SD: Next scan: re-check same task
        else Stalled task found (>24hr)
            DB-->>SD: task_id=99, research-team, running, 25hr
            SD->>Log: error("Critical stall", {task_id, team, age})
            SD->>Main: escalate() through hierarchy
            Main->>User: "Task stalled >24hr in research-team"
        else Clean scan
            DB-->>SD: No tasks exceed thresholds
            SD->>Log: debug("Stall detection scan clean")
        end
    end
Loading

19. Continuous-Watch Team with Work Handoff

This scenario demonstrates the activation primitives introduced by ADR-41 through ADR-44 working together: a team on continuous watch during a bounded window detects an event, hands work to its parent, and the parent fans out a parallel research query across peers before executing a mutation. The walkthrough uses a generic "inventory-watcher" analogue and can be re-read with domain-specific names (trading, security, news, etc.) without structural changes.

Motivating Evidence: Serial Fan-out Cost (Pre-ADR-41)

Before ADR-41/42, an orchestrator that needed answers from three peers invoked query_team sequentially because each child enforced the ADR-9 "one session per team" invariant. The live database recorded a single cycle of 1,136,410 ms ≈ 19 minutes — eleven query_team calls, none overlapping. Each arrow below is a distinct blocking call:

sequenceDiagram
    participant T as parent-orch
    participant F as peer-A
    participant S as peer-B
    participant P as peer-C

    Note over T: 19:10:04 cycle starts
    T->>F: query_team (19:11:53)
    F-->>T: result (200,891 ms)
    T->>F: query_team (19:15:19)
    F-->>T: empty (198,960 ms)
    T->>S: query_team (19:18:51)
    S-->>T: result (85,104 ms)
    T->>P: query_team (19:20:21)
    P-->>T: result (164,105 ms)
    T->>S: query_team (19:23:42)
    S-->>T: result (114,058 ms)
    T->>P: query_team (19:25:47)
    P-->>T: result (26,499 ms)
    T->>P: query_team (19:27:10)
    P-->>T: result (23,046 ms)
    T->>F: query_team (19:30:33)
    F-->>T: result (7,250 ms)
    T->>S: query_team (19:30:46)
    S-->>T: result (78,945 ms)
    T->>P: query_team (19:32:14)
    P-->>T: result (70,427 ms)
    T->>P: mutate (19:33:35)
    P-->>T: result (42,893 ms)
    Note over T: cycle ends (≈19 min)
Loading

Wall-clock of the cycle equals the sum of child durations, not the max. ADR-41 removes the per-team single-flight for daily-ops so peers can run concurrently; query_teams (see Organization-Tools#query_teams — Parallel Fan-out) exposes that concurrency as a single tool call.

Happy Path: Window Tick → Event → Handoff → Parallel Research → Mutation

  1. Bootstrap (setup). Parent orchestrator creates a window trigger on the watcher child: create_trigger(team="watcher", name="market-hours-watch", type="window", config={ watch_window: "30 9-16 * * 1-5", tick_interval_ms: 30000, max_tokens_per_window: 200000, max_ticks_per_window: 800, overlap_policy: "always-skip" }, subagent="event-scanner", skill="scan-window"). Trigger is created pending, verified with test_trigger, then enabled (see Triggers#window Trigger Type and Tool-Guidelines#Trigger Tools).
  2. Window opens. At 09:30, the cron expression matches — engine transitions market-hours-watch from WindowClosed to WindowOpen (see Triggers#window Trigger Type state diagram).
  3. Routine tick (no-op). At 09:30:30 the engine dispatches a fresh disposable session to the watcher (ADR-10). The event-scanner subagent reads its cursor event-scanner:last_scan_cursor from memory, fetches the upstream feed via web_fetch(url, rate_limit_key="upstream-feed"), finds no new items past the cursor, and returns { "action": "noop", "reason": "no new items since 2026-04-15T09:30:00Z" }. The engine records the cursor update and produces no notification, no parent-queue insertion. See Tool-Guidelines#No-op Tick Contract.
  4. Eventful tick. At 10:47:30 a new item appears. The scanner records it, advances its cursor, and calls enqueue_parent_task({ task: "new actionable event detected: <context>", priority: "high", correlation_id: "<uuid>" }). The payload carries context only, no subagent directive — the parent still routes (ADR-40/ADR-43). See Organization-Tools#enqueue_parent_task — Work Handoff.
  5. Parent dequeues. The parent orchestrator dequeues the high-priority task, consults the Activation Decision Framework (Tool-Guidelines#Activation Decision Framework), and recognises this is on-demand research (a specific event is already in hand) rather than recurring work.
  6. Parallel fan-out. Parent calls query_teams([{team: "peer-A", query: "..."}, {team: "peer-B", query: "..."}, {team: "peer-C", query: "..."}], default_timeout_ms: 150000). Three child sessions run concurrently — wall-clock is max(child durations) rather than the sum. All three are daily-ops per ADR-41 and share the parent's SQLite WAL without serialisation.
  7. Partial-failure tolerance. peer-C times out; peer-A and peer-B return {ok: true, result_or_error: ...}. Parent evaluates partial results per Tool-Guidelines#query_teams Partial Failure — quorum met, proceeds without retry.
  8. Mutation step. Parent delegates a single mutating task to a specialist child via delegate_task (org-ops-free — delegate_task itself is daily-ops; any structural change inside the specialist remains single-flight per-team). Specialist returns within its own cap.
  9. Window closes. At 16:00 the cron no longer matches. Any in-progress tick completes; no new ticks start. Engine transitions to WindowClosed. The learner trigger (if configured for this team) then runs after window close per [[Self-Evolution#Interaction with window Triggers]].

Failure: Cursor Not Advanced

The scanner detects an item, emits enqueue_parent_task, but crashes before writing the updated cursor.

  • → Next tick re-reads the unchanged cursor, detects the same item again, and would emit a duplicate handoff.
  • → Guard: enqueue_parent_task dedup window (default 60 s by correlation_id) absorbs the duplicate within the guard window. See Organization-Tools#enqueue_parent_task.
  • → Outside the dedup window, the subagent's idempotency invariant requires gating the external side effect on cursor advancement; re-queuing the same event after the dedup expiry is a subagent-level bug (see Subagents#Window-Trigger Subagents Responsibilities template).

Other failure modes (rate-limited handoff storms, query_teams partial failure, target saturation) are documented at their canonical locations: Organization-Tools#enqueue_parent_task — Work Handoff and Tool-Guidelines#query_teams Partial Failure.

Sequence Diagram (new-architecture composite flow)

sequenceDiagram
    participant Eng as Trigger Engine
    participant W as watcher subagent
    participant Mem as memory (cursor)
    participant P as parent-orch
    participant A as peer-A
    participant B as peer-B
    participant C as peer-C
    participant Sp as specialist

    Note over Eng: Window opens at 09:30 (watch_window cron)
    loop tick_interval_ms = 30s
        Eng->>W: dispatch fresh session (ADR-10)
        W->>Mem: read event-scanner:last_scan_cursor
        alt no new items
            W-->>Eng: {action:"noop", reason:"..."} (see Tool-Guidelines#No-op)
            Note over Eng: No notification, no parent queue, cursor updated
        else event detected
            W->>Mem: write updated cursor + event id
            W->>P: enqueue_parent_task(context, priority:high, correlation_id)
            Note over P: Parent orchestrator routes per ADR-40
            P->>P: consult Activation Framework → on-demand fan-out
            par daily-ops parallel
                P->>A: query_teams child
            and
                P->>B: query_teams child
            and
                P->>C: query_teams child
            end
            A-->>P: {team:"A", ok:true, result_or_error:"..."}
            B-->>P: {team:"B", ok:true, result_or_error:"..."}
            C-->>P: {team:"C", ok:false, result_or_error:"saturation"}
            Note over P: Partial failure tolerated (Tool-Guidelines#query_teams Partial Failure)
            P->>Sp: delegate_task(mutation with synthesized context)
            Sp-->>P: result
        end
    end
    Note over Eng: Window closes at 16:00 — no new ticks
Loading

Canonical Diagrams Referenced

Concept Canonical page
query_teams fan-out sequence Organization-Tools#query_teams — Parallel Fan-out
enqueue_parent_task handoff flowchart Organization-Tools#enqueue_parent_task — Work Handoff
window trigger state machine Triggers#window Trigger Type
Activation Decision Framework Tool-Guidelines#Activation Decision Framework
Daily-ops vs org-ops pool + mutex Architecture#Execution Model

User → Fix Flow

When a user reports an issue, the fix flows through the full 5-layer hierarchy: Main → Orchestrator → Subagent → Skill → Plugin.

sequenceDiagram
    participant User
    participant Main as Main Agent
    participant Orch as Team Orchestrator
    participant SA as Subagent
    participant SK as Skill
    participant PL as Plugin

    User->>Main: "Loggly alerts aren't working"
    Note over Main: Routes only. Identifies ops-team.
    Main->>Orch: delegate_task("ops-team", "Fix loggly alert issue")

    Note over Orch: Reads subagent defs → picks loggly-monitor

    Orch->>SA: invoke loggly-monitor subagent
    Note over SA: Context loaded: loggly-monitor.md + skills + task

    SA->>SK: follow alert-check skill steps
    SK->>PL: loggly_fetch.ts → fetch recent alerts
    PL-->>SK: alert data
    SK->>PL: classify_entries.ts → analyze
    PL-->>SK: classification
    SK-->>SA: diagnosis complete

    SA-->>Orch: "Fixed: alert threshold was misconfigured"
    Orch-->>Main: result
    Main-->>User: "Fixed. Alert threshold in loggly was misconfigured."
Loading
⚠️ **GitHub.com Fallback** ⚠️