Incident Response Runbook - nself-org/cli GitHub Wiki
Status: v2 (S9.T01 โ P1/P2/P3 SLAs + PagerDuty + full war-room protocol)
Canonical URL: .claude/docs/operations/incident-response-runbook.md
Cross-ref: incident-response.md (SEV definitions + decision tree), bus-factor.md, postmortem-template.md, pagerduty-setup.md, runbooks/ (per-scenario)
nSelf uses four severity levels. P1/P2/P3 map to the most common operator response patterns.
| Level | Alias | Threshold | Ack SLA | Mitigate SLA | Resolve SLA | Postmortem? | Status Page? |
|---|---|---|---|---|---|---|---|
| P0 / SEV1 | Total outage | Production down, data loss risk, or confirmed breach. >10% users blocked. | 5 min | 60 min | 4 hours | Required (48h) | YES โ every 15 min |
| P1 / SEV2 | Partial outage | Single-customer or partial outage. <10% users, or one critical-tier customer blocked. Auth/billing broken. | 1 hour | 4 hours | 8 hours | Required (48h) | YES โ on detect/mitigate/resolve |
| P2 / SEV3 | Degraded | User-visible degradation but workaround exists. Feature unavailable, slow response, non-critical plugin offline. | 4 hours | 24 hours | 48 hours | Optional | Optional |
| P3 / SEV4 | Monitoring | No user impact observed. Issue visible in monitoring only. Cert expiry >14 days, log volume warning. | Next business day | Next sprint | Next sprint | No | No |
| P4 | Info | Capacity warning, informational signal. No action needed. | โ | โ | โ | No | No |
- Auth broken for ANY user โ P1 minimum.
- Payment/billing broken โ P1 minimum.
- Any credential or secret exposed โ P0 regardless of user count.
- If uncertain between P1 and P2: pick P1 and downgrade after triage.
nSelf is a single-maintainer ecosystem. Primary on-call is [email protected] (24/7). Backup on-call: per bus-factor.md (USER DECISION PENDING โ nomination required for 9 critical services).
Via nself-incident-mgmt plugin UI (preferred when running):
- Open
https://your-nself-instance/incidentsorlocalhost:3833/incidents - Click "Acknowledge" on the triggered alert
- Fill in: IC name, initial hypothesis, war-room channel name
- Set status to "Investigating"
Via Slack (fallback when UI unavailable):
Post to #incidents within ack SLA:
:ack: P{N} ack โ {one-line summary} โ IC: {your name} โ {HH:MM UTC}
Opening war room: #incident-{YYYY-MM-DD}-{slug}
Via PagerDuty (if wired โ see pagerduty-setup.md):
- Accept the page from the PagerDuty mobile app or email
- Post ack note in the PagerDuty incident: "Investigating โ IC: {name}"
- PagerDuty ack automatically updates nself-incident-mgmt via webhook
Every action during the incident gets a timestamped entry. Minimum data per entry:
{HH:MM UTC} | {who} | {what} | {outcome/finding}
Examples:
14:23 UTC | Aric | Checked Hasura logs โ confirmed query timeout spike | 500 errors since 14:17
14:28 UTC | Aric | Restarted Hasura container | Errors clearing
14:35 UTC | Aric | Confirmed resolution โ no new errors for 5 min | Mitigated
Update the timeline in nself-incident-mgmt UI or in the war-room Slack channel pinned message.
| Severity | Notify | Channel | Method | Timing |
|---|---|---|---|---|
| P0 | Primary on-call + all engineers | #incident-{slug} |
PagerDuty page + Slack | Immediately on detection |
| P1 | Primary on-call |
#incidents + #incident-{slug}
|
PagerDuty or Slack push | Within ack SLA |
| P2 | Primary on-call | #incidents |
Slack | Within ack SLA |
| P3 | Primary on-call | #ops-low |
Slack message | Business hours |
| Severity | Status Page | Customer Email | Timing |
|---|---|---|---|
| P0 | Post immediately on detection | Send within 30 min of mitigation | Update every 15 min until resolved |
| P1 | Post on detection | Send within 1h of resolution | Update on detect/mitigate/resolve |
| P2 | Optional | No (unless customer contacts support) | Single update on resolution |
| P3 | No | No | โ |
Initial post:
[Investigating] {service} โ {one-line symptom} โ {start-time UTC}
We are investigating reports of {symptom}. Affected users may see {impact}.
Next update: {now + 15 min}.
Identified:
[Identified] Root cause located โ fix in progress.
We identified {cause}. Rollout is underway; impact should clear by {ETA UTC}.
Next update: {now + 15 min}.
Resolved:
[Resolved] Service restored at {time UTC}. Duration: {N} min.
{service} is fully restored. Post-mortem within 48h at {link}.
Subject: nSelf service incident โ {one-line summary}
Hi {name or "there"},
We had an issue affecting {service} from {start UTC} to {end UTC} ({N} min).
What happened: {plain-English explanation โ no jargon}.
Impact on you: {specific impact if known, else "some requests may have failed"}.
What we did: {plain-English mitigation steps}.
What's next: post-mortem within 48h; we'll send it your way.
If you noticed anything we missed, reply here.
โ Aric, nSelf
(Tone rule: human, no "we apologize for the inconvenience" โ see GCI Outbound Human Correspondence.)
nSelf uses PagerDuty in stub mode โ routing is configured but single-escalation for now.
Alert triggered (nself-alert-router โ PagerDuty)
โโ Tier 1: Primary on-call (push notification + SMS) โ 5 min
โโ No ack โ Tier 2: Backup on-call (see bus-factor.md) โ +5 min
โโ No ack โ Escalation policy: email + phone โ +10 min
See pagerduty-setup.md for integration key setup and test flow.
See War Room Protocol (Section 6).
Run in this order:
# Check overall health
nself health
# Check specific service
nself service status <postgres|hasura|auth|nginx|redis>
# Check recent logs (last 100 lines)
nself logs <service> --tail 100
# Check error rate spike
nself monitor --last 15m- Which nself.org subdomains affected? (
curl -I https://{subdomain}.nself.org/health) - Which plugins returning errors? (Grafana โ Plugin Error Rate panel)
- Cloud customers affected? (check
np_cloud_tenantsactivity via Hasura) - License validation broken? (
curl https://ping.nself.org/health)
# What deployed in the last 2 hours?
cd /Volumes/X9/Sites/nself/cli && git log --oneline --since="2 hours ago"
# Any nself update recently?
nself version && nself update --dry-runNavigate to the matching scenario, then follow it to resolution:
| Symptom | Runbook |
|---|---|
| Postgres down / query errors / deadlock | runbooks/postgres-deadlock.md |
| Hetzner server unreachable | hetzner-failover-runbook.md |
| Hasura metadata errors / migration failure | hasura-migration-runbook.md |
| Vercel deploy needed / rollback | vercel-failover-runbook.md |
| Cloudflare DNS failure / license validation down | cloudflare-dns-failure-runbook.md |
| Stripe billing broken | stripe-failover-runbook.md |
| Secret/credential exposed | runbooks/secret-rotation.md |
| Mass data leak / GDPR trigger | runbooks/mass-data-leak.md |
| AI provider (OpenAI/Anthropic) unreachable | runbooks/ai-provider-outage.md |
| License server (ping.nself.org) down | runbooks/license-server-outage.md |
| Malicious plugin behavior detected | runbooks/malicious-plugin-response.md |
| No matching runbook | Use Root-Cause Template below as live working doc |
Rollback a bad deploy:
# Vercel โ rollback to prior production deployment
vercel rollback --prod
# CLI fix pushed to prod โ roll back to prior tag
cd /Volumes/X9/Sites/nself/cli
git tag --sort=-creatordate | head -5 # find prior tag
# then trigger deploy via nself deploy with prior versionRestart a crashed service:
nself service restart <service-name>Enable graceful degradation (disable a bad plugin):
nself plugin disable <plugin-name>Route around a failing dependency: See specific runbooks for Hetzner / Cloudflare / Vercel failovers.
- P0: required, within 48 hours.
- P1: required, within 48 hours.
- P2: optional but recommended for recurring issues.
- P3/P4: skip.
Use postmortem-template.md verbatim. Key sections:
## Post-Mortem: {one-line title}
**Date:** {YYYY-MM-DD}
**Severity:** P{N}
**Duration:** {start UTC} โ {resolved UTC} ({N} min)
**IC:** {name}
**Authors:** {names}
### Impact
{Quantified user impact: N users affected, N min downtime, N failed requests.}
### Timeline
{Timestamped list โ detection through resolution.}
### Root Cause
{Specific, factual. Use 5-Whys: keep asking "why" until you reach a process or system gap.}
### What Went Well
{Be honest โ what actually helped?}
### What Went Wrong
{Be honest โ what slowed response or caused the gap?}
### Action Items
| Action | Owner | Due | Status |
|---|---|---|---|
| {specific preventive or detective fix} | {name} | {YYYY-MM-DD} | open |- Blameless means no "X did the wrong thing." Focus on system and process gaps.
- Every action item gets an owner and due date. Track in
.claude/tasks/active.mduntil closed. - Action items are always one specific, verifiable fix โ never vague ("improve monitoring").
- Publish within 48 hours. Send link to affected customers by email.
- P0: always.
- P1: always.
- P2: at IC discretion.
1. Create the Slack channel:
#incident-{YYYY-MM-DD}-{slug} e.g. #incident-2026-05-07-auth-down
Channel topic:
P{N} | {one-line summary} | IC: {name} | Zoom: {url} | started {HH:MM UTC}
2. Assign roles:
| Role | Responsibility | Required? |
|---|---|---|
| Incident Commander (IC) | Owns the incident. Drives triage, calls mitigation, declares resolution. | Always |
| Technical Lead | Runs commands, reads logs, executes the runbook steps. | P0/P1 |
| Scribe | Posts timestamped updates every 15 min. Updates status page. | P0/P1 |
| Comms Officer | Drafts customer messages. Handles @-mentions + support tickets. | P0 (optional P1) |
For single-maintainer: IC = Technical Lead. Scribe and Comms can be the same person or deferred until mitigation is underway.
3. Open Zoom bridge:
Use the standing bridge URL stored in vault:
source ~/.claude/vault.env && echo $INCIDENT_ZOOM_URLPaste the URL in the channel topic immediately. For P2: text-only in Slack is fine.
4. Pin the root-cause template:
Paste into the channel as a pinned message:
## Incident {YYYY-MM-DD} โ {slug}
**P-level:** P{N}
**IC:** {name}
**Started:** {HH:MM UTC}
**Mitigated:** pending
**Resolved:** pending
### Symptoms
- {bullet each user-visible signal with timestamp}
### Affected surfaces
- [ ] {service / endpoint / customer segment}
### Hypotheses (live โ update as ruled out or confirmed)
- H1: {hypothesis} โ Evidence: {link/paste} โ Status: open
- H2: โฆ
### Actions taken
- {HH:MM} {action} โ {outcome}| Severity | Internal (Slack) | External (status page) |
|---|---|---|
| P0 | Every 10 min | Every 15 min |
| P1 | Every 30 min | On state change |
| P2 | On state change | On resolution |
Update format for internal:
[HH:MM UTC] Update #{N}: {one-line current state}. Next: {what we're doing now}.
The IC declares resolution when:
- The user-visible symptom is confirmed gone (not just "looks better").
- Root cause is identified and either fixed or safely mitigated.
- No new errors in the past 10 min (P0) or 5 min (P1).
Say explicitly in the channel:
:white_check_mark: RESOLVED โ P{N} resolved at {HH:MM UTC}. Duration: {N} min.
Post-mortem due: {YYYY-MM-DD HH:MM UTC}.
Then: post status page resolution, send customer email if required, and file the post-mortem task in .claude/tasks/active.md.
- nself-incident-mgmt UI โ incident ack + timeline
- status.nself.org โ public status page
- PagerDuty setup โ alert routing + integration key
- postmortem-template.md โ blameless format
- on-call-rotation.md โ who is primary on-call
- vendor-contacts.md โ Cloudflare / Hetzner / Vercel / Stripe escalation
- bus-factor.md โ backup admin assignments
- dr-runbook.md โ full disaster recovery