INCIDENT RESPONSE PLAYBOOK - nself-org/nchat GitHub Wiki
Version: 1.0.0 Last Updated: February 9, 2026 Owner: Operations Team Review Cycle: Quarterly
- Incident Classification
- Response Procedures
- Communication Templates
- Escalation Paths
- Post-Mortem Process
| Level | Description | Response Time | Example |
|---|---|---|---|
| P0 | Critical - Complete service outage | 15 minutes | Total system down, data loss |
| P1 | High - Major functionality unavailable | 30 minutes | Auth service down, messages not sending |
| P2 | Medium - Degraded performance | 2 hours | Slow queries, intermittent errors |
| P3 | Low - Minor issues | 1 business day | UI glitches, non-critical features |
| P4 | Info - Monitoring alerts | Best effort | Resource usage warnings |
IMPACT x URGENCY = PRIORITY
High Impact + High Urgency = P0
High Impact + Low Urgency = P1
Low Impact + High Urgency = P1
Low Impact + Low Urgency = P2/P3
-
Automated Monitoring
- Sentry error rate alerts
- Grafana threshold violations
- Health check failures
- Resource exhaustion warnings
-
User Reports
- Support tickets
- Social media mentions
- Direct customer contact
- Status page comments
-
Internal Discovery
- Team member observation
- Routine maintenance checks
- Security scans
- Performance testing
Objective: Restore service within 1 hour (RTO)
# 1. ACKNOWLEDGE THE INCIDENT
# Log into incident tracking system
# Create incident ticket with P0 tag
# 2. ASSEMBLE WAR ROOM
# Slack: #incident-response
# Zoom: incidents.zoom.us/warroom
# Google Doc: Incident Log Template
# 3. QUICK HEALTH CHECK
cd /Users/admin/Sites/nself-chat/.backend
nself status
# Check all critical services
docker ps --filter "label=nself.service=critical"
# Check disk space
df -h
# Check memory
free -h
# Check database connectivity
docker exec nself-postgres pg_isready
# Check Hasura
curl -f http://localhost:8080/healthz || echo "HASURA DOWN"
# Check Auth
curl -f http://localhost:4000/healthz || echo "AUTH DOWN"# 1. IDENTIFY FAILING COMPONENT
# Check service logs
nself logs postgres --tail 100
nself logs hasura --tail 100
nself logs auth --tail 100
# Check Sentry for errors
# https://sentry.io/organizations/[org]/issues/
# Check Grafana dashboards
# http://localhost:3000/d/system-overview
# 2. ISOLATE THE PROBLEM
# Is it a single service or cascading failure?
# Is data at risk?
# Can we fail over to backup?
# 3. ESTIMATE RECOVERY TIME
# Quick fix available? < 30 min
# Requires restore? 1-2 hours
# Requires rebuild? 2-4 hoursScenario 1: Database Failure
# Stop all services writing to DB
docker stop nself-hasura nself-auth
# Assess corruption
docker exec nself-postgres psql -U postgres -c "SELECT 1"
# If corrupted, initiate restore
# See: DISASTER-RECOVERY-PROCEDURES.md
# If successful, restart services
docker start nself-hasura nself-auth
# Verify writes working
docker exec nself-postgres psql -U postgres -d nself_db -c \
"INSERT INTO health_check (timestamp) VALUES (NOW())"Scenario 2: Service Crash Loop
# Identify crashing service
docker ps -a | grep Restarting
# Get crash logs
docker logs [container-id] --tail 200
# Check resource limits
docker stats --no-stream
# Try safe restart with increased limits
docker update --memory 4g --cpus 2 [container-id]
docker restart [container-id]
# If still failing, rollback to last known good
# See: DISASTER-RECOVERY-PROCEDURES.mdScenario 3: Network Partition
# Check connectivity between services
docker network inspect nself_default
# Test DNS resolution
docker exec nself-hasura ping nself-postgres
# Check firewall rules
sudo iptables -L -n
# Recreate network if needed
docker network rm nself_default
docker network create nself_default
# Restart services to reconnect# Update status page every 15 minutes
# Template: "We are investigating reports of [issue].
# Current impact: [description].
# Next update: [time]"
# Internal updates every 10 minutes in war room
# Use: INCIDENT-STATUS-TEMPLATE.md# 1. VERIFY ALL SERVICES HEALTHY
./scripts/ops/verify-system-health.sh
# 2. RUN SMOKE TESTS
pnpm test:e2e:smoke
# 3. CHECK DATA INTEGRITY
./scripts/ops/verify-data-integrity.sh
# 4. MONITOR FOR 30 MINUTES
# Watch error rates, response times, resource usage
# 5. DECLARE RESOLUTION
# Update status page
# Send all-clear to stakeholders
# Schedule post-mortemObjective: Restore functionality within 4 hours
- Assessment (0-30 min): Understand scope and impact
- Containment (30-60 min): Prevent escalation
- Resolution (60-240 min): Fix root cause
- Verification (ongoing): Confirm restoration
Auth Service Unavailable
# Check auth container
docker logs nself-auth --tail 100
# Common fix: JWT secret misconfiguration
docker exec nself-auth env | grep JWT
# Restart auth service
docker restart nself-auth
# Verify login working
curl -X POST http://localhost:4000/v1/auth/signin \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]","password":"test123"}'Messages Not Sending
# Check GraphQL engine
curl http://localhost:8080/healthz
# Check database connection pool
docker exec nself-postgres psql -U postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active'"
# Check message queue if applicable
docker logs nself-redis --tail 50
# Test message insert directly
docker exec nself-postgres psql -U postgres -d nself_db -c \
"INSERT INTO nchat_messages (content, user_id, channel_id)
VALUES ('test', '123', '456') RETURNING id"Objective: Resolve within 1 business day
- Performance degradation (>500ms p95 latency)
- Intermittent errors (<5% error rate)
- Non-critical feature outages
Response: Standard troubleshooting during business hours
Objective: Schedule for next maintenance window
- Cosmetic issues
- Feature requests
- Performance optimization opportunities
Subject: [P0/P1] Production Incident - [Brief Description]
Status: INVESTIGATING
Start Time: [timestamp]
Impact: [user-facing impact]
Services Affected: [list]
CURRENT SITUATION:
[What we know]
ACTIONS TAKEN:
- [timestamp] [action]
- [timestamp] [action]
NEXT STEPS:
[What we're doing next]
ESTIMATED RESOLUTION: [time or "unknown"]
Next Update: [time]
Incident Commander: [name]
War Room: [link]
Subject: [P0/P1] Update #[N] - [Brief Description]
Status: [INVESTIGATING|IDENTIFIED|MONITORING|RESOLVED]
Duration: [elapsed time]
Last Updated: [timestamp]
UPDATE:
[New information since last update]
ACTIONS TAKEN:
- [new actions]
CURRENT STATUS:
[Service status, recovery progress]
NEXT STEPS:
[What's happening next]
ESTIMATED RESOLUTION: [updated estimate]
Next Update: [time] or when significant change occurs
Subject: [RESOLVED] [P0/P1] - [Brief Description]
Status: RESOLVED
Start Time: [timestamp]
End Time: [timestamp]
Total Duration: [duration]
Impact: [summary]
RESOLUTION:
[What fixed the issue]
ROOT CAUSE:
[Brief explanation - detailed post-mortem to follow]
PREVENTIVE MEASURES:
[What we're doing to prevent recurrence]
We apologize for the disruption. A detailed post-mortem will be
published within 48 hours.
Questions? Contact: [support email]
[HH:MM] [NAME]: Current status check
- Service A: [status]
- Service B: [status]
- Root cause: [hypothesis]
- Current action: [what's being done]
- Blockers: [any blockers]
- ETA: [estimate]
P0 Incident Detected
↓
On-Call Engineer (0-15 min)
↓ (if not resolved in 30 min)
Senior Engineer + Engineering Manager
↓ (if not resolved in 1 hour)
CTO + CEO (for customer communication)
↓ (if data breach or security)
Legal + PR Team
on-call-engineer:
primary: [contact info]
backup: [contact info]
escalation_time: 15 minutes
engineering-manager:
contact: [info]
escalation_time: 30 minutes
senior-engineer:
contact: [info]
availability: 24/7 for P0
cto:
contact: [info]
notify_for: P0, Security incidents
ceo:
contact: [info]
notify_for: P0 > 1 hour, Data breach
security-team:
contact: [info]
notify_for: Any security incident
legal:
contact: [info]
notify_for: Data breach, Compliance violationImmediate Escalation to CTO:
- Data loss or corruption
- Security breach detected
- Estimated downtime > 4 hours
- Customer data exposed
Immediate Escalation to CEO:
- Media attention
- Major customer impact (>1000 users)
- Legal/compliance implications
- Potential PR crisis
- 0-24 hours: Initial data collection
- 24-48 hours: Draft post-mortem
- 48-72 hours: Review and finalize
- 1 week: Present to team
- 2 weeks: Complete action items
# Post-Mortem: [Incident Title]
**Date**: [incident date]
**Authors**: [names]
**Status**: [Draft|Review|Final]
**Severity**: [P0|P1|P2]
## Executive Summary
[2-3 sentence overview]
## Impact
- **Duration**: [total time]
- **Users Affected**: [number/percentage]
- **Services Affected**: [list]
- **Revenue Impact**: [$amount or N/A]
- **Data Loss**: [description or "none"]
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | [first symptom detected] |
| HH:MM | [incident declared] |
| HH:MM | [action taken] |
| HH:MM | [service restored] |
| HH:MM | [incident resolved] |
## Root Cause
[Detailed technical explanation]
### Contributing Factors
1. [Factor 1]
2. [Factor 2]
3. [Factor 3]
## Resolution
[How we fixed it]
## What Went Well
- ✅ [positive aspect]
- ✅ [positive aspect]
## What Went Poorly
- ❌ [negative aspect]
- ❌ [negative aspect]
## Action Items
| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| [Action 1] | [name] | [date] | [Open/Done] |
| [Action 2] | [name] | [date] | [Open/Done] |
## Lessons Learned
1. [Lesson 1]
2. [Lesson 2]
## Prevention
[How we'll prevent this in the future]
## Appendix
- Logs: [link]
- Metrics: [link]
- War Room Notes: [link]Key Principles:
-
Focus on Systems, Not People
- "The system failed to prevent..." not "Person X caused..."
-
Assume Good Intent
- Everyone was doing their best with available information
-
Learn and Improve
- Every incident is a learning opportunity
-
Psychological Safety
- Encourage honest reporting
- No punishment for mistakes
- Reward transparency
- Review Timeline (10 min)
- Discuss Root Cause (15 min)
- Identify Contributing Factors (10 min)
- Brainstorm Action Items (15 min)
- Assign Owners and Deadlines (5 min)
- Document Lessons Learned (5 min)
- MTTD (Mean Time To Detect): Detection → Acknowledgment
- MTTA (Mean Time To Acknowledge): Alert → Response
- MTTI (Mean Time To Investigate): Acknowledgment → Root Cause
- MTTR (Mean Time To Repair): Detection → Resolution
- MTTF (Mean Time To Fix): Issue Identified → Fix Deployed
INCIDENT SUMMARY - [Month Year]
Total Incidents: [number]
- P0: [count] (target: 0)
- P1: [count] (target: <2)
- P2: [count] (target: <5)
- P3: [count]
Average MTTR by Priority:
- P0: [time] (target: <1 hour)
- P1: [time] (target: <4 hours)
- P2: [time] (target: <1 day)
Top Causes:
1. [cause] - [count]
2. [cause] - [count]
3. [cause] - [count]
Action Items Completed: [x/y] ([%])
Trending: [up/down/stable]
- Status Page: status.example.com
- Sentry: sentry.io/organizations/[org]
- Grafana: grafana.example.com
- PagerDuty: (if applicable)
- War Room: zoom.us/j/warroom
- Runbook: DISASTER-RECOVERY-PROCEDURES.md
# Service health check
nself status
# View all logs
nself logs --follow
# Database backup
./scripts/ops/backup-database.sh
# Restore from backup
./scripts/ops/restore-database.sh [backup-file]
# Check resource usage
docker stats --no-stream
# View recent errors
docker logs nself-hasura --since 10m 2>&1 | grep ERRORIs the entire system down?
YES → P0
NO ↓
Is critical functionality unavailable?
YES → P1
NO ↓
Is performance significantly degraded?
YES → P2
NO ↓
Is a non-critical feature affected?
YES → P3
NO ↓
Is it just a warning or informational?
YES → P4
For each incident:
- Create incident ticket
- Post initial status update
- Notify on-call engineer
- Create war room
- Update status page
- Send customer notification (P0/P1)
- Post updates every 15 min (P0) or 30 min (P1)
- Declare resolution
- Send resolution notification
- Schedule post-mortem
- Complete post-mortem within 48 hours
- Track action items to completion
Related Documents: