QUICK REFERENCE CARD - nself-org/nchat GitHub Wiki
Print this page and keep it handy during incidents
| Priority | Description | Response Time |
|---|---|---|
| P0 | Complete outage / Data loss | 15 minutes |
| P1 | Major feature down | 30 minutes |
| P2 | Degraded performance | 2 hours |
| P3 | Minor issue | 1 business day |
# Quick health check
cd /Users/admin/Sites/nself-chat
./scripts/ops/verify-system-health.sh
# Check what's down
docker ps --format "table {{.Names}}\t{{.Status}}"
# Check logs for errors
cd .backend && nself logs --tail 100 | grep -i error- P0/P1: Create war room immediately
- Slack: #incident-response
- Create ticket: [Incident Tracking System]
- Start timeline: Document all actions
# Automated recovery (use this!)
./scripts/ops/start-services-ordered.sh
# Manual restart (if needed)
docker start nself-postgres # Wait 30s
docker start nself-redis # Wait 10s
docker start nself-minio # Wait 15s
docker start nself-auth # Wait 15s
docker start nself-hasura # Wait 20s# Complete health check
./scripts/ops/verify-system-health.sh
# Data integrity check
./scripts/ops/verify-data-integrity.sh
# Quick service checks
docker exec nself-postgres pg_isready
docker exec nself-redis redis-cli ping
curl http://localhost:8080/healthz # Hasura
curl http://localhost:4000/healthz # Auth# Stop dependent services first
docker stop nself-hasura nself-auth
# Restore from latest backup
./scripts/ops/restore-database.sh /backups/postgres/daily/latest.dump
# Verify and restart
./scripts/ops/verify-data-integrity.sh
./scripts/ops/start-services-ordered.sh| Service | RTO | RPO | Recovery Method |
|---|---|---|---|
| PostgreSQL | 30 min | 5 min | Restore + WAL replay |
| Hasura | 15 min | 0 | Restart |
| Auth | 15 min | 5 min | Restart + DB restore if needed |
| Redis | 15 min | 15 min | Restart + RDB restore |
On-Call Engineer (0-15 min)
โ
Engineering Manager + Senior Engineer (30 min)
โ
CTO (1 hour) + CEO (customer comms)
โ
Security Team (if security incident)
โ
Legal/PR (if data breach)
- On-Call: [Phone/Pager]
- Backup On-Call: [Phone/Pager]
- Engineering Manager: [Phone]
- CTO: [Phone]
๐จ [P0] Production Incident - [Brief Description]
Status: INVESTIGATING
Impact: [user-facing impact]
Services Affected: [list]
CURRENT SITUATION:
[What we know]
ACTIONS TAKEN:
- [timestamp] [action]
NEXT STEPS:
[What we're doing next]
Next Update: [time]
Incident Commander: [name]
๐ Update #[N] - [Brief Description]
Status: [INVESTIGATING|IDENTIFIED|MONITORING|RESOLVED]
Duration: [elapsed time]
UPDATE:
[New information]
CURRENT STATUS:
[Service status, recovery progress]
ETA: [updated estimate]
Next Update: [time]
โ
[RESOLVED] - [Brief Description]
Duration: [total time]
Impact: [summary]
RESOLUTION:
[What fixed it]
ROOT CAUSE:
[Brief explanation]
Post-mortem within 48 hours.
# Check status
docker exec nself-postgres pg_isready
# If down, restart
docker restart nself-postgres
# If corrupted, restore
# See: docs/ops/DISASTER-RECOVERY-PROCEDURES.md# Check health
curl http://localhost:8080/healthz
# Restart
docker restart nself-hasura
# Verify DB connection
docker exec nself-hasura curl http://nself-postgres:5432# Check health
curl http://localhost:4000/healthz
# Check dependencies
docker exec nself-auth curl http://nself-postgres:5432
docker exec nself-auth curl http://nself-redis:6379
# Restart
docker restart nself-auth# Use automated recovery
./scripts/ops/start-services-ordered.sh
# Takes ~10-15 minutes
# Monitors health of each service
# Reports any failures- Service restored and verified
- Status page updated (resolved)
- Customers notified
- Timeline documented
- Post-mortem scheduled (within 48 hours)
- Action items created
- Runbooks updated if needed
df -h /
# If > 80%, clean up logs or expand diskfree -h
docker stats --no-stream
# If exhausted, restart heavy services# All services
nself logs --since 10m
# Specific service
docker logs nself-postgres --tail 100
docker logs nself-hasura --tail 100 | grep ERROR# Between containers
docker exec nself-hasura ping nself-postgres
docker exec nself-auth ping nself-redis
# From host
curl http://localhost:8080/healthz
curl http://localhost:4000/healthz-
Incident Response:
docs/ops/INCIDENT-RESPONSE-PLAYBOOK.md -
Disaster Recovery:
docs/ops/DISASTER-RECOVERY-PROCEDURES.md -
Drill Scenarios:
docs/ops/RECOVERY-DRILL-SCENARIOS.md -
RTO/RPO Details:
docs/ops/RTO-RPO-TARGETS.md
- Week 1: Database restore drill
- Week 2: Service recovery drill
- Week 3: PITR drill
- Week 4: Cascading failure drill
# Run a drill
export DRILL_MODE=test
./scripts/ops/run-recovery-drill.sh service-outageKeep calm and follow the runbooks! ๐
Last Updated: February 9, 2026 Version: 1.0.0