Disaster Recovery Procedures

Version: 1.0.0 Last Updated: February 9, 2026 Owner: Operations Team Review Cycle: Quarterly

Recovery Objectives
Backup Strategy
Recovery Procedures
Service Recovery Order
Data Integrity Verification
Failover Procedures

Recovery Objectives

RTO/RPO Targets

Service	RTO (Recovery Time)	RPO (Recovery Point)	Priority
PostgreSQL	30 minutes	5 minutes	Critical
Hasura	15 minutes	0 (stateless)	Critical
Auth	15 minutes	5 minutes	Critical
Redis	10 minutes	15 minutes	High
MinIO	1 hour	1 hour	Medium
MeiliSearch	2 hours	1 day	Low

Business Impact by Service

Critical Services (Revenue/Reputation Impact)

PostgreSQL: All data storage
Hasura: API gateway
Auth: User access

High Priority Services (Degraded Functionality)

Redis: Session/cache
MinIO: File storage

Medium Priority Services (Workarounds Available)

MeiliSearch: Search (can use DB queries)
Monitoring stack: Observability

Backup Strategy

Automated Backups

PostgreSQL - Continuous + Daily

WAL (Write-Ahead Log) Archiving

# Configured in docker-compose.yml
# WAL archives every 16MB or 60 seconds

# Location: /backups/postgres/wal/
# Retention: 7 days
# Enables Point-In-Time Recovery (PITR)

Daily Full Backups

# Automated via cron
# Time: 02:00 UTC daily
# Location: /backups/postgres/daily/
# Retention: 30 days
# Format: pg_dump custom format (compressed)

Weekly Long-term Backups

# Automated via cron
# Time: Sunday 03:00 UTC
# Location: /backups/postgres/weekly/
# Retention: 12 weeks
# Format: pg_dump custom format

Redis - Snapshot + AOF

RDB Snapshots

# Configuration in redis.conf
# save 900 1      # After 900 sec if 1 key changed
# save 300 10     # After 300 sec if 10 keys changed
# save 60 10000   # After 60 sec if 10000 keys changed

# Location: /data/redis/dump.rdb
# Retention: Keep last 7 snapshots

AOF (Append Only File)

# Configuration
# appendonly yes
# appendfsync everysec

# Location: /data/redis/appendonly.aof
# Auto-rewrite when size doubles

MinIO - Object Replication

# Versioning enabled on all buckets
# Lifecycle policy: retain 30 versions

# Optional: Replicate to secondary MinIO instance
# or S3 bucket for off-site backup

MeiliSearch - Index Snapshots

# Manual snapshots before major changes
# Location: /backups/meilisearch/
# Retention: Last 5 snapshots

# Can rebuild from database if needed

Backup Verification

Daily Verification

#!/bin/bash
# /scripts/ops/verify-backups.sh

# 1. Check backup files exist
if [ ! -f "/backups/postgres/daily/$(date +%Y%m%d).dump" ]; then
  echo "ERROR: Daily PostgreSQL backup missing"
  exit 1
fi

# 2. Test restore to temporary database
pg_restore --dbname=test_restore /backups/postgres/daily/latest.dump

# 3. Verify row counts match production
psql -d test_restore -c "SELECT COUNT(*) FROM nchat_messages"

# 4. Cleanup
dropdb test_restore

# 5. Send notification
echo "Backup verification passed" | mail -s "Backup OK" [email protected]

Off-Site Backup

Strategy: 3-2-1 Rule

3 copies of data
2 different media types
1 off-site copy

Implementation:

# Daily sync to S3 or equivalent
aws s3 sync /backups/ s3://company-backups-offsite/ \
  --storage-class STANDARD_IA \
  --exclude "*.tmp"

# Encrypted with KMS
# Lifecycle: Transition to Glacier after 90 days
# Retention: 1 year minimum

Recovery Procedures

Scenario 1: Database Corruption

Symptoms:

PostgreSQL crashes repeatedly
Data inconsistency errors
Checksum failures

Recovery Steps:

# 1. STOP ALL DEPENDENT SERVICES
echo "Stopping services that write to database..."
docker stop nself-hasura nself-auth nself-functions

# 2. ASSESS CORRUPTION
docker exec nself-postgres pg_dump -U postgres -d nself_db \
  --schema-only > /tmp/schema_check.sql

# If dump fails, corruption confirmed

# 3. BACKUP CURRENT STATE (even if corrupted)
docker exec nself-postgres pg_dumpall -U postgres \
  > /backups/emergency/corrupted_$(date +%Y%m%d_%H%M%S).sql

# 4. STOP DATABASE
docker stop nself-postgres

# 5. RENAME CORRUPTED DATA DIRECTORY
mv /data/postgres /data/postgres_corrupted_$(date +%Y%m%d_%H%M%S)

# 6. CREATE NEW DATA DIRECTORY
mkdir -p /data/postgres
chown -R 999:999 /data/postgres  # postgres user in container

# 7. START DATABASE (will initialize fresh)
docker start nself-postgres

# Wait for initialization
sleep 10

# 8. RESTORE FROM LATEST BACKUP
./scripts/ops/restore-database.sh /backups/postgres/daily/latest.dump

# 9. IF PITR NEEDED, REPLAY WAL
# See: Point-In-Time Recovery section below

# 10. VERIFY DATA INTEGRITY
./scripts/ops/verify-data-integrity.sh

# 11. RESTART DEPENDENT SERVICES
docker start nself-hasura nself-auth nself-functions

# 12. VERIFY APPLICATION FUNCTIONALITY
./scripts/ops/smoke-test.sh

# Expected recovery time: 20-30 minutes
# Data loss: Maximum 5 minutes (last WAL archive)

Scenario 2: Complete Service Outage

Symptoms:

All Docker containers stopped
Host system crash/reboot
Disk failure

Recovery Steps:

# 1. VERIFY SYSTEM HEALTH
df -h  # Check disk space
free -h  # Check memory
docker info  # Check Docker daemon

# 2. CHECK DATA DIRECTORY INTEGRITY
ls -lah /data/
# Should see: postgres/, redis/, minio/, meilisearch/

# 3. START SERVICES IN ORDER (see Service Recovery Order below)
./scripts/ops/start-services-ordered.sh

# 4. MONITOR STARTUP
watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'

# 5. VERIFY CONNECTIVITY
./scripts/ops/verify-system-health.sh

# 6. CHECK LOGS FOR ERRORS
nself logs --since 5m | grep -i error

# 7. RUN SMOKE TESTS
pnpm test:e2e:smoke

# Expected recovery time: 10-15 minutes
# Data loss: None (if data directory intact)

Scenario 3: Data Loss / Accidental Deletion

Symptoms:

Critical data missing
Tables dropped
Mass deletion event

Recovery Steps:

# 1. IMMEDIATELY STOP WRITES
docker pause nself-hasura nself-auth

# This prevents WAL from being overwritten

# 2. IDENTIFY WHEN DATA WAS LAST GOOD
# Check application logs, audit logs, user reports

# 3. PERFORM POINT-IN-TIME RECOVERY
RECOVERY_TARGET="2026-02-09 14:30:00 UTC"

# Stop database
docker stop nself-postgres

# Move current data aside
mv /data/postgres /data/postgres_before_recovery

# Restore base backup
mkdir -p /data/postgres
./scripts/ops/restore-database.sh /backups/postgres/daily/[date].dump

# Create recovery.conf
cat > /data/postgres/recovery.conf <<EOF
restore_command = 'cp /backups/postgres/wal/%f %p'
recovery_target_time = '$RECOVERY_TARGET'
recovery_target_action = 'promote'
EOF

# Start database in recovery mode
docker start nself-postgres

# Monitor recovery
docker logs -f nself-postgres

# Wait for "database system is ready to accept connections"

# 4. VERIFY RECOVERED DATA
docker exec nself-postgres psql -U postgres -d nself_db -c \
  "SELECT COUNT(*) FROM nchat_messages WHERE created_at < '$RECOVERY_TARGET'"

# 5. IF CORRECT, RESUME SERVICES
docker unpause nself-hasura nself-auth

# Expected recovery time: 30-60 minutes
# Data loss: Depends on PITR target (can be sub-minute)

Scenario 4: Service-Specific Failure

Hasura Failure

# Hasura is stateless - simple restart usually sufficient

# 1. Check configuration
docker exec nself-hasura env | grep HASURA

# 2. Verify database connectivity
docker exec nself-hasura curl -f http://nself-postgres:5432 || echo "DB unreachable"

# 3. Restart with clean state
docker stop nself-hasura
docker rm nself-hasura
cd .backend && nself build && nself start hasura

# 4. Apply migrations if needed
hasura migrate apply --endpoint http://localhost:8080

# Expected recovery time: 5 minutes

Auth Service Failure

# Auth stores sessions in Redis + user data in PostgreSQL

# 1. Verify dependencies
docker exec nself-auth curl -f http://nself-postgres:5432 || echo "DB down"
docker exec nself-auth curl -f http://nself-redis:6379 || echo "Redis down"

# 2. Check JWT configuration
docker exec nself-auth env | grep JWT_SECRET

# 3. Restart auth service
docker restart nself-auth

# 4. Verify users can login
curl -X POST http://localhost:4000/v1/auth/signin \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]","password":"test123"}'

# Expected recovery time: 5 minutes

Redis Failure

# Redis failure means session loss but data is in PostgreSQL

# 1. Check if Redis process is running
docker exec nself-redis redis-cli ping

# 2. Check for memory issues
docker stats nself-redis --no-stream

# 3. If corrupted, restore from RDB + AOF
docker stop nself-redis
cp /backups/redis/dump.rdb /data/redis/dump.rdb
cp /backups/redis/appendonly.aof /data/redis/appendonly.aof
docker start nself-redis

# 4. If persistent issues, start fresh
docker stop nself-redis
rm -rf /data/redis/*
docker start nself-redis

# Users will need to re-login
# Expected recovery time: 5 minutes

MinIO Failure

# MinIO stores user-uploaded files

# 1. Check disk space
df -h /data/minio

# 2. Verify data directory
ls -lah /data/minio/.minio.sys

# 3. Restart MinIO
docker restart nself-minio

# 4. If data corrupted, restore from backup
docker stop nself-minio
rsync -av /backups/minio/ /data/minio/
docker start nself-minio

# 5. Verify buckets accessible
mc ls local/nchat-uploads

# Expected recovery time: 10-30 minutes depending on data size

Service Recovery Order

Dependency Graph

1. PostgreSQL (no dependencies)
   ↓
2. Redis (no dependencies)
   ↓
3. MinIO (no dependencies)
   ↓
4. Auth (depends on: PostgreSQL, Redis)
   ↓
5. Hasura (depends on: PostgreSQL)
   ↓
6. Functions (depends on: PostgreSQL, Hasura)
   ↓
7. MeiliSearch (optional, depends on: PostgreSQL)
   ↓
8. Monitoring (observability only)

Automated Recovery Script

#!/bin/bash
# /scripts/ops/start-services-ordered.sh

set -e  # Exit on error

echo "=== Starting Service Recovery ==="
START_TIME=$(date +%s)

# Function to wait for service health
wait_for_service() {
  local service=$1
  local health_check=$2
  local max_wait=60
  local waited=0

  echo "Waiting for $service to be healthy..."
  until eval $health_check &>/dev/null; do
    sleep 2
    waited=$((waited + 2))
    if [ $waited -ge $max_wait ]; then
      echo "ERROR: $service failed to start after ${max_wait}s"
      exit 1
    fi
  done
  echo "✓ $service is healthy"
}

# 1. Start PostgreSQL
echo "Starting PostgreSQL..."
docker start nself-postgres
wait_for_service "PostgreSQL" \
  "docker exec nself-postgres pg_isready -U postgres"

# 2. Start Redis
echo "Starting Redis..."
docker start nself-redis
wait_for_service "Redis" \
  "docker exec nself-redis redis-cli ping"

# 3. Start MinIO
echo "Starting MinIO..."
docker start nself-minio
wait_for_service "MinIO" \
  "curl -f http://localhost:9000/minio/health/live"

# 4. Start Auth
echo "Starting Auth..."
docker start nself-auth
wait_for_service "Auth" \
  "curl -f http://localhost:4000/healthz"

# 5. Start Hasura
echo "Starting Hasura..."
docker start nself-hasura
wait_for_service "Hasura" \
  "curl -f http://localhost:8080/healthz"

# 6. Start Functions
echo "Starting Functions..."
docker start nself-functions || echo "Functions not configured"

# 7. Start MeiliSearch
echo "Starting MeiliSearch..."
docker start nself-meilisearch || echo "MeiliSearch not configured"
sleep 5

# 8. Start Monitoring
echo "Starting Monitoring Stack..."
docker start nself-prometheus nself-grafana nself-loki 2>/dev/null || echo "Monitoring not configured"

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

echo ""
echo "=== Recovery Complete ==="
echo "Total time: ${DURATION} seconds"
echo ""
echo "Next steps:"
echo "1. Run: ./scripts/ops/verify-system-health.sh"
echo "2. Run: pnpm test:e2e:smoke"
echo "3. Check logs: nself logs --since 5m"

Data Integrity Verification

Verification Script

#!/bin/bash
# /scripts/ops/verify-data-integrity.sh

echo "=== Data Integrity Verification ==="

# 1. Check database consistency
echo "Checking database consistency..."
docker exec nself-postgres psql -U postgres -d nself_db -c "
  SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
  FROM pg_tables
  WHERE schemaname = 'public'
  ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
  LIMIT 10;
"

# 2. Check for orphaned records
echo ""
echo "Checking for orphaned records..."
docker exec nself-postgres psql -U postgres -d nself_db -c "
  SELECT
    'Messages with invalid user_id' as issue,
    COUNT(*) as count
  FROM nchat_messages m
  WHERE NOT EXISTS (SELECT 1 FROM nchat_users u WHERE u.id = m.user_id);
"

# 3. Check for future timestamps (data corruption indicator)
echo ""
echo "Checking for future timestamps..."
docker exec nself-postgres psql -U postgres -d nself_db -c "
  SELECT
    'Future timestamps' as issue,
    COUNT(*) as count
  FROM nchat_messages
  WHERE created_at > NOW() + INTERVAL '1 day';
"

# 4. Verify user counts match
echo ""
echo "Verifying user counts..."
USERS_COUNT=$(docker exec nself-postgres psql -U postgres -d nself_db -t -c \
  "SELECT COUNT(*) FROM nchat_users;")
AUTH_USERS=$(docker exec nself-postgres psql -U postgres -d nself_db -t -c \
  "SELECT COUNT(*) FROM auth.users;")

echo "Users in nchat_users: $USERS_COUNT"
echo "Users in auth.users: $AUTH_USERS"

if [ "$USERS_COUNT" -ne "$AUTH_USERS" ]; then
  echo "WARNING: User count mismatch!"
fi

# 5. Check constraint violations
echo ""
echo "Checking constraint violations..."
docker exec nself-postgres psql -U postgres -d nself_db -c "
  SELECT
    conrelid::regclass AS table_name,
    conname AS constraint_name,
    contype AS constraint_type
  FROM pg_constraint
  WHERE conrelid::regclass::text LIKE 'nchat_%'
  ORDER BY conrelid::regclass::text;
"

# 6. Verify indexes
echo ""
echo "Verifying critical indexes..."
docker exec nself-postgres psql -U postgres -d nself_db -c "
  SELECT
    tablename,
    indexname,
    indexdef
  FROM pg_indexes
  WHERE schemaname = 'public'
    AND tablename LIKE 'nchat_%'
  ORDER BY tablename, indexname;
"

echo ""
echo "=== Verification Complete ==="

Expected Results

Healthy System:

No orphaned records
No future timestamps
User counts match
All constraints valid
All indexes present

Needs Attention:

Orphaned records > 0
Future timestamps detected
User count mismatch
Missing indexes

Failover Procedures

Active-Passive Failover

Architecture:

Primary Site (Active)
  ↓ Continuous replication
Secondary Site (Passive)
  ↓ Automatic failover trigger
Becomes Active

PostgreSQL Streaming Replication

Setup:

# On primary
# In postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64

# Create replication user
CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'secure_password';

# On standby
# Create recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=primary port=5432 user=replicator password=secure_password'
trigger_file = '/tmp/postgresql.trigger.5432'

Failover Process:

# 1. Verify primary is down
pg_isready -h primary-host -p 5432

# 2. Promote standby to primary
touch /tmp/postgresql.trigger.5432

# 3. Wait for promotion
tail -f /var/log/postgresql/postgresql.log | grep "database system is ready"

# 4. Update DNS/Load Balancer to point to new primary
# or
# Update environment variables
export POSTGRES_HOST=standby-host

# 5. Restart application services
docker restart nself-hasura nself-auth

# Expected failover time: 2-5 minutes
# Data loss: Minimal (depends on replication lag)

Geographic Failover

Multi-Region Setup:

US-East (Primary)
  ↓ Async replication
EU-West (Secondary)
  ↓ Manual failover

Failover Checklist:

Verify primary region is down
Check replication lag on secondary
Promote secondary database
Update DNS to point to secondary region
Update CDN configuration
Notify users of potential data loss window
Monitor error rates
Update status page

Expected RTO: 15-30 minutes Expected RPO: 5-15 minutes (replication lag)

Recovery Drills

Monthly Drill Schedule

Week 1: Database restore drill Week 2: Service recovery drill Week 3: PITR (Point-In-Time Recovery) drill Week 4: Full disaster recovery drill

Drill Execution

# Run monthly recovery drill
./scripts/ops/run-recovery-drill.sh

# This script:
# 1. Creates test environment
# 2. Simulates failure
# 3. Executes recovery procedure
# 4. Verifies recovery
# 5. Generates report

See RECOVERY-DRILL-SCENARIOS.md for detailed drill procedures.

Contact Information

Emergency Contacts

On-Call Engineer: [phone/email] Backup On-Call: [phone/email] Database Administrator: [phone/email] Infrastructure Lead: [phone/email]

Escalation

After Hours: Page on-call via PagerDuty Business Hours: #ops-emergency Slack channel Critical Issues: Call CTO directly

Appendix

Backup Locations

Local Backups:
/backups/
├── postgres/
│   ├── daily/          # Last 30 days
│   ├── weekly/         # Last 12 weeks
│   └── wal/           # Last 7 days
├── redis/
│   ├── dump.rdb       # Latest snapshot
│   └── appendonly.aof # AOF log
├── minio/             # Bucket replication
└── meilisearch/       # Index snapshots

Off-Site Backups:
s3://company-backups/
├── postgres/
├── redis/
└── minio/

Recovery Time Estimates

Scenario	RTO	RPO	Complexity
Single service restart	5 min	0	Low
Database restore	30 min	5 min	Medium
Full system recovery	1 hour	5 min	High
PITR from corruption	2 hours	<1 min	Very High
Geographic failover	30 min	15 min	High

Related Documents:

DISASTER RECOVERY PROCEDURES - nself-org/nchat GitHub Wiki

Disaster Recovery Procedures

Table of Contents

Recovery Objectives

RTO/RPO Targets

Business Impact by Service

Backup Strategy

Automated Backups

PostgreSQL - Continuous + Daily

Redis - Snapshot + AOF

MinIO - Object Replication

MeiliSearch - Index Snapshots

Backup Verification

Off-Site Backup

Recovery Procedures

Scenario 1: Database Corruption

Scenario 2: Complete Service Outage

Scenario 3: Data Loss / Accidental Deletion

Scenario 4: Service-Specific Failure

Hasura Failure

Auth Service Failure

Redis Failure

MinIO Failure

Service Recovery Order

Dependency Graph

Automated Recovery Script

Data Integrity Verification

Verification Script

Expected Results

Failover Procedures

Active-Passive Failover

PostgreSQL Streaming Replication

Geographic Failover

Recovery Drills

Monthly Drill Schedule

Drill Execution

Contact Information

Emergency Contacts

Escalation

Appendix

Backup Locations

Recovery Time Estimates

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️