nself-chat Operations Runbook

Operational procedures and troubleshooting guide for production nself-chat deployment.

Quick Reference
Common Operations
Incident Response
Troubleshooting
Maintenance Procedures
Performance Tuning
Security Procedures
Disaster Recovery

Quick Reference

Essential Commands

# Check application status
kubectl get pods -n nself-chat
kubectl logs -f deployment/nself-chat -n nself-chat

# Scale application
kubectl scale deployment/nself-chat --replicas=5 -n nself-chat

# Restart deployment
kubectl rollout restart deployment/nself-chat -n nself-chat

# Check database
kubectl exec -it postgres-0 -n nself-chat -- psql -U nchat_user -d nchat

# Check Redis
kubectl exec -it deployment/redis -n nself-chat -- redis-cli -a $REDIS_PASSWORD ping

# View metrics
kubectl top pods -n nself-chat
kubectl top nodes

Service URLs

Service	URL	Purpose
Application	https://nchat.example.com	Main application
Grafana	https://monitoring.nchat.example.com	Dashboards
Prometheus	Internal only	Metrics
Hasura Console	Internal only	GraphQL admin

Critical Thresholds

Metric	Warning	Critical	Action
CPU Usage	70%	85%	Scale up
Memory Usage	75%	90%	Scale up
Disk Usage	70%	85%	Expand storage
Response Time	500ms	1000ms	Investigate
Error Rate	1%	5%	Incident
Database Connections	150/200	190/200	Investigate

Common Operations

Viewing Logs

Application Logs

# Tail logs
kubectl logs -f deployment/nself-chat -n nself-chat

# Tail logs from all pods
kubectl logs -f -l app.kubernetes.io/name=nself-chat -n nself-chat

# Get logs from specific time
kubectl logs deployment/nself-chat -n nself-chat --since=1h

# Get previous pod logs (after crash)
kubectl logs deployment/nself-chat -n nself-chat --previous

Database Logs

# PostgreSQL logs
kubectl logs postgres-0 -n nself-chat

# Query slow queries
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

System Events

# Get cluster events
kubectl get events -n nself-chat --sort-by='.lastTimestamp'

# Watch events in real-time
kubectl get events -n nself-chat --watch

Scaling Operations

Manual Scaling

# Scale deployment
kubectl scale deployment/nself-chat --replicas=10 -n nself-chat

# Scale database (if using StatefulSet)
kubectl scale statefulset/postgres --replicas=3 -n nself-chat

# Verify scaling
kubectl get pods -n nself-chat -w

Autoscaling

# Check HPA status
kubectl get hpa -n nself-chat
kubectl describe hpa nself-chat -n nself-chat

# Update HPA
kubectl patch hpa nself-chat -n nself-chat -p \
  '{"spec":{"minReplicas":5,"maxReplicas":20}}'

Deployment Updates

Rolling Update

# Update image
kubectl set image deployment/nself-chat \
  nself-chat=ghcr.io/nself/nself-chat:v0.3.1 \
  -n nself-chat

# Watch rollout
kubectl rollout status deployment/nself-chat -n nself-chat

# Pause rollout (if issues detected)
kubectl rollout pause deployment/nself-chat -n nself-chat

# Resume rollout
kubectl rollout resume deployment/nself-chat -n nself-chat

Using Helm

# Upgrade release
helm upgrade nself-chat ./deploy/helm/nself-chat \
  -f deploy/helm/nself-chat/values-production.yaml \
  --set image.tag=v0.3.1 \
  -n nself-chat

# Check status
helm status nself-chat -n nself-chat

Configuration Updates

Update ConfigMap

# Edit configmap
kubectl edit configmap nself-chat-config -n nself-chat

# Restart to apply changes
kubectl rollout restart deployment/nself-chat -n nself-chat

Update Secrets

# Update secret
kubectl create secret generic nself-chat-secrets \
  --from-literal=new-key=new-value \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pods to use new secrets
kubectl rollout restart deployment/nself-chat -n nself-chat

Incident Response

Severity Levels

Level	Response Time	Description
P1 - Critical	Immediate	Complete outage, data loss
P2 - High	15 minutes	Partial outage, severe degradation
P3 - Medium	1 hour	Performance issues, minor features down
P4 - Low	24 hours	Cosmetic issues, non-urgent bugs

P1 - Critical Incident (Application Down)

Immediate Actions

Verify the issue:

curl https://nchat.example.com/api/health
kubectl get pods -n nself-chat
kubectl get nodes

Check pod status:

kubectl describe pod <pod-name> -n nself-chat
kubectl logs <pod-name> -n nself-chat

Check recent changes:

kubectl rollout history deployment/nself-chat -n nself-chat
kubectl get events -n nself-chat --sort-by='.lastTimestamp' | head -20

Quick fixes:

# Restart deployment
kubectl rollout restart deployment/nself-chat -n nself-chat

# OR rollback if recent deployment
kubectl rollout undo deployment/nself-chat -n nself-chat

Verify recovery:

kubectl rollout status deployment/nself-chat -n nself-chat
curl https://nchat.example.com/api/health

P2 - High Severity (Database Issues)

Database Connection Issues

Check database pod:

kubectl get pods -n nself-chat | grep postgres
kubectl logs postgres-0 -n nself-chat

Test connectivity:

kubectl exec -it postgres-0 -n nself-chat -- pg_isready

Check connections:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT count(*) FROM pg_stat_activity;"

Kill idle connections:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < NOW() - INTERVAL '5 minutes';"

Database Performance Issues

Check slow queries:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

Check database size:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT pg_size_pretty(pg_database_size('nchat'));"

Vacuum and analyze:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c "VACUUM ANALYZE;"

P3 - Medium Severity (High CPU/Memory)

High CPU Usage

Identify resource hogs:

kubectl top pods -n nself-chat --sort-by=cpu

Scale horizontally:

kubectl scale deployment/nself-chat --replicas=8 -n nself-chat

Investigate application:

kubectl exec -it deployment/nself-chat -n nself-chat -- node --prof

High Memory Usage

Check memory usage:

kubectl top pods -n nself-chat --sort-by=memory

Look for memory leaks:

kubectl logs deployment/nself-chat -n nself-chat | grep -i "out of memory"

Restart affected pods:

kubectl delete pod <pod-name> -n nself-chat

Troubleshooting

Application Won't Start

Check pod status:

kubectl get pods -n nself-chat
kubectl describe pod <pod-name> -n nself-chat

Common issues:

Image pull errors:

# Check image pull secret
kubectl get secret nself-chat-registry -n nself-chat

# Verify image exists
docker manifest inspect ghcr.io/nself/nself-chat:v0.3.0

Insufficient resources:

kubectl describe nodes
kubectl top nodes

ConfigMap/Secret missing:

kubectl get configmap -n nself-chat
kubectl get secret -n nself-chat

Database Connection Failures

Verify database is running:

kubectl get pods -n nself-chat | grep postgres

Test connection:

kubectl run pg-test --image=postgres:16 --rm -it --restart=Never -n nself-chat -- \
  psql -h postgres -U nchat_user -d nchat

Check credentials:

kubectl get secret nself-chat-secrets -n nself-chat -o jsonpath='{.data.database-url}' | base64 -d

Slow Performance

Check response times:

curl -w "@curl-format.txt" -o /dev/null -s https://nchat.example.com/api/health

Create curl-format.txt:

time_namelookup:  %{time_namelookup}s\n
time_connect:  %{time_connect}s\n
time_appconnect:  %{time_appconnect}s\n
time_pretransfer:  %{time_pretransfer}s\n
time_redirect:  %{time_redirect}s\n
time_starttransfer:  %{time_starttransfer}s\n
time_total:  %{time_total}s\n

Check database performance:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT * FROM pg_stat_activity WHERE state = 'active';"

Check Redis performance:

kubectl exec -it deployment/redis -n nself-chat -- \
  redis-cli -a $REDIS_PASSWORD --latency

SSL/TLS Issues

Check certificate:

kubectl get certificate -n nself-chat
kubectl describe certificate nself-chat-tls -n nself-chat

Verify cert-manager:

kubectl get pods -n cert-manager
kubectl logs -n cert-manager deployment/cert-manager

Manual certificate check:

openssl s_client -connect nchat.example.com:443 -servername nchat.example.com

Maintenance Procedures

Database Maintenance

Routine Maintenance (Weekly)

# Vacuum and analyze
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c "VACUUM ANALYZE;"

# Reindex
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c "REINDEX DATABASE nchat;"

# Check for bloat
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size FROM pg_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 10;"

Backup Verification

# Test backup restore (on test environment)
gunzip < backup-20250129.sql.gz | \
  kubectl exec -i postgres-0 -n nself-chat-test -- \
  psql -U nchat_user nchat_test

Application Updates

Zero-Downtime Deployment

Prepare:

# Ensure HPA is configured
kubectl get hpa nself-chat -n nself-chat

Deploy:

helm upgrade nself-chat ./deploy/helm/nself-chat \
  -f deploy/helm/nself-chat/values-production.yaml \
  --set image.tag=v0.3.1 \
  -n nself-chat \
  --wait \
  --timeout 10m

Monitor:

kubectl rollout status deployment/nself-chat -n nself-chat
watch -n 2 'kubectl get pods -n nself-chat'

Verify:

curl https://nchat.example.com/api/health
kubectl logs -f deployment/nself-chat -n nself-chat

Certificate Renewal

Certificates auto-renew via cert-manager. Manual renewal if needed:

# Delete certificate to force renewal
kubectl delete certificate nself-chat-tls -n nself-chat

# Recreate
kubectl apply -f deploy/k8s/ingress.yaml

# Verify
kubectl get certificate nself-chat-tls -n nself-chat

Performance Tuning

Database Optimization

Connection Pooling

Update Hasura environment:

HASURA_GRAPHQL_PG_CONNECTIONS: 50
HASURA_GRAPHQL_PG_TIMEOUT: 180

Query Optimization

# Enable pg_stat_statements
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"

# Find missing indexes
kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U nchat_user -d nchat -c \
  "SELECT schemaname, tablename, attname, n_distinct, correlation FROM pg_stats WHERE schemaname NOT IN ('pg_catalog', 'information_schema') AND n_distinct > 100 ORDER BY n_distinct DESC LIMIT 20;"

Redis Optimization

# Check memory usage
kubectl exec -it deployment/redis -n nself-chat -- \
  redis-cli -a $REDIS_PASSWORD INFO memory

# Set maxmemory policy
kubectl exec -it deployment/redis -n nself-chat -- \
  redis-cli -a $REDIS_PASSWORD CONFIG SET maxmemory-policy allkeys-lru

Application Tuning

Update deployment resources:

resources:
  requests:
    cpu: '500m'
    memory: '1Gi'
  limits:
    cpu: '2000m'
    memory: '2Gi'

Security Procedures

Rotating Secrets

Rotate JWT Secret

Generate new secret:

NEW_JWT_SECRET=$(openssl rand -base64 64)

Update secret:

kubectl create secret generic nself-chat-secrets \
  --from-literal=jwt-secret="$NEW_JWT_SECRET" \
  --dry-run=client -o yaml | kubectl apply -f -

Restart services:

kubectl rollout restart deployment/nself-chat -n nself-chat
kubectl rollout restart deployment/hasura -n nself-chat

Rotate Database Password

Change password in database:

kubectl exec -it postgres-0 -n nself-chat -- \
  psql -U postgres -c \
  "ALTER USER nchat_user WITH PASSWORD 'new-secure-password';"

Update application secret:

kubectl create secret generic nself-chat-secrets \
  --from-literal=database-url="postgresql://nchat_user:new-secure-password@postgres:5432/nchat" \
  --dry-run=client -o yaml | kubectl apply -f -

Restart application:

kubectl rollout restart deployment/nself-chat -n nself-chat

Security Audit

# Check for exposed secrets
kubectl get secrets -n nself-chat

# Review RBAC
kubectl get rolebindings -n nself-chat
kubectl get clusterrolebindings | grep nself-chat

# Check network policies
kubectl get networkpolicies -n nself-chat

# Scan for vulnerabilities (using Trivy)
kubectl run trivy --rm -it --restart=Never \
  --image=aquasec/trivy:latest \
  -- image ghcr.io/nself/nself-chat:latest

Disaster Recovery

Full System Restore

Prerequisites:

Recent database backup
Infrastructure as Code (Terraform) configurations
Kubernetes manifests in version control

Procedure:

Restore infrastructure (if using Terraform):

cd deploy/terraform
terraform init
terraform apply

Deploy application stack:

kubectl create namespace nself-chat
kubectl apply -f deploy/k8s/

Restore database:

aws s3 cp s3://nself-chat-backups/latest.sql.gz .
gunzip < latest.sql.gz | \
  kubectl exec -i postgres-0 -n nself-chat -- \
  psql -U nchat_user nchat

Verify services:

kubectl get pods -n nself-chat
curl https://nchat.example.com/api/health

Restore monitoring:
```
kubectl apply -f deploy/k8s/monitoring/
```

Data Recovery

Recover deleted data (if within retention period):

# List available backups
aws s3 ls s3://nself-chat-backups/

# Download specific backup
aws s3 cp s3://nself-chat-backups/backup-20250128.sql.gz .

# Extract specific table
gunzip < backup-20250128.sql.gz | \
  grep -A 10000 "CREATE TABLE messages" > messages_backup.sql

# Restore specific table
kubectl exec -i postgres-0 -n nself-chat -- \
  psql -U nchat_user nchat < messages_backup.sql

Contact Information

On-Call Rotation

Primary: [Team Lead]
Secondary: [Senior Engineer]
Escalation: [Engineering Manager]

External Services

Cloud Provider Support: [Support Link]
DNS Provider: [Support Link]
SSL Provider: [Support Link]
SMTP Service: [Support Link]

Monitoring Alerts

Slack: #nself-chat-alerts
PagerDuty: [Integration Key]
Email: [email protected]

Appendix

Useful kubectl Aliases

alias k='kubectl'
alias kn='kubectl -n nself-chat'
alias kgp='kubectl get pods -n nself-chat'
alias kgd='kubectl get deployments -n nself-chat'
alias kgs='kubectl get services -n nself-chat'
alias kl='kubectl logs -f -n nself-chat'
alias kx='kubectl exec -it -n nself-chat'
alias kd='kubectl describe -n nself-chat'

Emergency Commands Reference

# Nuclear option: Delete and recreate deployment
kubectl delete deployment nself-chat -n nself-chat
kubectl apply -f deploy/k8s/deployment.yaml

# Force pod deletion
kubectl delete pod <pod-name> -n nself-chat --force --grace-period=0

# Drain node for maintenance
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Uncordon node
kubectl uncordon <node-name>