OBSERVABILITY RUNBOOK - nself-org/nchat GitHub Wiki
Version: 1.0.0 Last Updated: February 9, 2026 Status: Production Ready
This runbook provides step-by-step procedures for investigating production issues, debugging errors, and maintaining the observability infrastructure.
- Quick Reference
- Investigating Errors
- Performance Debugging
- Alert Response Procedures
- Common Issues
- Debugging Workflows
- Metrics Reference
- Dashboard Guide
| Service | URL | Purpose |
|---|---|---|
| Sentry | https://sentry.io/organizations/nself-chat | Error tracking |
| Grafana | http://localhost:3000 | Metrics dashboard |
| Prometheus | http://localhost:9090 | Metrics storage |
| Application Logs | Docker logs | Structured logs |
| Metrics Endpoint | http://localhost:3000/api/metrics | Prometheus scrape |
| Health Check | http://localhost:3000/api/health | Service status |
# View real-time metrics
curl http://localhost:3000/api/metrics
# Check application health
curl http://localhost:3000/api/health
# View Prometheus targets
curl http://localhost:9090/api/v1/targets
# View active alerts
curl http://localhost:9090/api/v1/alerts# View application logs
docker logs nself-chat-app -f
# View logs with timestamps
docker logs nself-chat-app --timestamps
# View last 100 lines
docker logs nself-chat-app --tail 100
# View logs from last hour
docker logs nself-chat-app --since 1h
# Search logs for errors
docker logs nself-chat-app 2>&1 | grep ERROR
# Search logs by request ID
docker logs nself-chat-app 2>&1 | grep "req-abc123"Check Sentry Dashboard:
- Go to https://sentry.io/organizations/nself-chat
- View "Issues" tab
- Sort by "Last Seen" or "Events"
- Look for patterns in error messages
Check Application Logs:
# Recent errors
docker logs nself-chat-app --tail 500 | grep ERROR
# Errors in last hour
docker logs nself-chat-app --since 1h | grep ERRORFrom Sentry:
- Error message and stack trace
- Affected users (count and IDs)
- Release version
- Environment (production/staging)
- Breadcrumbs (user actions before error)
- Tags (route, feature, etc.)
From Logs:
# Find request ID from Sentry
REQUEST_ID="req-abc123"
# Get full request context
docker logs nself-chat-app 2>&1 | grep $REQUEST_ID
# Get surrounding logs (10 lines before/after)
docker logs nself-chat-app 2>&1 | grep -A 10 -B 10 $REQUEST_ID- Check release version in Sentry
- Checkout that version:
git checkout <commit-sha> - Start application:
pnpm dev - Follow steps from breadcrumbs
- Check browser console and network tab
Common Patterns:
| Error Type | Likely Cause | Investigation |
|---|---|---|
| TypeError: Cannot read property 'X' | Null/undefined value | Check data flow |
| Network Error | API down or timeout | Check backend health |
| 401 Unauthorized | Auth token expired | Check token refresh |
| 500 Internal Server Error | Server-side exception | Check API logs |
| Database Error | Query timeout/deadlock | Check DB metrics |
| WebSocket Error | Connection dropped | Check WS latency |
Check Related Metrics:
# Query Prometheus
curl 'http://localhost:9090/api/v1/query?query=http_request_duration_seconds'
# Check error rate
curl 'http://localhost:9090/api/v1/query?query=rate(http_requests_total{status=~"5.."}[5m])'- Create fix in feature branch
- Write test to prevent regression
- Deploy to staging
- Verify fix in Sentry (wait 1 hour)
- Deploy to production
- Monitor for 24 hours
Symptoms:
- Slow page loads
- Delayed API responses
- High WebSocket latency
- Database query timeouts
Check Grafana Dashboard:
- Open http://localhost:3000/d/performance
- Look for spikes in:
- Response time (P95, P99)
- Error rate
- CPU/Memory usage
- Database connections
Step 1: Identify Slow Endpoint
# Query Prometheus for slowest endpoints
curl -G 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=topk(10, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])))'Step 2: Analyze Request
In Sentry:
- Go to Performance tab
- Find transaction by endpoint
- View flame graph
- Identify slow spans (database, external API, etc.)
Step 3: Check Database Queries
# Check Hasura metrics
curl http://hasura:8080/v1/metrics
# Check PostgreSQL slow queries
docker exec -it postgres psql -U postgres -c \
"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"Step 4: Optimize
Common fixes:
- Add database index
- Implement query batching
- Add Redis caching
- Use GraphQL field limiting
- Implement pagination
Step 5: Verify Improvement
# Compare metrics before/after
curl 'http://localhost:9090/api/v1/query_range?query=histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{endpoint="/api/messages"}[5m]))&start=<before>&end=<after>&step=60'Symptoms:
- Gradual memory increase over time
- Memory usage doesn't decrease after load
- Out of memory errors
Investigation:
# Check memory usage trend
curl 'http://localhost:9090/api/v1/query?query=node_memory_MemAvailable_bytes'
# Check memory by process
docker stats nself-chat-app
# Take heap snapshot
node --inspect-brk src/server.jsCommon Causes:
- Event listeners not cleaned up
- Global variables accumulating data
- WebSocket connections not closed
- Database connections not released
- Large in-memory caches
Fix and Verify:
- Identify leaking component
- Add cleanup in useEffect return
- Implement connection pooling
- Add cache TTL and eviction
- Monitor memory over 24 hours
Alert: CPU usage > 95% for 2 minutes
Response:
- Acknowledge: Respond within 5 minutes
- Assess: Check Grafana CPU dashboard
-
Identify: Find process causing high CPU
docker exec nself-chat-app top -b -n 1 -
Mitigate:
- If infinite loop: Restart app
- If high load: Scale horizontally
- If attack: Enable rate limiting
- Document: Create incident report
Alert: Available memory < 10% for 2 minutes
Response:
- Acknowledge: Respond within 5 minutes
-
Check: Memory usage by process
docker stats nself-chat-app
-
Mitigate:
- Restart app to clear memory
- Increase container memory limit
- Clear Redis cache if needed
- Investigate: Check for memory leaks
- Monitor: Watch memory over next hour
Alert: P95 API latency > 1s for 2 minutes
Response:
- Acknowledge: Respond within 10 minutes
-
Identify: Find slow endpoints
- Check Sentry Performance
- Query Prometheus metrics
-
Check Dependencies:
- Database query time
- External API calls
- Cache hit rate
-
Mitigate:
- Enable aggressive caching
- Reduce query complexity
- Increase connection pool
- Document: Root cause and fix
Alert: Error rate > 5% for 2 minutes
Response:
- Acknowledge: Immediately (< 2 minutes)
-
Assess Impact:
- Check Sentry for error types
- Identify affected endpoints
- Count affected users
-
Mitigate:
- If deployment issue: Rollback
- If dependency down: Enable fallback
- If database issue: Check DB health
- Communicate: Post status update
- Resolve: Deploy fix
- Post-Mortem: Within 24 hours
Alert: P95 API latency > 500ms for 5 minutes
Response:
- Acknowledge within 30 minutes
- Investigate slow endpoints
- Check database query performance
- Schedule optimization work
- Monitor trend over next day
Alert: Error rate > 1% for 5 minutes
Response:
- Acknowledge within 30 minutes
- Check Sentry for error patterns
- Identify affected users
- Create fix if widespread
- Monitor for escalation
Alert: Redis cache hit rate < 80% for 10 minutes
Response:
- Acknowledge within 1 hour
- Check cache configuration
- Verify TTL settings
- Review cache keys being used
- Optimize cache strategy
Symptoms:
- Sudden spike in errors after deployment
- Errors in Sentry with new release tag
Investigation:
# Check recent deployments
git log -10 --oneline
# Check Sentry release
curl https://sentry.io/api/0/organizations/nself-chat/releases/Resolution:
- Review changes in deployment
- Check for breaking API changes
- Rollback if critical:
git revert <commit> - Deploy fix
- Add tests to prevent regression
Symptoms:
- Users report disconnections
- High reconnection rate in metrics
Investigation:
# Check WebSocket metrics
curl 'http://localhost:9090/api/v1/query?query=rate(websocket_disconnections_total[5m])'
# Check nginx logs
docker logs nginx | grep WebSocketResolution:
- Check nginx timeout settings
- Verify WebSocket heartbeat
- Check load balancer configuration
- Increase connection limits if needed
Symptoms:
- "Too many connections" errors
- Slow query performance
Investigation:
# Check connection count
docker exec postgres psql -U postgres -c \
"SELECT count(*) FROM pg_stat_activity;"
# Check max connections
docker exec postgres psql -U postgres -c \
"SHOW max_connections;"Resolution:
- Identify connection leaks
- Ensure connections are properly released
- Increase max_connections if needed
- Implement connection pooling
Symptoms:
- Warning emails from Sentry
- Events being dropped
Investigation:
- Check Sentry quota usage
- Identify noisy errors
- Review error grouping
Resolution:
- Add ignoreErrors for known issues
- Implement error sampling
- Increase quota if needed
- Fix root cause of errors
1. User Reports Issue
↓
2. Check Browser Console
- Error messages
- Network failures
- Warnings
↓
3. Check Sentry
- Error details
- Breadcrumbs
- User context
↓
4. Check Network Tab
- API responses
- Status codes
- Timing
↓
5. Reproduce Locally
- Follow breadcrumbs
- Check React DevTools
- Add console.logs
↓
6. Identify Root Cause
↓
7. Fix and Test
↓
8. Deploy
↓
9. Verify in Production
1. Alert Triggered
↓
2. Check Grafana
- Identify affected metric
- Find spike timing
- Correlate with deployment
↓
3. Check Logs
- Filter by time range
- Search for errors
- Find request IDs
↓
4. Check Sentry
- Error details
- Stack traces
- User impact
↓
5. Check Dependencies
- Database health
- Redis health
- External APIs
↓
6. Reproduce Locally
- Use same data
- Test with curl/Postman
↓
7. Identify Root Cause
↓
8. Fix and Test
↓
9. Deploy
↓
10. Monitor Metrics
1. Slow Response Reported
↓
2. Check Metrics
- Response time (P95, P99)
- Error rate
- Throughput
↓
3. Identify Slow Endpoint
- Query Prometheus
- Check Sentry Performance
↓
4. Analyze Traces
- View flame graph
- Find slow spans
↓
5. Check Database
- Query performance
- Connection count
- Indexes
↓
6. Check Cache
- Hit rate
- TTL
- Memory usage
↓
7. Profile Code
- Add instrumentation
- Measure execution time
↓
8. Optimize
- Add indexes
- Implement caching
- Optimize queries
↓
9. Measure Improvement
↓
10. Deploy and Monitor
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Response time (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Response time (P99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Active WebSocket connections
websocket_connections_active
# Messages per second
rate(messages_sent_total[5m])
# Active users (last 5 minutes)
count(user_activity_timestamp > (time() - 300))
# Cache hit rate
redis_keyspace_hits_total / (redis_keyspace_hits_total + redis_keyspace_misses_total)
# Database connections
pg_stat_database_numbackends
# Queue depth
message_queue_depth
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk usage
(node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} * 100
# Network throughput
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
URL: http://localhost:3000/d/performance
Panels:
- Request Rate: Total requests per second
- Response Time (P95/P99): API latency percentiles
- Error Rate: Percentage of failed requests
- Active WebSocket Connections: Current connection count
- CPU Usage: Server CPU utilization
- Memory Usage: Server memory utilization
- Database Connections: Active DB connections
- Cache Hit Rate: Redis cache efficiency
How to Use:
- Set time range (default: last 1 hour)
- Use variables to filter by endpoint, status code
- Click on graph to zoom in
- Hover for exact values
- Use annotations to mark deployments
URL: http://localhost:3000/d/business
Panels:
- Messages Sent: Message throughput
- Active Users: User activity
- Channels Created: Channel growth
- File Uploads: Storage usage
- Search Queries: Search activity
# Check system health
curl http://localhost:3000/api/health
# Review overnight errors
# Go to Sentry → Last 24 hours
# Check disk space
df -h
# Review active alerts
curl http://localhost:9090/api/v1/alerts# Review performance trends
# Open Grafana → Last 7 days
# Check Sentry quota usage
# Sentry → Settings → Usage
# Clean up old logs
docker system prune -f
# Review slow queries
docker exec postgres psql -U postgres -c \
"SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"# Review and update alert thresholds
# Based on observed patterns
# Clean up unused metrics
# Remove metrics with no data
# Review observability costs
# Sentry usage
# Storage costs
# Update runbook
# Document new issues and solutions| Level | Response Time | Escalation |
|---|---|---|
| Critical | 5 minutes | Immediate escalation |
| High | 30 minutes | Escalate if not resolved in 1 hour |
| Medium | 4 hours | Escalate if not resolved in 8 hours |
| Low | 24 hours | Track in backlog |
On-Call Engineer: [Your team's on-call rotation]
Engineering Manager: [Manager contact]
DevOps Lead: [DevOps contact]
Emergency Contact: [Emergency escalation]
- Acknowledge: Respond to alert
- Assess: Determine severity and impact
- Communicate: Post status update
- Mitigate: Immediate fix or rollback
- Resolve: Deploy permanent fix
- Post-Mortem: Document lessons learned
- Sentry Documentation: https://docs.sentry.io
- Prometheus Query Language: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Grafana Documentation: https://grafana.com/docs/
- Architecture Docs:
/.wiki/ARCHITECTURE.md - API Documentation:
/docs/API.md - Deployment Guide:
/docs/DEPLOYMENT.md
# Restart application
docker-compose restart nself-chat-app
# View real-time logs
docker-compose logs -f
# Check service health
docker-compose ps
# Access database
docker-compose exec postgres psql -U postgres
# Clear Redis cache
docker-compose exec redis redis-cli FLUSHALL
# Reload Prometheus config
curl -X POST http://localhost:9090/-/reloadThis runbook covers the most common observability scenarios. For issues not covered here:
- Check Sentry for similar errors
- Search internal documentation
- Consult with team
- Update this runbook with solution
Last Updated: February 9, 2026 Next Review: March 9, 2026