Operational Runbook - sgajbi/portfolio-analytics-system GitHub Wiki
This runbook provides step-by-step instructions for monitoring, supporting, and troubleshooting the Portfolio Analytics System in production. It is intended for DevOps engineers, SREs, and on-call support staff.
-
API Health:
Use/health
endpoints for all services.
Example:curl http://<service-host>:<port>/health
* **Service Status:**
Check service pods/status using Kubernetes:
```bash
kubectl get pods -n <namespace>
```
### 2. Log Monitoring
* **Centralized Logging:**
All services log to Splunk/ELK, enriched with correlation IDs (`corr_id=...`).
* **Search by Correlation ID:**
To trace an event across the pipeline:
```
corr_id=ING:550e8400-e29b-41d4-a716-446655440000
```
### 3. Kafka Monitoring
* **Consumer Lag:**
Monitor using Kafka UI tools or:
```bash
kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <service-group>
```
* **Dead Letter Queue (DLQ):**
Monitor DLQ topics for failed events. Investigate and replay after fixes.
### 4. Database Monitoring
* **Connection Health:**
Monitor with built-in PostgreSQL dashboards or queries.
* **Schema Consistency:**
Validate schema after migrations:
```sql
\d processed_events;
```
---
## Incident Response
### 1. Common Issues
* **Event Processing Failures:**
* Symptom: Messages in DLQ, missing downstream calculations.
* Action: Investigate root cause in logs (search by correlation ID), fix data/code, replay DLQ if safe.
* **Service Unavailable:**
* Symptom: Health check fails, pods not running.
* Action: Restart pod/deployment, check logs for startup errors.
* **DB Migration Issues:**
* Symptom: Service startup errors after deploy, schema mismatch.
* Action: Roll forward with a new migration or restore from backup as per the [Production Database Migration Guide](Production-Database-Migration-Guide.md).
### 2. On-Call Escalation
* **Critical errors or data loss:**
* Notify lead engineer and product owner immediately.
* Follow incident playbook for client communications.
---
## Maintenance Tasks
* **Rolling Restarts:**
Use Kubernetes rolling update or restart deployments to apply patches or config changes.
* **Kafka Topic Cleanup:**
Periodically clean up old topics and DLQ after retention expiry.
* **DB Backups:**
Ensure automated backups are running and periodically tested for restore.
---
## Useful Links
* [Observability & Logging](Observability-&-Logging.md)
* [Production Database Migration Guide](Production-Database-Migration-Guide.md)
* [Testing Guide](Testing-Guide.md)
---
git add Operational-Runbook.md
git commit -m "docs(wiki): Add initial Operational Runbook"
git push