Operational Runbook - sgajbi/portfolio-analytics-system GitHub Wiki

Operational Runbook

Overview

This runbook provides step-by-step instructions for monitoring, supporting, and troubleshooting the Portfolio Analytics System in production. It is intended for DevOps engineers, SREs, and on-call support staff.


Core Operational Tasks

1. Health Checks

  • API Health:
    Use /health endpoints for all services.
    Example:
    curl http://<service-host>:<port>/health

* **Service Status:**
  Check service pods/status using Kubernetes:

  ```bash
  kubectl get pods -n <namespace>
  ```

### 2. Log Monitoring

* **Centralized Logging:**
  All services log to Splunk/ELK, enriched with correlation IDs (`corr_id=...`).
* **Search by Correlation ID:**
  To trace an event across the pipeline:

  ```
  corr_id=ING:550e8400-e29b-41d4-a716-446655440000
  ```

### 3. Kafka Monitoring

* **Consumer Lag:**
  Monitor using Kafka UI tools or:

  ```bash
  kafka-consumer-groups.sh --bootstrap-server <broker> --describe --group <service-group>
  ```
* **Dead Letter Queue (DLQ):**
  Monitor DLQ topics for failed events. Investigate and replay after fixes.

### 4. Database Monitoring

* **Connection Health:**
  Monitor with built-in PostgreSQL dashboards or queries.
* **Schema Consistency:**
  Validate schema after migrations:

  ```sql
  \d processed_events;
  ```

---

## Incident Response

### 1. Common Issues

* **Event Processing Failures:**

  * Symptom: Messages in DLQ, missing downstream calculations.
  * Action: Investigate root cause in logs (search by correlation ID), fix data/code, replay DLQ if safe.

* **Service Unavailable:**

  * Symptom: Health check fails, pods not running.
  * Action: Restart pod/deployment, check logs for startup errors.

* **DB Migration Issues:**

  * Symptom: Service startup errors after deploy, schema mismatch.
  * Action: Roll forward with a new migration or restore from backup as per the [Production Database Migration Guide](Production-Database-Migration-Guide.md).

### 2. On-Call Escalation

* **Critical errors or data loss:**

  * Notify lead engineer and product owner immediately.
  * Follow incident playbook for client communications.

---

## Maintenance Tasks

* **Rolling Restarts:**
  Use Kubernetes rolling update or restart deployments to apply patches or config changes.
* **Kafka Topic Cleanup:**
  Periodically clean up old topics and DLQ after retention expiry.
* **DB Backups:**
  Ensure automated backups are running and periodically tested for restore.

---

## Useful Links

* [Observability & Logging](Observability-&-Logging.md)
* [Production Database Migration Guide](Production-Database-Migration-Guide.md)
* [Testing Guide](Testing-Guide.md)

---


Git Commands to Commit

git add Operational-Runbook.md
git commit -m "docs(wiki): Add initial Operational Runbook"
git push
⚠️ **GitHub.com Fallback** ⚠️