Templates (PIR) - cilerler/cilerler.github.io GitHub Wiki
Status | Description |
---|---|
Draft | The postmortem is being written. Initial facts are collected. |
Awaiting Root Cause | The incident is known but root cause is still being investigated. |
In Review | The draft is under internal or peer review. |
Pending Approval | Awaiting sign-off from leads or responsible engineers. |
Approved | Approved and ready to publish or share. |
Completed | The postmortem is published. Action items are assigned but not all are closed. |
Follow-up Required | Additional remediation steps remain open. |
Closed | All actions are resolved. The incident lifecycle is fully completed. |
Canceled | The postmortem was intentionally abandoned. No further work will be done. |
Obsolete | Superseded by a newer or merged postmortem. |
Reopened | Incident was previously closed or canceled but has reoccurred or evolved. |
2025-03-22 14:17 UTC
The user-api
service (built with .NET and fronted by YARP) was down for approximately 37 minutes due to a misconfigured DAPR pub/sub component. The misconfiguration occurred during a redeploy triggered via Flux, which caused the service to enter a crash loop. This impacted login and registration endpoints across all platforms.
Automatic alerts were triggered by Grafana OnCall within 2 minutes of deployment. No end-user reports were received prior to system alerts.
- 12,400 login attempts failed (approx. 8% of daily average)
- 220 user registrations dropped (approx. 5% of daily average)
- Customer support ticket volume increased by 30%
- Users could not log in or register via web or mobile apps.
- Frustration noted in support interactions, but no escalations.
- All disrupted sessions resumed automatically post-recovery.
- Incident acknowledged publicly on the status page at 14:28 UTC
- Internal comms handled via Google Chat (
#incident-response
space) - Final postmortem published in GitHub Wiki and linked from the status page
- 14:17 - Deployment triggered via Flux GitOps
- 14:19 - Grafana OnCall alerts fired
- 14:22 - SRE began triage
-
14:34 - Root cause identified in
component.yaml
(DAPR config) - 14:41 - Hotfix committed and redeployed via Flux
- 14:54 - All systems stable and healthy
A malformed RabbitMQ topic name was introduced in the DAPR pubsub
component (dapr-components/component.yaml
). The invalid configuration caused DAPR to fail during initialization, placing the user-api
service into a crash loop. No validation mechanism was present in the CI pipeline to detect the error before deployment.
Once Grafana OnCall triggered alerts, the SRE team began immediate investigation. Logs from the DAPR sidecar and container restarts pointed to a subscription initialization failure. The root cause was traced to the incorrect RabbitMQ topic. The configuration was fixed, committed, and redeployed via Flux by 14:41 UTC. All health checks passed by 14:54 UTC. No data restoration was needed.
- Grafana OnCall alerted early and reliably
- Strong collaboration between SRE and development teams
- Deployment via Flux was smooth once the fix was applied
- Observability tools (Loki, Tempo) made tracing root cause efficient
- No schema validation for DAPR component configurations
- Direct production deployment skipped staging validation
- Grafana alert context lacked details on the failing DAPR component
- Corrected RabbitMQ topic in
dapr-components/component.yaml
- Redeployed
user-api
service via Flux with fixed config - Enhanced Grafana alert templates with component-level failure context
- Introduced a temporary PowerShell pre-deploy validation step for DAPR configs
- CI pipeline now includes schema validation for all DAPR configs
- All production deployments require successful staging rollout
- On-call runbook updated with procedures for DAPR crash loops and service restarts
- Scheduled internal training session covering DAPR config best practices (RabbitMQ, Redis)
Action | Owner | Due Date |
---|---|---|
Add CI-based schema validation for DAPR | DevOps | 2025-03-25 |
Enforce staging validation in Flux workflows | Platform | 2025-03-27 |
Improve Grafana alert context for DAPR issues | SRE | 2025-03-30 |