Templates (PIR) - cilerler/cilerler.github.io GitHub Wiki

Post Incident Review (aka Postmortem): User API Outage - March 22, 2025

Status

Status Description
Draft The postmortem is being written. Initial facts are collected.
Awaiting Root Cause The incident is known but root cause is still being investigated.
In Review The draft is under internal or peer review.
Pending Approval Awaiting sign-off from leads or responsible engineers.
Approved Approved and ready to publish or share.
Completed The postmortem is published. Action items are assigned but not all are closed.
Follow-up Required Additional remediation steps remain open.
Closed All actions are resolved. The incident lifecycle is fully completed.
Canceled The postmortem was intentionally abandoned. No further work will be done.
Obsolete Superseded by a newer or merged postmortem.
Reopened Incident was previously closed or canceled but has reoccurred or evolved.

Incident Date

2025-03-22 14:17 UTC

Summary

The user-api service (built with .NET and fronted by YARP) was down for approximately 37 minutes due to a misconfigured DAPR pub/sub component. The misconfiguration occurred during a redeploy triggered via Flux, which caused the service to enter a crash loop. This impacted login and registration endpoints across all platforms.

Detection

Automatic alerts were triggered by Grafana OnCall within 2 minutes of deployment. No end-user reports were received prior to system alerts.

Impact

  • 12,400 login attempts failed (approx. 8% of daily average)
  • 220 user registrations dropped (approx. 5% of daily average)
  • Customer support ticket volume increased by 30%

Customer Impact

  • Users could not log in or register via web or mobile apps.
  • Frustration noted in support interactions, but no escalations.
  • All disrupted sessions resumed automatically post-recovery.

Communications

  • Incident acknowledged publicly on the status page at 14:28 UTC
  • Internal comms handled via Google Chat (#incident-response space)
  • Final postmortem published in GitHub Wiki and linked from the status page

Timeline (UTC)

  • 14:17 - Deployment triggered via Flux GitOps
  • 14:19 - Grafana OnCall alerts fired
  • 14:22 - SRE began triage
  • 14:34 - Root cause identified in component.yaml (DAPR config)
  • 14:41 - Hotfix committed and redeployed via Flux
  • 14:54 - All systems stable and healthy

Root Cause

A malformed RabbitMQ topic name was introduced in the DAPR pubsub component (dapr-components/component.yaml). The invalid configuration caused DAPR to fail during initialization, placing the user-api service into a crash loop. No validation mechanism was present in the CI pipeline to detect the error before deployment.

Resolution and Recovery (How did we respond?)

Once Grafana OnCall triggered alerts, the SRE team began immediate investigation. Logs from the DAPR sidecar and container restarts pointed to a subscription initialization failure. The root cause was traced to the incorrect RabbitMQ topic. The configuration was fixed, committed, and redeployed via Flux by 14:41 UTC. All health checks passed by 14:54 UTC. No data restoration was needed.

What Went Well

  • Grafana OnCall alerted early and reliably
  • Strong collaboration between SRE and development teams
  • Deployment via Flux was smooth once the fix was applied
  • Observability tools (Loki, Tempo) made tracing root cause efficient

What Went Wrong

  • No schema validation for DAPR component configurations
  • Direct production deployment skipped staging validation
  • Grafana alert context lacked details on the failing DAPR component

Corrective Measures (What did we do to fix the current problem?)

  • Corrected RabbitMQ topic in dapr-components/component.yaml
  • Redeployed user-api service via Flux with fixed config
  • Enhanced Grafana alert templates with component-level failure context
  • Introduced a temporary PowerShell pre-deploy validation step for DAPR configs

Preventative Measures (How are we preventing similar issues in the future?)

  • CI pipeline now includes schema validation for all DAPR configs
  • All production deployments require successful staging rollout
  • On-call runbook updated with procedures for DAPR crash loops and service restarts
  • Scheduled internal training session covering DAPR config best practices (RabbitMQ, Redis)

Action Items

Action Owner Due Date
Add CI-based schema validation for DAPR DevOps 2025-03-25
Enforce staging validation in Flux workflows Platform 2025-03-27
Improve Grafana alert context for DAPR issues SRE 2025-03-30

Links

⚠️ **GitHub.com Fallback** ⚠️