Post Incident Review (aka Postmortem): User API Outage - March 22, 2025

Status

Status	Description
Draft	The postmortem is being written. Initial facts are collected.
Awaiting Root Cause	The incident is known but root cause is still being investigated.
In Review	The draft is under internal or peer review.
Pending Approval	Awaiting sign-off from leads or responsible engineers.
Approved	Approved and ready to publish or share.
Completed	The postmortem is published. Action items are assigned but not all are closed.
Follow-up Required	Additional remediation steps remain open.
Closed	All actions are resolved. The incident lifecycle is fully completed.
Canceled	The postmortem was intentionally abandoned. No further work will be done.
Obsolete	Superseded by a newer or merged postmortem.
Reopened	Incident was previously closed or canceled but has reoccurred or evolved.

Incident Date

2025-03-22 14:17 UTC

Summary

The user-api service (built with .NET and fronted by YARP) was down for approximately 37 minutes due to a misconfigured DAPR pub/sub component. The misconfiguration occurred during a redeploy triggered via Flux, which caused the service to enter a crash loop. This impacted login and registration endpoints across all platforms.

Detection

Automatic alerts were triggered by Grafana OnCall within 2 minutes of deployment. No end-user reports were received prior to system alerts.

Impact

12,400 login attempts failed (approx. 8% of daily average)
220 user registrations dropped (approx. 5% of daily average)
Customer support ticket volume increased by 30%

Customer Impact

Users could not log in or register via web or mobile apps.
Frustration noted in support interactions, but no escalations.
All disrupted sessions resumed automatically post-recovery.

Communications

Incident acknowledged publicly on the status page at 14:28 UTC
Internal comms handled via Google Chat (#incident-response space)
Final postmortem published in GitHub Wiki and linked from the status page

Timeline (UTC)

14:17 - Deployment triggered via Flux GitOps
14:19 - Grafana OnCall alerts fired
14:22 - SRE began triage
14:34 - Root cause identified in component.yaml (DAPR config)
14:41 - Hotfix committed and redeployed via Flux
14:54 - All systems stable and healthy

Root Cause

A malformed RabbitMQ topic name was introduced in the DAPR pubsub component (dapr-components/component.yaml). The invalid configuration caused DAPR to fail during initialization, placing the user-api service into a crash loop. No validation mechanism was present in the CI pipeline to detect the error before deployment.

Resolution and Recovery (How did we respond?)

Once Grafana OnCall triggered alerts, the SRE team began immediate investigation. Logs from the DAPR sidecar and container restarts pointed to a subscription initialization failure. The root cause was traced to the incorrect RabbitMQ topic. The configuration was fixed, committed, and redeployed via Flux by 14:41 UTC. All health checks passed by 14:54 UTC. No data restoration was needed.

What Went Well

Grafana OnCall alerted early and reliably
Strong collaboration between SRE and development teams
Deployment via Flux was smooth once the fix was applied
Observability tools (Loki, Tempo) made tracing root cause efficient

What Went Wrong

No schema validation for DAPR component configurations
Direct production deployment skipped staging validation
Grafana alert context lacked details on the failing DAPR component

Corrective Measures (What did we do to fix the current problem?)

Corrected RabbitMQ topic in dapr-components/component.yaml
Redeployed user-api service via Flux with fixed config
Enhanced Grafana alert templates with component-level failure context
Introduced a temporary PowerShell pre-deploy validation step for DAPR configs

Preventative Measures (How are we preventing similar issues in the future?)

CI pipeline now includes schema validation for all DAPR configs
All production deployments require successful staging rollout
On-call runbook updated with procedures for DAPR crash loops and service restarts
Scheduled internal training session covering DAPR config best practices (RabbitMQ, Redis)

Action Items

Action	Owner	Due Date
Add CI-based schema validation for DAPR	DevOps	2025-03-25
Enforce staging validation in Flux workflows	Platform	2025-03-27
Improve Grafana alert context for DAPR issues	SRE	2025-03-30

Templates (PIR) - cilerler/cilerler.github.io GitHub Wiki

Post Incident Review (aka Postmortem): User API Outage - March 22, 2025

Status

Incident Date

Summary

Detection

Impact

Customer Impact

Communications

Timeline (UTC)

Root Cause

Resolution and Recovery (How did we respond?)

What Went Well

What Went Wrong

Corrective Measures (What did we do to fix the current problem?)

Preventative Measures (How are we preventing similar issues in the future?)

Action Items

Links

⚠️ GitHub.com Fallback ⚠️

Templates (PIR) - cilerler/cilerler.github.io GitHub Wiki

Post Incident Review (aka Postmortem): User API Outage - March 22, 2025

Status

Incident Date

Summary

Detection

Impact

Customer Impact

Communications

Timeline (UTC)

Root Cause

Resolution and Recovery (How did we respond?)

What Went Well

What Went Wrong

Corrective Measures (What did we do to fix the current problem?)

Preventative Measures (How are we preventing similar issues in the future?)

Action Items

Links

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️