Backend Support Alarm Runbook - a2n-seattle/rms-app GitHub Wiki
Default
-
Check in using the :eyes: emoji on the chatbot message.
-
Find the Root Cause for the issue using the Dashboard
-
Check Metrics to look for the problematic service.
- Start from the top of the tech stack and move down until you find the offending service. (Dashboard is ordered with the services at the top of the stack at the top of the page.) This gives you an idea of what might have went wrong.
-
Check Logs to root cause the specific problem that caused the alarm
-
Document the root cause (with the corresponding error message if applicable) in the chatbot alert thread.
-
If root cause is not clear, tag @Jeremy Yau in the thread and he will help take a look.
-
-
-
Propose Mitigating Action in the thread
-
If no Mitigating Action required, mark chatbot message with a ❎,
-
If code change is required to take Mitigating Action
-
Create a GitHub ticket for the Mitigating Action to track progress on the fix.
-
Once GitHub ticket is created, comment the link to the Jira ticket in the thread and then mark Issue as Resolved by marking chatbot message with a ✅.
-
Depending on urgency, code change can be done either immediately, or put into the backlog to be prioritized by the PM.
-
-
If other mitigating action is required, reach out to @Jeremy Yau.
-