Partner Team Incident Reports - department-of-veterans-affairs/abd-vro GitHub Wiki

04-01-2023: EE Team - EP Merge On 04-01-23 3:38 PM Pacific time, VRO team received a message in our support channel from a member of the EE team that their EP Merge application was down in the production environment. They received an alert from a DataDog healthcheck alert they had set up.

We had received an alert in our alerts channel at 3:30 PM on the same day that there were unavailable pods in prod. However, this alert alone does not necessarily indicate that an application will experience downtime. These alerts are most often triggered by cluster changes made by the LHDI team which cause many new instances of pods to begin spinning up. The K8S agent will wait for these new pods to report a ready status before evicting older versions of the pods. Once all of these pods reporting ready status, the alert is often resolved without any action from VRO engineers. So while we were alerted of a potential issue before our partner team engineer reported it, it was not clear that their application began to experience downtime.

A ticket was created https://github.com/department-of-veterans-affairs/abd-vro/issues/2816 for discovery along with a followup for execution. These tickets will focus on how to ensure pod evictions are handled gracefully so that applications hosted on the VRO platform can experience a greater deal of stability.

04-22-2024: CC TeamOn 04-22-2024 at 8:10am (MDT-mountain time), VRO team received a message in our support channel from a member of the CC team that the CC API service was down in the production environment. They had noticed that something wasn't right and were reviewing Datadog health check logs as a result. As they found, the last entry indicated that the application was shutting down, followed by silence from the logging.

After acknowledging their message, both the Primary and Secondary on-call were investigating the outage as top priority. During this initial troubleshooting, it was quickly identified that the CC service's helm configuration was unexpectedly broken. A known issue (a typo in the helm config) that was present in the develop branch of abd-vro, had somehow been deployed to prod. Additionally the image that was referenced as the deployed version, was specifying an image that had not been signed by SecRel, and therefore would not ever work in the production environment. This combination of broken helm config and incorrect image reference was preventing redeployment via the github actions that normally perform the deployment.

To bring things back online, Erik applied the helm config fix that was needed and proceeded to find the most recently signed image of CC. With both of these in hand, he ran the helm update command locally. This approach ensured that the VRO team could apply the patched helm config and specify a SecRel passing CC image without further delays in waiting on github actions to complete, thereby recovering from the outage sooner.

The Datadog http status monitoring of the CC service shows that the outage started at: Apr 19th, 2024 at 11:52am returning to a fully operational state at: Apr 22, 2024 at 11:34:43.934 am.

Further root cause analysis will be performed in issue #2883. The VRO Team also plans to develop an incident response plan as described in #2570).

⚠️ **GitHub.com Fallback** ⚠️