What to Do When You Have No Idea What is Wrong - CDCgov/prime-simplereport GitHub Wiki

Sometimes, things crash, and we have no idea why. SimpleReport is a system that depends on several external pieces of infrastructure, and any one of them can run into issues.

Consider this document a "Choose Your Own Adventure" guide to navigating an incident. If you run into a situation that isn't covered here, congratulations! You're going to have an amazing story to tell. Once you've recovered from the ordeal, make sure you add your knowledge to this document. Even the smallest of contributions will help others as we move forward.

Let's get started, shall we?

START HERE: Is it Azure?

Remember: SimpleReport depends on Microsoft Azure. If Azure is having trouble with its data center network, whether due to a physical problem at a DC, or due to a networking trunk issue, the critical components that make up SimpleReport may go offline. When these outages are at their worst, the Azure portal may also become inaccessible.

What to Check

Your first stop will be to check the Azure Status page. This should give you a detailed picture of what condition Azure's services are in.

Unfortunately...Microsoft is notorious for not updating this status page until after an incident has already been resolved...or, in the case of a very bad incident, they won't update until they get enough pressure on Twitter. They may also not be able to update the page at all if any of its dependencies are hosted on their ailing cloud.

To get a better idea of Azure's condition, check out DownDetector. This status is crowdsourced, and it can usually tell you that Microsoft is having a problem before Microsoft knows about it.

How to Respond

If Azure is the problem:

  • Turn on Maintenance Mode.
  • Monitor. There's nothing more you can do until Azure comes back up.

If Azure is NOT the problem: Proceed to the next section.

Check our Subscriptions

If Azure is working, we need to run a sanity check to make sure our Azure subscription is properly allocated.

What to Check

Fetch a fresh SU password from the CDC network. Once you have that, log into Azure with your credentials.

What you do next depends on what results from your login.

How to Respond

If you're not able to log in:

  • Escalate the PagerDuty alert to your backup. You will need someone else to handle on-call until your permissions are rectified.

If you're not able to log in, and you ARE the backup:

If you're able to log in, and cannot see any of our subscriptions:

  • Escalate the PagerDuty alert to your backup. You will need someone else to handle on-call until your permissions are rectified.

If you're able to log in, you cannot see any of our subscriptions, and you ARE the backup:

If none of the above:

  • Proceed to the next section.

Check the Backend App Service

The backend consists of our SimpleReport App Service and PostgreSQL Flexible Server instance. Logs from the former should indicate problems with the latter, and will provide you with helpful debugging information.

What to Check

Follow the steps in our Container Debugging document.

How to Respond

If the error messages indicate a problem acquiring a database changelog lock:

  • Follow the steps here.

If the error messages indicate a problem with application startup, and the problem is the DATABASE:

If the error messages indicate a problem with Okta:

If the error messages indicate a problem with ANYTHING ELSE:

If swapping slots doesn't solve the problem:

If none of the above work:

Check the Database

Without our database running the show, SimpleReport will grind to a halt. This is arguably our most critical component outside of the code itself.

What to Check

First, make sure the database isn't in a maintenance window. Full documentation for the managed maintenance, and how to find out when it is scheduled, can be found in Microsoft's documentation.

Next, check the basics. Does the database exist? Is the server started and running? Perform your standard sanity checks. If you need detailed instructions for troubleshooting, including diagnosing metrics that are outside of expected ranges, start here.

How to Respond

If the database is in a maintenance window:

  • Monitor. There's nothing more you can do until maintenance is finished. Maintenance should be happening on a weekend outside of business hours, and usually lasts no more than an hour, so you won't need to worry about maintenance mode for this.

If the maintenance window lasts until the start of business hours:

  • Call DevSecOps, and make some popcorn. There will be some fun calls with Microsoft Support you can listen to in your near future.

If the database is turned off:

  • Turn it on by clicking the "Start" button.

If the database is started, but isn't responding:

If the restart doesn't fix the issue:

  • Attempt to connect to the database using the Bastion procedure. More information about that is included in our secure docset, which can only be accessed with a CDC account and if you are a part of the CDCent org.

If you are unable to Bastion into the database:

If the database doesn't exist:

  • Did you make a typo in the search box? Are you in the right subscription?

If you're sure you typed it correctly, and you verified that you're in the right subscription:

Check Akamai

Ah, Akamai: wonderful purveyor of rapid data and blocker of unwanted DDoS attacks when it works, and an endless source of confusion and misery when it doesn't.

Akamai is our Content Delivery Network (CDN) layer. It is controlled through a CDC-owned panel, with a dedicated support team.

What to Check

First, make sure the backend is working. You'll want to follow the steps in Check the Backend above to verify that.

If the backend is up, try to navigate to the website again. If you see this screen:

image

...your problem is likely Akamai. The screen above is what renders when Akamai logs an error while trying to load the origin site. To confirm, navigate to the following URL, replacing <env> with your environment's shortcode:

https://origin-<env>.simplereport.gov

What you do next will depend on what you see at the link above. Specifically, make sure that you aren't seeing a certificate error of any sort.

How to Respond

If you see the Akamai error screen, and the origin link loads with no certificate error:

  • Call DevSecOps. Make it their problem...you'll thank me later. (Editor's Note: This statement was written by a DevSecOps engineer.)

If you see the Akamai error screen, and the origin link loads WITH A CERTIFICATE ERROR:

  • Call DevSecOps. They forgot to renew an SSL certificate, or Azure somehow lost it. Regardless, it's their problem, now.

If you see the Akamai error screen, and the origin link DOES NOT LOAD:

Check Okta

We use Okta as our central clearinghouse for authentication and authorization throughout the app. If Okta goes down, we're dead in the water.

What to Check

To see if Okta is reporting an outage, you can check their status page. They tend to be pretty good about providing updates, and will likely send notifications via email, as well.

You can also leverage DownDetector as an early-warning system for potential issues.

How to Respond

If Okta reports an outage:

  • Turn on Maintenance Mode.
  • Monitor. There's nothing more you can do until Okta comes back up.

If Okta is up, but authorization and authentication are still broken:

  • Call DevSecOps. There is likely a configuration issue that needs to be taken care of, or Terraform decided to nuke something. Regardless, it's time to make this their problem.





















Politely Panic

If you've reached this part of the document...now you can start to panic. But, let's panic constructively!

If you have a secondary on-call member, escalate your alert to them.

If you are the secondary, or have no secondary to escalate to, now is the time to escalate the alert to the entirety of the on-call list. Make direct phone calls if you have to; our goal is to get as many eyes on the problem as possible.

If you hit this step, please remember one thing: breathe. You've done all you can; this is why we are a team. We're all here to help, and we will get through this!

⚠️ **GitHub.com Fallback** ⚠️