Pagerduty - alphagov/notifications-manuals GitHub Wiki

Service and Escalation policies

Services

Notify has set up two "services" in PagerDuty. A "service" is used to group alerts, have independent integrations and use escalation policies.

  • GOV.UK Notify P1 outages
    • Alerts sent by Email, SNS (Cloudwatch), Elastalert, Concourse, Managed prometheus, Cronitor and Sentry and Live Call Routing
    • Sends alerts to Slack incident channel
    • Automatically escalates using the "P1 outages" escalation policy (see below)
    • Allows manual escalation to "Escalate to Managers" escalation policy
  • GOV.UK Notify Warnings
    • Alerts sent by Email, SNS (Cloudwatch), Elastalert, Concourse, Alertmanager (prometheus), Managed prometheus, Cronitor, Sentry
    • Sends alerts to Slack incident channel
    • Automatically escalates using the "Warnings" escalation policy (see below)
    • Allows manual escalation to "Escalate to Managers" escalation policy

Escalation policies

There are three escalation policies configured in PagerDuty:

  • Escalate to Managers - Policy not linked to a service and therefore is not in use
  • P1 outages - Escalates P1 outages to:
    • Tech leads during office hours
    • Out Of Hours (OOH) team at other times
    • Unacknowledged incidents are escalated to Notify managers after 5 minutes
  • Warnings - Escalates lower priority alerts to Tech leads during office hours

PagerDuty integrations

Concourse PagerDuty alerts

Concourse has PagerDuty integration keys for both the GOV.UK Notify P1 outages and GOV.UK Notify Warnings service. The Jinja templates define the priority_level for an alert, which defines which integration key should be used.

Type Name PagerDuty Service
smoke file-uploads Warnings
smoke text-message-sending P1 outages
smoke email-sending P1 outages
smoke letter-sending Warnings
provider text-message-delivery-receipts Warnings
provider text-message-receiving Warnings
provider email-delivery-receipts P1 outages

Sentry PagerDuty alerts

See the information on the Sentry page.

Holiday

Before hols

  • Wait until the day before you go on holiday
  • Log in to pagerduty and go to Escalation Policies (under the "People" nav item) https://governmentdigitalservice.pagerduty.com/escalation_policies
  • For each of "GOV.UK Notify - P1 outages" and "GOV.UK Notify - Warnings"
    • Click the cog icon to edit
    • Remove your on-call schedule from step 1 ("Notify the following users or schedules")
  • Consider setting a slackbot reminder for the first morning you're back post-holiday to remind you to re-add yourself to the schedule.

After hols

When you return from holiday you need to add your schedule back to each escalation policy. Follow the steps above but in reverse.