How To: connect Dead man's switch OpsGenie - theartusz/config GitHub Wiki

Intro

SpV has configured number of alerts both in Prometheus and Splunk. These alerts will send notifications to specified channels on Slack. The alerts are monitoring specific metrics and send alert if value is over defined level.

But what if the whole cluster goes down and therefore all of the alerts with it? Who is watching the watcher? For this case we implement feature called dead man’s switch. Detailed Description

Prometheus (installed to every cluster and included in sre-system-gitops repo) has dead man’s switch implemented out of the box - called watchdog. To properly configure dead man’s switch we will:

create api key in Opsgenie and create kind: secret object in cluster with the api key
create new prometheus rule with new watchdog and custom labels
create alertmanager config to send the watchdog signal to receiver (Opsgenie)
set up Opsgenie to receive signal from cluster

Create api key and secret for OpsGenie

Create hearbeat API key:
- in opsgenie navigate to Settings → integration list → API
- give the integration your name, copy the autogenerated key and Save Integration
- use the copied key instead when creating secret 👇
Create secret with the hearbeat API key encoded with base64

apiVersion: v1
kind: Secret
metadata:
  namespace: monitoring
  name: opsgenie
type: Opaque
data:
  # apiKey in encoded in base64
  apiKey: <YOUR-PASSWORD>
  username: Og==

Create new prometheus rule

Apply the following yaml to your cluster with kubectl apply -f <file-path>. Change the name of namespace based on your requirements.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: watchdog-opsgenie
  namespace: <namespace-name>
# for prometheus to load new prometheusrules labels must match
# ruleSelector labels in kind Prometheus configuration
# k get prometheus <name of file>
  labels:
    app: kube-prometheus-stack
    release: kube-prometheus-stack
spec:
  groups:
  - name: general.rules
    rules:
    - alert: Watchdog
      annotations:
        description: |
          This is an alert meant to ensure that the entire alerting pipeline is functional.
          This alert is always firing, therefore it should always be firing in Alertmanager
          and always fire against a receiver. There are integrations with various notification
          mechanisms that send a notification when this alert is not firing.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-watchdog
        summary: An alert that should always be firing to verify that Alertmanager
          is working properly.
      expr: vector(1)
      labels: #labels are important for sending notification to the right channel defined in alertmanagerconfig
        severity: none
        namespace: monitoring

Create alertmanager config

In the manifest below change the in webhookConfigs.url. The same heartbeat name must be used when creating heartbeat in Opsgenie. In addition to change the reference to secret containing apiKey and username if you used different than described in secret manifest above. Similarly as with prometheus rule, apply alertmanager config to your cluster with kubectl apply -f <path>.

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  labels:
    managedBy: team-sre
  name: alertmanager-opsgenie-config
  namespace: <namespace-name>
spec:
  receivers:
  - name: deadMansSwitch
    webhookConfigs:
      # url link to the specific heartbeat, replace test with heartbeat name
      - url: 'https://api.opsgenie.com/v2/heartbeats/<hearbeat-name>/ping'
        sendResolved: true
        httpConfig:
          basicAuth:
          # reference to secret containing login credentals
            password:
              key: apiKey
              name: opsgenie
            username:
              key: username
              name: opsgenie
  route:
    groupBy:
    - job
    groupInterval: 10s
    groupWait: 0s
    repeatInterval: 10s
    matchers:
      - name: alertname
        value: Watchdog
      - name: namespace
        value: <namespace-name>
    receiver: deadmansswitch

Set up Opsgenie

Navigate to Settings → Heartbeats → Create heartbeat
Give the heartbeat name you want (without spaces) - will be part of url defined in alertmanager
Select team sre
Define interval within which the heartbeat is expected to be received - if the heartbeat is not received within the time frame Opsgenie will send alert to Slack channel cluster-down that, well you guest it, cluster might be down
hit Create and enable
check if the heartbeat received signal from cluster (status will change to Active)

Finito

Now whenever the cluster falls you should receive alert on :slack: . This event will happen also when you upgrade something with 💙 💚 strategy and take down the blue cluster.