How To: connect Dead man's switch OpsGenie - theartusz/config GitHub Wiki
Intro
SpV has configured number of alerts both in Prometheus and Splunk. These alerts will send notifications to specified channels on Slack. The alerts are monitoring specific metrics and send alert if value is over defined level.
But what if the whole cluster goes down and therefore all of the alerts with it? Who is watching the watcher? For this case we implement feature called dead man’s switch. Detailed Description
Prometheus (installed to every cluster and included in sre-system-gitops repo) has dead man’s switch implemented out of the box - called watchdog. To properly configure dead man’s switch we will:
- create api key in Opsgenie and create kind: secret object in cluster with the api key
- create new prometheus rule with new watchdog and custom labels
- create alertmanager config to send the watchdog signal to receiver (Opsgenie)
- set up Opsgenie to receive signal from cluster
- Create hearbeat API key:
- in opsgenie navigate to Settings → integration list → API
- give the integration your name, copy the autogenerated key and Save Integration
- use the copied key instead when creating secret 👇
- Create secret with the hearbeat API key encoded with base64
apiVersion: v1
kind: Secret
metadata:
namespace: monitoring
name: opsgenie
type: Opaque
data:
# apiKey in encoded in base64
apiKey: <YOUR-PASSWORD>
username: Og==
Apply the following yaml to your cluster with kubectl apply -f <file-path>
. Change the name of namespace based on your requirements.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: watchdog-opsgenie
namespace: <namespace-name>
# for prometheus to load new prometheusrules labels must match
# ruleSelector labels in kind Prometheus configuration
# k get prometheus <name of file>
labels:
app: kube-prometheus-stack
release: kube-prometheus-stack
spec:
groups:
- name: general.rules
rules:
- alert: Watchdog
annotations:
description: |
This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-watchdog
summary: An alert that should always be firing to verify that Alertmanager
is working properly.
expr: vector(1)
labels: #labels are important for sending notification to the right channel defined in alertmanagerconfig
severity: none
namespace: monitoring
In the manifest below change the in webhookConfigs.url. The same heartbeat name must be used when creating heartbeat in Opsgenie. In addition to change the reference to secret containing apiKey and username if you used different than described in secret manifest above. Similarly as with prometheus rule, apply alertmanager config to your cluster with kubectl apply -f <path>
.
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
labels:
managedBy: team-sre
name: alertmanager-opsgenie-config
namespace: <namespace-name>
spec:
receivers:
- name: deadMansSwitch
webhookConfigs:
# url link to the specific heartbeat, replace test with heartbeat name
- url: 'https://api.opsgenie.com/v2/heartbeats/<hearbeat-name>/ping'
sendResolved: true
httpConfig:
basicAuth:
# reference to secret containing login credentals
password:
key: apiKey
name: opsgenie
username:
key: username
name: opsgenie
route:
groupBy:
- job
groupInterval: 10s
groupWait: 0s
repeatInterval: 10s
matchers:
- name: alertname
value: Watchdog
- name: namespace
value: <namespace-name>
receiver: deadmansswitch
- Navigate to Settings → Heartbeats → Create heartbeat
- Give the heartbeat name you want (without spaces) - will be part of url defined in alertmanager
- Select team sre
- Define interval within which the heartbeat is expected to be received - if the heartbeat is not received within the time frame Opsgenie will send alert to Slack channel cluster-down that, well you guest it, cluster might be down
- hit Create and enable
- check if the heartbeat received signal from cluster (status will change to Active)
Now whenever the cluster falls you should receive alert on :slack: . This event will happen also when you upgrade something with 💙 💚 strategy and take down the blue cluster.