Monitoring - kimschles/schlesinger-knowledge GitHub Wiki

Chapter 6 of the SRE Book

Monitoring

Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break.

Effective alerting systems have good signal and very low noise.

A good monitoring system tells what's broken (symptom) and why (cause)

The Four Golden Signals

Latency
- The time it takes to respond to a request
Traffic
- Web: HTTP requests per second
- Key/value storage: transaction and retrivals per second
Errors
Saturation
- Knowing when your service is 'full'

When monitoring, keep things simple:

The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal. Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.

Chapter 4 of the SRE Workbook

https://landing.google.com/sre/workbook/chapters/monitoring/

Metrics and structured logging

A good monitoring system has the following attributes:

Speed
- How 'fresh' do your need your data to be?
- How fast can you retrive your data?
Calculations
Interfaces
- You should give people different options when looking at the data (types of graphs, ways of drilling down into the data )
Alerts
- Some alerts are more important than others. How do you classify your alerts? (slack-only vs. pagerduty)

Sources of Monitoring Data

Sources of monitoring data: logs, metrics, distributed tracing and runtime introspection
Metrics are numbers that represent attributes and events
Logs are an append-only record of events

Metrics with Purpose

Alert when you SLI metrics show that your error budget is under threat
SLI metrics should be easy to see on the landing page of your dashboard

Intended Changes

Make sure you can tweek alerting to know when you have made changes to your codebase

Dependencies

Monitor responses coming from important dependencies

Saturation

Status of Served Traffic

Testing Alerting Logic

Write tests for your monitoring systems
Good luck, you'll have to develop a DSL