Monitoring - kimschles/schlesinger-knowledge GitHub Wiki
Chapter 6 of the SRE Book
Monitoring
Monitoring and alerting enables a system to tell us when it’s broken, or perhaps to tell us what’s about to break.
Effective alerting systems have good signal and very low noise.
A good monitoring system tells what's broken (symptom) and why (cause)
The Four Golden Signals
- Latency
- The time it takes to respond to a request
- Traffic
- Web: HTTP requests per second
- Key/value storage: transaction and retrivals per second
- Errors
- Saturation
- Knowing when your service is 'full'
When monitoring, keep things simple:
The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. Data collection, aggregation, and alerting configuration that is rarely exercised (e.g., less than once a quarter for some SRE teams) should be up for removal. Signals that are collected, but not exposed in any prebaked dashboard nor used by any alert, are candidates for removal.
Chapter 4 of the SRE Workbook
https://landing.google.com/sre/workbook/chapters/monitoring/
Metrics and structured logging
A good monitoring system has the following attributes:
- Speed
- How 'fresh' do your need your data to be?
- How fast can you retrive your data?
- Calculations
- Interfaces
- You should give people different options when looking at the data (types of graphs, ways of drilling down into the data )
- Alerts
- Some alerts are more important than others. How do you classify your alerts? (slack-only vs. pagerduty)
Sources of Monitoring Data
- Sources of monitoring data: logs, metrics, distributed tracing and runtime introspection
- Metrics are numbers that represent attributes and events
- Logs are an append-only record of events
Metrics with Purpose
- Alert when you SLI metrics show that your error budget is under threat
- SLI metrics should be easy to see on the landing page of your dashboard
Intended Changes
- Make sure you can tweek alerting to know when you have made changes to your codebase
Dependencies
- Monitor responses coming from important dependencies
Saturation
Status of Served Traffic
Testing Alerting Logic
- Write tests for your monitoring systems
- Good luck, you'll have to develop a DSL