SRE - unders/mywiki GitHub Wiki
Grafana Alloy
Grafana Loki
Go
- grafana/alloy/blob/main/docs/developer/contributing.md
- go.dev/wiki/CodeReviewComments
- go.dev/wiki/TestComments
- go.dev/doc/effective_go
- google.github.io/styleguide/go/decisions
- uber-go/guide/blob/master/style.md
- go.dev/blog/package-names
- go.dev/doc/code
- go.dev/wiki/CommonMistakes
- Organizing Go code
- peter.bourgon.org/go-in-production/#formatting-and-style
- peter.bourgon.org/go-for-industrial-programming/
OLD Stuff
Links
- GCP - Transparent SLI
- GCP - monitoring
- Transparent SLIs see google cloud the way your application experiences it
- meeting-reliability-challenges-with-sre-principles
Books
SRE Book
6. Monitoring Distributed Systems
The Four Golden Signals
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
Latency
- The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests.
Traffic
- For a web service, this measurement is usually HTTP requests per second
Errors
- The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error").
Saturation
- Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Availability Chart (Rounded)
Level | Per Year | Per Quarter | Per 30 days |
---|---|---|---|
95% | 18d | 9d | 3d |
99% | 3d | 21h | 7h |
99.9% | 8h | 2h | 43m |
99.95% | 4h | 1h | 21m |
99.99% | 52m | 12m | 4m |
99.999% | 5m | 1m | 25s |
1. Whats the difference between DevOps and SRE
DevOps
- Reduce Organizational Silos
- Accept Failure as Normal
- Implement Gradual Change
- Leverage Tooling & Automation
- Measure "Everything"
SRE is a prescriptive way to accomplish the DevOps philosophy and you could say that: class SRE implements DevOps
- Share ownership
- SLOs & Blameless PMs
- Reduces costs of failure - canary deploy
- Automate this year's job away
- Measure toil and reliability
2. SLIs, SLOs, SLAs, oh my!
SRE
We must first define what availability is and then set levels of availability and finally plan what we do in case of a failure.
SLIs drive SLOs which inform SLAs
- SLIs are Service Level Indicators or metrics over time which inform about the health of a service
- SLOs are Service Level Objectives which are agreed upon bounds for how often those SLIs must be met.
- SLAs are business-level agreements which define the service availability for a customer and the penalties for breaking that availability.
SLI - Service Level Indicator
Request latency
95th percentile latency of homepage request over past 5 minutes < 300 ms
Failures per request
The number of errors / total request < 1%
Batch throughput
???
SLO - Service Level Objectives
SLOs are binding targets for a collection of SLIs
Example:
95th percentile homepage SLI will succeed 99.9% over trailing year.
SLA - Service Level Agreement
Business agreement between a customer and a service provider typically based on SLOs
Example:
Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year.
3. Risk and Error Budgets
While we want to reduce the risk of system failure, we also have to accept risk in order to deliver new products and features.
In the SRE discipline, error budgets are the prescriptive, quantitative measurements for how much risk a service is willing to tolerate. Error budgets are the byproduct of the agreed-upon SLOs (Service Level Objectives) between product owners and systems engineers.
Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define that "accidents are normal" by quantifying accidents and risk. Error budgets also enforce that "change should be gradual", because non-gradual changes could quickly break the SLO and prevent further development for the quarter. This is why we say class SRE implements DevOps.
Resources
4. Toil and Toil Budgets
Toil is work that is tied to work running an operations service that is:
- Manual
- repetitive
- Automatable
- Tactical
- Devoid of long-term value
- Scales linearly as service grows
Overhead (Not toil)
- Expense reports
- meetings
- traveling
Resource
5. Now SRE Everyone Else with CRE!
We need to teach our customer about SRE so they understand our API SLIs and SLOs...
Additionally, it relates closely to "accepting failure as normal" by setting clear expectations with users about service availability. Lastly, it relates to "measuring everything" by sharing service level measurements in a single pane of glass between customers and platform providers. This is why we say class SRE implements DevOps.
Resource
6. Managing Risks as a Site Reliability Engineer
Risk Analysis
List of items that may case SLO violation:
- Database backup downtown:
120 min * 12 months 100% of users = 1,440 bad minutes / year
- Slow response times on every second Friday:
60 min * 26 weeks/year * 50% of users = 780 bad minutes / year
Expected Const
Risk | TTD | TTR | Freq / Year | Users | Bad / Year |
---|---|---|---|---|---|
Production database | 0 | 2h | 12 | 100% | 1440m |
Degraded code push during deployment | 30m | 30m | 26 | 50% | 780m |
Datacenter failure | 15m | 1h | 1 | 100% | 495m |
Bad code deploy | 45m | 15min | 150 | 25% | 2250m |
Upstream provider failure | 5 m | 30m | 6 | 75% | 157m |
99.5 % Error Budget
365 * 24 * 60 * (100-99.5) / 100 = 2,628 bad minutes minutes/year
7. Actionable Alerting for SRE
Alerting on low-level metrics such as CPU usage or disk space doesn't actually show whether our users are experiencing issues with our product or service. Instead, we should build our alerts using our SLOs. By integrating our remaining error budget over time, we can see how outages or partial outages will affect our SLO.
- Fast burn alert
- Slow burn alert
Resource
- Drilling down into Stackdriver Service Monitoring
- creating-a-dashboard-with-stackdriver-sli-monitoring
8. Observability of Distributed Systems
We use structured logs, metrics, and traces help SRE and DevOps practitioners find out where the systems are broken. We'll use metrics to find slow or erroring queries, traces to find interactions between components, and logs to understand the errors in more detail.
- Structured logging
- Metrics (aggregat typed data:
counters:requests
,Gauges:cpu-load
,Distribution:Latency
) - Traces - execution flows
Resosource
- stackdriver
- drilling-down-into-stackdriver-service-monitoring
- Improving Reliability with Error Budgets, Metrics, and Tracing in Stackdriver (Cloud Next '18)
9. Incident Management
- Process for declaring incidents
- Dashboard for viewing current incidents
- Database of who to contact for each kind of incident
Roles
- Incident Commander ("IC") - Overview plan
- Operations Lead - running the command
- Communications Lead
- Planing Lead
- Logistic Lead
10. Postmortem and Retrospectives
Blameless Postmortem Metadata
- What systems were affected?
- Who was involved in responding?
- How did we found out about the event?
- When did we start responding?
- What mitigations did we deploy?
- When did the incident conclude?
End of course
Monitoring
The simplest way to think about black-box monitoring versus white-box monitoring is that black-box monitoring is symptom-oriented and represent active -- not predicted -- problems.
- Tracing
- Stats/Metrics
- zPages
Why black-box monitoring is not sufficient
- Your probe is not real user traffic.
- Long-tail latency matters.
- You still need to investigate causes.
White-box
- Instrumentation
Postmortem
Services
Tools
Prometheus, instrumenting, metrics