SRE - unders/mywiki GitHub Wiki

Links

Books

SRE Book

6. Monitoring Distributed Systems

The Four Golden Signals

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Latency
  • The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests.
Traffic
  • For a web service, this measurement is usually HTTP requests per second
Errors
  • The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error").
Saturation
  • Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

Availability Chart (Rounded)

Level Per Year Per Quarter Per 30 days
95% 18d 9d 3d
99% 3d 21h 7h
99.9% 8h 2h 43m
99.95% 4h 1h 21m
99.99% 52m 12m 4m
99.999% 5m 1m 25s

1. Whats the difference between DevOps and SRE

DevOps

  • Reduce Organizational Silos
  • Accept Failure as Normal
  • Implement Gradual Change
  • Leverage Tooling & Automation
  • Measure "Everything"

SRE is a prescriptive way to accomplish the DevOps philosophy and you could say that: class SRE implements DevOps

  • Share ownership
  • SLOs & Blameless PMs
  • Reduces costs of failure - canary deploy
  • Automate this year's job away
  • Measure toil and reliability

2. SLIs, SLOs, SLAs, oh my!

SRE

We must first define what availability is and then set levels of availability and finally plan what we do in case of a failure.

SLIs drive SLOs which inform SLAs

  • SLIs are Service Level Indicators or metrics over time which inform about the health of a service
  • SLOs are Service Level Objectives which are agreed upon bounds for how often those SLIs must be met.
  • SLAs are business-level agreements which define the service availability for a customer and the penalties for breaking that availability.

SLI - Service Level Indicator

Request latency

95th percentile latency of homepage request over past 5 minutes < 300 ms

Failures per request

The number of errors / total request < 1%

Batch throughput

???

SLO - Service Level Objectives

SLOs are binding targets for a collection of SLIs

Example:

95th percentile homepage SLI will succeed 99.9% over trailing year.

SLA - Service Level Agreement

Business agreement between a customer and a service provider typically based on SLOs

Example:

Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year.

3. Risk and Error Budgets

While we want to reduce the risk of system failure, we also have to accept risk in order to deliver new products and features.

In the SRE discipline, error budgets are the prescriptive, quantitative measurements for how much risk a service is willing to tolerate. Error budgets are the byproduct of the agreed-upon SLOs (Service Level Objectives) between product owners and systems engineers.

Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define that "accidents are normal" by quantifying accidents and risk. Error budgets also enforce that "change should be gradual", because non-gradual changes could quickly break the SLO and prevent further development for the quarter. This is why we say class SRE implements DevOps.

Resources

4. Toil and Toil Budgets

Toil is work that is tied to work running an operations service that is:

  • Manual
  • repetitive
  • Automatable
  • Tactical
  • Devoid of long-term value
  • Scales linearly as service grows

Overhead (Not toil)

  • Email
  • Expense reports
  • meetings
  • traveling

Resource

5. Now SRE Everyone Else with CRE!

We need to teach our customer about SRE so they understand our API SLIs and SLOs...

Additionally, it relates closely to "accepting failure as normal" by setting clear expectations with users about service availability. Lastly, it relates to "measuring everything" by sharing service level measurements in a single pane of glass between customers and platform providers. This is why we say class SRE implements DevOps.

Resource

6. Managing Risks as a Site Reliability Engineer

Risk Analysis

List of items that may case SLO violation:

  • Database backup downtown: 120 min * 12 months 100% of users = 1,440 bad minutes / year
  • Slow response times on every second Friday: 60 min * 26 weeks/year * 50% of users = 780 bad minutes / year

Expected Const

Risk TTD TTR Freq / Year Users Bad / Year
Production database 0 2h 12 100% 1440m
Degraded code push during deployment 30m 30m 26 50% 780m
Datacenter failure 15m 1h 1 100% 495m
Bad code deploy 45m 15min 150 25% 2250m
Upstream provider failure 5 m 30m 6 75% 157m

99.5 % Error Budget

365 * 24 * 60 * (100-99.5) / 100 = 2,628 bad minutes minutes/year

7. Actionable Alerting for SRE

Alerting on low-level metrics such as CPU usage or disk space doesn't actually show whether our users are experiencing issues with our product or service. Instead, we should build our alerts using our SLOs. By integrating our remaining error budget over time, we can see how outages or partial outages will affect our SLO.

  • Fast burn alert
  • Slow burn alert

Resource

8. Observability of Distributed Systems

We use structured logs, metrics, and traces help SRE and DevOps practitioners find out where the systems are broken. We'll use metrics to find slow or erroring queries, traces to find interactions between components, and logs to understand the errors in more detail.

  • Structured logging
  • Metrics (aggregat typed data: counters:requests, Gauges:cpu-load, Distribution:Latency)
  • Traces - execution flows

Resosource

9. Incident Management

  1. Process for declaring incidents
  2. Dashboard for viewing current incidents
  3. Database of who to contact for each kind of incident

Roles

  • Incident Commander ("IC") - Overview plan
  • Operations Lead - running the command
  • Communications Lead
  • Planing Lead
  • Logistic Lead

10. Postmortem and Retrospectives

Blameless Postmortem Metadata

  1. What systems were affected?
  2. Who was involved in responding?
  3. How did we found out about the event?
  4. When did we start responding?
  5. What mitigations did we deploy?
  6. When did the incident conclude?

End of course

Monitoring

The simplest way to think about black-box monitoring versus white-box monitoring is that black-box monitoring is symptom-oriented and represent active -- not predicted -- problems.

  • Tracing
  • Stats/Metrics
  • zPages

Why black-box monitoring is not sufficient

  1. Your probe is not real user traffic.
  2. Long-tail latency matters.
  3. You still need to investigate causes.

White-box

  • Instrumentation

Postmortem

Services

Tools

Prometheus, instrumenting, metrics

Logging

Opencencus Go

Opencensus Java

Grafana

Video