SRE - unders/mywiki GitHub Wiki

Grafana Alloy

alloy-opentelemetry-collector

Grafana Loki

grafana-loki-101-how-to-ingest-logs-with-alloy-or-the-opentelemetry-collector/

Go

OLD Stuff

Books

SRE Book

6. Monitoring Distributed Systems

The Four Golden Signals

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Latency

The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests.

Traffic

For a web service, this measurement is usually HTTP requests per second

Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error").

Saturation

Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

https://developers.soundcloud.com/blog/alerting-on-slos

Availability Chart (Rounded)

Level	Per Year	Per Quarter	Per 30 days
95%	18d	9d	3d
99%	3d	21h	7h
99.9%	8h	2h	43m
99.95%	4h	1h	21m
99.99%	52m	12m	4m
99.999%	5m	1m	25s

1. Whats the difference between DevOps and SRE

DevOps

Reduce Organizational Silos
Accept Failure as Normal
Implement Gradual Change
Leverage Tooling & Automation
Measure "Everything"

SRE is a prescriptive way to accomplish the DevOps philosophy and you could say that: class SRE implements DevOps

Share ownership
SLOs & Blameless PMs
Reduces costs of failure - canary deploy
Automate this year's job away
Measure toil and reliability

2. SLIs, SLOs, SLAs, oh my!

SRE

We must first define what availability is and then set levels of availability and finally plan what we do in case of a failure.

SLIs drive SLOs which inform SLAs

SLIs are Service Level Indicators or metrics over time which inform about the health of a service
SLOs are Service Level Objectives which are agreed upon bounds for how often those SLIs must be met.
SLAs are business-level agreements which define the service availability for a customer and the penalties for breaking that availability.

SLI - Service Level Indicator

Request latency

95th percentile latency of homepage request over past 5 minutes < 300 ms

Failures per request

The number of errors / total request < 1%

Batch throughput

???

SLO - Service Level Objectives

SLOs are binding targets for a collection of SLIs

Example:

95th percentile homepage SLI will succeed 99.9% over trailing year.

SLA - Service Level Agreement

Business agreement between a customer and a service provider typically based on SLOs

Example:

Service credits if 95th percentile homepage SLI succeeds less than 99.5% over trailing year.

3. Risk and Error Budgets

While we want to reduce the risk of system failure, we also have to accept risk in order to deliver new products and features.

In the SRE discipline, error budgets are the prescriptive, quantitative measurements for how much risk a service is willing to tolerate. Error budgets are the byproduct of the agreed-upon SLOs (Service Level Objectives) between product owners and systems engineers.

Risk and error budgets are directly related to many DevOps principles. Error budgets clearly define that "accidents are normal" by quantifying accidents and risk. Error budgets also enforce that "change should be gradual", because non-gradual changes could quickly break the SLO and prevent further development for the quarter. This is why we say class SRE implements DevOps.

Resources

4. Toil and Toil Budgets

Toil is work that is tied to work running an operations service that is:

Manual
repetitive
Automatable
Tactical
Devoid of long-term value
Scales linearly as service grows

Overhead (Not toil)

Email
Expense reports
meetings
traveling

Resource

Eliminating Toil

5. Now SRE Everyone Else with CRE!

We need to teach our customer about SRE so they understand our API SLIs and SLOs...

Additionally, it relates closely to "accepting failure as normal" by setting clear expectations with users about service availability. Lastly, it relates to "measuring everything" by sharing service level measurements in a single pane of glass between customers and platform providers. This is why we say class SRE implements DevOps.

Resource

Introducing Google Customer Reliability Engineering

6. Managing Risks as a Site Reliability Engineer

Risk Analysis

List of items that may case SLO violation:

Database backup downtown: 120 min * 12 months 100% of users = 1,440 bad minutes / year
Slow response times on every second Friday: 60 min * 26 weeks/year * 50% of users = 780 bad minutes / year

Expected Const

Risk	TTD	TTR	Freq / Year	Users	Bad / Year
Production database	0	2h	12	100%	1440m
Degraded code push during deployment	30m	30m	26	50%	780m
Datacenter failure	15m	1h	1	100%	495m
Bad code deploy	45m	15min	150	25%	2250m
Upstream provider failure	5 m	30m	6	75%	157m

99.5 % Error Budget

365 * 24 * 60 * (100-99.5) / 100 = 2,628 bad minutes minutes/year

7. Actionable Alerting for SRE

Alerting on low-level metrics such as CPU usage or disk space doesn't actually show whether our users are experiencing issues with our product or service. Instead, we should build our alerts using our SLOs. By integrating our remaining error budget over time, we can see how outages or partial outages will affect our SLO.

Fast burn alert
Slow burn alert

Resource

8. Observability of Distributed Systems

We use structured logs, metrics, and traces help SRE and DevOps practitioners find out where the systems are broken. We'll use metrics to find slow or erroring queries, traces to find interactions between components, and logs to understand the errors in more detail.

Structured logging
Metrics (aggregat typed data: counters:requests, Gauges:cpu-load, Distribution:Latency)
Traces - execution flows

Resosource

9. Incident Management

Process for declaring incidents
Dashboard for viewing current incidents
Database of who to contact for each kind of incident

Roles

Incident Commander ("IC") - Overview plan
Operations Lead - running the command
Communications Lead
Planing Lead
Logistic Lead

10. Postmortem and Retrospectives

Blameless Postmortem Metadata

What systems were affected?
Who was involved in responding?
How did we found out about the event?
When did we start responding?
What mitigations did we deploy?
When did the incident conclude?

End of course

Monitoring

The simplest way to think about black-box monitoring versus white-box monitoring is that black-box monitoring is symptom-oriented and represent active -- not predicted -- problems.

Tracing
Stats/Metrics
zPages

Why black-box monitoring is not sufficient

Your probe is not real user traffic.
Long-tail latency matters.
You still need to investigate causes.

White-box

Instrumentation

SRE - unders/mywiki GitHub Wiki

Grafana Alloy

Grafana Loki

Go

OLD Stuff

Links

Books

SRE Book

6. Monitoring Distributed Systems

The Four Golden Signals

Latency

Traffic

Errors

Saturation

Availability Chart (Rounded)

1. Whats the difference between DevOps and SRE

DevOps

2. SLIs, SLOs, SLAs, oh my!

SRE

SLI - Service Level Indicator

Request latency

Failures per request

Batch throughput

SLO - Service Level Objectives

Example:

SLA - Service Level Agreement

Example:

3. Risk and Error Budgets

Resources

4. Toil and Toil Budgets

Overhead (Not toil)

Resource

5. Now SRE Everyone Else with CRE!

Resource

6. Managing Risks as a Site Reliability Engineer

Risk Analysis

Expected Const

99.5 % Error Budget

7. Actionable Alerting for SRE

Resource

8. Observability of Distributed Systems

Resosource

9. Incident Management

Roles

10. Postmortem and Retrospectives

Blameless Postmortem Metadata

End of course

Monitoring

Why black-box monitoring is not sufficient

White-box

Postmortem

Services

Tools

Logging

Opencencus Go

Opencensus Java

Grafana

Video