📘 Day 4 — SLI, SLO, SLA: The Backbone of Reliability Engineering

(GitHub Wiki Edition — Structured, Clean, Vendor-Neutral)

📝 Overview

SLI, SLO, and SLA are the core pillars of Site Reliability Engineering (SRE). They define:

SLI → What we measure
SLO → What target we expect
SLA → What we promise contractually

After today, you will be able to:

Define meaningful SLIs
Set realistic and measurable SLOs
Understand error budgets
Understand how SLOs guide deployment decisions
Connect Golden Signals → SLIs → SLOs → Alerts
Answer SRE architect–level interview questions

1️⃣ What Are SLI, SLO, SLA? (Simple & Clear)

✔ SLI — Service Level Indicator

A measurable metric that reflects service health.

Examples:

Latency (p95)
Error rate (%)
Availability (%)
Throughput (RPS)
Queue delay
Apdex Score

SLI answers the question:

“What exactly should we measure to know if the system is healthy?”

✔ SLO — Service Level Objective

A target or goal for the SLI.

Examples:

95% of API requests under 300 ms
Error rate < 1%
Availability ≥ 99.9%
Queue delay < 2 sec

SLO answers the question:

“What is the performance target we want to maintain?”

✔ SLA — Service Level Agreement

A business contract with penalties if reliability drops below a threshold.

Examples:

If availability < 99.9%, the customer receives a credit
If API is down > 40 minutes/month → refund

SLA is legal.
SLO is internal goal.
SLI is measurement.

2️⃣ How SLIs Connect to Golden Signals

Golden Signals → measurable SLIs → form SLOs → drive alerts.

3️⃣ Error Budget — The MOST Important SRE Concept

An error budget is the allowed failure window based on SLO.

Example SLO:



Availability SLO: 99.9%

Error budget:



Allowed downtime per month ≈ 43 minutes

⚠️ If error budget is exhausted:

Freeze deployments
Engineering focuses on reliability
Architecture must be fixed

⚡ If error budget is healthy:

Faster deployments
More experiments allowed

Error budget lets SRE control feature velocity vs reliability.

4️⃣ Examples of Real SLOs Across Industries

API Service

SLI: p95 latency
SLO: < 300ms
SLI: Error rate
SLO: < 1%

Mobile App

SLI: Crash-free sessions
SLO: > 99.5%

Function Apps (AWS Lambda / Azure Functions)

SLI: Cold start time
SLO: < 800ms for 95% of requests

Database

SLI: Query success rate
SLO: > 99.99%

E-commerce Checkout

SLI: Steps completed successfully
SLO: > 99.8%

5️⃣ How to Build an SLO (Step-by-Step)

Step 1 — Identify the user journey

Example:

Login
Add to Cart
Checkout

Step 2 — Choose SLIs

Examples:

Latency
Availability
Error rate

Step 3 — Set SLO Targets

Examples:

99% availability
p95 < 200ms latency

Step 4 — Calculate Error Budget

Example:



99% SLO = 7.2 hours downtime allowed per month

Step 5 — Monitor continuously

Dashboards
Alerts
Logs + metrics + traces

Step 6 — Use SLOs to guide releases

SRE teams pause deployments when SLOs degrade.

6️⃣ Vendor-Neutral Examples of SLI Measurement

Prometheus (PromQL) Examples

Latency (p95):



histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate:



rate(http_requests_total{status=~"5.."}[5m])

ELK / OpenSearch Example



{
  "range": { "latency_ms": { "gte": 300 } }
}

OpenTelemetry Semantic Conventions



http.server.duration
http.client.duration
db.system
messaging.system

AWS CloudWatch

API availability
Error 5xx counts
Latency percentile metrics

Azure Monitor (optional)



requests
| summarize percentile(duration,95)

7️⃣ SLO Dashboards (What to Include)

✔ Latency

p50, p95, p99
Per endpoint

✔ Errors

4xx vs 5xx
Error budget burn rate

✔ Availability

Uptime %
Region-level health

✔ Saturation

CPU, memory, I/O thresholds

✔ Burn Rate Alerts

Fast burn (minutes)
Slow burn (hours/days)

8️⃣ Real-World Example — Using SLOs to Solve a Major Issue

Incident:

Checkout API latency increases suddenly.

SLIs detected:

p95 latency → 2.8 seconds
Error rate → 3%
Availability → 98.4% (below SLO)

Error budget:

Consumed 60% in 1 day.

SRE Action:

Freeze deployments
Added caching
Increased DB read replicas
Tuned connection pool

Result:

SLO restored within 2 hours.

9️⃣ Hands-On Labs — Day 4

🔧 Lab 1 — Create Your First SLI

Pick a service and define:

Latency SLI
Error rate SLI
Availability SLI

🔧 Lab 2 — Define an SLO

Example:



99.5% availability over 30 days

🔧 Lab 3 — Calculate Error Budget

Example:



99.5% = 3h 36m downtime allowed per month

🔧 Lab 4 — Build an SLO Panel in Grafana

Panels to add:

ApiLatency(p95)
ErrorRate
Availability run chart
Burn rate

🔟 Interview Questions — Day 4

Beginner

What is an SLI?
What is an SLO?
What is an SLA?

Intermediate

Why is p95 more useful than average latency?
What is an error budget?
How does SRE use SLOs to reduce outages?

Senior

Design SLOs for a microservices-based API.
How do you calculate error-budget burn rate?
How do SLOs influence deployment decisions?

Architect

Build an SLO strategy for 500+ microservices.
How do you align SLOs with business KPIs?
How do you enforce SLO adoption across multiple teams?

📝 Your Learning Notes



What I understood clearly:
Which SLI is most important for my workload:
How I will define SLOs for my system:
Questions to revisit:

Day 4 SLI, SLO, SLA - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 4 — SLI, SLO, SLA: The Backbone of Reliability Engineering

(GitHub Wiki Edition — Structured, Clean, Vendor-Neutral)

📝 Overview

1️⃣ What Are SLI, SLO, SLA? (Simple & Clear)

✔ SLI — Service Level Indicator

✔ SLO — Service Level Objective

✔ SLA — Service Level Agreement

2️⃣ How SLIs Connect to Golden Signals

3️⃣ Error Budget — The MOST Important SRE Concept

⚠️ If error budget is exhausted:

⚡ If error budget is healthy:

4️⃣ Examples of Real SLOs Across Industries

API Service

Mobile App

Function Apps (AWS Lambda / Azure Functions)

Database

E-commerce Checkout

5️⃣ How to Build an SLO (Step-by-Step)

Step 1 — Identify the user journey

Step 2 — Choose SLIs

Step 3 — Set SLO Targets

Step 4 — Calculate Error Budget

Step 5 — Monitor continuously

Step 6 — Use SLOs to guide releases

6️⃣ Vendor-Neutral Examples of SLI Measurement

Prometheus (PromQL) Examples

ELK / OpenSearch Example

OpenTelemetry Semantic Conventions

AWS CloudWatch

Azure Monitor (optional)

7️⃣ SLO Dashboards (What to Include)

✔ Latency

✔ Errors

✔ Availability

✔ Saturation

✔ Burn Rate Alerts

8️⃣ Real-World Example — Using SLOs to Solve a Major Issue

Incident:

SLIs detected:

Error budget:

SRE Action:

Result:

9️⃣ Hands-On Labs — Day 4

🔧 Lab 1 — Create Your First SLI

🔧 Lab 2 — Define an SLO

🔧 Lab 3 — Calculate Error Budget

🔧 Lab 4 — Build an SLO Panel in Grafana

🔟 Interview Questions — Day 4

Beginner

Intermediate

Senior

Architect

📝 Your Learning Notes

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️