Day 4 SLI, SLO, SLA - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

πŸ“˜ Day 4 β€” SLI, SLO, SLA: The Backbone of Reliability Engineering

(GitHub Wiki Edition β€” Structured, Clean, Vendor-Neutral)


πŸ“ Overview

SLI, SLO, and SLA are the core pillars of Site Reliability Engineering (SRE). They define:

  • SLI β†’ What we measure

  • SLO β†’ What target we expect

  • SLA β†’ What we promise contractually

After today, you will be able to:

  • Define meaningful SLIs

  • Set realistic and measurable SLOs

  • Understand error budgets

  • Understand how SLOs guide deployment decisions

  • Connect Golden Signals β†’ SLIs β†’ SLOs β†’ Alerts

  • Answer SRE architect–level interview questions


1️⃣ What Are SLI, SLO, SLA? (Simple & Clear)


βœ” SLI β€” Service Level Indicator

A measurable metric that reflects service health.

Examples:

  • Latency (p95)

  • Error rate (%)

  • Availability (%)

  • Throughput (RPS)

  • Queue delay

  • Apdex Score

SLI answers the question:

β€œWhat exactly should we measure to know if the system is healthy?”


βœ” SLO β€” Service Level Objective

A target or goal for the SLI.

Examples:

  • 95% of API requests under 300 ms

  • Error rate < 1%

  • Availability β‰₯ 99.9%

  • Queue delay < 2 sec

SLO answers the question:

β€œWhat is the performance target we want to maintain?”


βœ” SLA β€” Service Level Agreement

A business contract with penalties if reliability drops below a threshold.

Examples:

  • If availability < 99.9%, the customer receives a credit

  • If API is down > 40 minutes/month β†’ refund

SLA is legal.
SLO is internal goal.
SLI is measurement.


2️⃣ How SLIs Connect to Golden Signals

Golden Signal | Example SLI -- | -- Latency | p95 < 300 ms Traffic | RPS handled without degradation Errors | Error rate < 1% Saturation | CPU usage < 85%

Golden Signals β†’ measurable SLIs β†’ form SLOs β†’ drive alerts.


3️⃣ Error Budget β€” The MOST Important SRE Concept

An error budget is the allowed failure window based on SLO.

Example SLO:

Availability SLO: 99.9%

Error budget:

Allowed downtime per month β‰ˆ 43 minutes

⚠️ If error budget is exhausted:

  • Freeze deployments

  • Engineering focuses on reliability

  • Architecture must be fixed

⚑ If error budget is healthy:

  • Faster deployments

  • More experiments allowed

Error budget lets SRE control feature velocity vs reliability.


4️⃣ Examples of Real SLOs Across Industries

API Service

  • SLI: p95 latency

  • SLO: < 300ms

  • SLI: Error rate

  • SLO: < 1%

Mobile App

  • SLI: Crash-free sessions

  • SLO: > 99.5%

Function Apps (AWS Lambda / Azure Functions)

  • SLI: Cold start time

  • SLO: < 800ms for 95% of requests

Database

  • SLI: Query success rate

  • SLO: > 99.99%

E-commerce Checkout

  • SLI: Steps completed successfully

  • SLO: > 99.8%


5️⃣ How to Build an SLO (Step-by-Step)


Step 1 β€” Identify the user journey

Example:

  • Login

  • Add to Cart

  • Checkout

Step 2 β€” Choose SLIs

Examples:

  • Latency

  • Availability

  • Error rate

Step 3 β€” Set SLO Targets

Examples:

  • 99% availability

  • p95 < 200ms latency

Step 4 β€” Calculate Error Budget

Example:

99% SLO = 7.2 hours downtime allowed per month

Step 5 β€” Monitor continuously

  • Dashboards

  • Alerts

  • Logs + metrics + traces

Step 6 β€” Use SLOs to guide releases

SRE teams pause deployments when SLOs degrade.


6️⃣ Vendor-Neutral Examples of SLI Measurement


Prometheus (PromQL) Examples

Latency (p95):

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate:

rate(http_requests_total{status=~"5.."}[5m])

ELK / OpenSearch Example

{ "range": { "latency_ms": { "gte": 300 } } }

OpenTelemetry Semantic Conventions

http.server.duration http.client.duration db.system messaging.system

AWS CloudWatch

  • API availability

  • Error 5xx counts

  • Latency percentile metrics


Azure Monitor (optional)

requests | summarize percentile(duration,95)

7️⃣ SLO Dashboards (What to Include)

βœ” Latency

  • p50, p95, p99

  • Per endpoint

βœ” Errors

  • 4xx vs 5xx

  • Error budget burn rate

βœ” Availability

  • Uptime %

  • Region-level health

βœ” Saturation

  • CPU, memory, I/O thresholds

βœ” Burn Rate Alerts

  • Fast burn (minutes)

  • Slow burn (hours/days)


8️⃣ Real-World Example β€” Using SLOs to Solve a Major Issue

Incident:

Checkout API latency increases suddenly.

SLIs detected:

  • p95 latency β†’ 2.8 seconds

  • Error rate β†’ 3%

  • Availability β†’ 98.4% (below SLO)

Error budget:

Consumed 60% in 1 day.

SRE Action:

  • Freeze deployments

  • Added caching

  • Increased DB read replicas

  • Tuned connection pool

Result:

SLO restored within 2 hours.


9️⃣ Hands-On Labs β€” Day 4


πŸ”§ Lab 1 β€” Create Your First SLI

Pick a service and define:

  • Latency SLI

  • Error rate SLI

  • Availability SLI

πŸ”§ Lab 2 β€” Define an SLO

Example:

99.5% availability over 30 days

πŸ”§ Lab 3 β€” Calculate Error Budget

Example:

99.5% = 3h 36m downtime allowed per month

πŸ”§ Lab 4 β€” Build an SLO Panel in Grafana

Panels to add:

  • ApiLatency(p95)

  • ErrorRate

  • Availability run chart

  • Burn rate


πŸ”Ÿ Interview Questions β€” Day 4


Beginner

  1. What is an SLI?

  2. What is an SLO?

  3. What is an SLA?


Intermediate

  1. Why is p95 more useful than average latency?

  2. What is an error budget?

  3. How does SRE use SLOs to reduce outages?


Senior

  1. Design SLOs for a microservices-based API.

  2. How do you calculate error-budget burn rate?

  3. How do SLOs influence deployment decisions?


Architect

  1. Build an SLO strategy for 500+ microservices.

  2. How do you align SLOs with business KPIs?

  3. How do you enforce SLO adoption across multiple teams?


πŸ“ Your Learning Notes

What I understood clearly: Which SLI is most important for my workload: How I will define SLOs for my system: Questions to revisit:
⚠️ **GitHub.com Fallback** ⚠️