Day 4 SLI, SLO, SLA - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
SLI, SLO, and SLA are the core pillars of Site Reliability Engineering (SRE). They define:
-
SLI β What we measure
-
SLO β What target we expect
-
SLA β What we promise contractually
After today, you will be able to:
-
Define meaningful SLIs
-
Set realistic and measurable SLOs
-
Understand error budgets
-
Understand how SLOs guide deployment decisions
-
Connect Golden Signals β SLIs β SLOs β Alerts
-
Answer SRE architectβlevel interview questions
A measurable metric that reflects service health.
Examples:
-
Latency (p95)
-
Error rate (%)
-
Availability (%)
-
Throughput (RPS)
-
Queue delay
-
Apdex Score
SLI answers the question:
βWhat exactly should we measure to know if the system is healthy?β
A target or goal for the SLI.
Examples:
-
95% of API requests under 300 ms
-
Error rate < 1%
-
Availability β₯ 99.9%
-
Queue delay < 2 sec
SLO answers the question:
βWhat is the performance target we want to maintain?β
A business contract with penalties if reliability drops below a threshold.
Examples:
-
If availability < 99.9%, the customer receives a credit
-
If API is down > 40 minutes/month β refund
SLA is legal.
SLO is internal goal.
SLI is measurement.
Golden Signals β measurable SLIs β form SLOs β drive alerts.
An error budget is the allowed failure window based on SLO.
Example SLO:
Availability SLO: 99.9%
Error budget:
Allowed downtime per month β 43 minutes
-
Freeze deployments
-
Engineering focuses on reliability
-
Architecture must be fixed
-
Faster deployments
-
More experiments allowed
Error budget lets SRE control feature velocity vs reliability.
-
SLI: p95 latency
-
SLO: < 300ms
-
SLI: Error rate
-
SLO: < 1%
-
SLI: Crash-free sessions
-
SLO: > 99.5%
-
SLI: Cold start time
-
SLO: < 800ms for 95% of requests
-
SLI: Query success rate
-
SLO: > 99.99%
-
SLI: Steps completed successfully
-
SLO: > 99.8%
Example:
-
Login
-
Add to Cart
-
Checkout
Examples:
-
Latency
-
Availability
-
Error rate
Examples:
-
99% availability
-
p95 < 200ms latency
Example:
99% SLO = 7.2 hours downtime allowed per month
-
Dashboards
-
Alerts
-
Logs + metrics + traces
SRE teams pause deployments when SLOs degrade.
Latency (p95):
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Error rate:
rate(http_requests_total{status=~"5.."}[5m])
{ "range": { "latency_ms": { "gte": 300 } } }
http.server.duration http.client.duration db.system messaging.system
-
API availability
-
Error 5xx counts
-
Latency percentile metrics
requests | summarize percentile(duration,95)
-
p50, p95, p99
-
Per endpoint
-
4xx vs 5xx
-
Error budget burn rate
-
Uptime %
-
Region-level health
-
CPU, memory, I/O thresholds
-
Fast burn (minutes)
-
Slow burn (hours/days)
Checkout API latency increases suddenly.
-
p95 latency β 2.8 seconds
-
Error rate β 3%
-
Availability β 98.4% (below SLO)
Consumed 60% in 1 day.
-
Freeze deployments
-
Added caching
-
Increased DB read replicas
-
Tuned connection pool
SLO restored within 2 hours.
Pick a service and define:
-
Latency SLI
-
Error rate SLI
-
Availability SLI
Example:
99.5% availability over 30 days
Example:
99.5% = 3h 36m downtime allowed per month
Panels to add:
-
ApiLatency(p95)
-
ErrorRate
-
Availability run chart
-
Burn rate
-
What is an SLI?
-
What is an SLO?
-
What is an SLA?
-
Why is p95 more useful than average latency?
-
What is an error budget?
-
How does SRE use SLOs to reduce outages?
-
Design SLOs for a microservices-based API.
-
How do you calculate error-budget burn rate?
-
How do SLOs influence deployment decisions?
-
Build an SLO strategy for 500+ microservices.
-
How do you align SLOs with business KPIs?
-
How do you enforce SLO adoption across multiple teams?
What I understood clearly: Which SLI is most important for my workload: How I will define SLOs for my system: Questions to revisit: