📘 Day 3 — Golden Signals (Latency, Traffic, Errors, Saturation)

The universal SRE model for understanding system health at scale.

🖼️ Golden Signals Visual

Use this image in your wiki:



![Golden Signals](A_2D_digital_infographic_showcases_the_Golden_Sign.png)

🎯 Learning Objective

By the end of Day 3, you should deeply understand:

The four Golden Signals used globally across SRE and Observability
Why they are more important than CPU/Memory monitoring
How they help detect and prevent outages
How to interpret them in distributed systems
Where they appear in logs/metrics/traces
How to create Golden Signals dashboards
Interview questions for different experience levels

This is a key foundation before we move into SLI/SLO/SLA tomorrow.

1️⃣ What Are the Golden Signals?

Golden Signals are the minimum essential metrics required to understand a service’s health.

They apply to any stack, including:

Prometheus
Grafana
ELK / OpenSearch
CloudWatch / Datadog / Dynatrace
Azure Monitor / App Insights
GCP Cloud Monitoring
OpenTelemetry

The 4 signals are:

Latency — How long it takes to process a request
Traffic — How much load the system receives
Errors — How many requests fail
Saturation — How “full” the system is (resource pressure)

These work for APIs, microservices, serverless, databases, queues, and any distributed system.

2️⃣ Latency — How long does it take?

Latency measures response time.

Examples:

API response time
p95 latency of microservice
Database query duration
Queue processing delays
Function execution duration

⚠️ IMPORTANT: Percentiles matter

Averages hide problems — percentiles reveal them.

3️⃣ Traffic — How much load?

Traffic measures the demand on your system.

Examples:

Requests per second (RPS)
Transactions per second (TPS)
Kafka/EH queue message rate
Concurrent sessions
Data ingestion rate

A sudden traffic spike can cascade into:

High latency
Queue buildup
Increased error rates
System overload

4️⃣ Errors — What is failing?

Errors represent failed or degraded requests.

Explicit Errors

HTTP 4xx, 5xx
Exceptions
Failed DB queries
Message retry failures

Implicit Errors

Requests so slow they violate SLO
Partial failures in distributed systems
Timeouts due to downstream services

Errors show quality of the service.

5️⃣ Saturation — How full is the system?

Saturation indicates resource exhaustion.

Common indicators:

CPU > 85–90%
Memory nearing limit
Disk IOPS saturated
DB connection pool full
Thread pool exhaustion
Kafka consumer lag

Saturation predicts failure before it happens.

6️⃣ Golden Signals Architecture (Text Diagram)



                     ┌──────────────────────────┐
                     │      Golden Signals       │
                     │  Latency                  │
Client → System →     │  Traffic                 │ → Observability → Alerts/SLOs/RCA
                     │  Errors                  │
                     │  Saturation              │
                     └──────────────────────────┘
                               ↓
                   Telemetry Collectors (OpenTelemetry/Agents)
                               ↓
                      Metrics + Logs + Traces
                               ↓
            Dashboards (Grafana / ELK / Datadog / CloudWatch / Azure)
                               ↓
                     Notifications + Auto-Remediation

This is used by every modern observability platform.

7️⃣ Tool-Neutral Golden Signal Examples

These apply broadly across systems:

Latency

API p95 latency
SQL query duration
Function execution time
External API call slowdowns

Traffic

Requests per second
Messages ingested per second
Active connections
Page views

Errors

HTTP 5xx
Timeout exceptions
Dependency failures
Retry storms

Saturation

CPU high
Memory leak
Disk I/O bottleneck
DB connection exhaustion

8️⃣ Hands-On Exercises (Vendor-Neutral)

🔧 Exercise A — Measure Latency

Prometheus:



histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

ELK/OpenSearch:



{
  "range": { "latency_ms": { "gte": 500 } }
}

🔧 Exercise B — Measure Traffic

Prometheus:



rate(http_requests_total[1m])

🔧 Exercise C — Measure Errors



rate(http_requests_total{status=~"5.."}[5m])

🔧 Exercise D — Measure Saturation



100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

9️⃣ Real-World Outage Example (Golden Signals in Action)

❌ Incident

Checkout system slow during evening peak.

Golden Signals:

Latency: p99 = 3.5 sec
Traffic: RPS increased ×4
Errors: 502 and DB_TIMEOUT started appearing
Saturation: DB CPU = 92%, threads blocked

🎯 Root Cause

A long-running SQL report was blocking new queries → high latency → errors → saturation.

📈 Result

Fix applied → latency dropped → errors disappeared → saturation normal.

Golden Signals made the RCA immediate.

🔟 Interview Questions (Beginner → Architect)

Beginner

What are the four Golden Signals?
What is latency?
Why is p95 more useful than average latency?
What is an example of traffic?
What does saturation indicate?

Intermediate

Explain the relationship between traffic and latency.
What are implicit errors?
What happens when a DB connection pool saturates?
How can latency hide downstream failures?
How do you detect error spikes?

Senior

Design a Golden Signals dashboard for a microservice.
How do you detect hidden latency (tail latency)?
Explain saturation in thread pools, DB pools, and queues.
What is a retry storm and how does it relate to Golden Signals?
How do Golden Signals shape auto-scaling decisions?

Architect

Implement Golden Signals across 300 microservices in hybrid cloud.
How do you enforce Golden Signal standards across teams?
How do you connect Golden Signals to SLOs and business KPIs?
Explain how Golden Signals reduce MTTR.
How do you design multi-region signal aggregation?

1️⃣1️⃣ Your Learning Notes



Today I learned:
Which Golden Signal was easiest to understand:
Which one was hardest:
Which tools I tested today:
What I need to review:

Day 3 Golden Signals - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 3 — Golden Signals (Latency, Traffic, Errors, Saturation)

The universal SRE model for understanding system health at scale.

🖼️ Golden Signals Visual

🎯 Learning Objective

1️⃣ What Are the Golden Signals?

2️⃣ Latency — How long does it take?

⚠️ IMPORTANT: Percentiles matter

3️⃣ Traffic — How much load?

4️⃣ Errors — What is failing?

Explicit Errors

Implicit Errors

5️⃣ Saturation — How full is the system?

6️⃣ Golden Signals Architecture (Text Diagram)

7️⃣ Tool-Neutral Golden Signal Examples

Latency

Traffic

Errors

Saturation

8️⃣ Hands-On Exercises (Vendor-Neutral)

🔧 Exercise A — Measure Latency

🔧 Exercise B — Measure Traffic

🔧 Exercise C — Measure Errors

🔧 Exercise D — Measure Saturation

9️⃣ Real-World Outage Example (Golden Signals in Action)

❌ Incident

Golden Signals:

🎯 Root Cause

📈 Result

🔟 Interview Questions (Beginner → Architect)

Beginner

Intermediate

Senior

Architect

1️⃣1️⃣ Your Learning Notes

📢 Next → Day 4

SLI / SLO / SLA — The backbone of reliability engineering

⚠️ GitHub.com Fallback ⚠️

Day 3 Golden Signals - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 3 — Golden Signals (Latency, Traffic, Errors, Saturation)

The universal SRE model for understanding system health at scale.

🖼️ Golden Signals Visual

🎯 Learning Objective

1️⃣ What Are the Golden Signals?

2️⃣ Latency — How long does it take?

⚠️ IMPORTANT: Percentiles matter

3️⃣ Traffic — How much load?

4️⃣ Errors — What is failing?

Explicit Errors

Implicit Errors

5️⃣ Saturation — How full is the system?

6️⃣ Golden Signals Architecture (Text Diagram)

7️⃣ Tool-Neutral Golden Signal Examples

Latency

Traffic

Errors

Saturation

8️⃣ Hands-On Exercises (Vendor-Neutral)

🔧 Exercise A — Measure Latency

🔧 Exercise B — Measure Traffic

🔧 Exercise C — Measure Errors

🔧 Exercise D — Measure Saturation

9️⃣ Real-World Outage Example (Golden Signals in Action)

❌ Incident

Golden Signals:

🎯 Root Cause

📈 Result

🔟 Interview Questions (Beginner → Architect)

Beginner

Intermediate

Senior

Architect

1️⃣1️⃣ Your Learning Notes

📢 Next → Day 4

SLI / SLO / SLA — The backbone of reliability engineering

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️