Day 3 Golden Signals - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

πŸ“˜ Day 3 β€” Golden Signals (Latency, Traffic, Errors, Saturation)

The universal SRE model for understanding system health at scale.


πŸ–ΌοΈ Golden Signals Visual

Use this image in your wiki:

![Golden Signals](A_2D_digital_infographic_showcases_the_Golden_Sign.png)

🎯 Learning Objective

By the end of Day 3, you should deeply understand:

  • The four Golden Signals used globally across SRE and Observability

  • Why they are more important than CPU/Memory monitoring

  • How they help detect and prevent outages

  • How to interpret them in distributed systems

  • Where they appear in logs/metrics/traces

  • How to create Golden Signals dashboards

  • Interview questions for different experience levels

This is a key foundation before we move into SLI/SLO/SLA tomorrow.


1️⃣ What Are the Golden Signals?

Golden Signals are the minimum essential metrics required to understand a service’s health.

They apply to any stack, including:

  • Prometheus

  • Grafana

  • ELK / OpenSearch

  • CloudWatch / Datadog / Dynatrace

  • Azure Monitor / App Insights

  • GCP Cloud Monitoring

  • OpenTelemetry

The 4 signals are:

  1. Latency β€” How long it takes to process a request

  2. Traffic β€” How much load the system receives

  3. Errors β€” How many requests fail

  4. Saturation β€” How β€œfull” the system is (resource pressure)

These work for APIs, microservices, serverless, databases, queues, and any distributed system.


2️⃣ Latency β€” How long does it take?

Latency measures response time.

Examples:

  • API response time

  • p95 latency of microservice

  • Database query duration

  • Queue processing delays

  • Function execution duration

⚠️ IMPORTANT: Percentiles matter

Percentile | Meaning -- | -- p50 | Typical user p95 | Slower users p99 | Edge cases / performance tail

Averages hide problems β€” percentiles reveal them.


3️⃣ Traffic β€” How much load?

Traffic measures the demand on your system.

Examples:

  • Requests per second (RPS)

  • Transactions per second (TPS)

  • Kafka/EH queue message rate

  • Concurrent sessions

  • Data ingestion rate

A sudden traffic spike can cascade into:

  • High latency

  • Queue buildup

  • Increased error rates

  • System overload


4️⃣ Errors β€” What is failing?

Errors represent failed or degraded requests.

Explicit Errors

  • HTTP 4xx, 5xx

  • Exceptions

  • Failed DB queries

  • Message retry failures

Implicit Errors

  • Requests so slow they violate SLO

  • Partial failures in distributed systems

  • Timeouts due to downstream services

Errors show quality of the service.


5️⃣ Saturation β€” How full is the system?

Saturation indicates resource exhaustion.

Common indicators:

  • CPU > 85–90%

  • Memory nearing limit

  • Disk IOPS saturated

  • DB connection pool full

  • Thread pool exhaustion

  • Kafka consumer lag

Saturation predicts failure before it happens.


6️⃣ Golden Signals Architecture (Text Diagram)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Golden Signals β”‚ β”‚ Latency β”‚ Client β†’ System β†’ β”‚ Traffic β”‚ β†’ Observability β†’ Alerts/SLOs/RCA β”‚ Errors β”‚ β”‚ Saturation β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↓ Telemetry Collectors (OpenTelemetry/Agents) ↓ Metrics + Logs + Traces ↓ Dashboards (Grafana / ELK / Datadog / CloudWatch / Azure) ↓ Notifications + Auto-Remediation

This is used by every modern observability platform.


7️⃣ Tool-Neutral Golden Signal Examples

These apply broadly across systems:

Latency

  • API p95 latency

  • SQL query duration

  • Function execution time

  • External API call slowdowns

Traffic

  • Requests per second

  • Messages ingested per second

  • Active connections

  • Page views

Errors

  • HTTP 5xx

  • Timeout exceptions

  • Dependency failures

  • Retry storms

Saturation

  • CPU high

  • Memory leak

  • Disk I/O bottleneck

  • DB connection exhaustion


8️⃣ Hands-On Exercises (Vendor-Neutral)

πŸ”§ Exercise A β€” Measure Latency

Prometheus:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

ELK/OpenSearch:

{ "range": { "latency_ms": { "gte": 500 } } }

πŸ”§ Exercise B β€” Measure Traffic

Prometheus:

rate(http_requests_total[1m])

πŸ”§ Exercise C β€” Measure Errors

rate(http_requests_total{status=~"5.."}[5m])

πŸ”§ Exercise D β€” Measure Saturation

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

9️⃣ Real-World Outage Example (Golden Signals in Action)

❌ Incident

Checkout system slow during evening peak.

Golden Signals:

  • Latency: p99 = 3.5 sec

  • Traffic: RPS increased Γ—4

  • Errors: 502 and DB_TIMEOUT started appearing

  • Saturation: DB CPU = 92%, threads blocked

🎯 Root Cause

A long-running SQL report was blocking new queries β†’ high latency β†’ errors β†’ saturation.

πŸ“ˆ Result

Fix applied β†’ latency dropped β†’ errors disappeared β†’ saturation normal.

Golden Signals made the RCA immediate.


πŸ”Ÿ Interview Questions (Beginner β†’ Architect)

Beginner

  1. What are the four Golden Signals?

  2. What is latency?

  3. Why is p95 more useful than average latency?

  4. What is an example of traffic?

  5. What does saturation indicate?

Intermediate

  1. Explain the relationship between traffic and latency.

  2. What are implicit errors?

  3. What happens when a DB connection pool saturates?

  4. How can latency hide downstream failures?

  5. How do you detect error spikes?

Senior

  1. Design a Golden Signals dashboard for a microservice.

  2. How do you detect hidden latency (tail latency)?

  3. Explain saturation in thread pools, DB pools, and queues.

  4. What is a retry storm and how does it relate to Golden Signals?

  5. How do Golden Signals shape auto-scaling decisions?

Architect

  1. Implement Golden Signals across 300 microservices in hybrid cloud.

  2. How do you enforce Golden Signal standards across teams?

  3. How do you connect Golden Signals to SLOs and business KPIs?

  4. Explain how Golden Signals reduce MTTR.

  5. How do you design multi-region signal aggregation?


1️⃣1️⃣ Your Learning Notes

Today I learned: Which Golden Signal was easiest to understand: Which one was hardest: Which tools I tested today: What I need to review:

πŸ“’ Next β†’ Day 4

SLI / SLO / SLA β€” The backbone of reliability engineering

⚠️ **GitHub.com Fallback** ⚠️