Day 3 Golden Signals - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
Use this image in your wiki:

By the end of Day 3, you should deeply understand:
-
The four Golden Signals used globally across SRE and Observability
-
Why they are more important than CPU/Memory monitoring
-
How they help detect and prevent outages
-
How to interpret them in distributed systems
-
Where they appear in logs/metrics/traces
-
How to create Golden Signals dashboards
-
Interview questions for different experience levels
This is a key foundation before we move into SLI/SLO/SLA tomorrow.
Golden Signals are the minimum essential metrics required to understand a serviceβs health.
They apply to any stack, including:
-
Prometheus
-
Grafana
-
ELK / OpenSearch
-
CloudWatch / Datadog / Dynatrace
-
Azure Monitor / App Insights
-
GCP Cloud Monitoring
-
OpenTelemetry
The 4 signals are:
-
Latency β How long it takes to process a request
-
Traffic β How much load the system receives
-
Errors β How many requests fail
-
Saturation β How βfullβ the system is (resource pressure)
These work for APIs, microservices, serverless, databases, queues, and any distributed system.
Latency measures response time.
Examples:
-
API response time
-
p95 latency of microservice
-
Database query duration
-
Queue processing delays
-
Function execution duration
Averages hide problems β percentiles reveal them.
Traffic measures the demand on your system.
Examples:
-
Requests per second (RPS)
-
Transactions per second (TPS)
-
Kafka/EH queue message rate
-
Concurrent sessions
-
Data ingestion rate
A sudden traffic spike can cascade into:
-
High latency
-
Queue buildup
-
Increased error rates
-
System overload
Errors represent failed or degraded requests.
-
HTTP 4xx, 5xx
-
Exceptions
-
Failed DB queries
-
Message retry failures
-
Requests so slow they violate SLO
-
Partial failures in distributed systems
-
Timeouts due to downstream services
Errors show quality of the service.
Saturation indicates resource exhaustion.
Common indicators:
-
CPU > 85β90%
-
Memory nearing limit
-
Disk IOPS saturated
-
DB connection pool full
-
Thread pool exhaustion
-
Kafka consumer lag
Saturation predicts failure before it happens.
ββββββββββββββββββββββββββββ β Golden Signals β β Latency β Client β System β β Traffic β β Observability β Alerts/SLOs/RCA β Errors β β Saturation β ββββββββββββββββββββββββββββ β Telemetry Collectors (OpenTelemetry/Agents) β Metrics + Logs + Traces β Dashboards (Grafana / ELK / Datadog / CloudWatch / Azure) β Notifications + Auto-Remediation
This is used by every modern observability platform.
These apply broadly across systems:
-
API p95 latency
-
SQL query duration
-
Function execution time
-
External API call slowdowns
-
Requests per second
-
Messages ingested per second
-
Active connections
-
Page views
-
HTTP 5xx
-
Timeout exceptions
-
Dependency failures
-
Retry storms
-
CPU high
-
Memory leak
-
Disk I/O bottleneck
-
DB connection exhaustion
Prometheus:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
ELK/OpenSearch:
{ "range": { "latency_ms": { "gte": 500 } } }
Prometheus:
rate(http_requests_total[1m])
rate(http_requests_total{status=~"5.."}[5m])
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Checkout system slow during evening peak.
-
Latency: p99 = 3.5 sec
-
Traffic: RPS increased Γ4
-
Errors: 502 and DB_TIMEOUT started appearing
-
Saturation: DB CPU = 92%, threads blocked
A long-running SQL report was blocking new queries β high latency β errors β saturation.
Fix applied β latency dropped β errors disappeared β saturation normal.
Golden Signals made the RCA immediate.
-
What are the four Golden Signals?
-
What is latency?
-
Why is p95 more useful than average latency?
-
What is an example of traffic?
-
What does saturation indicate?
-
Explain the relationship between traffic and latency.
-
What are implicit errors?
-
What happens when a DB connection pool saturates?
-
How can latency hide downstream failures?
-
How do you detect error spikes?
-
Design a Golden Signals dashboard for a microservice.
-
How do you detect hidden latency (tail latency)?
-
Explain saturation in thread pools, DB pools, and queues.
-
What is a retry storm and how does it relate to Golden Signals?
-
How do Golden Signals shape auto-scaling decisions?
-
Implement Golden Signals across 300 microservices in hybrid cloud.
-
How do you enforce Golden Signal standards across teams?
-
How do you connect Golden Signals to SLOs and business KPIs?
-
Explain how Golden Signals reduce MTTR.
-
How do you design multi-region signal aggregation?
Today I learned: Which Golden Signal was easiest to understand: Which one was hardest: Which tools I tested today: What I need to review: