Day 2 Logs, Metrics, Traces - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
Understand:
-
What logs, metrics, and traces are
-
Why all three are required for observability
-
How each signal type helps during troubleshooting
-
How data flows through Azure Monitor, App Insights, Prometheus, Grafana
-
How to query telemetry using KQL and PromQL
-
Real enterprise examples
-
Interview questions for different job levels
Traditional monitoring focused only on infra metrics like CPU and RAM. Modern systems are:
-
Distributed
-
Containerized
-
Event-driven
-
Microservices-oriented
This complexity requires deeper, correlated visibility.
-
Logs → What happened
-
Metrics → How much / How often
-
Traces → Where / Why it happened
Together, they form the observability foundation.
Logs capture discrete events happening inside the system.
-
Exceptions
-
API requests
-
Function app failures
-
SQL timeout
-
Kubernetes pod crash
-
Security events
-
Metric spike: latency increased
-
Trace: shows DB taking 2 seconds
-
Log: shows SQL timeout
This triangulation gives you true observability.
[ Application / Container / VM ] ↓ ┌───────────────────────────┐ │ Telemetry Signals │ │ Logs → Data Collector │ │ Metrics → Time-Series DB │ │ Traces → Trace Store │ └───────────────────────────┘ ↓ Query Layer (KQL / PromQL) ↓ Dashboards (Grafana / Azure) ↓ Alerts + SLO + RCA
You may add the image version in the wiki as well.
Paste this in Azure Log Analytics:
AppTraces | where SeverityLevel > 2 | project TimeGenerated, Message, OperationName | take 20
node_cpu_seconds_total{mode!="idle"}
kube_pod_container_status_restarts_total
histogram_quantile(0.95, rate(http_server_request_duration_seconds_bucket[5m]))
Steps:
-
Go to Application Insights
-
Select Transaction Search
-
Filter by:
-
Duration > 1s
-
Operation = GET /api/orders
-
You will see each span and total duration.
-
CPU normal
-
Memory normal
-
No alerts triggered
-
Trace → OrderService → SQL → 2.8 sec delay
-
Logs → SQL: deadlock
-
Metrics → DB DTU @ 90%
Root cause: Missing database index
Fix: Index added → latency drop from 2.8s → 180ms
❌ Logging everything (cost explosion)
❌ Metrics without labels (incomplete)
❌ Missing correlation IDs
❌ No tracing in microservices
❌ Alerts only on CPU/memory
❌ No sampling strategy
Every request should carry a unique ID:
X-Correlation-ID: <GUID>
This ID must appear in:
-
Logs
-
Metrics labels
-
Trace context
This is how telemetry becomes connected.
-
What is a log?
-
What is a metric?
-
What is a trace?
-
Example of a time-series metric?
-
When would you use logs vs metrics?
-
What is structured logging?
-
Why do we need three pillars for observability?
-
What is a span in distributed tracing?
-
What is the difference between p95 and p99 latency?
-
What is metric cardinality?
-
What is log sampling?
-
How do you correlate logs, metrics, and traces?
-
Explain tracing for microservices.
-
How does Prometheus store metrics?
-
How do you optimize logging cost?
-
How do you design log retention policies?
-
What causes cardinality explosion in metrics?
-
Design a telemetry pipeline for 500 microservices.
-
Describe your approach to building an observability platform.
-
How do you unify logs, metrics, and traces in a distributed ecosystem?
-
What is your strategy for sampling traces at scale?
-
How do you build SLO dashboards using metrics?
(Add this section to encourage readers)
Key things I learned today: What confused me initially: How logs differ from traces: Where metrics help the most: Concepts I need to revise:
👉 Golden Signals (Latency, Traffic, Errors, Saturation)