Day 2 Logs, Metrics, Traces - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 2 — Logs, Metrics, and Traces (The Three Pillars of Observability)

A deep dive into the core of modern observability systems.


🎯 Learning Objective

Understand:

  • What logs, metrics, and traces are

  • Why all three are required for observability

  • How each signal type helps during troubleshooting

  • How data flows through Azure Monitor, App Insights, Prometheus, Grafana

  • How to query telemetry using KQL and PromQL

  • Real enterprise examples

  • Interview questions for different job levels


1️⃣ Why Logs, Metrics, and Traces Matter

Traditional monitoring focused only on infra metrics like CPU and RAM. Modern systems are:

  • Distributed

  • Containerized

  • Event-driven

  • Microservices-oriented

This complexity requires deeper, correlated visibility.

The three pillars provide:

  • Logs → What happened

  • Metrics → How much / How often

  • Traces → Where / Why it happened

Together, they form the observability foundation.


2️⃣ Logs — Text Events, Errors, Exceptions

Logs capture discrete events happening inside the system.

📌 Examples:

  • Exceptions

  • API requests

  • Function app failures

  • SQL timeout

  • Kubernetes pod crash

  • Security events

✔ Characteristics

Property | Details -- | -- Data Type | Text / JSON Good For | Debugging, application errors Volume | High Storage | Log Analytics, Loki, Elasticsearch Query | KQL, LogQL

A real workflow:

  • Metric spike: latency increased

  • Trace: shows DB taking 2 seconds

  • Log: shows SQL timeout

This triangulation gives you true observability.


6️⃣ Architecture Flow (GitHub Wiki Friendly)

[ Application / Container / VM ] ↓ ┌───────────────────────────┐ │ Telemetry Signals │ │ LogsData Collector │ │ MetricsTime-Series DB │ │ TracesTrace Store │ └───────────────────────────┘ ↓ Query Layer (KQL / PromQL)Dashboards (Grafana / Azure)Alerts + SLO + RCA

You may add the image version in the wiki as well.


7️⃣ Hands-On Labs (Practical Work)

🔧 Lab 1 — Query Logs in KQL

Paste this in Azure Log Analytics:

AppTraces | where SeverityLevel > 2 | project TimeGenerated, Message, OperationName | take 20

🔧 Lab 2 — Query Metrics in PromQL

Node CPU:

node_cpu_seconds_total{mode!="idle"}

Pod restarts:

kube_pod_container_status_restarts_total

API latency:

histogram_quantile(0.95, rate(http_server_request_duration_seconds_bucket[5m]))

🔧 Lab 3 — View Traces in Application Insights

Steps:

  1. Go to Application Insights

  2. Select Transaction Search

  3. Filter by:

    • Duration > 1s

    • Operation = GET /api/orders

You will see each span and total duration.


8️⃣ Enterprise Example — Real Outage

Scenario: Checkout API slow

Monitoring shows:

  • CPU normal

  • Memory normal

  • No alerts triggered

Observability shows:

  • Trace → OrderService → SQL → 2.8 sec delay

  • Logs → SQL: deadlock

  • Metrics → DB DTU @ 90%

Root cause: Missing database index

Fix: Index added → latency drop from 2.8s → 180ms


9️⃣ Common Mistakes to Avoid

❌ Logging everything (cost explosion)
❌ Metrics without labels (incomplete)
❌ Missing correlation IDs
❌ No tracing in microservices
❌ Alerts only on CPU/memory
❌ No sampling strategy


🔟 Bonus: Correlation ID Pattern

Every request should carry a unique ID:

X-Correlation-ID: <GUID>

This ID must appear in:

  • Logs

  • Metrics labels

  • Trace context

This is how telemetry becomes connected.


1️⃣1️⃣ Interview Questions (Beginner → Architect)


🎯 Beginner-Level

  1. What is a log?

  2. What is a metric?

  3. What is a trace?

  4. Example of a time-series metric?

  5. When would you use logs vs metrics?

  6. What is structured logging?


🎯 Intermediate-Level

  1. Why do we need three pillars for observability?

  2. What is a span in distributed tracing?

  3. What is the difference between p95 and p99 latency?

  4. What is metric cardinality?

  5. What is log sampling?

  6. How do you correlate logs, metrics, and traces?


🎯 Senior-Level

  1. Explain tracing for microservices.

  2. How does Prometheus store metrics?

  3. How do you optimize logging cost?

  4. How do you design log retention policies?

  5. What causes cardinality explosion in metrics?


🎯 Architect-Level

  1. Design a telemetry pipeline for 500 microservices.

  2. Describe your approach to building an observability platform.

  3. How do you unify logs, metrics, and traces in a distributed ecosystem?

  4. What is your strategy for sampling traces at scale?

  5. How do you build SLO dashboards using metrics?


1️⃣2️⃣ Your Learning Notes

(Add this section to encourage readers)

Key things I learned today: What confused me initially: How logs differ from traces: Where metrics help the most: Concepts I need to revise:

📢 Next → Day 3

👉 Golden Signals (Latency, Traffic, Errors, Saturation)

⚠️ **GitHub.com Fallback** ⚠️