📘 Day 2 — Logs, Metrics, and Traces (The Three Pillars of Observability)

A deep dive into the core of modern observability systems.

🎯 Learning Objective

Understand:

What logs, metrics, and traces are
Why all three are required for observability
How each signal type helps during troubleshooting
How data flows through Azure Monitor, App Insights, Prometheus, Grafana
How to query telemetry using KQL and PromQL
Real enterprise examples
Interview questions for different job levels

1️⃣ Why Logs, Metrics, and Traces Matter

Traditional monitoring focused only on infra metrics like CPU and RAM. Modern systems are:

Distributed
Containerized
Event-driven
Microservices-oriented

This complexity requires deeper, correlated visibility.

The three pillars provide:

Logs → What happened
Metrics → How much / How often
Traces → Where / Why it happened

Together, they form the observability foundation.

2️⃣ Logs — Text Events, Errors, Exceptions

Logs capture discrete events happening inside the system.

📌 Examples:

Exceptions
API requests
Function app failures
SQL timeout
Kubernetes pod crash
Security events

✔ Characteristics

A real workflow:

Metric spike: latency increased
Trace: shows DB taking 2 seconds
Log: shows SQL timeout

This triangulation gives you true observability.

6️⃣ Architecture Flow (GitHub Wiki Friendly)



[ Application / Container / VM ]
               ↓
     ┌───────────────────────────┐
     │      Telemetry Signals    │
     │  Logs    → Data Collector │
     │  Metrics → Time-Series DB │
     │  Traces  → Trace Store    │
     └───────────────────────────┘
               ↓
       Query Layer (KQL / PromQL)
               ↓
      Dashboards (Grafana / Azure)
               ↓
         Alerts + SLO + RCA

You may add the image version in the wiki as well.

7️⃣ Hands-On Labs (Practical Work)

🔧 Lab 1 — Query Logs in KQL

Paste this in Azure Log Analytics:



AppTraces
| where SeverityLevel > 2
| project TimeGenerated, Message, OperationName
| take 20

🔧 Lab 2 — Query Metrics in PromQL

Node CPU:



node_cpu_seconds_total{mode!="idle"}

Pod restarts:



kube_pod_container_status_restarts_total

API latency:



histogram_quantile(0.95, rate(http_server_request_duration_seconds_bucket[5m]))

🔧 Lab 3 — View Traces in Application Insights

Steps:

Go to Application Insights
Select Transaction Search
Filter by:
- Duration > 1s
- Operation = GET /api/orders

You will see each span and total duration.

8️⃣ Enterprise Example — Real Outage

Scenario: Checkout API slow

Monitoring shows:

CPU normal
Memory normal
No alerts triggered

Observability shows:

Trace → OrderService → SQL → 2.8 sec delay
Logs → SQL: deadlock
Metrics → DB DTU @ 90%

Root cause: Missing database index

Fix: Index added → latency drop from 2.8s → 180ms

9️⃣ Common Mistakes to Avoid

❌ Logging everything (cost explosion)
❌ Metrics without labels (incomplete)
❌ Missing correlation IDs
❌ No tracing in microservices
❌ Alerts only on CPU/memory
❌ No sampling strategy

🔟 Bonus: Correlation ID Pattern

Every request should carry a unique ID:



X-Correlation-ID: <GUID>

This ID must appear in:

Logs
Metrics labels
Trace context

This is how telemetry becomes connected.

1️⃣1️⃣ Interview Questions (Beginner → Architect)

🎯 Beginner-Level

What is a log?
What is a metric?
What is a trace?
Example of a time-series metric?
When would you use logs vs metrics?
What is structured logging?

🎯 Intermediate-Level

Why do we need three pillars for observability?
What is a span in distributed tracing?
What is the difference between p95 and p99 latency?
What is metric cardinality?
What is log sampling?
How do you correlate logs, metrics, and traces?

🎯 Senior-Level

Explain tracing for microservices.
How does Prometheus store metrics?
How do you optimize logging cost?
How do you design log retention policies?
What causes cardinality explosion in metrics?

🎯 Architect-Level

Design a telemetry pipeline for 500 microservices.
Describe your approach to building an observability platform.
How do you unify logs, metrics, and traces in a distributed ecosystem?
What is your strategy for sampling traces at scale?
How do you build SLO dashboards using metrics?

1️⃣2️⃣ Your Learning Notes

(Add this section to encourage readers)



Key things I learned today:
What confused me initially:
How logs differ from traces:
Where metrics help the most:
Concepts I need to revise:

📢 Next → Day 3

👉 Golden Signals (Latency, Traffic, Errors, Saturation)

Day 2 Logs, Metrics, Traces - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 2 — Logs, Metrics, and Traces (The Three Pillars of Observability)

A deep dive into the core of modern observability systems.

🎯 Learning Objective

1️⃣ Why Logs, Metrics, and Traces Matter

The three pillars provide:

2️⃣ Logs — Text Events, Errors, Exceptions

📌 Examples:

✔ Characteristics

A real workflow:

6️⃣ Architecture Flow (GitHub Wiki Friendly)

7️⃣ Hands-On Labs (Practical Work)

🔧 Lab 1 — Query Logs in KQL

🔧 Lab 2 — Query Metrics in PromQL

Node CPU:

Pod restarts:

API latency:

🔧 Lab 3 — View Traces in Application Insights

8️⃣ Enterprise Example — Real Outage

Scenario: Checkout API slow

Monitoring shows:

Observability shows:

9️⃣ Common Mistakes to Avoid

🔟 Bonus: Correlation ID Pattern

1️⃣1️⃣ Interview Questions (Beginner → Architect)

🎯 Beginner-Level

🎯 Intermediate-Level

🎯 Senior-Level

🎯 Architect-Level

1️⃣2️⃣ Your Learning Notes

📢 Next → Day 3

⚠️ GitHub.com Fallback ⚠️

Day 2 Logs, Metrics, Traces - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 2 — Logs, Metrics, and Traces (The Three Pillars of Observability)

A deep dive into the core of modern observability systems.

🎯 Learning Objective

1️⃣ Why Logs, Metrics, and Traces Matter

The three pillars provide:

2️⃣ Logs — Text Events, Errors, Exceptions

📌 Examples:

✔ Characteristics

A real workflow:

6️⃣ Architecture Flow (GitHub Wiki Friendly)

7️⃣ Hands-On Labs (Practical Work)

🔧 Lab 1 — Query Logs in KQL

🔧 Lab 2 — Query Metrics in PromQL

Node CPU:

Pod restarts:

API latency:

🔧 Lab 3 — View Traces in Application Insights

8️⃣ Enterprise Example — Real Outage

Scenario: Checkout API slow

Monitoring shows:

Observability shows:

9️⃣ Common Mistakes to Avoid

🔟 Bonus: Correlation ID Pattern

1️⃣1️⃣ Interview Questions (Beginner → Architect)

🎯 Beginner-Level

🎯 Intermediate-Level

🎯 Senior-Level

🎯 Architect-Level

1️⃣2️⃣ Your Learning Notes

📢 Next → Day 3

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️