Day 10 Hands on: Azure Monitor Grafana Basics - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 10 — Logging Deep Dive

Structured Logging • JSON • Log Levels • Enrichment • Indexing • Retention • Cost Optimization


📌 Overview

Logging is one of the three pillars of Observability (Logs, Metrics, Traces).
Today we go deep into how to design logs that are:

  • Searchable

  • Machine-readable

  • Cost-efficient

  • Correlated with traces

  • Useful during incidents

  • Helpful for RCA and SRE workflows

Logs are the most powerful but also the most expensive telemetry signal — so understanding them is critical.


1️⃣ What Is Logging?

Logging is the process of recording important events happening inside your system.

Types of logs:

  • Application logs

  • System logs

  • Access logs

  • Security/Audit logs

  • Transaction logs

  • Cloud platform logs

  • Container & Runtime logs

Logs explain why something happened inside your system.


2️⃣ Structured Logging (The Right Way to Log)

❌ Bad (Unstructured)

User login failed for John due to invalid token

✔ Correct (Structured JSON)

{ "event": "login_failed", "user": "john", "reason": "invalid_token", "timestamp": "2025-01-01T12:22:10Z", "trace_id": "93ab12f1df", "service": "auth-service" }

Why JSON Log Format?

  • Easy to index

  • Easy to parse

  • Easy to analyze

  • Works with all tools (ELK, Datadog, Splunk, Loki)

  • Enables ML-based anomaly detection

Every modern logging architecture uses JSON.


3️⃣ Log Levels (Critical for SRE & Developers)

Use correct log levels:

Level | Meaning | Example -- | -- | -- DEBUG | Low-level details | Request body, function steps INFO | Normal operations | User login successful WARN | Something unexpected | Retry attempt #2 ERROR | Something failed | Payment timeout FATAL | System unusable | DB connection lost

🔟 Real-World Example (How Logs Solve Incidents)

Issue:

Checkout API returns 502 errors.

Logs show:

payment-service timeout after 5000ms retrying request... circuit-breaker OPEN

Correlated with:

  • Traces → Payment service slow

  • Metrics → DB latency high

  • Deployment → new version deployed 30 mins ago

Root Cause:

Slow external payment gateway → retry storm → DB saturation.

Logs + metrics + traces = instant RCA.


1️⃣1️⃣ Interview Questions (GitHub Wiki Style)


Beginner-Level

  • What is structured logging?

  • Why use JSON logs?

  • What are log levels?


Intermediate-Level

  • How do you enrich logs with trace IDs?

  • Explain why indexing affects log costs.

  • What is a retention policy?


Senior-Level

  • Design a logging pipeline for a microservice architecture.

  • How do you optimize logs for cost without losing insight?

  • What fields should be indexed in an enterprise system?


Architect-Level

  • Build a multi-cloud logging architecture using OTel + FluentBit + ES + S3.

  • Define governance rules for logging across 50+ teams.

  • Explain the role of logs in SLOs and SLIs.


📝 Your Learning Notes

What I learned today: My current logging anti-patterns: Improvements I will apply: My retention plan: Tools I want to test:
⚠️ **GitHub.com Fallback** ⚠️