Day 10 Hands on: Azure Monitor Grafana Basics - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
Logging is one of the three pillars of Observability (Logs, Metrics, Traces).
Today we go deep into how to design logs that are:
-
Searchable
-
Machine-readable
-
Cost-efficient
-
Correlated with traces
-
Useful during incidents
-
Helpful for RCA and SRE workflows
Logs are the most powerful but also the most expensive telemetry signal — so understanding them is critical.
Logging is the process of recording important events happening inside your system.
-
Application logs
-
System logs
-
Access logs
-
Security/Audit logs
-
Transaction logs
-
Cloud platform logs
-
Container & Runtime logs
Logs explain why something happened inside your system.
User login failed for John due to invalid token
{ "event": "login_failed", "user": "john", "reason": "invalid_token", "timestamp": "2025-01-01T12:22:10Z", "trace_id": "93ab12f1df", "service": "auth-service" }
-
Easy to index
-
Easy to parse
-
Easy to analyze
-
Works with all tools (ELK, Datadog, Splunk, Loki)
-
Enables ML-based anomaly detection
Every modern logging architecture uses JSON.
Use correct log levels:
Checkout API returns 502 errors.
payment-service timeout after 5000ms retrying request... circuit-breaker OPEN
-
Traces → Payment service slow
-
Metrics → DB latency high
-
Deployment → new version deployed 30 mins ago
Slow external payment gateway → retry storm → DB saturation.
Logs + metrics + traces = instant RCA.
-
What is structured logging?
-
Why use JSON logs?
-
What are log levels?
-
How do you enrich logs with trace IDs?
-
Explain why indexing affects log costs.
-
What is a retention policy?
-
Design a logging pipeline for a microservice architecture.
-
How do you optimize logs for cost without losing insight?
-
What fields should be indexed in an enterprise system?
-
Build a multi-cloud logging architecture using OTel + FluentBit + ES + S3.
-
Define governance rules for logging across 50+ teams.
-
Explain the role of logs in SLOs and SLIs.
What I learned today: My current logging anti-patterns: Improvements I will apply: My retention plan: Tools I want to test: