Day 1 Monitoring vs Observability - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
Understand the difference between Monitoring and Observability, why observability is critical for modern systems, and how logs, metrics, and traces form the foundation.
Traditional systems were simple:
-
One VM
-
One app
-
One DB
-
One network
Monitoring only needed:
-
CPU
-
Memory
-
Disk
-
Uptime
Modern systems are distributed, dynamic, cloud-native, and require deeper visibility.
Monitoring answers:
“Is the system working?”
Monitoring focuses on:
-
Metrics (CPU, memory, latency)
-
Logs
-
Alerts
-
Thresholds
-
Uptime checks
You are notified after something breaks.
-
Basic infra health
-
Detecting outages
-
Triggering alerts
-
Does NOT show root cause
-
Cannot show flow of requests
-
Lacks end-to-end visibility
Observability answers:
“WHY is the system not working?”
Observability includes:
-
Distributed tracing
-
Correlating logs, metrics, and traces
-
Understanding request paths
-
Identifying bottlenecks
-
Detecting anomalies
-
End-to-end dependency mapping
Monitoring + Tracing + Correlation + Context + Insights
Monitoring shows:
-
CPU OK
-
Memory OK
-
DB online
Observability shows:
-
Trace: Web → API → OrderService → SQL
-
SQL query = 3.8 seconds
-
Caused by long-running report job
-
Caused by a locking issue
Without observability → You guess
With observability → You KNOW
Text events, errors, exceptions
Examples:
-
“NullReferenceException…”
-
“Login failed for user…”
Numeric values over time
Examples:
-
CPU 78%
-
P95 Latency = 1.4s
End-to-end transaction flow
Examples:
-
User → API → DB → Cache → Response
Observability = combining all 3.
Tracing provides:
-
Call flow across services
-
Latency of each dependency
-
Root cause signals
-
Context propagation
-
Span hierarchy
Tools:
-
Azure Application Insights
-
OpenTelemetry
-
Jaeger
-
Grafana Tempo
-
AWS X-Ray
User Request ↓ Frontend → API → Microservices → Database → External API↓ Logs, Metrics, Traces ↓Collectors → Processors → Telemetry Store ↓ Visualization (Grafana / Kibana / Azure Workbooks) ↓ Alerts → Incident → Automation
Add diagram as PNG/draw.io in GitHub repo.
Monitoring:
-
API latency high
Observability: -
Trace reveals delay in PaymentGateway → external provider slow
Monitoring:
-
Error count high
Observability: -
Trace shows cold starts
-
Logs show malformed event payload
-
Metrics show retry storms
Monitoring:
-
CPU normal
Observability: -
Trace reveals cascade failure due to cache timeout
Go to:
Azure Portal → VM → Metrics
Add charts:
-
CPU %
-
Disk Queue
-
Outbound/Inbound traffic
Run KQL:
traces | take 10
Connect Azure Monitor → Add CPU graph.
Write down differences between monitoring and observability “based on your system”.
Reflect and document:
“If CPU, memory, and logs look normal, what else could be wrong?”
Hint: The answer = dependencies, latency, tracing.
(Add this section in your GitHub wiki so readers can follow your learning journey)
### What I learned today: ### What was confusing but now clear: ### What real world example I understood: ### Questions I still have:
-
What is monitoring?
-
What are metrics? Give examples.
-
What are logs?
-
What is alerting?
-
What is uptime? How do you measure it?
-
What tools are used for monitoring?
-
Why do we monitor CPU and memory?
-
Define observability. How is it different from monitoring?
-
Explain logs, metrics, and traces.
-
What are the four Golden Signals?
-
Difference between black-box vs white-box monitoring.
-
How do you detect root cause in distributed systems?
-
What is SLO, SLI, SLA?
-
Why are traces important?
-
Design an observability stack for a microservice architecture.
-
How do you correlate logs, metrics, and traces?
-
Explain USE vs RED methodology.
-
How do you control metric cardinality explosion?
-
Why is OpenTelemetry important for observability?
-
How do you break down API latency end-to-end?
-
How do you prevent alert fatigue?
-
Design a full observability platform for an enterprise.
-
How do you enable observability for Azure Functions at scale?
-
Define SLOs for API, DB, login service.
-
How do you build an observability maturity model for an org?
-
How do you unify monitoring for 500+ microservices?
-
How do you reduce observability cost in cloud?
-
How do you design alerts that map to business KPIs?
-
Your monitoring shows everything green, but users report slowness. What do you check?
-
API returns 200 OK but the page still errors out. Why?
-
Function App fails once in every batch. How to debug?
-
You cannot reproduce a production issue — what next?
-
How to detect memory leaks in production?
-
“If monitoring is good, do I need observability?”
-
“Can logs replace traces?”
-
“Is observability a tool or capability?”
-
“Can you create SLOs without observability?”
-
“Can metrics alone tell you root cause?”
📢 Next → Day 2
👉 Day 2 - Logs, Metrics, Traces (Deep Dive) https://github.com/vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery/wiki/Day-2---Logs%2C-Metrics%2C-Traces