Day 9 Monitoring Maturity Model - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

πŸ“˜ Day 9 β€” Distributed Tracing Deep Dive

Spans, Context Propagation, Trace Trees, Sampling, Flame Graphs


1. Overview

Distributed tracing allows you to track a single request as it flows through multiple services.
Useful for debugging microservices, identifying latency sources, and performing RCA.


2. Key Terminology

Trace

A complete journey of a request.
Example:

Trace ID: fbc6b2ef87d17cba8c40f389e1d3411

Span

A single operation within a trace.
Each span has:

  • span_id

  • parent_span_id

  • start/end time

  • duration

  • status

  • metadata (attributes)

Span Tree (Trace Tree)

Span A (Frontend) β”œβ”€β”€ Span B (API) β”‚ β”œβ”€β”€ Span C (Auth) β”‚ └── Span D (Cart) └── Span E (Logging)

Context Propagation

Ensures trace continuity across services using W3C TraceContext.

Example HTTP header:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

3. Why Distributed Tracing Is Important

Benefit | Explanation -- | -- End-to-end visibility | See every service involved in a request Latency attribution | Identify slow components Dependency understanding | Map how services communicate RCA | Faster root-cause detection Reduces MTTR | Immediate bottleneck identification

9. Hands-On Labs

Lab 1 β€” Run Jaeger Locally

docker run -d --name jaeger \ -p 16686:16686 \ jaegertracing/all-in-one:latest

Open Jaeger UI:
http://localhost:16686


Lab 2 β€” Auto-Instrument a Python Application

pip install opentelemetry-instrumentation opentelemetry-instrument python app.py

Lab 3 β€” Send Traces to Jaeger via OTel Collector

Collector config:

exporters: jaeger: endpoint: jaeger:14250

Lab 4 β€” Visualize Flame Graphs

In Jaeger UI:

  • Select a service

  • View traces

  • Open β€œFlame Graph” and β€œTrace Timeline”


10. Real-World Example

Issue: Checkout API slow (4 seconds)

Trace Breakdown:

  • API Gateway: 20ms

  • Cart Service: 80ms

  • Payment Service: 2900ms

  • Database (inside Payment): 800ms

Root Cause: Slow downstream payment provider + inefficient DB query.

Tracing β†’ found root cause in minutes.


11. Interview Questions

Beginner

  • What is a trace?

  • What is a span?

  • What is a trace ID?

Intermediate

  • Why is context propagation important?

  • Explain head vs tail sampling.

  • What is a span tree?

Senior

  • How do you integrate tracing with logs and metrics?

  • How can tracing reduce MTTR?

  • How do retries show up in tracing?

Architect

  • Design a tracing pipeline using OTel Collector.

  • Role of semantic conventions in large orgs.

  • How to implement distributed tracing across 500+ microservices.


12. Learning Notes

β€’ Key learnings from Day 9: β€’ Concepts I need to revisit: β€’ Tools I will test: β€’ How tracing can help my project: β€’ Questions I still have:
⚠️ **GitHub.com Fallback** ⚠️