Day 11 Metrics Deep Dive — Counters, Gauges, Histograms, Summaries, Cardinality, Aggregations, Labels - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 11 — Metrics Deep Dive: Counters, Gauges, Histograms, Summaries, Labels, Aggregations & Cardinality

Metrics are the backbone of monitoring — but only when designed correctly.


🎯 Learning Objective

By the end of Day 11, you will understand:

  • The 4 core metric types (Counter, Gauge, Histogram, Summary)

  • Metric labels & cardinality (the most misunderstood topic)

  • How metrics are collected, aggregated, and queried

  • Differences between metric storage engines (Prometheus, CloudWatch, Datadog, Azure, GCP)

  • How to design metrics for dashboards & alerts

  • Common mistakes (that break production systems!)

  • Enterprise best practices

After this, you will be able to design metrics like a Senior SRE/Observability Architect.


1️⃣ What Are Metrics?

Metrics are numeric measurements over time.

Examples:

  • CPU Usage

  • Memory Usage

  • HTTP Requests/Second

  • Error Rate

  • Latency Percentiles

  • Queue Depth

  • Cache Hit Ratio

  • Database Connection Count

Metrics are:

  • Lightweight

  • Fast

  • Cheap

  • Perfect for alerting

  • Perfect for dashboards

  • Perfect for detecting trends


2️⃣ The 4 Metric Types (Critical for All Observability Designers)


🔹 1. Counter (Monotonic Increasing Value)

A counter only goes up, never down.

Used for:

  • Number of requests

  • Number of errors

  • Bytes received

  • Bytes sent

  • Cache misses

  • Queue messages processed

Example (PromQL):

rate(http_requests_total[1m])

Counters reset only when the process restarts.


🔹 2. Gauge (Value That Goes Up or Down)

A gauge represents a value at a point in time.

Used for:

  • CPU usage

  • Memory usage

  • Active sessions

  • Queue depth

  • Temperature

  • Thread count

Example:

node_memory_Active_bytes

Gauges change dynamically.


🔹 3. Histogram (Distribution of Values)

MOST IMPORTANT metric type for latency.

Histograms store:

  • Buckets (ranges)

  • Count

  • Sum

Used for:

  • Latency measurements

  • API duration

  • DB query duration

  • Request size

  • CPU load distribution

Example:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

This is how you calculate p95 and p99 latency.


🔹 4. Summary (Client-Side Percentile Tracking)

Summaries calculate:

  • Percentiles

  • Count

  • Sum

But they:

  • Cannot be aggregated across services

  • Are less recommended for microservices

  • Are rarely used in large-scale distributed systems

Histograms > Summaries
(Especially in microservice architectures)


3️⃣ Labels (Dimensions) — POWERFUL but Dangerous

Labels add dimensions to metrics.

Example metric without labels:

http_requests_total 1000

With labels:

http_requests_total{method="GET", status="200", region="us-east"} 500 http_requests_total{method="POST", status="500", region="us-east"} 30

Labels let you slice and filter metrics for dashboards & alerts.

BUT labels increase cardinality — the # of unique metric combinations.


4️⃣ Cardinality — The #1 Cause of Monitoring Outages

Cardinality =
# of metrics × # of label combinations

Example:

Label | Values -- | -- status | 200, 400, 500 method | GET, POST region | us-east, us-west

Examples:

p95 latency:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

CPU utilization:

100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

6️⃣ Metric Storage Engines (Vendor-Neutral)


✔ Prometheus

  • Fast

  • Lightweight

  • Pull-based

  • Used with exporters (Node Exporter, Kube-State-Metrics)


✔ CloudWatch

  • Fully managed

  • No metric cardinality errors

  • Cost based on API ingestion


✔ Azure Monitor Metrics

  • Near real-time

  • Good time-series database performance


✔ Datadog

  • High cardinality support

  • Expensive but powerful


✔ GCP Cloud Monitoring

  • Widely used in Kubernetes GKE environments


✔ ClickHouse / VictoriaMetrics / Thanos / Mimir

  • Used for large-scale metric storage

  • Very cost-efficient


7️⃣ Common Metric Anti-Patterns (Avoid These!)

❌ High cardinality labels (user_id, request_id)
❌ Creating per-request metrics
❌ Using Summary metrics in microservices
❌ Exposing too many metrics
❌ Scraping huge endpoints
❌ Logging metrics instead of exporting metrics
❌ No aggregation (raw metrics everywhere)


8️⃣ Best Practices (Enterprise Standards)

✔ Use counters for all operations

✔ Use gauges for resource usage

✔ Use histograms for latency tracking

✔ Avoid summaries unless necessary

✔ Avoid high-cardinality labels

✔ Keep metrics lightweight

✔ Add consistent labels:

  • environment

  • region

  • service_name

  • version

✔ Export metrics via OpenTelemetry whenever possible

✔ Build dashboards around Golden Signals:

  • Latency

  • Traffic

  • Errors

  • Saturation


9️⃣ Hands-On Labs (Day 11)


🔧 Lab 1 — Create Histogram Metrics (Python)

from prometheus_client import Histogram H = Histogram('request_latency_seconds', 'Latency', buckets=[0.1, 0.3, 1.5, 5]) def process_request(): with H.time(): pass

🔧 Lab 2 — Export Metrics to Prometheus from a Flask App

from prometheus_client import Counter, generate_latest from flask import Flask app = Flask(__name__)

c = Counter(

'requests_total', 'Total requests') @app.route('/') def home(): c.inc() return "OK" @app.route('/metrics') def metrics(): return generate_latest(), 200

🔧 Lab 3 — Query p95 latency in Prometheus

histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))

🔧 Lab 4 — Detect Cardinality Issue

If a metric shows:

> 1 million series

→ You MUST fix labels immediately.


🔟 Real Example (How Metrics Saved the Company)

Scenario:

Checkout latency increased from 200ms → 2s.

Metrics showed:

  • p99 latency huge

  • retries increasing

  • DB connection pool saturation

  • payment-service CPU at 95%

Root cause:
Connection leak in payment-service.

Without metrics:

  • RCA takes 2–3 hours

With metrics:

  • RCA takes 10 minutes


1️⃣1️⃣ Interview Questions (Day 11)


Beginner

  • What are the four types of metrics?

  • What is a counter vs a gauge?

  • What is cardinality?


Intermediate

  • Why are histograms used for latency?

  • Why are summaries NOT recommended in microservices?

  • How do labels increase cardinality?


Senior

  • Design metrics for an API service.

  • How do you detect metric cardinality explosion?

  • Explain rate() vs irate() in PromQL.


Architect

  • Build a scalable metric architecture for 1000 microservices.

  • Choose Prometheus vs Datadog vs CloudWatch for metrics.

  • Define label governance for enterprise observability.


📝 Your Learning Notes

Key metric types I understood: What cardinality issues I must fix: Which metric type I want to experiment with: My next steps for implementing metrics: Questions I still have:
⚠️ **GitHub.com Fallback** ⚠️