📘 Day 11 — Metrics Deep Dive: Counters, Gauges, Histograms, Summaries, Labels, Aggregations & Cardinality

Metrics are the backbone of monitoring — but only when designed correctly.

🎯 Learning Objective

By the end of Day 11, you will understand:

The 4 core metric types (Counter, Gauge, Histogram, Summary)
Metric labels & cardinality (the most misunderstood topic)
How metrics are collected, aggregated, and queried
Differences between metric storage engines (Prometheus, CloudWatch, Datadog, Azure, GCP)
How to design metrics for dashboards & alerts
Common mistakes (that break production systems!)
Enterprise best practices

After this, you will be able to design metrics like a Senior SRE/Observability Architect.

1️⃣ What Are Metrics?

Metrics are numeric measurements over time.

Examples:

CPU Usage
Memory Usage
HTTP Requests/Second
Error Rate
Latency Percentiles
Queue Depth
Cache Hit Ratio
Database Connection Count

Metrics are:

Lightweight
Fast
Cheap
Perfect for alerting
Perfect for dashboards
Perfect for detecting trends

2️⃣ The 4 Metric Types (Critical for All Observability Designers)

🔹 1. Counter (Monotonic Increasing Value)

A counter only goes up, never down.

Used for:

Number of requests
Number of errors
Bytes received
Bytes sent
Cache misses
Queue messages processed

Example (PromQL):



rate(http_requests_total[1m])

Counters reset only when the process restarts.

🔹 2. Gauge (Value That Goes Up or Down)

A gauge represents a value at a point in time.

Used for:

CPU usage
Memory usage
Active sessions
Queue depth
Temperature
Thread count

Example:



node_memory_Active_bytes

Gauges change dynamically.

🔹 3. Histogram (Distribution of Values)

MOST IMPORTANT metric type for latency.

Histograms store:

Buckets (ranges)
Count
Sum

Used for:

Latency measurements
API duration
DB query duration
Request size
CPU load distribution

Example:



histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

This is how you calculate p95 and p99 latency.

🔹 4. Summary (Client-Side Percentile Tracking)

Summaries calculate:

Percentiles
Count
Sum

But they:

Cannot be aggregated across services
Are less recommended for microservices
Are rarely used in large-scale distributed systems

Histograms > Summaries
(Especially in microservice architectures)

3️⃣ Labels (Dimensions) — POWERFUL but Dangerous

Labels add dimensions to metrics.

Example metric without labels:



http_requests_total 1000

With labels:



http_requests_total{method="GET", status="200", region="us-east"} 500
http_requests_total{method="POST", status="500", region="us-east"} 30

Labels let you slice and filter metrics for dashboards & alerts.

BUT labels increase cardinality — the # of unique metric combinations.

4️⃣ Cardinality — The #1 Cause of Monitoring Outages

Cardinality =
# of metrics × # of label combinations

Example:

Examples:

p95 latency:



histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Error rate:



sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m]))

CPU utilization:



100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

6️⃣ Metric Storage Engines (Vendor-Neutral)

✔ Prometheus

Fast
Lightweight
Pull-based
Used with exporters (Node Exporter, Kube-State-Metrics)

✔ CloudWatch

Fully managed
No metric cardinality errors
Cost based on API ingestion

✔ Azure Monitor Metrics

Near real-time
Good time-series database performance

✔ Datadog

High cardinality support
Expensive but powerful

✔ GCP Cloud Monitoring

Widely used in Kubernetes GKE environments

✔ ClickHouse / VictoriaMetrics / Thanos / Mimir

Used for large-scale metric storage
Very cost-efficient

7️⃣ Common Metric Anti-Patterns (Avoid These!)

❌ High cardinality labels (user_id, request_id)
❌ Creating per-request metrics
❌ Using Summary metrics in microservices
❌ Exposing too many metrics
❌ Scraping huge endpoints
❌ Logging metrics instead of exporting metrics
❌ No aggregation (raw metrics everywhere)

8️⃣ Best Practices (Enterprise Standards)

✔ Use counters for all operations

✔ Use gauges for resource usage

✔ Use histograms for latency tracking

✔ Avoid summaries unless necessary

✔ Avoid high-cardinality labels

✔ Keep metrics lightweight

✔ Add consistent labels:

environment
region
service_name
version

✔ Export metrics via OpenTelemetry whenever possible

✔ Build dashboards around Golden Signals:

Latency
Traffic
Errors
Saturation

9️⃣ Hands-On Labs (Day 11)

🔧 Lab 1 — Create Histogram Metrics (Python)



from prometheus_client import Histogram
H = Histogram('request_latency_seconds', 'Latency', buckets=[0.1, 0.3, 1.5, 5])
def process_request():
with H.time():
pass

🔧 Lab 2 — Export Metrics to Prometheus from a Flask App



from prometheus_client import Counter, generate_latest
from flask import Flask
app = Flask(__name__)
c = Counter('requests_total', 'Total requests')
@app.route('/')
def home():
c.inc()
return "OK"
@app.route('/metrics')
def metrics():
return generate_latest(), 200

🔧 Lab 3 — Query p95 latency in Prometheus



histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))

🔧 Lab 4 — Detect Cardinality Issue

If a metric shows:



> 1 million series

→ You MUST fix labels immediately.

🔟 Real Example (How Metrics Saved the Company)

Scenario:

Checkout latency increased from 200ms → 2s.

Metrics showed:

p99 latency huge
retries increasing
DB connection pool saturation
payment-service CPU at 95%

Root cause:
Connection leak in payment-service.

Without metrics:

RCA takes 2–3 hours

With metrics:

RCA takes 10 minutes

1️⃣1️⃣ Interview Questions (Day 11)

Beginner

What are the four types of metrics?
What is a counter vs a gauge?
What is cardinality?

Intermediate

Why are histograms used for latency?
Why are summaries NOT recommended in microservices?
How do labels increase cardinality?

Senior

Design metrics for an API service.
How do you detect metric cardinality explosion?
Explain rate() vs irate() in PromQL.

Architect

Build a scalable metric architecture for 1000 microservices.
Choose Prometheus vs Datadog vs CloudWatch for metrics.
Define label governance for enterprise observability.

📝 Your Learning Notes



Key metric types I understood:
What cardinality issues I must fix:
Which metric type I want to experiment with:
My next steps for implementing metrics:
Questions I still have:

Day 11 Metrics Deep Dive — Counters, Gauges, Histograms, Summaries, Cardinality, Aggregations, Labels - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki

📘 Day 11 — Metrics Deep Dive: Counters, Gauges, Histograms, Summaries, Labels, Aggregations & Cardinality

Metrics are the backbone of monitoring — but only when designed correctly.

🎯 Learning Objective

1️⃣ What Are Metrics?

2️⃣ The 4 Metric Types (Critical for All Observability Designers)

🔹 1. Counter (Monotonic Increasing Value)

🔹 2. Gauge (Value That Goes Up or Down)

🔹 3. Histogram (Distribution of Values)

🔹 4. Summary (Client-Side Percentile Tracking)

3️⃣ Labels (Dimensions) — POWERFUL but Dangerous

4️⃣ Cardinality — The #1 Cause of Monitoring Outages

p95 latency:

Error rate:

CPU utilization:

6️⃣ Metric Storage Engines (Vendor-Neutral)

✔ Prometheus

✔ CloudWatch

✔ Azure Monitor Metrics

✔ Datadog

✔ GCP Cloud Monitoring

✔ ClickHouse / VictoriaMetrics / Thanos / Mimir

7️⃣ Common Metric Anti-Patterns (Avoid These!)

8️⃣ Best Practices (Enterprise Standards)

✔ Use counters for all operations

✔ Use gauges for resource usage

✔ Use histograms for latency tracking

✔ Avoid summaries unless necessary

✔ Avoid high-cardinality labels

✔ Keep metrics lightweight

✔ Add consistent labels:

✔ Export metrics via OpenTelemetry whenever possible

✔ Build dashboards around Golden Signals:

9️⃣ Hands-On Labs (Day 11)

🔧 Lab 1 — Create Histogram Metrics (Python)

🔧 Lab 2 — Export Metrics to Prometheus from a Flask App

🔧 Lab 3 — Query p95 latency in Prometheus

🔧 Lab 4 — Detect Cardinality Issue

🔟 Real Example (How Metrics Saved the Company)

Scenario:

1️⃣1️⃣ Interview Questions (Day 11)

Beginner

Intermediate

Senior

Architect

📝 Your Learning Notes

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️