Day 11 Metrics Deep Dive — Counters, Gauges, Histograms, Summaries, Cardinality, Aggregations, Labels - vinoji2005/GitHub-Repository-Structure-90-Days-Observability-Mastery GitHub Wiki
📘 Day 11 — Metrics Deep Dive: Counters, Gauges, Histograms, Summaries, Labels, Aggregations & Cardinality
By the end of Day 11, you will understand:
-
The 4 core metric types (Counter, Gauge, Histogram, Summary)
-
Metric labels & cardinality (the most misunderstood topic)
-
How metrics are collected, aggregated, and queried
-
Differences between metric storage engines (Prometheus, CloudWatch, Datadog, Azure, GCP)
-
How to design metrics for dashboards & alerts
-
Common mistakes (that break production systems!)
-
Enterprise best practices
After this, you will be able to design metrics like a Senior SRE/Observability Architect.
Metrics are numeric measurements over time.
Examples:
-
CPU Usage
-
Memory Usage
-
HTTP Requests/Second
-
Error Rate
-
Latency Percentiles
-
Queue Depth
-
Cache Hit Ratio
-
Database Connection Count
Metrics are:
-
Lightweight
-
Fast
-
Cheap
-
Perfect for alerting
-
Perfect for dashboards
-
Perfect for detecting trends
A counter only goes up, never down.
Used for:
-
Number of requests
-
Number of errors
-
Bytes received
-
Bytes sent
-
Cache misses
-
Queue messages processed
Example (PromQL):
rate(http_requests_total[1m])
Counters reset only when the process restarts.
A gauge represents a value at a point in time.
Used for:
-
CPU usage
-
Memory usage
-
Active sessions
-
Queue depth
-
Temperature
-
Thread count
Example:
node_memory_Active_bytes
Gauges change dynamically.
MOST IMPORTANT metric type for latency.
Histograms store:
-
Buckets (ranges)
-
Count
-
Sum
Used for:
-
Latency measurements
-
API duration
-
DB query duration
-
Request size
-
CPU load distribution
Example:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
This is how you calculate p95 and p99 latency.
Summaries calculate:
-
Percentiles
-
Count
-
Sum
But they:
-
Cannot be aggregated across services
-
Are less recommended for microservices
-
Are rarely used in large-scale distributed systems
Histograms > Summaries
(Especially in microservice architectures)
Labels add dimensions to metrics.
Example metric without labels:
http_requests_total 1000
With labels:
http_requests_total{method="GET", status="200", region="us-east"} 500 http_requests_total{method="POST", status="500", region="us-east"} 30
Labels let you slice and filter metrics for dashboards & alerts.
BUT labels increase cardinality — the # of unique metric combinations.
Cardinality =
# of metrics × # of label combinations
Example:
Examples:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
-
Fast
-
Lightweight
-
Pull-based
-
Used with exporters (Node Exporter, Kube-State-Metrics)
-
Fully managed
-
No metric cardinality errors
-
Cost based on API ingestion
-
Near real-time
-
Good time-series database performance
-
High cardinality support
-
Expensive but powerful
-
Widely used in Kubernetes GKE environments
-
Used for large-scale metric storage
-
Very cost-efficient
❌ High cardinality labels (user_id, request_id)
❌ Creating per-request metrics
❌ Using Summary metrics in microservices
❌ Exposing too many metrics
❌ Scraping huge endpoints
❌ Logging metrics instead of exporting metrics
❌ No aggregation (raw metrics everywhere)
-
environment
-
region
-
service_name
-
version
-
Latency
-
Traffic
-
Errors
-
Saturation
from prometheus_client import Histogram H = Histogram('request_latency_seconds', 'Latency', buckets=[0.1, 0.3, 1.5, 5]) def process_request(): with H.time(): pass
from prometheus_client import Counter, generate_latest from flask import Flask app = Flask(__name__)c = Counter(
'requests_total', 'Total requests') @app.route('/') def home(): c.inc() return "OK" @app.route('/metrics') def metrics(): return generate_latest(), 200
histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m]))
If a metric shows:
> 1 million series
→ You MUST fix labels immediately.
Checkout latency increased from 200ms → 2s.
Metrics showed:
-
p99 latency huge
-
retries increasing
-
DB connection pool saturation
-
payment-service CPU at 95%
Root cause:
Connection leak in payment-service.
Without metrics:
-
RCA takes 2–3 hours
With metrics:
-
RCA takes 10 minutes
-
What are the four types of metrics?
-
What is a counter vs a gauge?
-
What is cardinality?
-
Why are histograms used for latency?
-
Why are summaries NOT recommended in microservices?
-
How do labels increase cardinality?
-
Design metrics for an API service.
-
How do you detect metric cardinality explosion?
-
Explain rate() vs irate() in PromQL.
-
Build a scalable metric architecture for 1000 microservices.
-
Choose Prometheus vs Datadog vs CloudWatch for metrics.
-
Define label governance for enterprise observability.
Key metric types I understood: What cardinality issues I must fix: Which metric type I want to experiment with: My next steps for implementing metrics: Questions I still have: