Additional Metrics - clue2-sose25/Clue2 GitHub Wiki
Additional Metrics
This page documents new sustainability and performance metrics we consider adding to CLUE’s benchmarking results. We cover:
- Prometheus-native metrics not yet scraped
- Kubernetes/cluster metrics from kube-state, kubelet, scheduler
- Research-inspired metrics (carbon, PUE, tail latency…)
- Recommendations on integration effort and value
1. Prometheus-Native Metrics
CLUE already ingests basic CPU/memory and energy readings. The following additional metrics come “for free” from cAdvisor, kubelet and node-exporter—no new exporters needed:
Metric | Purpose | Category | Effort |
---|---|---|---|
container_cpu_cfs_throttled_seconds_total |
Cumulative time containers were throttled by CFS quota — indicates CPU pressure and waste. | Performance wastage | Easy |
container_memory_failures_total |
Count of OOM kills or allocation failures inside containers. | Reliability | Easy |
container_network_receive_bytes_total _transmit_bytes_total |
Total network I/O per container — correlates with NIC power draw and latency. | I/O performance | Easy |
container_network_receive_packets_dropped_total _transmit_packets_dropped_total |
Dropped packet counts — wasted retries & energy. | Quality | Easy |
container_fs_reads_bytes_total _writes_bytes_total |
Disk I/O volume per container — impacts storage energy use. | I/O performance | Easy |
container_fs_io_time_seconds_total |
Time spent in block I/O — reveals storage busy-time overhead. | I/O performance | Easy |
container_cpu_user_seconds_total _system_seconds_total |
Breakdown of CPU time in user vs kernel space — helps isolate OS overhead. | Performance | Easy |
container_memory_working_set_bytes |
Actual working-set memory use (excludes cache) — shows true memory pressure. | Performance | Easy |
2. Kubernetes & Cluster Metrics
Enable kube-state-metrics, kubelet and scheduler endpoints to gather these higher-level insights:
Metric | Purpose | Source | Effort |
---|---|---|---|
kube_pod_container_status_restarts_total |
Total restarts per container — reliability overhead and wasted restart energy. | kube-state-metrics | Easy |
kube_horizontalpodautoscaler_status_current_replicas _desired_replicas |
HPA scaling behavior — correlates with scaling waste (RE). | kube-state-metrics | Easy |
kube_pod_container_resource_requests_cpu_cores _memory_bytes |
Configured requests & limits — refines RU and RO calculations. | kube-state-metrics | Easy |
kube_node_status_allocatable_cpu_cores _memory_bytes |
Node capacity vs actual use — cluster-wide utilization. | kube-state-metrics / Metrics API | Easy |
scheduler_pod_scheduling_duration_seconds_bucket |
Pod scheduling latency histogram — measures orchestration delay. | kube-scheduler’s Prometheus | Moderate |
kubelet_volume_stats_used_bytes |
Volume usage per PV — storage efficiency and I/O overhead. | Kubelet metrics endpoint | Moderate |
3. Research-Informed Metrics
Drawing on recent energy-efficiency and sustainability studies, consider these additions:
Metric | Formula | Description | Effort |
---|---|---|---|
CO₂ emissions per request | CO2e/request = (Energy Wh × CO2 g/Wh) ÷ 1000 |
Translates energy use into grams of CO₂ emitted per request. Requires regional carbon data feed. | Moderate – needs external API/data |
Power Usage Effectiveness (PUE) | PUE = Total facility power ÷ IT equipment power |
Data-center efficiency metric capturing cooling/PDU overhead. | High – needs facility-level power data |
Performance-per-Watt (RPS/W) | RPS/W = (Requests/sec) ÷ (Avg. total power W) |
Shows how many requests are served per watt of power. | Easy – derived from existing metrics |
Idle Power Ratio | IdleRatio = Idle host power W ÷ Active host power W |
Quantifies energy proportionality by comparing idle vs active host draw. | Easy – uses existing Scaphandre/Tapo data |
Tail Latency (p99, p999) | p99 = 99th-percentile latency p999 = 99.9th-percentile latency |
Captures worst-case request latencies for SLA compliance. | Easy – extend Locust/PromQL queries |
4. Integration Recommendations
-
Immediate-win metrics (Easy):
- All cAdvisor-exported counters (throttling, memory failures, I/O stats).
- kube-state-metrics fields (restarts, HPA replicas, resource requests).
-
Moderate-effort metrics:
- Scheduler and Kubelet endpoints (scheduling latency, volume stats).
- Node-exporter collectors for CPU freq, C-states, temperature.
-
High-value, external metrics:
- CO₂ per request (via carbon intensity API).
- PUE (via facility power feeds).
-
Derived efficiency scores:
- RPS/W, Idle Ratio, CPU-util/Watt — computed in notebook from existing series.
- Tail-latency percentiles (p99, p999) in Locust or PromQL.
Last updated: May 2025