Additional Metrics - clue2-sose25/Clue2 GitHub Wiki

Additional Metrics

This page documents new sustainability and performance metrics we consider adding to CLUE’s benchmarking results. We cover:

Prometheus-native metrics not yet scraped
Kubernetes/cluster metrics from kube-state, kubelet, scheduler
Research-inspired metrics (carbon, PUE, tail latency…)
Recommendations on integration effort and value

1. Prometheus-Native Metrics

CLUE already ingests basic CPU/memory and energy readings. The following additional metrics come “for free” from cAdvisor, kubelet and node-exporter—no new exporters needed:

Metric	Purpose	Category	Effort
`container_cpu_cfs_throttled_seconds_total`	Cumulative time containers were throttled by CFS quota — indicates CPU pressure and waste.	Performance wastage	Easy
`container_memory_failures_total`	Count of OOM kills or allocation failures inside containers.	Reliability	Easy
`container_network_receive_bytes_total_transmit_bytes_total`	Total network I/O per container — correlates with NIC power draw and latency.	I/O performance	Easy
`container_network_receive_packets_dropped_total_transmit_packets_dropped_total`	Dropped packet counts — wasted retries & energy.	Quality	Easy
`container_fs_reads_bytes_total_writes_bytes_total`	Disk I/O volume per container — impacts storage energy use.	I/O performance	Easy
`container_fs_io_time_seconds_total`	Time spent in block I/O — reveals storage busy-time overhead.	I/O performance	Easy
`container_cpu_user_seconds_total_system_seconds_total`	Breakdown of CPU time in user vs kernel space — helps isolate OS overhead.	Performance	Easy
`container_memory_working_set_bytes`	Actual working-set memory use (excludes cache) — shows true memory pressure.	Performance	Easy

2. Kubernetes & Cluster Metrics

Enable kube-state-metrics, kubelet and scheduler endpoints to gather these higher-level insights:

Metric	Purpose	Source	Effort
`kube_pod_container_status_restarts_total`	Total restarts per container — reliability overhead and wasted restart energy.	kube-state-metrics	Easy
`kube_horizontalpodautoscaler_status_current_replicas_desired_replicas`	HPA scaling behavior — correlates with scaling waste (RE).	kube-state-metrics	Easy
`kube_pod_container_resource_requests_cpu_cores_memory_bytes`	Configured requests & limits — refines RU and RO calculations.	kube-state-metrics	Easy
`kube_node_status_allocatable_cpu_cores_memory_bytes`	Node capacity vs actual use — cluster-wide utilization.	kube-state-metrics / Metrics API	Easy
`scheduler_pod_scheduling_duration_seconds_bucket`	Pod scheduling latency histogram — measures orchestration delay.	kube-scheduler’s Prometheus	Moderate
`kubelet_volume_stats_used_bytes`	Volume usage per PV — storage efficiency and I/O overhead.	Kubelet metrics endpoint	Moderate

3. Research-Informed Metrics

Drawing on recent energy-efficiency and sustainability studies, consider these additions:

Metric	Formula	Description	Effort
CO₂ emissions per request	`CO2e/request = (Energy Wh × CO2 g/Wh) ÷ 1000`	Translates energy use into grams of CO₂ emitted per request. Requires regional carbon data feed.	Moderate – needs external API/data
Power Usage Effectiveness (PUE)	`PUE = Total facility power ÷ IT equipment power`	Data-center efficiency metric capturing cooling/PDU overhead.	High – needs facility-level power data
Performance-per-Watt (RPS/W)	`RPS/W = (Requests/sec) ÷ (Avg. total power W)`	Shows how many requests are served per watt of power.	Easy – derived from existing metrics
Idle Power Ratio	`IdleRatio = Idle host power W ÷ Active host power W`	Quantifies energy proportionality by comparing idle vs active host draw.	Easy – uses existing Scaphandre/Tapo data
Tail Latency (p99, p999)	`p99 = 99th-percentile latencyp999 = 99.9th-percentile latency`	Captures worst-case request latencies for SLA compliance.	Easy – extend Locust/PromQL queries

4. Integration Recommendations

Immediate-win metrics (Easy):
- All cAdvisor-exported counters (throttling, memory failures, I/O stats).
- kube-state-metrics fields (restarts, HPA replicas, resource requests).
Moderate-effort metrics:
- Scheduler and Kubelet endpoints (scheduling latency, volume stats).
- Node-exporter collectors for CPU freq, C-states, temperature.
High-value, external metrics:
- CO₂ per request (via carbon intensity API).
- PUE (via facility power feeds).
Derived efficiency scores:
- RPS/W, Idle Ratio, CPU-util/Watt — computed in notebook from existing series.
- Tail-latency percentiles (p99, p999) in Locust or PromQL.

Last updated: May 2025