Additional Metrics - clue2-sose25/Clue2 GitHub Wiki

Additional Metrics

This page documents new sustainability and performance metrics we consider adding to CLUE’s benchmarking results. We cover:

  1. Prometheus-native metrics not yet scraped
  2. Kubernetes/cluster metrics from kube-state, kubelet, scheduler
  3. Research-inspired metrics (carbon, PUE, tail latency…)
  4. Recommendations on integration effort and value

1. Prometheus-Native Metrics

CLUE already ingests basic CPU/memory and energy readings. The following additional metrics come “for free” from cAdvisor, kubelet and node-exporter—no new exporters needed:

Metric Purpose Category Effort
container_cpu_cfs_throttled_seconds_total Cumulative time containers were throttled by CFS quota — indicates CPU pressure and waste. Performance wastage Easy
container_memory_failures_total Count of OOM kills or allocation failures inside containers. Reliability Easy
container_network_receive_bytes_total_transmit_bytes_total Total network I/O per container — correlates with NIC power draw and latency. I/O performance Easy
container_network_receive_packets_dropped_total_transmit_packets_dropped_total Dropped packet counts — wasted retries & energy. Quality Easy
container_fs_reads_bytes_total_writes_bytes_total Disk I/O volume per container — impacts storage energy use. I/O performance Easy
container_fs_io_time_seconds_total Time spent in block I/O — reveals storage busy-time overhead. I/O performance Easy
container_cpu_user_seconds_total_system_seconds_total Breakdown of CPU time in user vs kernel space — helps isolate OS overhead. Performance Easy
container_memory_working_set_bytes Actual working-set memory use (excludes cache) — shows true memory pressure. Performance Easy

2. Kubernetes & Cluster Metrics

Enable kube-state-metrics, kubelet and scheduler endpoints to gather these higher-level insights:

Metric Purpose Source Effort
kube_pod_container_status_restarts_total Total restarts per container — reliability overhead and wasted restart energy. kube-state-metrics Easy
kube_horizontalpodautoscaler_status_current_replicas_desired_replicas HPA scaling behavior — correlates with scaling waste (RE). kube-state-metrics Easy
kube_pod_container_resource_requests_cpu_cores_memory_bytes Configured requests & limits — refines RU and RO calculations. kube-state-metrics Easy
kube_node_status_allocatable_cpu_cores_memory_bytes Node capacity vs actual use — cluster-wide utilization. kube-state-metrics / Metrics API Easy
scheduler_pod_scheduling_duration_seconds_bucket Pod scheduling latency histogram — measures orchestration delay. kube-scheduler’s Prometheus Moderate
kubelet_volume_stats_used_bytes Volume usage per PV — storage efficiency and I/O overhead. Kubelet metrics endpoint Moderate

3. Research-Informed Metrics

Drawing on recent energy-efficiency and sustainability studies, consider these additions:

Metric Formula Description Effort
CO₂ emissions per request CO2e/request = (Energy Wh × CO2 g/Wh) ÷ 1000 Translates energy use into grams of CO₂ emitted per request. Requires regional carbon data feed. Moderate – needs external API/data
Power Usage Effectiveness (PUE) PUE = Total facility power ÷ IT equipment power Data-center efficiency metric capturing cooling/PDU overhead. High – needs facility-level power data
Performance-per-Watt (RPS/W) RPS/W = (Requests/sec) ÷ (Avg. total power W) Shows how many requests are served per watt of power. Easy – derived from existing metrics
Idle Power Ratio IdleRatio = Idle host power W ÷ Active host power W Quantifies energy proportionality by comparing idle vs active host draw. Easy – uses existing Scaphandre/Tapo data
Tail Latency (p99, p999) p99 = 99th-percentile latencyp999 = 99.9th-percentile latency Captures worst-case request latencies for SLA compliance. Easy – extend Locust/PromQL queries

4. Integration Recommendations

  1. Immediate-win metrics (Easy):

    • All cAdvisor-exported counters (throttling, memory failures, I/O stats).
    • kube-state-metrics fields (restarts, HPA replicas, resource requests).
  2. Moderate-effort metrics:

    • Scheduler and Kubelet endpoints (scheduling latency, volume stats).
    • Node-exporter collectors for CPU freq, C-states, temperature.
  3. High-value, external metrics:

    • CO₂ per request (via carbon intensity API).
    • PUE (via facility power feeds).
  4. Derived efficiency scores:

    • RPS/W, Idle Ratio, CPU-util/Watt — computed in notebook from existing series.
    • Tail-latency percentiles (p99, p999) in Locust or PromQL.

Last updated: May 2025