Performance Metrics - antimetal/system-agent GitHub Wiki

Performance Metrics

The Antimetal System Agent collects comprehensive performance metrics from Linux systems to enable effective diagnosis of latency and throughput issues. This document details the available metrics, their diagnostic value, and how they map to the USE methodology for performance analysis.

USE Methodology Overview

The USE methodology, developed by Brendan Gregg, provides a systematic approach to performance analysis by examining three key metric categories for every system resource:

  • Utilization - How busy a resource is (percentage of time busy)
  • Saturation - How much work is queued waiting for the resource
  • Errors - Count of error events that occurred

This methodology helps identify bottlenecks quickly and avoid common performance analysis pitfalls.

System Resources and USE Metrics

CPU Resources

Utilization Metrics

Metric Source USE Category Diagnostic Value
cpu.user /proc/stat Utilization Application workload intensity
cpu.system /proc/stat Utilization Kernel/syscall overhead
cpu.iowait /proc/stat Utilization Storage I/O blocking CPU
cpu.steal /proc/stat Utilization Hypervisor stealing CPU (cloud)
cpu.per_core.* /proc/stat Utilization Individual core utilization patterns

Saturation Metrics

Metric Source USE Category Diagnostic Value
load.avg_1min /proc/loadavg Saturation Short-term CPU pressure
load.avg_5min /proc/loadavg Saturation Medium-term CPU pressure
load.avg_15min /proc/loadavg Saturation Long-term CPU pressure
load.running_procs /proc/loadavg Saturation Currently executing processes
load.total_procs /proc/loadavg Saturation Total processes in system

Error Metrics

Metric Source USE Category Diagnostic Value
kernel.errors /dev/kmsg Errors CPU-related kernel errors
cpu.thermal_throttling /sys/devices Errors CPU frequency reduction due to heat

CPU Diagnostic Patterns:

  • High Utilization + Low Saturation = CPU-bound workload, consider scaling
  • Low Utilization + High Saturation = I/O or lock contention
  • High IOWait = Storage bottleneck affecting CPU efficiency
  • High Steal Time = Noisy neighbor in cloud environment

Memory Resources

Utilization Metrics

Metric Source USE Category Diagnostic Value
memory.used_percent /proc/meminfo Utilization Overall memory pressure
memory.available /proc/meminfo Utilization Memory available to applications
memory.buffers /proc/meminfo Utilization File system buffer cache usage
memory.cached /proc/meminfo Utilization Page cache utilization
memory.anon_pages /proc/meminfo Utilization Anonymous (heap/stack) memory
memory.shmem /proc/meminfo Utilization Shared memory usage

Saturation Metrics

Metric Source USE Category Diagnostic Value
memory.swap_used /proc/meminfo Saturation Memory pressure forcing swap
memory.swap_in_rate /proc/vmstat Saturation Pages swapped in per second
memory.swap_out_rate /proc/vmstat Saturation Pages swapped out per second
memory.dirty_pages /proc/meminfo Saturation Pages waiting to be written
memory.writeback_pages /proc/meminfo Saturation Pages currently being written

Error Metrics

Metric Source USE Category Diagnostic Value
memory.oom_kills /proc/vmstat Errors Out-of-memory killer invocations
memory.allocation_failures /proc/vmstat Errors Failed memory allocations

Memory Diagnostic Patterns:

  • High Utilization + No Swapping = Efficient memory usage
  • Active Swapping = Memory pressure, consider adding RAM
  • High Dirty Pages = Write I/O bottleneck
  • OOM Kills = Memory limits exceeded, investigate memory leaks

Storage Resources

Utilization Metrics

Metric Source USE Category Diagnostic Value
disk.util_percent /proc/diskstats Utilization Device busy percentage
disk.read_bytes_rate /proc/diskstats Utilization Read throughput
disk.write_bytes_rate /proc/diskstats Utilization Write throughput
disk.read_ops_rate /proc/diskstats Utilization Read IOPS
disk.write_ops_rate /proc/diskstats Utilization Write IOPS

Saturation Metrics

Metric Source USE Category Diagnostic Value
disk.avg_queue_size /proc/diskstats Saturation Average I/O queue depth
disk.await_time_ms /proc/diskstats Saturation Average I/O wait time
disk.read_await_ms /proc/diskstats Saturation Read operation latency
disk.write_await_ms /proc/diskstats Saturation Write operation latency

Error Metrics

Metric Source USE Category Diagnostic Value
disk.read_errors /proc/diskstats Errors Failed read operations
disk.write_errors /proc/diskstats Errors Failed write operations
disk.io_errors /proc/diskstats Errors General I/O errors

Storage Diagnostic Patterns:

  • High Utilization + Low Queue Depth = Sequential I/O pattern
  • High Queue Depth + High Latency = Storage device saturation
  • High IOPS + Small Transfer Size = Random I/O pattern
  • I/O Errors = Hardware issues or filesystem corruption

Network Resources

Utilization Metrics

Metric Source USE Category Diagnostic Value
network.rx_bytes_rate /proc/net/dev Utilization Receive bandwidth usage
network.tx_bytes_rate /proc/net/dev Utilization Transmit bandwidth usage
network.rx_packets_rate /proc/net/dev Utilization Receive packet rate
network.tx_packets_rate /proc/net/dev Utilization Transmit packet rate

Saturation Metrics

Metric Source USE Category Diagnostic Value
network.rx_dropped /proc/net/dev Saturation Receive buffer overflows
network.tx_dropped /proc/net/dev Saturation Transmit buffer overflows
tcp.retrans_rate /proc/net/snmp Saturation TCP retransmission rate
tcp.listen_drops /proc/net/netstat Saturation Dropped connection attempts

Error Metrics

Metric Source USE Category Diagnostic Value
network.rx_errors /proc/net/dev Errors Receive errors (CRC, frame)
network.tx_errors /proc/net/dev Errors Transmit errors
network.collisions /proc/net/dev Errors Ethernet collisions
tcp.failed_conns /proc/net/netstat Errors Failed TCP connections

Network Diagnostic Patterns:

  • High Bandwidth + No Drops = Efficient network usage
  • Packet Drops = Buffer exhaustion or network congestion
  • High Retransmissions = Network quality issues
  • Connection Errors = Service unavailability or firewall issues

Advanced Performance Metrics

NUMA Topology Metrics

Metric Source Diagnostic Value
numa.node_memory_usage /sys/devices/system/node Memory locality efficiency
numa.node_cpu_usage /sys/devices/system/node CPU locality patterns
numa.memory_migrations /proc/vmstat Cross-node memory access cost

Process-Level Metrics

Metric Source Diagnostic Value
process.cpu_percent /proc/[pid]/stat Per-process CPU usage
process.memory_rss /proc/[pid]/status Process memory footprint
process.io_read_bytes /proc/[pid]/io Process I/O patterns
process.voluntary_switches /proc/[pid]/status Process cooperation level

eBPF-Enhanced Metrics

Metric Source Diagnostic Value
execsnoop.exec_rate eBPF program Process creation overhead
execsnoop.short_lived_procs eBPF program Fork/exec thrashing detection

Diagnostic Workflows

Latency Issues

1. High Application Latency

Check: cpu.iowait, disk.await_time_ms, memory.swap_in_rate
Cause: I/O or memory pressure affecting response time
Action: Optimize I/O patterns or add memory/storage capacity

2. Network Latency

Check: tcp.retrans_rate, network.rx_dropped, network.tx_dropped
Cause: Network congestion or quality issues  
Action: Investigate network path, adjust buffer sizes

3. Lock Contention

Check: load.avg vs cpu.utilization, process.voluntary_switches
Cause: High load but low CPU usage indicates blocking
Action: Profile application for lock contention

Throughput Issues

1. CPU-Bound Throughput

Check: cpu.user + cpu.system approaching 100%, load.avg > core_count  
Cause: Insufficient CPU capacity
Action: Scale horizontally or optimize CPU-intensive code

2. I/O-Bound Throughput

Check: disk.util_percent > 80%, disk.avg_queue_size > 2
Cause: Storage device saturation
Action: Add storage devices, optimize I/O patterns, use caching

3. Memory-Bound Throughput

Check: memory.swap_out_rate > 0, memory.dirty_pages high
Cause: Memory pressure forcing swapping or delayed writes
Action: Add memory, optimize memory usage patterns

Metric Collection Architecture

Collection Frequency

  • High-frequency (1s): CPU, Memory, Load metrics for real-time alerting
  • Medium-frequency (15s): Network, Disk metrics for trend analysis
  • Low-frequency (60s): Hardware info, NUMA topology for inventory

Data Retention Strategy

  • Raw metrics: 24 hours for detailed troubleshooting
  • 1-minute aggregates: 7 days for trend analysis
  • 5-minute aggregates: 30 days for capacity planning
  • 1-hour aggregates: 1 year for historical analysis

Alerting Thresholds

Critical Thresholds

  • CPU Utilization > 90% for 5 minutes
  • Memory Utilization > 95% for 2 minutes
  • Disk Utilization > 95% for 5 minutes
  • Load Average > 2x core count for 5 minutes

Warning Thresholds

  • CPU Utilization > 80% for 10 minutes
  • Memory Utilization > 85% for 5 minutes
  • Disk Await Time > 50ms for 5 minutes
  • Network Errors > 0.1% of packets for 5 minutes

Integration with Antimetal Platform

Cost Optimization Insights

  • Right-sizing: CPU and memory utilization trends inform instance sizing
  • Storage Optimization: I/O patterns guide storage type selection
  • Network Optimization: Bandwidth usage patterns optimize network configurations

Performance Monitoring Integration

  • Baseline Establishment: Historical metrics establish normal performance ranges
  • Anomaly Detection: Statistical analysis identifies performance regressions
  • Capacity Planning: Growth trends predict future resource needs

Next Steps


For detailed implementation of metric collection, see the Performance Monitoring documentation and individual collector guides.