Performance Metrics
The Antimetal System Agent collects comprehensive performance metrics from Linux systems to enable effective diagnosis of latency and throughput issues. This document details the available metrics, their diagnostic value, and how they map to the USE methodology for performance analysis.
USE Methodology Overview
The USE methodology, developed by Brendan Gregg, provides a systematic approach to performance analysis by examining three key metric categories for every system resource:
- Utilization - How busy a resource is (percentage of time busy)
- Saturation - How much work is queued waiting for the resource
- Errors - Count of error events that occurred
This methodology helps identify bottlenecks quickly and avoid common performance analysis pitfalls.
System Resources and USE Metrics
CPU Resources
Utilization Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
cpu.user |
/proc/stat |
Utilization |
Application workload intensity |
cpu.system |
/proc/stat |
Utilization |
Kernel/syscall overhead |
cpu.iowait |
/proc/stat |
Utilization |
Storage I/O blocking CPU |
cpu.steal |
/proc/stat |
Utilization |
Hypervisor stealing CPU (cloud) |
cpu.per_core.* |
/proc/stat |
Utilization |
Individual core utilization patterns |
Saturation Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
load.avg_1min |
/proc/loadavg |
Saturation |
Short-term CPU pressure |
load.avg_5min |
/proc/loadavg |
Saturation |
Medium-term CPU pressure |
load.avg_15min |
/proc/loadavg |
Saturation |
Long-term CPU pressure |
load.running_procs |
/proc/loadavg |
Saturation |
Currently executing processes |
load.total_procs |
/proc/loadavg |
Saturation |
Total processes in system |
Error Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
kernel.errors |
/dev/kmsg |
Errors |
CPU-related kernel errors |
cpu.thermal_throttling |
/sys/devices |
Errors |
CPU frequency reduction due to heat |
CPU Diagnostic Patterns:
- High Utilization + Low Saturation = CPU-bound workload, consider scaling
- Low Utilization + High Saturation = I/O or lock contention
- High IOWait = Storage bottleneck affecting CPU efficiency
- High Steal Time = Noisy neighbor in cloud environment
Memory Resources
Utilization Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
memory.used_percent |
/proc/meminfo |
Utilization |
Overall memory pressure |
memory.available |
/proc/meminfo |
Utilization |
Memory available to applications |
memory.buffers |
/proc/meminfo |
Utilization |
File system buffer cache usage |
memory.cached |
/proc/meminfo |
Utilization |
Page cache utilization |
memory.anon_pages |
/proc/meminfo |
Utilization |
Anonymous (heap/stack) memory |
memory.shmem |
/proc/meminfo |
Utilization |
Shared memory usage |
Saturation Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
memory.swap_used |
/proc/meminfo |
Saturation |
Memory pressure forcing swap |
memory.swap_in_rate |
/proc/vmstat |
Saturation |
Pages swapped in per second |
memory.swap_out_rate |
/proc/vmstat |
Saturation |
Pages swapped out per second |
memory.dirty_pages |
/proc/meminfo |
Saturation |
Pages waiting to be written |
memory.writeback_pages |
/proc/meminfo |
Saturation |
Pages currently being written |
Error Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
memory.oom_kills |
/proc/vmstat |
Errors |
Out-of-memory killer invocations |
memory.allocation_failures |
/proc/vmstat |
Errors |
Failed memory allocations |
Memory Diagnostic Patterns:
- High Utilization + No Swapping = Efficient memory usage
- Active Swapping = Memory pressure, consider adding RAM
- High Dirty Pages = Write I/O bottleneck
- OOM Kills = Memory limits exceeded, investigate memory leaks
Storage Resources
Utilization Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
disk.util_percent |
/proc/diskstats |
Utilization |
Device busy percentage |
disk.read_bytes_rate |
/proc/diskstats |
Utilization |
Read throughput |
disk.write_bytes_rate |
/proc/diskstats |
Utilization |
Write throughput |
disk.read_ops_rate |
/proc/diskstats |
Utilization |
Read IOPS |
disk.write_ops_rate |
/proc/diskstats |
Utilization |
Write IOPS |
Saturation Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
disk.avg_queue_size |
/proc/diskstats |
Saturation |
Average I/O queue depth |
disk.await_time_ms |
/proc/diskstats |
Saturation |
Average I/O wait time |
disk.read_await_ms |
/proc/diskstats |
Saturation |
Read operation latency |
disk.write_await_ms |
/proc/diskstats |
Saturation |
Write operation latency |
Error Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
disk.read_errors |
/proc/diskstats |
Errors |
Failed read operations |
disk.write_errors |
/proc/diskstats |
Errors |
Failed write operations |
disk.io_errors |
/proc/diskstats |
Errors |
General I/O errors |
Storage Diagnostic Patterns:
- High Utilization + Low Queue Depth = Sequential I/O pattern
- High Queue Depth + High Latency = Storage device saturation
- High IOPS + Small Transfer Size = Random I/O pattern
- I/O Errors = Hardware issues or filesystem corruption
Network Resources
Utilization Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
network.rx_bytes_rate |
/proc/net/dev |
Utilization |
Receive bandwidth usage |
network.tx_bytes_rate |
/proc/net/dev |
Utilization |
Transmit bandwidth usage |
network.rx_packets_rate |
/proc/net/dev |
Utilization |
Receive packet rate |
network.tx_packets_rate |
/proc/net/dev |
Utilization |
Transmit packet rate |
Saturation Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
network.rx_dropped |
/proc/net/dev |
Saturation |
Receive buffer overflows |
network.tx_dropped |
/proc/net/dev |
Saturation |
Transmit buffer overflows |
tcp.retrans_rate |
/proc/net/snmp |
Saturation |
TCP retransmission rate |
tcp.listen_drops |
/proc/net/netstat |
Saturation |
Dropped connection attempts |
Error Metrics
Metric |
Source |
USE Category |
Diagnostic Value |
network.rx_errors |
/proc/net/dev |
Errors |
Receive errors (CRC, frame) |
network.tx_errors |
/proc/net/dev |
Errors |
Transmit errors |
network.collisions |
/proc/net/dev |
Errors |
Ethernet collisions |
tcp.failed_conns |
/proc/net/netstat |
Errors |
Failed TCP connections |
Network Diagnostic Patterns:
- High Bandwidth + No Drops = Efficient network usage
- Packet Drops = Buffer exhaustion or network congestion
- High Retransmissions = Network quality issues
- Connection Errors = Service unavailability or firewall issues
Advanced Performance Metrics
NUMA Topology Metrics
Metric |
Source |
Diagnostic Value |
numa.node_memory_usage |
/sys/devices/system/node |
Memory locality efficiency |
numa.node_cpu_usage |
/sys/devices/system/node |
CPU locality patterns |
numa.memory_migrations |
/proc/vmstat |
Cross-node memory access cost |
Process-Level Metrics
Metric |
Source |
Diagnostic Value |
process.cpu_percent |
/proc/[pid]/stat |
Per-process CPU usage |
process.memory_rss |
/proc/[pid]/status |
Process memory footprint |
process.io_read_bytes |
/proc/[pid]/io |
Process I/O patterns |
process.voluntary_switches |
/proc/[pid]/status |
Process cooperation level |
eBPF-Enhanced Metrics
Metric |
Source |
Diagnostic Value |
execsnoop.exec_rate |
eBPF program |
Process creation overhead |
execsnoop.short_lived_procs |
eBPF program |
Fork/exec thrashing detection |
Diagnostic Workflows
Latency Issues
1. High Application Latency
Check: cpu.iowait, disk.await_time_ms, memory.swap_in_rate
Cause: I/O or memory pressure affecting response time
Action: Optimize I/O patterns or add memory/storage capacity
2. Network Latency
Check: tcp.retrans_rate, network.rx_dropped, network.tx_dropped
Cause: Network congestion or quality issues
Action: Investigate network path, adjust buffer sizes
3. Lock Contention
Check: load.avg vs cpu.utilization, process.voluntary_switches
Cause: High load but low CPU usage indicates blocking
Action: Profile application for lock contention
Throughput Issues
1. CPU-Bound Throughput
Check: cpu.user + cpu.system approaching 100%, load.avg > core_count
Cause: Insufficient CPU capacity
Action: Scale horizontally or optimize CPU-intensive code
2. I/O-Bound Throughput
Check: disk.util_percent > 80%, disk.avg_queue_size > 2
Cause: Storage device saturation
Action: Add storage devices, optimize I/O patterns, use caching
3. Memory-Bound Throughput
Check: memory.swap_out_rate > 0, memory.dirty_pages high
Cause: Memory pressure forcing swapping or delayed writes
Action: Add memory, optimize memory usage patterns
Metric Collection Architecture
Collection Frequency
- High-frequency (1s): CPU, Memory, Load metrics for real-time alerting
- Medium-frequency (15s): Network, Disk metrics for trend analysis
- Low-frequency (60s): Hardware info, NUMA topology for inventory
Data Retention Strategy
- Raw metrics: 24 hours for detailed troubleshooting
- 1-minute aggregates: 7 days for trend analysis
- 5-minute aggregates: 30 days for capacity planning
- 1-hour aggregates: 1 year for historical analysis
Alerting Thresholds
Critical Thresholds
- CPU Utilization > 90% for 5 minutes
- Memory Utilization > 95% for 2 minutes
- Disk Utilization > 95% for 5 minutes
- Load Average > 2x core count for 5 minutes
Warning Thresholds
- CPU Utilization > 80% for 10 minutes
- Memory Utilization > 85% for 5 minutes
- Disk Await Time > 50ms for 5 minutes
- Network Errors > 0.1% of packets for 5 minutes
Integration with Antimetal Platform
Cost Optimization Insights
- Right-sizing: CPU and memory utilization trends inform instance sizing
- Storage Optimization: I/O patterns guide storage type selection
- Network Optimization: Bandwidth usage patterns optimize network configurations
Performance Monitoring Integration
- Baseline Establishment: Historical metrics establish normal performance ranges
- Anomaly Detection: Statistical analysis identifies performance regressions
- Capacity Planning: Growth trends predict future resource needs
Next Steps
For detailed implementation of metric collection, see the Performance Monitoring documentation and individual collector guides.