CPU Collector - antimetal/system-agent GitHub Wiki

CPU Collector

Overview

The CPU Collector is a performance monitoring component of the Antimetal System Agent that collects CPU time statistics from the Linux /proc/stat file. It provides detailed insights into how CPU time is distributed across different states (user, system, idle, I/O wait, etc.) for both the aggregate system CPU and individual CPU cores.

This collector is essential for:

  • Performance monitoring: Track CPU utilization patterns and identify bottlenecks
  • Resource optimization: Understand CPU time distribution across different states
  • Capacity planning: Monitor per-core utilization for workload distribution
  • Virtualization awareness: Track steal time in virtualized environments
  • Troubleshooting: Identify high interrupt rates or I/O wait issues

Technical Details

MetricType

  • Enum Value: MetricTypeCPU
  • String Value: "cpu"
  • Registry: Automatically registered with ContinuousPointCollector wrapper for periodic collection

Data Source

  • Primary Source: /proc/stat
  • Format: Space-separated values for CPU time in different states
  • Units: All values are in "jiffies" (USER_HZ units, typically 100 Hz)

Capabilities

Capability Value Description
SupportsOneShot true Can perform single point-in-time collections
SupportsContinuous false Native continuous collection not implemented (wrapped by framework)
RequiresRoot false Can run as non-root user
RequiresEBPF false No eBPF kernel modules required
MinKernelVersion 2.6.0 /proc/stat has been available since early Linux versions

Collected Metrics

The collector returns an array of CPUStats structures, one for each CPU plus an aggregate entry:

Field Type Description /proc/stat Field
CPUIndex int32 CPU identifier (-1 for aggregate, 0+ for individual cores) Derived from line prefix
User uint64 Time spent in user mode Field 1
Nice uint64 Time spent in user mode with low priority (nice) Field 2
System uint64 Time spent in system/kernel mode Field 3
Idle uint64 Time spent idle Field 4
IOWait uint64 Time waiting for I/O completion Field 5
IRQ uint64 Time servicing hardware interrupts Field 6
SoftIRQ uint64 Time servicing software interrupts Field 7
Steal uint64 Time stolen by hypervisor (virtualization) Field 8 (optional)
Guest uint64 Time spent running virtual CPUs for guests Field 9 (optional)
GuestNice uint64 Time spent running niced guests Field 10 (optional)

Understanding CPU Time Values

All time values are cumulative counters in "jiffies" since system boot. To calculate CPU utilization:

  1. Take two samples at different times
  2. Calculate the delta for each field
  3. Sum all deltas to get total time elapsed
  4. Calculate percentage: (delta_field / total_delta) * 100

To convert jiffies to seconds: divide by USER_HZ (typically 100)

Data Structure

The collector implementation can be found at:

Configuration

The CPU Collector is configured through the CollectionConfig structure:

config := performance.CollectionConfig{
    HostProcPath: "/proc",     // Path to proc filesystem (required)
    Interval:     time.Second, // Collection interval (when wrapped as continuous)
}

Container Environments

When running in containers, the HostProcPath should be set to the mounted host proc filesystem:

env:
  - name: HOST_PROC
    value: /host/proc
volumeMounts:
  - name: proc
    mountPath: /host/proc
    readOnly: true

Platform Considerations

Linux Kernel Requirements

  • Minimum Version: 2.6.0 (effectively all modern Linux systems)
  • Required Files: /proc/stat must be readable
  • Optional Fields:
    • Steal, Guest, GuestNice fields added in kernel 2.6.24+
    • Older kernels will show 0 for these fields

Container Considerations

  • Must mount host /proc filesystem to access system-wide CPU stats
  • Container's own /proc/stat would only show container-specific limits
  • No special privileges required beyond filesystem access

Virtualization Environments

  • Steal Time: Important metric in VMs showing CPU time taken by hypervisor
  • Guest Time: Relevant when running nested virtualization
  • Helps identify "noisy neighbor" problems in cloud environments

Common Issues

Troubleshooting Guide

Issue Symptoms Solution
No CPU stats collected Error: "no CPU statistics found" Verify /proc/stat exists and is readable
Missing CPU cores Some CPUs not in output Check for CPU hotplug events or offline CPUs
Zero values All metrics show 0 System just booted or /proc/stat format issue
Permission denied Cannot read /proc/stat Check file permissions and container mounts
Wrong proc path File not found errors Ensure HostProcPath is absolute path to actual /proc

Debugging Commands

# Check /proc/stat format
cat /proc/stat | head -20

# Verify CPU count
grep -c ^processor /proc/cpuinfo

# Check for offline CPUs
cat /sys/devices/system/cpu/offline

# Monitor CPU usage in real-time
watch -n 1 'cat /proc/stat | grep "^cpu"'

Examples

Sample Output

Raw /proc/stat content:

cpu  1234 56 789 10000 200 30 40 50 60 70
cpu0 600 30 400 5000 100 15 20 25 30 35
cpu1 634 26 389 5000 100 15 20 25 30 35

Collected CPUStats array:

[
  {
    "CPUIndex": -1,
    "User": 1234,
    "Nice": 56,
    "System": 789,
    "Idle": 10000,
    "IOWait": 200,
    "IRQ": 30,
    "SoftIRQ": 40,
    "Steal": 50,
    "Guest": 60,
    "GuestNice": 70
  },
  {
    "CPUIndex": 0,
    "User": 600,
    "Nice": 30,
    "System": 400,
    "Idle": 5000,
    "IOWait": 100,
    "IRQ": 15,
    "SoftIRQ": 20,
    "Steal": 25,
    "Guest": 30,
    "GuestNice": 35
  },
  {
    "CPUIndex": 1,
    "User": 634,
    "Nice": 26,
    "System": 389,
    "Idle": 5000,
    "IOWait": 100,
    "IRQ": 15,
    "SoftIRQ": 20,
    "Steal": 25,
    "Guest": 30,
    "GuestNice": 35
  }
]

Usage Patterns

Common patterns for consuming CPU statistics:

// Calculate CPU usage between two samples
func calculateCPUUsage(prev, curr *performance.CPUStats) float64 {
    totalPrev := prev.User + prev.Nice + prev.System + prev.Idle + 
                 prev.IOWait + prev.IRQ + prev.SoftIRQ + prev.Steal
    totalCurr := curr.User + curr.Nice + curr.System + curr.Idle + 
                 curr.IOWait + curr.IRQ + curr.SoftIRQ + curr.Steal
    
    totalDelta := float64(totalCurr - totalPrev)
    idleDelta := float64(curr.Idle - prev.Idle)
    
    if totalDelta == 0 {
        return 0
    }
    
    return (1.0 - idleDelta/totalDelta) * 100
}

Performance Impact

Resource Usage

  • CPU: Negligible - single file read operation
  • Memory: ~2KB for typical 8-core system (scales with CPU count)
  • I/O: One read of /proc/stat per collection
  • Frequency: Default 1 second interval when used as continuous collector

Optimization Notes

  • File read is buffered, no seek operations required
  • Parsing is done in single pass with minimal allocations
  • Missing CPUs logged but don't fail collection
  • No system calls beyond file read

Related Collectors

Direct Relationships

Complementary Metrics

Analysis Combinations

  • Combine with Load Collector for complete CPU pressure picture
  • Cross-reference with Process Collector to identify CPU-intensive processes
  • Use with CPU Info to understand performance relative to hardware capabilities

References