CPU Collector - antimetal/system-agent GitHub Wiki
CPU Collector
Overview
The CPU Collector is a performance monitoring component of the Antimetal System Agent that collects CPU time statistics from the Linux /proc/stat
file. It provides detailed insights into how CPU time is distributed across different states (user, system, idle, I/O wait, etc.) for both the aggregate system CPU and individual CPU cores.
This collector is essential for:
- Performance monitoring: Track CPU utilization patterns and identify bottlenecks
- Resource optimization: Understand CPU time distribution across different states
- Capacity planning: Monitor per-core utilization for workload distribution
- Virtualization awareness: Track steal time in virtualized environments
- Troubleshooting: Identify high interrupt rates or I/O wait issues
Technical Details
MetricType
- Enum Value:
MetricTypeCPU
- String Value:
"cpu"
- Registry: Automatically registered with
ContinuousPointCollector
wrapper for periodic collection
Data Source
- Primary Source:
/proc/stat
- Format: Space-separated values for CPU time in different states
- Units: All values are in "jiffies" (USER_HZ units, typically 100 Hz)
Capabilities
Capability | Value | Description |
---|---|---|
SupportsOneShot | true | Can perform single point-in-time collections |
SupportsContinuous | false | Native continuous collection not implemented (wrapped by framework) |
RequiresRoot | false | Can run as non-root user |
RequiresEBPF | false | No eBPF kernel modules required |
MinKernelVersion | 2.6.0 | /proc/stat has been available since early Linux versions |
Collected Metrics
The collector returns an array of CPUStats
structures, one for each CPU plus an aggregate entry:
Field | Type | Description | /proc/stat Field |
---|---|---|---|
CPUIndex | int32 | CPU identifier (-1 for aggregate, 0+ for individual cores) | Derived from line prefix |
User | uint64 | Time spent in user mode | Field 1 |
Nice | uint64 | Time spent in user mode with low priority (nice) | Field 2 |
System | uint64 | Time spent in system/kernel mode | Field 3 |
Idle | uint64 | Time spent idle | Field 4 |
IOWait | uint64 | Time waiting for I/O completion | Field 5 |
IRQ | uint64 | Time servicing hardware interrupts | Field 6 |
SoftIRQ | uint64 | Time servicing software interrupts | Field 7 |
Steal | uint64 | Time stolen by hypervisor (virtualization) | Field 8 (optional) |
Guest | uint64 | Time spent running virtual CPUs for guests | Field 9 (optional) |
GuestNice | uint64 | Time spent running niced guests | Field 10 (optional) |
Understanding CPU Time Values
All time values are cumulative counters in "jiffies" since system boot. To calculate CPU utilization:
- Take two samples at different times
- Calculate the delta for each field
- Sum all deltas to get total time elapsed
- Calculate percentage:
(delta_field / total_delta) * 100
To convert jiffies to seconds: divide by USER_HZ
(typically 100)
Data Structure
The collector implementation can be found at:
- Source:
pkg/performance/collectors/cpu.go
- Tests:
pkg/performance/collectors/cpu_test.go
- Type Definition:
pkg/performance/types.go
(seeCPUStats
)
Configuration
The CPU Collector is configured through the CollectionConfig
structure:
config := performance.CollectionConfig{
HostProcPath: "/proc", // Path to proc filesystem (required)
Interval: time.Second, // Collection interval (when wrapped as continuous)
}
Container Environments
When running in containers, the HostProcPath
should be set to the mounted host proc filesystem:
env:
- name: HOST_PROC
value: /host/proc
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
Platform Considerations
Linux Kernel Requirements
- Minimum Version: 2.6.0 (effectively all modern Linux systems)
- Required Files:
/proc/stat
must be readable - Optional Fields:
- Steal, Guest, GuestNice fields added in kernel 2.6.24+
- Older kernels will show 0 for these fields
Container Considerations
- Must mount host
/proc
filesystem to access system-wide CPU stats - Container's own
/proc/stat
would only show container-specific limits - No special privileges required beyond filesystem access
Virtualization Environments
- Steal Time: Important metric in VMs showing CPU time taken by hypervisor
- Guest Time: Relevant when running nested virtualization
- Helps identify "noisy neighbor" problems in cloud environments
Common Issues
Troubleshooting Guide
Issue | Symptoms | Solution |
---|---|---|
No CPU stats collected | Error: "no CPU statistics found" | Verify /proc/stat exists and is readable |
Missing CPU cores | Some CPUs not in output | Check for CPU hotplug events or offline CPUs |
Zero values | All metrics show 0 | System just booted or /proc/stat format issue |
Permission denied | Cannot read /proc/stat |
Check file permissions and container mounts |
Wrong proc path | File not found errors | Ensure HostProcPath is absolute path to actual /proc |
Debugging Commands
# Check /proc/stat format
cat /proc/stat | head -20
# Verify CPU count
grep -c ^processor /proc/cpuinfo
# Check for offline CPUs
cat /sys/devices/system/cpu/offline
# Monitor CPU usage in real-time
watch -n 1 'cat /proc/stat | grep "^cpu"'
Examples
Sample Output
Raw /proc/stat
content:
cpu 1234 56 789 10000 200 30 40 50 60 70
cpu0 600 30 400 5000 100 15 20 25 30 35
cpu1 634 26 389 5000 100 15 20 25 30 35
Collected CPUStats
array:
[
{
"CPUIndex": -1,
"User": 1234,
"Nice": 56,
"System": 789,
"Idle": 10000,
"IOWait": 200,
"IRQ": 30,
"SoftIRQ": 40,
"Steal": 50,
"Guest": 60,
"GuestNice": 70
},
{
"CPUIndex": 0,
"User": 600,
"Nice": 30,
"System": 400,
"Idle": 5000,
"IOWait": 100,
"IRQ": 15,
"SoftIRQ": 20,
"Steal": 25,
"Guest": 30,
"GuestNice": 35
},
{
"CPUIndex": 1,
"User": 634,
"Nice": 26,
"System": 389,
"Idle": 5000,
"IOWait": 100,
"IRQ": 15,
"SoftIRQ": 20,
"Steal": 25,
"Guest": 30,
"GuestNice": 35
}
]
Usage Patterns
Common patterns for consuming CPU statistics:
// Calculate CPU usage between two samples
func calculateCPUUsage(prev, curr *performance.CPUStats) float64 {
totalPrev := prev.User + prev.Nice + prev.System + prev.Idle +
prev.IOWait + prev.IRQ + prev.SoftIRQ + prev.Steal
totalCurr := curr.User + curr.Nice + curr.System + curr.Idle +
curr.IOWait + curr.IRQ + curr.SoftIRQ + curr.Steal
totalDelta := float64(totalCurr - totalPrev)
idleDelta := float64(curr.Idle - prev.Idle)
if totalDelta == 0 {
return 0
}
return (1.0 - idleDelta/totalDelta) * 100
}
Performance Impact
Resource Usage
- CPU: Negligible - single file read operation
- Memory: ~2KB for typical 8-core system (scales with CPU count)
- I/O: One read of
/proc/stat
per collection - Frequency: Default 1 second interval when used as continuous collector
Optimization Notes
- File read is buffered, no seek operations required
- Parsing is done in single pass with minimal allocations
- Missing CPUs logged but don't fail collection
- No system calls beyond file read
Related Collectors
Direct Relationships
- CPU Info Collector: Provides static CPU hardware information (model, frequency, cache)
- Load Collector: System load averages that correlate with CPU pressure
- Process Collector: Per-process CPU usage statistics
Complementary Metrics
- Memory Collector: Memory pressure can cause CPU wait states
- Disk Collector: High I/O wait often indicates disk bottlenecks
- Network Collector: Network interrupts contribute to CPU overhead
Analysis Combinations
- Combine with Load Collector for complete CPU pressure picture
- Cross-reference with Process Collector to identify CPU-intensive processes
- Use with CPU Info to understand performance relative to hardware capabilities