Load Collector - antimetal/system-agent GitHub Wiki
The Load Collector is a critical component of the Antimetal System Agent that monitors system load statistics, providing real-time insights into system workload and process activity. It collects load averages, process counts, and system uptime information by reading from Linux kernel interfaces.
- Performance Indicators: Load averages provide a quick snapshot of system performance and resource utilization
- Capacity Planning: Helps identify when systems are becoming overloaded and need scaling
- Troubleshooting: High load values can indicate CPU bottlenecks, I/O wait issues, or runaway processes
- SLA Monitoring: Essential for maintaining performance service level agreements
- Autoscaling Triggers: Load metrics are commonly used to trigger horizontal pod autoscaling in Kubernetes
-
Type:
MetricTypeLoad
-
Value:
"load"
-
Primary:
/proc/loadavg
- System load averages and process information -
Secondary:
/proc/uptime
- System uptime information (optional, gracefully degrades if unavailable)
-
SupportsOneShot:
true
-
SupportsContinuous:
false
(wrapped byContinuousPointCollector
for continuous operation) -
RequiresRoot:
false
-
RequiresEBPF:
false
-
MinKernelVersion:
2.6.0
(though/proc/loadavg
has been available since much earlier)
The Load Collector implements the PointCollector
interface and is automatically wrapped as a continuous collector using PartialNewContinuousPointCollector
. This means it performs point-in-time collections at regular intervals (default: 1 second).
Metric | Type | Source | Description |
---|---|---|---|
Load1Min |
float64 |
/proc/loadavg field 1 |
System load average over the last 1 minute |
Load5Min |
float64 |
/proc/loadavg field 2 |
System load average over the last 5 minutes |
Load15Min |
float64 |
/proc/loadavg field 3 |
System load average over the last 15 minutes |
RunningProcs |
int32 |
/proc/loadavg field 4 (numerator) |
Number of currently running/runnable processes |
TotalProcs |
int32 |
/proc/loadavg field 4 (denominator) |
Total number of processes/threads in the system |
LastPID |
int32 |
/proc/loadavg field 5 |
Most recently assigned process ID |
Uptime |
time.Duration |
/proc/uptime field 1 |
System uptime since boot |
Load averages represent the average number of processes that are either:
- Running on a CPU
- Waiting for CPU time (runnable)
- In uninterruptible sleep (typically waiting for I/O)
A load average of 1.0 means:
- On a single-core system: CPU is fully utilized
- On a 4-core system: System is 25% utilized
- Values above the number of CPU cores indicate queuing/waiting
The collector returns a LoadStats
struct defined in pkg/performance/types.go
:
type LoadStats struct {
// Load averages from /proc/loadavg (1st, 2nd, 3rd fields)
Load1Min float64
Load5Min float64
Load15Min float64
// Running/total processes from /proc/loadavg (4th field, e.g., "2/1234")
RunningProcs int32
TotalProcs int32
// Last PID from /proc/loadavg (5th field)
LastPID int32
// System uptime from /proc/uptime (1st field in seconds)
Uptime time.Duration
}
Source code: pkg/performance/collectors/load.go
The Load Collector is enabled by default in the performance monitoring system. Configuration is managed through the CollectionConfig
:
-
HOST_PROC
: Path to the proc filesystem (default:/proc
)- In containerized environments, typically mounted from host at
/host/proc
- In containerized environments, typically mounted from host at
- Default: 1 second
- Configurable via performance manager settings
config := performance.CollectionConfig{
EnabledCollectors: map[performance.MetricType]bool{
performance.MetricTypeLoad: true, // or false to disable
},
}
-
Minimum Version: 2.6.0 (though
/proc/loadavg
predates this significantly) -
Required Files:
-
/proc/loadavg
(critical - collector fails without this) -
/proc/uptime
(optional - collector continues without uptime data)
-
When running in containers (Docker, Kubernetes):
-
Proc Filesystem Access: The container must mount the host's
/proc
filesystem:volumeMounts: - name: proc mountPath: /host/proc readOnly: true
-
Environment Variables: Set
HOST_PROC=/host/proc
to point to the mounted location -
Uptime Behavior:
- Container uptime may differ from host uptime
- Some container runtimes may not provide
/proc/uptime
- The collector gracefully handles missing uptime data
/proc/loadavg
format:
0.50 1.25 2.75 2/1234 12345
- Fields:
load1 load5 load15 running/total last_pid
/proc/uptime
format:
1234.56 5678.90
- Fields:
uptime_seconds idle_seconds
Error: failed to read /proc/loadavg: no such file or directory
Causes:
- Running on non-Linux system
- Incorrect
HOST_PROC
path - Container missing proc mount
Solution: Ensure proper proc filesystem mounting and HOST_PROC
configuration
Symptom: Load averages consistently above CPU core count
Common Causes:
- CPU-bound processes
- I/O wait (check with
iostat
) - Too many processes competing for resources
- Memory pressure causing swapping
Investigation:
# Check CPU usage and wait states
top -b -n 1
# Check I/O wait
iostat -x 1
# Find high-load processes
ps aux | sort -nrk 3,3 | head -10
Symptom: Very high TotalProcs
values
Causes:
- Thread leaks in applications
- Zombie processes not being reaped
- Fork bombs or runaway process creation
Investigation:
# Check for zombies
ps aux | grep -E "Z|<defunct>"
# Thread count by process
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -nr | head
Normal system load:
{
"Load1Min": 0.75,
"Load5Min": 1.23,
"Load15Min": 1.45,
"RunningProcs": 2,
"TotalProcs": 523,
"LastPID": 28934,
"Uptime": "72h15m30s"
}
High load scenario:
{
"Load1Min": 15.82,
"Load5Min": 12.45,
"Load15Min": 8.93,
"RunningProcs": 18,
"TotalProcs": 1847,
"LastPID": 65432,
"Uptime": "5h22m18s"
}
groups:
- name: load_alerts
rules:
- alert: HighSystemLoad
expr: antimetal_load_5min / antimetal_cpu_cores > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.node }}"
description: "5-minute load average is {{ $value }} times the number of CPU cores"
- alert: SystemOverloaded
expr: antimetal_load_1min / antimetal_cpu_cores > 4
for: 2m
labels:
severity: critical
annotations:
summary: "System overloaded on {{ $labels.node }}"
description: "1-minute load average is {{ $value }} times the number of CPU cores"
The Load Collector has minimal performance overhead:
- CPU Usage: Negligible - only reads two small files
- Memory Usage: ~1KB per collection (small struct)
- I/O Operations: 2 file reads per collection interval
- Collection Time: Typically < 1ms
Typical collection times on various systems:
- Modern server (NVMe): ~0.1ms
- Cloud VM (SSD): ~0.2ms
- Container with mounted proc: ~0.3ms
- Older hardware (HDD): ~0.5ms
The Load Collector works in conjunction with other collectors to provide comprehensive system monitoring:
- CPU Collector: Provides CPU core count for load ratio calculations and detailed CPU usage statistics
- Memory Collector: Memory pressure can cause swapping, increasing load
- Process Collector: Detailed per-process information to identify high-load culprits
- Network Collector: Network I/O can contribute to system load
- Always consider CPU count: A load of 4.0 is normal on an 8-core system but critical on a 2-core system
- Watch trends: The three averages (1/5/15 min) show if load is increasing, decreasing, or stable
- Correlate metrics: High load with low CPU usage often indicates I/O wait
- Set appropriate thresholds: Alert thresholds should be based on CPU cores, not absolute values