Load Collector - antimetal/system-agent GitHub Wiki

Load Collector

Overview

The Load Collector is a critical component of the Antimetal System Agent that monitors system load statistics, providing real-time insights into system workload and process activity. It collects load averages, process counts, and system uptime information by reading from Linux kernel interfaces.

Why Load Monitoring Matters

  • Performance Indicators: Load averages provide a quick snapshot of system performance and resource utilization
  • Capacity Planning: Helps identify when systems are becoming overloaded and need scaling
  • Troubleshooting: High load values can indicate CPU bottlenecks, I/O wait issues, or runaway processes
  • SLA Monitoring: Essential for maintaining performance service level agreements
  • Autoscaling Triggers: Load metrics are commonly used to trigger horizontal pod autoscaling in Kubernetes

Technical Details

MetricType

  • Type: MetricTypeLoad
  • Value: "load"

Data Sources

  • Primary: /proc/loadavg - System load averages and process information
  • Secondary: /proc/uptime - System uptime information (optional, gracefully degrades if unavailable)

Collector Capabilities

  • SupportsOneShot: true
  • SupportsContinuous: false (wrapped by ContinuousPointCollector for continuous operation)
  • RequiresRoot: false
  • RequiresEBPF: false
  • MinKernelVersion: 2.6.0 (though /proc/loadavg has been available since much earlier)

Collection Mode

The Load Collector implements the PointCollector interface and is automatically wrapped as a continuous collector using PartialNewContinuousPointCollector. This means it performs point-in-time collections at regular intervals (default: 1 second).

Collected Metrics

Metric Type Source Description
Load1Min float64 /proc/loadavg field 1 System load average over the last 1 minute
Load5Min float64 /proc/loadavg field 2 System load average over the last 5 minutes
Load15Min float64 /proc/loadavg field 3 System load average over the last 15 minutes
RunningProcs int32 /proc/loadavg field 4 (numerator) Number of currently running/runnable processes
TotalProcs int32 /proc/loadavg field 4 (denominator) Total number of processes/threads in the system
LastPID int32 /proc/loadavg field 5 Most recently assigned process ID
Uptime time.Duration /proc/uptime field 1 System uptime since boot

Understanding Load Averages

Load averages represent the average number of processes that are either:

  • Running on a CPU
  • Waiting for CPU time (runnable)
  • In uninterruptible sleep (typically waiting for I/O)

A load average of 1.0 means:

  • On a single-core system: CPU is fully utilized
  • On a 4-core system: System is 25% utilized
  • Values above the number of CPU cores indicate queuing/waiting

Data Structure

The collector returns a LoadStats struct defined in pkg/performance/types.go:

type LoadStats struct {
    // Load averages from /proc/loadavg (1st, 2nd, 3rd fields)
    Load1Min  float64
    Load5Min  float64
    Load15Min float64
    // Running/total processes from /proc/loadavg (4th field, e.g., "2/1234")
    RunningProcs int32
    TotalProcs   int32
    // Last PID from /proc/loadavg (5th field)
    LastPID int32
    // System uptime from /proc/uptime (1st field in seconds)
    Uptime time.Duration
}

Source code: pkg/performance/collectors/load.go

Configuration

The Load Collector is enabled by default in the performance monitoring system. Configuration is managed through the CollectionConfig:

Environment Variables

  • HOST_PROC: Path to the proc filesystem (default: /proc)
    • In containerized environments, typically mounted from host at /host/proc

Collection Interval

  • Default: 1 second
  • Configurable via performance manager settings

Enable/Disable

config := performance.CollectionConfig{
    EnabledCollectors: map[performance.MetricType]bool{
        performance.MetricTypeLoad: true,  // or false to disable
    },
}

Platform Considerations

Linux Kernel Requirements

  • Minimum Version: 2.6.0 (though /proc/loadavg predates this significantly)
  • Required Files:
    • /proc/loadavg (critical - collector fails without this)
    • /proc/uptime (optional - collector continues without uptime data)

Container Considerations

When running in containers (Docker, Kubernetes):

  1. Proc Filesystem Access: The container must mount the host's /proc filesystem:

    volumeMounts:
    - name: proc
      mountPath: /host/proc
      readOnly: true
  2. Environment Variables: Set HOST_PROC=/host/proc to point to the mounted location

  3. Uptime Behavior:

    • Container uptime may differ from host uptime
    • Some container runtimes may not provide /proc/uptime
    • The collector gracefully handles missing uptime data

File Format References

/proc/loadavg format:

0.50 1.25 2.75 2/1234 12345
  • Fields: load1 load5 load15 running/total last_pid

/proc/uptime format:

1234.56 5678.90
  • Fields: uptime_seconds idle_seconds

Common Issues

1. Missing /proc/loadavg

Error: failed to read /proc/loadavg: no such file or directory

Causes:

  • Running on non-Linux system
  • Incorrect HOST_PROC path
  • Container missing proc mount

Solution: Ensure proper proc filesystem mounting and HOST_PROC configuration

2. High Load Values

Symptom: Load averages consistently above CPU core count

Common Causes:

  • CPU-bound processes
  • I/O wait (check with iostat)
  • Too many processes competing for resources
  • Memory pressure causing swapping

Investigation:

# Check CPU usage and wait states
top -b -n 1

# Check I/O wait
iostat -x 1

# Find high-load processes
ps aux | sort -nrk 3,3 | head -10

3. Process Count Anomalies

Symptom: Very high TotalProcs values

Causes:

  • Thread leaks in applications
  • Zombie processes not being reaped
  • Fork bombs or runaway process creation

Investigation:

# Check for zombies
ps aux | grep -E "Z|<defunct>"

# Thread count by process
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -nr | head

Examples

Sample Output

Normal system load:

{
  "Load1Min": 0.75,
  "Load5Min": 1.23,
  "Load15Min": 1.45,
  "RunningProcs": 2,
  "TotalProcs": 523,
  "LastPID": 28934,
  "Uptime": "72h15m30s"
}

High load scenario:

{
  "Load1Min": 15.82,
  "Load5Min": 12.45,
  "Load15Min": 8.93,
  "RunningProcs": 18,
  "TotalProcs": 1847,
  "LastPID": 65432,
  "Uptime": "5h22m18s"
}

Alerting Rules

groups:
- name: load_alerts
  rules:
  - alert: HighSystemLoad
    expr: antimetal_load_5min / antimetal_cpu_cores > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High system load on {{ $labels.node }}"
      description: "5-minute load average is {{ $value }} times the number of CPU cores"

  - alert: SystemOverloaded
    expr: antimetal_load_1min / antimetal_cpu_cores > 4
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "System overloaded on {{ $labels.node }}"
      description: "1-minute load average is {{ $value }} times the number of CPU cores"

Performance Impact

The Load Collector has minimal performance overhead:

  • CPU Usage: Negligible - only reads two small files
  • Memory Usage: ~1KB per collection (small struct)
  • I/O Operations: 2 file reads per collection interval
  • Collection Time: Typically < 1ms

Benchmarks

Typical collection times on various systems:

  • Modern server (NVMe): ~0.1ms
  • Cloud VM (SSD): ~0.2ms
  • Container with mounted proc: ~0.3ms
  • Older hardware (HDD): ~0.5ms

Related Collectors

The Load Collector works in conjunction with other collectors to provide comprehensive system monitoring:

  • CPU Collector: Provides CPU core count for load ratio calculations and detailed CPU usage statistics
  • Memory Collector: Memory pressure can cause swapping, increasing load
  • Process Collector: Detailed per-process information to identify high-load culprits
  • Network Collector: Network I/O can contribute to system load

Load Analysis Best Practices

  1. Always consider CPU count: A load of 4.0 is normal on an 8-core system but critical on a 2-core system
  2. Watch trends: The three averages (1/5/15 min) show if load is increasing, decreasing, or stable
  3. Correlate metrics: High load with low CPU usage often indicates I/O wait
  4. Set appropriate thresholds: Alert thresholds should be based on CPU cores, not absolute values

References

⚠️ **GitHub.com Fallback** ⚠️