Load Collector

Overview

The Load Collector is a critical component of the Antimetal System Agent that monitors system load statistics, providing real-time insights into system workload and process activity. It collects load averages, process counts, and system uptime information by reading from Linux kernel interfaces.

Why Load Monitoring Matters

Performance Indicators: Load averages provide a quick snapshot of system performance and resource utilization
Capacity Planning: Helps identify when systems are becoming overloaded and need scaling
Troubleshooting: High load values can indicate CPU bottlenecks, I/O wait issues, or runaway processes
SLA Monitoring: Essential for maintaining performance service level agreements
Autoscaling Triggers: Load metrics are commonly used to trigger horizontal pod autoscaling in Kubernetes

Technical Details

MetricType

Type: MetricTypeLoad
Value: "load"

Data Sources

Primary: /proc/loadavg - System load averages and process information
Secondary: /proc/uptime - System uptime information (optional, gracefully degrades if unavailable)

Collector Capabilities

SupportsOneShot: true
SupportsContinuous: false (wrapped by ContinuousPointCollector for continuous operation)
RequiresRoot: false
RequiresEBPF: false
MinKernelVersion: 2.6.0 (though /proc/loadavg has been available since much earlier)

Collection Mode

The Load Collector implements the PointCollector interface and is automatically wrapped as a continuous collector using PartialNewContinuousPointCollector. This means it performs point-in-time collections at regular intervals (default: 1 second).

Collected Metrics

Metric	Type	Source	Description
`Load1Min`	`float64`	`/proc/loadavg` field 1	System load average over the last 1 minute
`Load5Min`	`float64`	`/proc/loadavg` field 2	System load average over the last 5 minutes
`Load15Min`	`float64`	`/proc/loadavg` field 3	System load average over the last 15 minutes
`RunningProcs`	`int32`	`/proc/loadavg` field 4 (numerator)	Number of currently running/runnable processes
`TotalProcs`	`int32`	`/proc/loadavg` field 4 (denominator)	Total number of processes/threads in the system
`LastPID`	`int32`	`/proc/loadavg` field 5	Most recently assigned process ID
`Uptime`	`time.Duration`	`/proc/uptime` field 1	System uptime since boot

Understanding Load Averages

Load averages represent the average number of processes that are either:

Running on a CPU
Waiting for CPU time (runnable)
In uninterruptible sleep (typically waiting for I/O)

A load average of 1.0 means:

On a single-core system: CPU is fully utilized
On a 4-core system: System is 25% utilized
Values above the number of CPU cores indicate queuing/waiting

Data Structure

The collector returns a LoadStats struct defined in pkg/performance/types.go:

type LoadStats struct {
    // Load averages from /proc/loadavg (1st, 2nd, 3rd fields)
    Load1Min  float64
    Load5Min  float64
    Load15Min float64
    // Running/total processes from /proc/loadavg (4th field, e.g., "2/1234")
    RunningProcs int32
    TotalProcs   int32
    // Last PID from /proc/loadavg (5th field)
    LastPID int32
    // System uptime from /proc/uptime (1st field in seconds)
    Uptime time.Duration
}

Source code: pkg/performance/collectors/load.go

Configuration

The Load Collector is enabled by default in the performance monitoring system. Configuration is managed through the CollectionConfig:

Environment Variables

HOST_PROC: Path to the proc filesystem (default: /proc)
- In containerized environments, typically mounted from host at /host/proc

Collection Interval

Default: 1 second
Configurable via performance manager settings

Enable/Disable

config := performance.CollectionConfig{
    EnabledCollectors: map[performance.MetricType]bool{
        performance.MetricTypeLoad: true,  // or false to disable
    },
}

Platform Considerations

Linux Kernel Requirements

Minimum Version: 2.6.0 (though /proc/loadavg predates this significantly)
Required Files:
- /proc/loadavg (critical - collector fails without this)
- /proc/uptime (optional - collector continues without uptime data)

Container Considerations

When running in containers (Docker, Kubernetes):

Proc Filesystem Access: The container must mount the host's /proc filesystem:

volumeMounts:
- name: proc
  mountPath: /host/proc
  readOnly: true

Environment Variables: Set HOST_PROC=/host/proc to point to the mounted location
Uptime Behavior:
- Container uptime may differ from host uptime
- Some container runtimes may not provide /proc/uptime
- The collector gracefully handles missing uptime data

File Format References

/proc/loadavg format:

0.50 1.25 2.75 2/1234 12345

Fields: load1 load5 load15 running/total last_pid

/proc/uptime format:

1234.56 5678.90

Fields: uptime_seconds idle_seconds

Common Issues

1. Missing /proc/loadavg

Error: failed to read /proc/loadavg: no such file or directory

Causes:

Running on non-Linux system
Incorrect HOST_PROC path
Container missing proc mount

Solution: Ensure proper proc filesystem mounting and HOST_PROC configuration

2. High Load Values

Symptom: Load averages consistently above CPU core count

Common Causes:

CPU-bound processes
I/O wait (check with iostat)
Too many processes competing for resources
Memory pressure causing swapping

Investigation:

# Check CPU usage and wait states
top -b -n 1

# Check I/O wait
iostat -x 1

# Find high-load processes
ps aux | sort -nrk 3,3 | head -10

3. Process Count Anomalies

Symptom: Very high TotalProcs values

Causes:

Thread leaks in applications
Zombie processes not being reaped
Fork bombs or runaway process creation

Investigation:

# Check for zombies
ps aux | grep -E "Z|<defunct>"

# Thread count by process
ps -eLf | awk '{print $2}' | sort | uniq -c | sort -nr | head

Examples

Sample Output

Normal system load:

{
  "Load1Min": 0.75,
  "Load5Min": 1.23,
  "Load15Min": 1.45,
  "RunningProcs": 2,
  "TotalProcs": 523,
  "LastPID": 28934,
  "Uptime": "72h15m30s"
}

High load scenario:

{
  "Load1Min": 15.82,
  "Load5Min": 12.45,
  "Load15Min": 8.93,
  "RunningProcs": 18,
  "TotalProcs": 1847,
  "LastPID": 65432,
  "Uptime": "5h22m18s"
}

Alerting Rules

groups:
- name: load_alerts
  rules:
  - alert: HighSystemLoad
    expr: antimetal_load_5min / antimetal_cpu_cores > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High system load on {{ $labels.node }}"
      description: "5-minute load average is {{ $value }} times the number of CPU cores"

  - alert: SystemOverloaded
    expr: antimetal_load_1min / antimetal_cpu_cores > 4
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "System overloaded on {{ $labels.node }}"
      description: "1-minute load average is {{ $value }} times the number of CPU cores"

Performance Impact

The Load Collector has minimal performance overhead:

CPU Usage: Negligible - only reads two small files
Memory Usage: ~1KB per collection (small struct)
I/O Operations: 2 file reads per collection interval
Collection Time: Typically < 1ms

Benchmarks

Typical collection times on various systems:

Modern server (NVMe): ~0.1ms
Cloud VM (SSD): ~0.2ms
Container with mounted proc: ~0.3ms
Older hardware (HDD): ~0.5ms

Related Collectors

The Load Collector works in conjunction with other collectors to provide comprehensive system monitoring:

CPU Collector: Provides CPU core count for load ratio calculations and detailed CPU usage statistics
Memory Collector: Memory pressure can cause swapping, increasing load
Process Collector: Detailed per-process information to identify high-load culprits
Network Collector: Network I/O can contribute to system load

Load Analysis Best Practices

Always consider CPU count: A load of 4.0 is normal on an 8-core system but critical on a 2-core system
Watch trends: The three averages (1/5/15 min) show if load is increasing, decreasing, or stable
Correlate metrics: High load with low CPU usage often indicates I/O wait
Set appropriate thresholds: Alert thresholds should be based on CPU cores, not absolute values

Load Collector - antimetal/system-agent GitHub Wiki

Load Collector

Overview

Why Load Monitoring Matters

Technical Details

MetricType

Data Sources

Collector Capabilities

Collection Mode

Collected Metrics

Understanding Load Averages

Data Structure

Configuration

Environment Variables

Collection Interval

Enable/Disable

Platform Considerations

Linux Kernel Requirements

Container Considerations

File Format References

Common Issues

1. Missing /proc/loadavg

2. High Load Values

3. Process Count Anomalies

Examples

Sample Output

Alerting Rules

Performance Impact

Benchmarks

Related Collectors

Load Analysis Best Practices

References

⚠️ GitHub.com Fallback ⚠️

Load Collector - antimetal/system-agent GitHub Wiki

Load Collector

Overview

Why Load Monitoring Matters

Technical Details

MetricType

Data Sources

Collector Capabilities

Collection Mode

Collected Metrics

Understanding Load Averages

Data Structure

Configuration

Environment Variables

Collection Interval

Enable/Disable

Platform Considerations

Linux Kernel Requirements

Container Considerations

File Format References

Common Issues

1. Missing /proc/loadavg

2. High Load Values

3. Process Count Anomalies

Examples

Sample Output

Alerting Rules

Performance Impact

Benchmarks

Related Collectors

Load Analysis Best Practices

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️