Process Collector - antimetal/system-agent GitHub Wiki

Process Collector

The Process Collector monitors per-process statistics in the Antimetal System Agent, tracking the top N processes by CPU usage along with their memory consumption, resource usage, and other vital metrics. This collector is essential for identifying resource-intensive processes and understanding system workload distribution.

Overview

The Process Collector provides real-time visibility into process-level resource consumption by:

  • Tracking CPU usage - Monitors CPU time and calculates percentage utilization over time
  • Memory profiling - Reports RSS, VSZ, PSS, and USS memory metrics
  • Process relationships - Captures parent-child relationships and process groups
  • Resource consumption - Tracks file descriptors, threads, and context switches
  • Performance analysis - Identifies top resource consumers for optimization

This collector is particularly important for:

  • Detecting runaway processes consuming excessive CPU or memory
  • Understanding application resource requirements
  • Capacity planning based on actual workload patterns
  • Troubleshooting performance issues at the process level
  • Container and pod resource attribution in Kubernetes environments

Technical Details

Property Value
MetricType process
Primary Data Source /proc/[pid]/stat, /proc/[pid]/status, /proc/[pid]/smaps_rollup
Collection Mode Continuous only
Default Interval 1 second
Default Top Process Count 20

Capabilities

CollectorCapabilities{
    SupportsOneShot:    false,  // Continuous only for CPU% calculation
    SupportsContinuous: true,
    RequiresRoot:       false,
    RequiresEBPF:       false,
    MinKernelVersion:   "2.6.0",
}

Collection Strategy

The collector uses a two-phase approach for efficiency:

  1. Minimal Collection Phase - Reads only /proc/[pid]/stat for all processes to calculate CPU percentages
  2. Full Collection Phase - Collects comprehensive data only for the top N processes by CPU usage

This strategy minimizes overhead when monitoring systems with thousands of processes.

Collected Metrics

Metric Field Name Description Source
Process ID PID Process identifier /proc/[pid]/stat field 1
Parent PID PPID Parent process ID /proc/[pid]/stat field 4
Process Group PGID Process group ID /proc/[pid]/stat field 5
Session ID SID Session ID /proc/[pid]/stat field 6
Command Command Process command name /proc/[pid]/stat field 2
State State Process state (R/S/D/Z/T) /proc/[pid]/stat field 3
CPU Time CPUTime Total CPU time (utime + stime) /proc/[pid]/stat fields 14+15
CPU Percent CPUPercent CPU usage percentage Calculated from CPU time delta
Virtual Memory MemoryVSZ Virtual memory size (bytes) /proc/[pid]/stat field 23
Resident Memory MemoryRSS Resident set size (bytes) /proc/[pid]/stat field 24 × page_size
Proportional Memory MemoryPSS Proportional set size (bytes) /proc/[pid]/smaps_rollup
Unique Memory MemoryUSS Unique set size (bytes) /proc/[pid]/smaps_rollup
Thread Count Threads Number of threads /proc/[pid]/stat field 20
Nice Value Nice Process nice value /proc/[pid]/stat field 19
Priority Priority Process priority /proc/[pid]/stat field 18
Start Time StartTime Process start timestamp Calculated from boot time + stat field 22
Minor Faults MinorFaults Page faults without disk I/O /proc/[pid]/stat field 10
Major Faults MajorFaults Page faults requiring disk I/O /proc/[pid]/stat field 12
File Descriptors NumFds Open file descriptor count Count of /proc/[pid]/fd/ entries
Thread Count NumThreads Thread count from status /proc/[pid]/status
Voluntary Context Switches VoluntaryCtxt Voluntary context switches /proc/[pid]/status
Involuntary Context Switches InvoluntaryCtxt Forced context switches /proc/[pid]/status

Memory Metrics Explained

  • VSZ (Virtual Size): Total virtual memory allocated to the process
  • RSS (Resident Set Size): Physical memory currently used by the process
  • PSS (Proportional Set Size): RSS where shared pages are divided by the number of processes sharing them
  • USS (Unique Set Size): Memory unique to a process (not shared)

Data Structure

The Process Collector returns []*performance.ProcessStats. See the implementation at:

Configuration

Via Command Line Flags

antimetal-agent \
  --performance-top-process-count=50 \
  --performance-interval=2s

Via Environment Variables

export PERFORMANCE_TOP_PROCESS_COUNT=50
export PERFORMANCE_INTERVAL=2s

Configuration Options

Option Default Description
TopProcessCount 20 Number of top processes to return
Interval 1s Collection interval for continuous mode
HostProcPath /proc Path to proc filesystem

Example Configuration

config := performance.CollectionConfig{
    HostProcPath:     "/proc",
    TopProcessCount:  30,        // Return top 30 processes
    Interval:         2 * time.Second,
}

Platform Considerations

Linux Kernel Requirements

  • Minimum Kernel: 2.6.0
  • Recommended: 3.14+ for /proc/[pid]/smaps_rollup support
  • Required Files:
    • /proc/[pid]/stat - Basic process statistics
    • /proc/[pid]/status - Additional process information
    • /proc/[pid]/fd/ - File descriptor directory
    • /proc/[pid]/smaps_rollup - Memory details (optional, 3.14+)

Container Considerations

When running in containers:

  1. Proc Mount: Ensure /proc from the host is mounted:

    volumes:
    - name: host-proc
      hostPath:
        path: /proc
        type: Directory
    volumeMounts:
    - name: host-proc
      mountPath: /host/proc
      readOnly: true
    
  2. Environment Variable: Set the proc path:

    env:
    - name: HOST_PROC
      value: /host/proc
    
  3. PID Namespace: For container-specific monitoring, the agent must run in the host PID namespace:

    hostPID: true
    

Performance Overhead

The Process Collector is optimized for minimal overhead:

  • Two-phase collection reduces unnecessary I/O
  • Only top N processes get full data collection
  • Efficient CPU percentage calculation with state tracking
  • Typical overhead: <0.5% CPU on systems with ~1000 processes

Common Issues

Issue: CPU Percentage Always Zero

Symptom: All processes show 0% CPU usage

Causes:

  1. First collection always shows 0% (no previous data for delta)
  2. Collection interval too short for measurable CPU time changes

Solution:

  • Wait for second collection cycle
  • Increase collection interval if needed (default 1s is usually sufficient)

Issue: Missing Processes

Symptom: Expected processes not in top N list

Causes:

  1. Process has low CPU usage
  2. TopProcessCount set too low
  3. Process disappeared between collection phases

Solution:

  • Increase TopProcessCount configuration
  • Check process CPU usage with top or ps
  • Enable debug logging to see total process count

Issue: Memory Metrics Missing PSS/USS

Symptom: MemoryPSS and MemoryUSS are zero

Cause: Kernel doesn't support /proc/[pid]/smaps_rollup (requires kernel 3.14+)

Solution:

  • Upgrade kernel to 3.14 or later
  • Use RSS as fallback metric
  • PSS/USS are optional enhancements

Issue: Permission Denied Errors

Symptom: Failed to read process information

Causes:

  1. Attempting to read other users' processes
  2. SELinux/AppArmor restrictions
  3. Container security policies

Solution:

  • Run agent with appropriate permissions
  • Configure security policies to allow /proc access
  • Check audit logs for security denials

Examples

Sample Output

[
  {
    "PID": 1234,
    "PPID": 1,
    "PGID": 1234,
    "SID": 1234,
    "Command": "nginx",
    "State": "S",
    "CPUTime": 15000,
    "CPUPercent": 25.5,
    "MemoryVSZ": 125829120,
    "MemoryRSS": 20971520,
    "MemoryPSS": 15728640,
    "MemoryUSS": 10485760,
    "Threads": 4,
    "Nice": 0,
    "Priority": 20,
    "StartTime": "2024-01-15T10:30:00Z",
    "MinorFaults": 5000,
    "MajorFaults": 10,
    "NumFds": 45,
    "NumThreads": 4,
    "VoluntaryCtxt": 1000,
    "InvoluntaryCtxt": 500
  }
]

Performance Impact

Resource Usage

Metric Typical Value Notes
CPU Usage 0.1-0.5% Depends on process count
Memory Usage 5-10 MB Includes tracking state
I/O Operations ~3N reads/sec N = number of processes
Collection Time 10-50ms For 1000 processes

Optimization Strategies

  1. Adjust Top Process Count: Lower values reduce full collection overhead
  2. Increase Collection Interval: Reduce frequency for less critical systems
  3. CPU Time Caching: Internal optimization tracks all processes efficiently
  4. Selective Collection: Only top N processes get full data collection

Scaling Considerations

  • < 100 processes: Negligible overhead
  • 100-1000 processes: Normal overhead (~0.2% CPU)
  • 1000-5000 processes: Moderate overhead (~0.5% CPU)
  • > 5000 processes: Consider increasing interval or reducing top count

Related Collectors

Complementary Metrics

Process Monitoring Stack

  1. Process Collector - Individual process metrics (this collector)
  2. CPU/Memory Collectors - System-wide resource usage
  3. Cgroup Collector - Container resource limits and usage
  4. eBPF Collectors - Deep process behavior analysis

Integration Points

  • Kubernetes pod attribution via container PIDs
  • Correlation with cgroup limits for container processes
  • Process genealogy tracking for security monitoring
  • Resource usage trends for capacity planning

References