Process Collector - antimetal/system-agent GitHub Wiki

Process Collector

The Process Collector monitors per-process statistics in the Antimetal System Agent, tracking the top N processes by CPU usage along with their memory consumption, resource usage, and other vital metrics. This collector is essential for identifying resource-intensive processes and understanding system workload distribution.

Overview

The Process Collector provides real-time visibility into process-level resource consumption by:

Tracking CPU usage - Monitors CPU time and calculates percentage utilization over time
Memory profiling - Reports RSS, VSZ, PSS, and USS memory metrics
Process relationships - Captures parent-child relationships and process groups
Resource consumption - Tracks file descriptors, threads, and context switches
Performance analysis - Identifies top resource consumers for optimization

This collector is particularly important for:

Detecting runaway processes consuming excessive CPU or memory
Understanding application resource requirements
Capacity planning based on actual workload patterns
Troubleshooting performance issues at the process level
Container and pod resource attribution in Kubernetes environments

Technical Details

Property	Value
MetricType	`process`
Primary Data Source	`/proc/[pid]/stat`, `/proc/[pid]/status`, `/proc/[pid]/smaps_rollup`
Collection Mode	Continuous only
Default Interval	1 second
Default Top Process Count	20

Capabilities

CollectorCapabilities{
    SupportsOneShot:    false,  // Continuous only for CPU% calculation
    SupportsContinuous: true,
    RequiresRoot:       false,
    RequiresEBPF:       false,
    MinKernelVersion:   "2.6.0",
}

Collection Strategy

The collector uses a two-phase approach for efficiency:

Minimal Collection Phase - Reads only /proc/[pid]/stat for all processes to calculate CPU percentages
Full Collection Phase - Collects comprehensive data only for the top N processes by CPU usage

This strategy minimizes overhead when monitoring systems with thousands of processes.

Collected Metrics

Metric	Field Name	Description	Source
Process ID	`PID`	Process identifier	`/proc/[pid]/stat` field 1
Parent PID	`PPID`	Parent process ID	`/proc/[pid]/stat` field 4
Process Group	`PGID`	Process group ID	`/proc/[pid]/stat` field 5
Session ID	`SID`	Session ID	`/proc/[pid]/stat` field 6
Command	`Command`	Process command name	`/proc/[pid]/stat` field 2
State	`State`	Process state (R/S/D/Z/T)	`/proc/[pid]/stat` field 3
CPU Time	`CPUTime`	Total CPU time (utime + stime)	`/proc/[pid]/stat` fields 14+15
CPU Percent	`CPUPercent`	CPU usage percentage	Calculated from CPU time delta
Virtual Memory	`MemoryVSZ`	Virtual memory size (bytes)	`/proc/[pid]/stat` field 23
Resident Memory	`MemoryRSS`	Resident set size (bytes)	`/proc/[pid]/stat` field 24 × page_size
Proportional Memory	`MemoryPSS`	Proportional set size (bytes)	`/proc/[pid]/smaps_rollup`
Unique Memory	`MemoryUSS`	Unique set size (bytes)	`/proc/[pid]/smaps_rollup`
Thread Count	`Threads`	Number of threads	`/proc/[pid]/stat` field 20
Nice Value	`Nice`	Process nice value	`/proc/[pid]/stat` field 19
Priority	`Priority`	Process priority	`/proc/[pid]/stat` field 18
Start Time	`StartTime`	Process start timestamp	Calculated from boot time + stat field 22
Minor Faults	`MinorFaults`	Page faults without disk I/O	`/proc/[pid]/stat` field 10
Major Faults	`MajorFaults`	Page faults requiring disk I/O	`/proc/[pid]/stat` field 12
File Descriptors	`NumFds`	Open file descriptor count	Count of `/proc/[pid]/fd/` entries
Thread Count	`NumThreads`	Thread count from status	`/proc/[pid]/status`
Voluntary Context Switches	`VoluntaryCtxt`	Voluntary context switches	`/proc/[pid]/status`
Involuntary Context Switches	`InvoluntaryCtxt`	Forced context switches	`/proc/[pid]/status`

Memory Metrics Explained

VSZ (Virtual Size): Total virtual memory allocated to the process
RSS (Resident Set Size): Physical memory currently used by the process
PSS (Proportional Set Size): RSS where shared pages are divided by the number of processes sharing them
USS (Unique Set Size): Memory unique to a process (not shared)

Data Structure

The Process Collector returns []*performance.ProcessStats. See the implementation at:

Collector: pkg/performance/collectors/process.go
Data Types: pkg/performance/types.go

Configuration

Via Command Line Flags

antimetal-agent \
  --performance-top-process-count=50 \
  --performance-interval=2s

Via Environment Variables

export PERFORMANCE_TOP_PROCESS_COUNT=50
export PERFORMANCE_INTERVAL=2s

Configuration Options

Option	Default	Description
`TopProcessCount`	20	Number of top processes to return
`Interval`	1s	Collection interval for continuous mode
`HostProcPath`	`/proc`	Path to proc filesystem

Example Configuration

config := performance.CollectionConfig{
    HostProcPath:     "/proc",
    TopProcessCount:  30,        // Return top 30 processes
    Interval:         2 * time.Second,
}

Platform Considerations

Linux Kernel Requirements

Minimum Kernel: 2.6.0
Recommended: 3.14+ for /proc/[pid]/smaps_rollup support
Required Files:
- /proc/[pid]/stat - Basic process statistics
- /proc/[pid]/status - Additional process information
- /proc/[pid]/fd/ - File descriptor directory
- /proc/[pid]/smaps_rollup - Memory details (optional, 3.14+)

Container Considerations

When running in containers:

Proc Mount: Ensure /proc from the host is mounted:

volumes:
- name: host-proc
  hostPath:
    path: /proc
    type: Directory
volumeMounts:
- name: host-proc
  mountPath: /host/proc
  readOnly: true

Environment Variable: Set the proc path:

env:
- name: HOST_PROC
  value: /host/proc

PID Namespace: For container-specific monitoring, the agent must run in the host PID namespace:
```
hostPID: true
```

Performance Overhead

The Process Collector is optimized for minimal overhead:

Two-phase collection reduces unnecessary I/O
Only top N processes get full data collection
Efficient CPU percentage calculation with state tracking
Typical overhead: <0.5% CPU on systems with ~1000 processes

Common Issues

Issue: CPU Percentage Always Zero

Symptom: All processes show 0% CPU usage

Causes:

First collection always shows 0% (no previous data for delta)
Collection interval too short for measurable CPU time changes

Solution:

Wait for second collection cycle
Increase collection interval if needed (default 1s is usually sufficient)

Issue: Missing Processes

Symptom: Expected processes not in top N list

Causes:

Process has low CPU usage
TopProcessCount set too low
Process disappeared between collection phases

Solution:

Increase TopProcessCount configuration
Check process CPU usage with top or ps
Enable debug logging to see total process count

Issue: Memory Metrics Missing PSS/USS

Symptom: MemoryPSS and MemoryUSS are zero

Cause: Kernel doesn't support /proc/[pid]/smaps_rollup (requires kernel 3.14+)

Solution:

Upgrade kernel to 3.14 or later
Use RSS as fallback metric
PSS/USS are optional enhancements

Issue: Permission Denied Errors

Symptom: Failed to read process information

Causes:

Attempting to read other users' processes
SELinux/AppArmor restrictions
Container security policies

Solution:

Run agent with appropriate permissions
Configure security policies to allow /proc access
Check audit logs for security denials

Examples

Sample Output

[
  {
    "PID": 1234,
    "PPID": 1,
    "PGID": 1234,
    "SID": 1234,
    "Command": "nginx",
    "State": "S",
    "CPUTime": 15000,
    "CPUPercent": 25.5,
    "MemoryVSZ": 125829120,
    "MemoryRSS": 20971520,
    "MemoryPSS": 15728640,
    "MemoryUSS": 10485760,
    "Threads": 4,
    "Nice": 0,
    "Priority": 20,
    "StartTime": "2024-01-15T10:30:00Z",
    "MinorFaults": 5000,
    "MajorFaults": 10,
    "NumFds": 45,
    "NumThreads": 4,
    "VoluntaryCtxt": 1000,
    "InvoluntaryCtxt": 500
  }
]

Performance Impact

Resource Usage

Metric	Typical Value	Notes
CPU Usage	0.1-0.5%	Depends on process count
Memory Usage	5-10 MB	Includes tracking state
I/O Operations	~3N reads/sec	N = number of processes
Collection Time	10-50ms	For 1000 processes

Optimization Strategies

Adjust Top Process Count: Lower values reduce full collection overhead
Increase Collection Interval: Reduce frequency for less critical systems
CPU Time Caching: Internal optimization tracks all processes efficiently
Selective Collection: Only top N processes get full data collection

Scaling Considerations

< 100 processes: Negligible overhead
100-1000 processes: Normal overhead (~0.2% CPU)
1000-5000 processes: Moderate overhead (~0.5% CPU)
> 5000 processes: Consider increasing interval or reducing top count

Related Collectors

Complementary Metrics

CPU Collector - System-wide CPU statistics
Memory Collector - System-wide memory usage
Load Collector - System load and running process count

Process Monitoring Stack

Process Collector - Individual process metrics (this collector)
CPU/Memory Collectors - System-wide resource usage
Cgroup Collector - Container resource limits and usage
eBPF Collectors - Deep process behavior analysis

Integration Points

Kubernetes pod attribution via container PIDs
Correlation with cgroup limits for container processes
Process genealogy tracking for security monitoring
Resource usage trends for capacity planning

Process Collector - antimetal/system-agent GitHub Wiki

Process Collector

Overview

Technical Details

Capabilities

Collection Strategy

Collected Metrics

Memory Metrics Explained

Data Structure

Configuration

Via Command Line Flags

Via Environment Variables

Configuration Options

Example Configuration

Platform Considerations

Linux Kernel Requirements

Container Considerations

Performance Overhead

Common Issues

Issue: CPU Percentage Always Zero

Issue: Missing Processes

Issue: Memory Metrics Missing PSS/USS

Issue: Permission Denied Errors

Examples

Sample Output

Performance Impact

Resource Usage

Optimization Strategies

Scaling Considerations

Related Collectors

Complementary Metrics

Process Monitoring Stack

Integration Points

References