Process Collector - antimetal/system-agent GitHub Wiki
Process Collector
The Process Collector monitors per-process statistics in the Antimetal System Agent, tracking the top N processes by CPU usage along with their memory consumption, resource usage, and other vital metrics. This collector is essential for identifying resource-intensive processes and understanding system workload distribution.
Overview
The Process Collector provides real-time visibility into process-level resource consumption by:
- Tracking CPU usage - Monitors CPU time and calculates percentage utilization over time
- Memory profiling - Reports RSS, VSZ, PSS, and USS memory metrics
- Process relationships - Captures parent-child relationships and process groups
- Resource consumption - Tracks file descriptors, threads, and context switches
- Performance analysis - Identifies top resource consumers for optimization
This collector is particularly important for:
- Detecting runaway processes consuming excessive CPU or memory
- Understanding application resource requirements
- Capacity planning based on actual workload patterns
- Troubleshooting performance issues at the process level
- Container and pod resource attribution in Kubernetes environments
Technical Details
Property | Value |
---|---|
MetricType | process |
Primary Data Source | /proc/[pid]/stat , /proc/[pid]/status , /proc/[pid]/smaps_rollup |
Collection Mode | Continuous only |
Default Interval | 1 second |
Default Top Process Count | 20 |
Capabilities
CollectorCapabilities{
SupportsOneShot: false, // Continuous only for CPU% calculation
SupportsContinuous: true,
RequiresRoot: false,
RequiresEBPF: false,
MinKernelVersion: "2.6.0",
}
Collection Strategy
The collector uses a two-phase approach for efficiency:
- Minimal Collection Phase - Reads only
/proc/[pid]/stat
for all processes to calculate CPU percentages - Full Collection Phase - Collects comprehensive data only for the top N processes by CPU usage
This strategy minimizes overhead when monitoring systems with thousands of processes.
Collected Metrics
Metric | Field Name | Description | Source |
---|---|---|---|
Process ID | PID |
Process identifier | /proc/[pid]/stat field 1 |
Parent PID | PPID |
Parent process ID | /proc/[pid]/stat field 4 |
Process Group | PGID |
Process group ID | /proc/[pid]/stat field 5 |
Session ID | SID |
Session ID | /proc/[pid]/stat field 6 |
Command | Command |
Process command name | /proc/[pid]/stat field 2 |
State | State |
Process state (R/S/D/Z/T) | /proc/[pid]/stat field 3 |
CPU Time | CPUTime |
Total CPU time (utime + stime) | /proc/[pid]/stat fields 14+15 |
CPU Percent | CPUPercent |
CPU usage percentage | Calculated from CPU time delta |
Virtual Memory | MemoryVSZ |
Virtual memory size (bytes) | /proc/[pid]/stat field 23 |
Resident Memory | MemoryRSS |
Resident set size (bytes) | /proc/[pid]/stat field 24 × page_size |
Proportional Memory | MemoryPSS |
Proportional set size (bytes) | /proc/[pid]/smaps_rollup |
Unique Memory | MemoryUSS |
Unique set size (bytes) | /proc/[pid]/smaps_rollup |
Thread Count | Threads |
Number of threads | /proc/[pid]/stat field 20 |
Nice Value | Nice |
Process nice value | /proc/[pid]/stat field 19 |
Priority | Priority |
Process priority | /proc/[pid]/stat field 18 |
Start Time | StartTime |
Process start timestamp | Calculated from boot time + stat field 22 |
Minor Faults | MinorFaults |
Page faults without disk I/O | /proc/[pid]/stat field 10 |
Major Faults | MajorFaults |
Page faults requiring disk I/O | /proc/[pid]/stat field 12 |
File Descriptors | NumFds |
Open file descriptor count | Count of /proc/[pid]/fd/ entries |
Thread Count | NumThreads |
Thread count from status | /proc/[pid]/status |
Voluntary Context Switches | VoluntaryCtxt |
Voluntary context switches | /proc/[pid]/status |
Involuntary Context Switches | InvoluntaryCtxt |
Forced context switches | /proc/[pid]/status |
Memory Metrics Explained
- VSZ (Virtual Size): Total virtual memory allocated to the process
- RSS (Resident Set Size): Physical memory currently used by the process
- PSS (Proportional Set Size): RSS where shared pages are divided by the number of processes sharing them
- USS (Unique Set Size): Memory unique to a process (not shared)
Data Structure
The Process Collector returns []*performance.ProcessStats
. See the implementation at:
- Collector:
pkg/performance/collectors/process.go
- Data Types:
pkg/performance/types.go
Configuration
Via Command Line Flags
antimetal-agent \
--performance-top-process-count=50 \
--performance-interval=2s
Via Environment Variables
export PERFORMANCE_TOP_PROCESS_COUNT=50
export PERFORMANCE_INTERVAL=2s
Configuration Options
Option | Default | Description |
---|---|---|
TopProcessCount |
20 | Number of top processes to return |
Interval |
1s | Collection interval for continuous mode |
HostProcPath |
/proc |
Path to proc filesystem |
Example Configuration
config := performance.CollectionConfig{
HostProcPath: "/proc",
TopProcessCount: 30, // Return top 30 processes
Interval: 2 * time.Second,
}
Platform Considerations
Linux Kernel Requirements
- Minimum Kernel: 2.6.0
- Recommended: 3.14+ for
/proc/[pid]/smaps_rollup
support - Required Files:
/proc/[pid]/stat
- Basic process statistics/proc/[pid]/status
- Additional process information/proc/[pid]/fd/
- File descriptor directory/proc/[pid]/smaps_rollup
- Memory details (optional, 3.14+)
Container Considerations
When running in containers:
-
Proc Mount: Ensure
/proc
from the host is mounted:volumes: - name: host-proc hostPath: path: /proc type: Directory volumeMounts: - name: host-proc mountPath: /host/proc readOnly: true
-
Environment Variable: Set the proc path:
env: - name: HOST_PROC value: /host/proc
-
PID Namespace: For container-specific monitoring, the agent must run in the host PID namespace:
hostPID: true
Performance Overhead
The Process Collector is optimized for minimal overhead:
- Two-phase collection reduces unnecessary I/O
- Only top N processes get full data collection
- Efficient CPU percentage calculation with state tracking
- Typical overhead: <0.5% CPU on systems with ~1000 processes
Common Issues
Issue: CPU Percentage Always Zero
Symptom: All processes show 0% CPU usage
Causes:
- First collection always shows 0% (no previous data for delta)
- Collection interval too short for measurable CPU time changes
Solution:
- Wait for second collection cycle
- Increase collection interval if needed (default 1s is usually sufficient)
Issue: Missing Processes
Symptom: Expected processes not in top N list
Causes:
- Process has low CPU usage
- TopProcessCount set too low
- Process disappeared between collection phases
Solution:
- Increase
TopProcessCount
configuration - Check process CPU usage with
top
orps
- Enable debug logging to see total process count
Issue: Memory Metrics Missing PSS/USS
Symptom: MemoryPSS
and MemoryUSS
are zero
Cause: Kernel doesn't support /proc/[pid]/smaps_rollup
(requires kernel 3.14+)
Solution:
- Upgrade kernel to 3.14 or later
- Use RSS as fallback metric
- PSS/USS are optional enhancements
Issue: Permission Denied Errors
Symptom: Failed to read process information
Causes:
- Attempting to read other users' processes
- SELinux/AppArmor restrictions
- Container security policies
Solution:
- Run agent with appropriate permissions
- Configure security policies to allow
/proc
access - Check audit logs for security denials
Examples
Sample Output
[
{
"PID": 1234,
"PPID": 1,
"PGID": 1234,
"SID": 1234,
"Command": "nginx",
"State": "S",
"CPUTime": 15000,
"CPUPercent": 25.5,
"MemoryVSZ": 125829120,
"MemoryRSS": 20971520,
"MemoryPSS": 15728640,
"MemoryUSS": 10485760,
"Threads": 4,
"Nice": 0,
"Priority": 20,
"StartTime": "2024-01-15T10:30:00Z",
"MinorFaults": 5000,
"MajorFaults": 10,
"NumFds": 45,
"NumThreads": 4,
"VoluntaryCtxt": 1000,
"InvoluntaryCtxt": 500
}
]
Performance Impact
Resource Usage
Metric | Typical Value | Notes |
---|---|---|
CPU Usage | 0.1-0.5% | Depends on process count |
Memory Usage | 5-10 MB | Includes tracking state |
I/O Operations | ~3N reads/sec | N = number of processes |
Collection Time | 10-50ms | For 1000 processes |
Optimization Strategies
- Adjust Top Process Count: Lower values reduce full collection overhead
- Increase Collection Interval: Reduce frequency for less critical systems
- CPU Time Caching: Internal optimization tracks all processes efficiently
- Selective Collection: Only top N processes get full data collection
Scaling Considerations
- < 100 processes: Negligible overhead
- 100-1000 processes: Normal overhead (~0.2% CPU)
- 1000-5000 processes: Moderate overhead (~0.5% CPU)
- > 5000 processes: Consider increasing interval or reducing top count
Related Collectors
Complementary Metrics
- CPU Collector - System-wide CPU statistics
- Memory Collector - System-wide memory usage
- Load Collector - System load and running process count
Process Monitoring Stack
- Process Collector - Individual process metrics (this collector)
- CPU/Memory Collectors - System-wide resource usage
- Cgroup Collector - Container resource limits and usage
- eBPF Collectors - Deep process behavior analysis
Integration Points
- Kubernetes pod attribution via container PIDs
- Correlation with cgroup limits for container processes
- Process genealogy tracking for security monitoring
- Resource usage trends for capacity planning