Performance Collectors - antimetal/system-agent GitHub Wiki

Performance Collectors

⚠️ Work in Progress: This documentation is currently being developed and may be incomplete or subject to change.

Overview

Performance collectors are the core components of the Antimetal System Agent responsible for gathering system metrics and hardware information. This page provides an overview of all available collectors, their architecture, and how to work with them.

Collector Architecture

Collector Types

The system supports two main collector types:

Point Collectors - Gather metrics at a single point in time
Continuous Collectors - Run continuously and calculate rates/deltas

Collector Interfaces

// Basic collector interface
type Collector interface {
    Name() string
    Collect(ctx context.Context) (any, error)
}

// Point collector for one-shot collection
type PointCollector interface {
    Collector
    Capabilities() CollectorCapabilities
}

// Continuous collector with lifecycle management
type ContinuousCollector interface {
    Collector
    Start(ctx context.Context) error
    Stop() error
}

Available Collectors

System Metrics Collectors

Collector	Type	Source	Collection Mode	Description
CPU Collector	`cpu`	`/proc/stat`	Continuous	CPU usage and time distribution
Memory Collector	`memory`	`/proc/meminfo`	Continuous	Memory usage, buffers, cache
Load Collector	`load`	`/proc/loadavg`	Continuous	System load averages
Network Collector	`network`	`/proc/net/dev`	Continuous	Network interface statistics
Disk Collector	`disk`	`/proc/diskstats`	Continuous	Disk I/O statistics
TCP Collector	`tcp`	`/proc/net/tcp`	Point	TCP connection states
Process Collector	`process`	`/proc/[pid]/*`	Continuous	Per-process metrics
Kernel Collector	`kernel`	`/proc/sys/kernel`	Point	Kernel parameters

Hardware Information Collectors

Collector	Type	Source	Collection Mode	Description
CPU Info Collector	`cpu_info`	`/proc/cpuinfo`	One-shot	CPU hardware details
Memory Info Collector	`memory_info`	`/proc/meminfo`	One-shot	Memory hardware configuration
Disk Info Collector	`disk_info`	`/sys/block`	One-shot	Disk hardware details
Network Info Collector	`network_info`	`/sys/class/net`	One-shot	Network hardware info
NUMA Collector	`numa`	`/sys/devices/system/node`	Mixed	NUMA topology and stats

eBPF Collectors

Collector	Type	Source	Collection Mode	Description
Execsnoop Collector	`execsnoop`	eBPF	Continuous	Process execution tracking

Collector Capabilities

Each collector declares its capabilities:

type CollectorCapabilities struct {
    SupportsOneShot    bool   // Can do point-in-time collection
    SupportsContinuous bool   // Can run continuously
    RequiresRoot       bool   // Needs root privileges
    RequiresEBPF       bool   // Needs eBPF support
    MinKernelVersion   string // Minimum Linux kernel version
}

Collector Registration

Collectors are registered at startup:

// Register a point collector
performance.Register(
    performance.MetricTypeCPU,
    performance.PartialNewContinuousPointCollector(
        collectors.NewCPUCollector,
    ),
)

// Register a continuous collector
performance.Register(
    performance.MetricTypeExecsnoop,
    collectors.NewExecsnoopCollector,
)

Data Collection Flow

graph LR
    A[Manager] --> B[Collector] --> C[Store]
    A --> D[Schedule Collection]
    B --> E[Read Data from Source]
    C --> F[Aggregate & Buffer]

Configuration

Global Configuration

performance:
  enabled: true
  interval: 10s
  collectors:
    - cpu
    - memory
    - disk
    - network

Per-Collector Configuration

collectors:
  cpu:
    enabled: true
    interval: 10s
    per_core: true
  
  memory:
    enabled: true
    interval: 10s
    include_swap: true
  
  process:
    enabled: true
    interval: 30s
    top_count: 20

Writing Custom Collectors

Basic Structure

package collectors

import (
    "context"
    "github.com/antimetal/system-agent/pkg/performance"
)

type MyCollector struct {
    logger Logger
    config CollectionConfig
}

func NewMyCollector(logger Logger, config CollectionConfig) (*MyCollector, error) {
    return &MyCollector{
        logger: logger,
        config: config,
    }, nil
}

func (c *MyCollector) Name() string {
    return "my_collector"
}

func (c *MyCollector) Capabilities() CollectorCapabilities {
    return CollectorCapabilities{
        SupportsOneShot:    true,
        SupportsContinuous: false,
        RequiresRoot:       false,
        MinKernelVersion:   "3.10",
    }
}

func (c *MyCollector) Collect(ctx context.Context) (any, error) {
    // Implementation
    data := &MyMetrics{
        Value: 42,
    }
    return data, nil
}

Best Practices

Error Handling
- Return meaningful errors
- Log warnings for non-fatal issues
- Gracefully handle missing data sources
Performance
- Minimize allocations
- Reuse buffers
- Cache file handles when appropriate
Context Handling
- Check context cancellation
- Respect timeouts
- Clean up resources
Testing
- Mock file systems
- Test error conditions
- Benchmark performance

Collector Lifecycle

Point Collectors

Initialization - Create collector instance
Collection - Call Collect() method
Cleanup - Automatic after collection

Continuous Collectors

Initialization - Create collector instance
Start - Begin background collection
Running - Periodic data collection
Stop - Graceful shutdown
Cleanup - Release resources

Monitoring Collectors

Metrics

Each collector exposes metrics:

# Collection success/failure
antimetal_collector_collections_total{collector="cpu",status="success"} 1234
antimetal_collector_collections_total{collector="cpu",status="error"} 5

# Collection duration
antimetal_collector_duration_seconds{collector="cpu",quantile="0.5"} 0.001
antimetal_collector_duration_seconds{collector="cpu",quantile="0.99"} 0.005

# Last collection timestamp
antimetal_collector_last_collection_timestamp{collector="cpu"} 1642598400

Health Checks

// Check collector health
health := manager.GetCollectorHealth("cpu")
if !health.Healthy {
    log.Errorf("CPU collector unhealthy: %v", health.LastError)
}

Troubleshooting

Common Issues

Permission Denied
- Check if collector requires root
- Verify file permissions
- Check SELinux/AppArmor policies
File Not Found
- Verify kernel version compatibility
- Check if running in container
- Ensure proc/sys mounted correctly
High CPU Usage
- Increase collection interval
- Reduce number of metrics
- Check for inefficient parsing

Debug Mode

Enable debug logging for collectors:

logging:
  level: debug
  collectors:
    - cpu
    - memory

Performance Considerations

Collection Overhead

Collector	CPU Impact	Memory Impact	I/O Impact
CPU	Negligible	Minimal	One file read
Memory	Negligible	Minimal	One file read
Process	Low-Medium	Proportional to processes	Multiple file reads
Network	Low	Minimal	One file read
Disk	Low	Minimal	One file read

Optimization Tips

Batch Operations - Collect multiple metrics in one pass
Caching - Cache static information
Filtering - Only collect needed metrics
Intervals - Adjust based on requirements

Future Collectors

Planned collectors include:

GPU Metrics - NVIDIA/AMD GPU utilization
Container Metrics - Docker/containerd statistics
Application Metrics - JVM, Python, Node.js metrics
Storage Metrics - Advanced filesystem statistics
Security Metrics - SELinux, AppArmor events