Performance Collectors - antimetal/system-agent GitHub Wiki

Performance Collectors

⚠️ Work in Progress: This documentation is currently being developed and may be incomplete or subject to change.

Overview

Performance collectors are the core components of the Antimetal System Agent responsible for gathering system metrics and hardware information. This page provides an overview of all available collectors, their architecture, and how to work with them.

Collector Architecture

Collector Types

The system supports two main collector types:

  1. Point Collectors - Gather metrics at a single point in time
  2. Continuous Collectors - Run continuously and calculate rates/deltas

Collector Interfaces

// Basic collector interface
type Collector interface {
    Name() string
    Collect(ctx context.Context) (any, error)
}

// Point collector for one-shot collection
type PointCollector interface {
    Collector
    Capabilities() CollectorCapabilities
}

// Continuous collector with lifecycle management
type ContinuousCollector interface {
    Collector
    Start(ctx context.Context) error
    Stop() error
}

Available Collectors

System Metrics Collectors

Collector Type Source Collection Mode Description
CPU Collector cpu /proc/stat Continuous CPU usage and time distribution
Memory Collector memory /proc/meminfo Continuous Memory usage, buffers, cache
Load Collector load /proc/loadavg Continuous System load averages
Network Collector network /proc/net/dev Continuous Network interface statistics
Disk Collector disk /proc/diskstats Continuous Disk I/O statistics
TCP Collector tcp /proc/net/tcp Point TCP connection states
Process Collector process /proc/[pid]/* Continuous Per-process metrics
Kernel Collector kernel /proc/sys/kernel Point Kernel parameters

Hardware Information Collectors

Collector Type Source Collection Mode Description
CPU Info Collector cpu_info /proc/cpuinfo One-shot CPU hardware details
Memory Info Collector memory_info /proc/meminfo One-shot Memory hardware configuration
Disk Info Collector disk_info /sys/block One-shot Disk hardware details
Network Info Collector network_info /sys/class/net One-shot Network hardware info
NUMA Collector numa /sys/devices/system/node Mixed NUMA topology and stats

eBPF Collectors

Collector Type Source Collection Mode Description
Execsnoop Collector execsnoop eBPF Continuous Process execution tracking

Collector Capabilities

Each collector declares its capabilities:

type CollectorCapabilities struct {
    SupportsOneShot    bool   // Can do point-in-time collection
    SupportsContinuous bool   // Can run continuously
    RequiresRoot       bool   // Needs root privileges
    RequiresEBPF       bool   // Needs eBPF support
    MinKernelVersion   string // Minimum Linux kernel version
}

Collector Registration

Collectors are registered at startup:

// Register a point collector
performance.Register(
    performance.MetricTypeCPU,
    performance.PartialNewContinuousPointCollector(
        collectors.NewCPUCollector,
    ),
)

// Register a continuous collector
performance.Register(
    performance.MetricTypeExecsnoop,
    collectors.NewExecsnoopCollector,
)

Data Collection Flow

graph LR
    A[Manager] --> B[Collector] --> C[Store]
    A --> D[Schedule Collection]
    B --> E[Read Data from Source]
    C --> F[Aggregate & Buffer]

Configuration

Global Configuration

performance:
  enabled: true
  interval: 10s
  collectors:
    - cpu
    - memory
    - disk
    - network

Per-Collector Configuration

collectors:
  cpu:
    enabled: true
    interval: 10s
    per_core: true
  
  memory:
    enabled: true
    interval: 10s
    include_swap: true
  
  process:
    enabled: true
    interval: 30s
    top_count: 20

Writing Custom Collectors

Basic Structure

package collectors

import (
    "context"
    "github.com/antimetal/system-agent/pkg/performance"
)

type MyCollector struct {
    logger Logger
    config CollectionConfig
}

func NewMyCollector(logger Logger, config CollectionConfig) (*MyCollector, error) {
    return &MyCollector{
        logger: logger,
        config: config,
    }, nil
}

func (c *MyCollector) Name() string {
    return "my_collector"
}

func (c *MyCollector) Capabilities() CollectorCapabilities {
    return CollectorCapabilities{
        SupportsOneShot:    true,
        SupportsContinuous: false,
        RequiresRoot:       false,
        MinKernelVersion:   "3.10",
    }
}

func (c *MyCollector) Collect(ctx context.Context) (any, error) {
    // Implementation
    data := &MyMetrics{
        Value: 42,
    }
    return data, nil
}

Best Practices

  1. Error Handling

    • Return meaningful errors
    • Log warnings for non-fatal issues
    • Gracefully handle missing data sources
  2. Performance

    • Minimize allocations
    • Reuse buffers
    • Cache file handles when appropriate
  3. Context Handling

    • Check context cancellation
    • Respect timeouts
    • Clean up resources
  4. Testing

    • Mock file systems
    • Test error conditions
    • Benchmark performance

Collector Lifecycle

Point Collectors

  1. Initialization - Create collector instance
  2. Collection - Call Collect() method
  3. Cleanup - Automatic after collection

Continuous Collectors

  1. Initialization - Create collector instance
  2. Start - Begin background collection
  3. Running - Periodic data collection
  4. Stop - Graceful shutdown
  5. Cleanup - Release resources

Monitoring Collectors

Metrics

Each collector exposes metrics:

# Collection success/failure
antimetal_collector_collections_total{collector="cpu",status="success"} 1234
antimetal_collector_collections_total{collector="cpu",status="error"} 5

# Collection duration
antimetal_collector_duration_seconds{collector="cpu",quantile="0.5"} 0.001
antimetal_collector_duration_seconds{collector="cpu",quantile="0.99"} 0.005

# Last collection timestamp
antimetal_collector_last_collection_timestamp{collector="cpu"} 1642598400

Health Checks

// Check collector health
health := manager.GetCollectorHealth("cpu")
if !health.Healthy {
    log.Errorf("CPU collector unhealthy: %v", health.LastError)
}

Troubleshooting

Common Issues

  1. Permission Denied

    • Check if collector requires root
    • Verify file permissions
    • Check SELinux/AppArmor policies
  2. File Not Found

    • Verify kernel version compatibility
    • Check if running in container
    • Ensure proc/sys mounted correctly
  3. High CPU Usage

    • Increase collection interval
    • Reduce number of metrics
    • Check for inefficient parsing

Debug Mode

Enable debug logging for collectors:

logging:
  level: debug
  collectors:
    - cpu
    - memory

Performance Considerations

Collection Overhead

Collector CPU Impact Memory Impact I/O Impact
CPU Negligible Minimal One file read
Memory Negligible Minimal One file read
Process Low-Medium Proportional to processes Multiple file reads
Network Low Minimal One file read
Disk Low Minimal One file read

Optimization Tips

  1. Batch Operations - Collect multiple metrics in one pass
  2. Caching - Cache static information
  3. Filtering - Only collect needed metrics
  4. Intervals - Adjust based on requirements

Future Collectors

Planned collectors include:

  • GPU Metrics - NVIDIA/AMD GPU utilization
  • Container Metrics - Docker/containerd statistics
  • Application Metrics - JVM, Python, Node.js metrics
  • Storage Metrics - Advanced filesystem statistics
  • Security Metrics - SELinux, AppArmor events

See Also