Data Flow - antimetal/system-agent GitHub Wiki

Data Flow

This document describes how data moves through the Antimetal System Agent, from collection sources to the Antimetal platform. Understanding these flows is essential for troubleshooting, performance optimization, and extending the system.

Overview

The System Agent processes two primary types of data:

  1. Kubernetes Resources - Objects from the K8s API (pods, nodes, services, etc.)
  2. Performance Metrics - System metrics from Linux kernel interfaces

Both data types flow through the Resource Store event system before being streamed to the Antimetal platform via gRPC.

Primary Data Flows

1. Kubernetes Resource Collection Flow

flowchart LR
    K8S["Kubernetes<br/>API Server<br/><br/>REST API<br/>HTTP/JSON<br/>Resources"] --> INF["Informers<br/>(Watchers)<br/><br/>Add/Update/<br/>Delete<br/>Events"]
    INF --> CTRL["Controller<br/>Reconcile<br/><br/>Normalize<br/>Transform<br/>Enrich"]
    CTRL --> STORE["Resource<br/>Store<br/><br/>Store<br/>BadgerDB<br/>Transaction"]
Loading

Detailed Steps:

  1. Kubernetes API - System agent watches K8s resources using informers
  2. Informers - Cache and detect changes, emit Add/Update/Delete events
  3. Controller Reconcile - Process events, normalize data, add metadata
  4. Resource Store - Persist to BadgerDB, emit internal events for subscribers

2. Performance Metrics Collection Flow

flowchart LR
    KERNEL["Linux<br/>Kernel<br/><br/>/proc/stat<br/>/proc/meminfo<br/>/sys/block<br/>eBPF progs"] --> COLL["Collectors<br/>(14 types)<br/><br/>Parse &<br/>Structure<br/>Data"]
    COLL --> PERF["Performance<br/>Manager<br/><br/>Batch &<br/>Aggregate<br/>Metrics<br/>Transforms"]
    PERF --> STORE["Resource<br/>Store<br/><br/>Store as<br/>Events<br/>(DB)"]
Loading

Detailed Steps:

  1. Linux Kernel - Data sources: /proc, /sys filesystems, eBPF programs
  2. Collectors - 14 different collectors parse and structure the raw data
  3. Performance Manager - Coordinates collection, batches metrics
  4. Resource Store - Stores metrics as events for upstream consumption

3. Event Processing and Routing Flow

flowchart LR
    STORE["Resource<br/>Store<br/><br/>Add/Update/<br/>Delete<br/>Operations"] --> EVENT["Event<br/>Generation<br/><br/>Event with<br/>Resource<br/>Metadata"]
    EVENT --> FILTER["Type<br/>Filtering<br/><br/>Route to<br/>Interested<br/>Subscribers"]
    FILTER --> SUB["Subscriber<br/>Components<br/><br/>Intake<br/>Worker<br/>(Primary)"]
Loading

Detailed Steps:

  1. Resource Store Transaction - Any change to stored data triggers event
  2. Event Generation - Creates event with operation type and resource data
  3. Type Filtering - Routes events to subscribers based on resource type
  4. Subscribers - Intake Worker receives events for upstream transmission

4. Upstream Data Transmission Flow

flowchart LR
    EVENTS["Event<br/>Subscribers<br/><br/>K8s + Perf<br/>Events<br/>(Mixed)"] --> INTAKE["Intake<br/>Worker<br/><br/>Convert to<br/>Protobuf<br/>Messages"]
    INTAKE --> BATCH["Batch<br/>Queue<br/><br/>Size/Time<br/>Based<br/>Batching"]
    BATCH --> PLATFORM["Antimetal<br/>Platform<br/><br/>gRPC Stream<br/>with TLS<br/>& Auth"]
Loading

Detailed Steps:

  1. Event Subscribers - Receive both K8s resource and performance events
  2. Intake Worker - Converts events to protobuf messages for transmission
  3. Batch Queue - Accumulates messages based on size/time thresholds
  4. Antimetal Platform - Receives data via secure gRPC stream

Data Types and Transformations

Kubernetes Resource Data

Input Format (from K8s API):

{
  "apiVersion": "v1",
  "kind": "Pod",
  "metadata": {
    "name": "example-pod",
    "namespace": "default",
    "labels": {...},
    "annotations": {...}
  },
  "spec": {...},
  "status": {...}
}

Normalized Format (in Resource Store):

type Resource struct {
    ID          string
    Type        string
    Name        string
    Namespace   string
    Cluster     string
    Region      string
    Provider    string
    Metadata    map[string]string
    Spec        interface{}
    Status      interface{}
    Timestamp   time.Time
}

Performance Metrics Data

Input Format (from /proc/stat):

cpu  1234 56 789 10000 200 30 40 50 60 70
cpu0 600 30 400 5000 100 15 20 25 30 35

Structured Format (from CPU Collector):

type CPUStats struct {
    CPUIndex    int32
    User        uint64
    Nice        uint64
    System      uint64
    Idle        uint64
    IOWait      uint64
    IRQ         uint64
    SoftIRQ     uint64
    Steal       uint64
    Guest       uint64
    GuestNice   uint64
}

Error Handling and Retry Logic

Collection Failures

  • Informer Disconnection: Automatic reconnection with exponential backoff
  • Missing Proc Files: Graceful degradation, log warnings, continue collection
  • Collector Errors: Individual collector failures don't affect others

Transmission Failures

  • Network Issues: Exponential backoff with jitter
  • gRPC Stream Errors: Automatic stream recreation
  • Batch Failures: Retry individual messages, dead letter queue for persistent failures

Performance Considerations

Batching Strategy

  • Size Threshold: 1MB default batch size
  • Time Threshold: 30 second maximum batch age
  • Adaptive: Adjusts based on event rate and network conditions

Memory Management

  • Event Queues: Bounded queues with back-pressure
  • Batch Buffers: Pre-allocated pools to reduce GC pressure
  • Resource Store: LRU eviction for old data

Throughput Optimization

  • Concurrent Processing: Parallel collector execution
  • Pipeline Parallelism: Overlapping collection, processing, and transmission
  • Compression: gRPC message compression reduces bandwidth

Monitoring Data Flow

Key Metrics

  • events_generated_total{type, operation} - Events created by type
  • batch_queue_size - Current items waiting for transmission
  • upload_duration_seconds - Time to transmit batches
  • collector_errors_total{collector} - Collection failure rates

Troubleshooting

  • Slow Collection: Check collector-specific metrics and /proc filesystem access
  • Event Backlog: Monitor batch queue size and upstream connectivity
  • Missing Data: Verify informer health and RBAC permissions

Next Steps


This document describes the data flow patterns as implemented. For configuration options, see Configuration Guide.

⚠️ **GitHub.com Fallback** ⚠️