Architecture Overview - antimetal/system-agent GitHub Wiki

Architecture Overview

The Antimetal System Agent is built with a modular, event-driven architecture designed for reliability, performance, and extensibility.

Core Design Principles

  1. Event-Driven Architecture - Components communicate through events and channels
  2. Separation of Concerns - Each component has a single, well-defined responsibility
  3. Cloud-Agnostic Design - Abstractions allow for multiple cloud providers
  4. Fault Tolerance - Automatic recovery and retry mechanisms throughout
  5. Performance First - Efficient batching, caching, and concurrent processing

High-Level Architecture

graph TB
    Platform["Antimetal Platform<br/>(Intake Service)"]

    subgraph SystemAgent ["System Agent"]
        subgraph MainController ["Main Controller"]
            K8sController["K8s<br/>Controller"]
            IntakeWorker["Intake<br/>Worker"]
            PerfManager["Performance<br/>Manager"]

            subgraph ResourceStore ["Resource Store (BadgerDB)"]
                Resources["Resources"]
                Relationships["Relationships"]
                EventRouter["Event Router"]
            end
        end

        subgraph CloudProvider ["Cloud Provider Abstraction"]
            EKS["EKS"]
            GKE["GKE"]
            AKS["AKS"]
            KIND["KIND"]
        end
    end

    K8sAPI["Kubernetes<br/>API Server"]
    LinuxKernel["Linux Kernel<br/>/proc, /sys, eBPF"]


    IntakeWorker -.->|gRPC Stream| Platform
    K8sController --> ResourceStore
    ResourceStore --> IntakeWorker
    PerfManager --> ResourceStore

    SystemAgent --> K8sAPI
    SystemAgent --> LinuxKernel
Loading

Component Overview

1. Main Controller (cmd/main.go)

The orchestrator that initializes and manages all components:

  • Parses configuration and command-line flags
  • Sets up the controller-runtime manager
  • Initializes the resource store
  • Starts all subsystems
  • Manages graceful shutdown

2. Kubernetes Controller

Monitors Kubernetes resources using the controller-runtime framework:

  • Watches: Nodes, Pods, Services, Deployments, StatefulSets, DaemonSets, ReplicaSets, PVs, PVCs, Jobs
  • Features: Leader election, concurrent reconciliation, index-based lookups
  • Output: Normalized resources sent to the Resource Store

3. Resource Store

Central data hub built on BadgerDB:

  • Storage: In-memory key-value store for fast access
  • Resources: Generic cloud/Kubernetes resources with metadata
  • Relationships: RDF triplets (subject-predicate-object)
  • Events: Publishes Add/Update/Delete events to subscribers

4. Intake Worker

Streams data to the Antimetal platform:

  • Protocol: gRPC with protobuf messages
  • Batching: Configurable batch size and time windows
  • Reliability: Exponential backoff, automatic reconnection
  • Features: Heartbeat mechanism, stream health monitoring

5. Performance Manager

Collects system and hardware metrics:

  • Architecture: Pluggable collector system
  • Data Sources: /proc, /sys, eBPF programs
  • Collectors: CPU, Memory, Network, Disk, NUMA, and more
  • Patterns: PointCollector (one-shot) and ContinuousCollector (streaming)

6. Cloud Provider Abstraction

Provides cloud-specific functionality:

  • Interface: Simple provider contract (Name, ClusterName, Region)
  • Implementations: EKS (full), KIND (local), GKE/AKS (planned)
  • Discovery: Auto-detects cloud environment

Data Flow Patterns

Resource Collection Flow

graph LR
    K8sAPI["K8s API Server"] --> Informers --> Controller --> Reconcile --> ResourceStore["Resource Store"] --> Event
Loading

Event Processing Flow

graph LR
    ResourceChange["Resource Change"] --> StoreTransaction["Store Transaction"] --> EventGeneration["Event Generation"] --> TypeFiltering["Type Filtering"] --> Subscribers
Loading

Intake Upload Flow

graph LR
    Event --> IntakeWorker["Intake Worker"] --> BatchQueue["Batch Queue"] --> gRPCStream["gRPC Stream"] --> AntimetalPlatform["Antimetal Platform"]
Loading

Performance Collection Flow

graph LR
    Kernel["Kernel (/proc, /sys)"] --> Collector --> PerformanceManager["Performance Manager"] --> MetricsStore["Metrics Store"]
Loading

Key Design Patterns

1. Event-Driven Communication

  • Components are loosely coupled through events
  • Resource Store acts as an event bus
  • Subscribers filter events by resource type
  • Non-blocking channel operations

2. Interface-Based Abstractions

// Cloud Provider
type Provider interface {
    Name() string
    ClusterName(ctx context.Context) (string, error)
    Region(ctx context.Context) (string, error)
}

// Collector
type Collector interface {
    Name() string
    Collect(ctx context.Context) (any, error)
}

3. Graceful Degradation

  • Missing optional data doesn't cause failures
  • Collectors handle missing files gracefully
  • Network failures trigger exponential backoff
  • Partial data is better than no data

4. Resource-Efficient Design

  • Batching reduces network overhead
  • In-memory caching minimizes I/O
  • Concurrent processing maximizes throughput
  • Careful memory management prevents leaks

Concurrency Model

Goroutine Organization

  • Main: Orchestration and lifecycle management
  • Controller Workers: Parallel Kubernetes reconciliation
  • Intake Worker: Dedicated gRPC streaming
  • Collectors: Independent metric collection
  • Event Router: Fan-out event distribution

Synchronization

  • Channels: Primary communication mechanism
  • Mutexes: Protecting shared state (sparingly used)
  • Context: Propagating cancellation and deadlines
  • WaitGroups: Coordinating shutdown

Extension Points

1. Adding New Collectors

Implement the Collector interface and register:

func init() {
    performance.Register(MetricTypeCustom, NewCustomCollector)
}

2. Adding Cloud Providers

Implement the Provider interface:

type CustomProvider struct{}
func (p *CustomProvider) Name() string { return "custom" }
func (p *CustomProvider) ClusterName(ctx context.Context) (string, error) { ... }
func (p *CustomProvider) Region(ctx context.Context) (string, error) { ... }

3. Custom Resource Types

Define new protobuf messages and register handlers.

Performance Characteristics

  • Startup Time: < 10 seconds typical
  • Memory Usage: 100-500MB depending on cluster size
  • CPU Usage: < 100m typical, spikes during reconciliation
  • Network: Batched uploads reduce bandwidth
  • Storage: In-memory with optional persistence

Security Architecture

  • Container: Runs as non-root user (65532)
  • Base Image: Distroless for minimal attack surface
  • Network: TLS for all external connections
  • Auth: API key-based authentication
  • RBAC: Minimal Kubernetes permissions

High Availability

  • Leader Election: Single active instance per cluster
  • State Recovery: Rebuilds from Kubernetes API
  • Graceful Handoff: Clean leader transitions
  • Health Checks: Liveness and readiness probes

Next: Component Diagram | Data Flow

⚠️ **GitHub.com Fallback** ⚠️