Architecture Overview

The Antimetal System Agent is built with a modular, event-driven architecture designed for reliability, performance, and extensibility.

Core Design Principles

Event-Driven Architecture - Components communicate through events and channels
Separation of Concerns - Each component has a single, well-defined responsibility
Cloud-Agnostic Design - Abstractions allow for multiple cloud providers
Fault Tolerance - Automatic recovery and retry mechanisms throughout
Performance First - Efficient batching, caching, and concurrent processing

High-Level Architecture

graph TB
    Platform["Antimetal Platform<br/>(Intake Service)"]

    subgraph SystemAgent ["System Agent"]
        subgraph MainController ["Main Controller"]
            K8sController["K8s<br/>Controller"]
            IntakeWorker["Intake<br/>Worker"]
            PerfManager["Performance<br/>Manager"]

            subgraph ResourceStore ["Resource Store (BadgerDB)"]
                Resources["Resources"]
                Relationships["Relationships"]
                EventRouter["Event Router"]
            end
        end

        subgraph CloudProvider ["Cloud Provider Abstraction"]
            EKS["EKS"]
            GKE["GKE"]
            AKS["AKS"]
            KIND["KIND"]
        end
    end

    K8sAPI["Kubernetes<br/>API Server"]
    LinuxKernel["Linux Kernel<br/>/proc, /sys, eBPF"]


    IntakeWorker -.->|gRPC Stream| Platform
    K8sController --> ResourceStore
    ResourceStore --> IntakeWorker
    PerfManager --> ResourceStore

    SystemAgent --> K8sAPI
    SystemAgent --> LinuxKernel

Component Overview

1. Main Controller (`cmd/main.go`)

The orchestrator that initializes and manages all components:

Parses configuration and command-line flags
Sets up the controller-runtime manager
Initializes the resource store
Starts all subsystems
Manages graceful shutdown

2. Kubernetes Controller

Monitors Kubernetes resources using the controller-runtime framework:

Watches: Nodes, Pods, Services, Deployments, StatefulSets, DaemonSets, ReplicaSets, PVs, PVCs, Jobs
Features: Leader election, concurrent reconciliation, index-based lookups
Output: Normalized resources sent to the Resource Store

3. Resource Store

Central data hub built on BadgerDB:

Storage: In-memory key-value store for fast access
Resources: Generic cloud/Kubernetes resources with metadata
Relationships: RDF triplets (subject-predicate-object)
Events: Publishes Add/Update/Delete events to subscribers

4. Intake Worker

Streams data to the Antimetal platform:

Protocol: gRPC with protobuf messages
Batching: Configurable batch size and time windows
Reliability: Exponential backoff, automatic reconnection
Features: Heartbeat mechanism, stream health monitoring

5. Performance Manager

Collects system and hardware metrics:

Architecture: Pluggable collector system
Data Sources: /proc, /sys, eBPF programs
Collectors: CPU, Memory, Network, Disk, NUMA, and more
Patterns: PointCollector (one-shot) and ContinuousCollector (streaming)

6. Cloud Provider Abstraction

Provides cloud-specific functionality:

Interface: Simple provider contract (Name, ClusterName, Region)
Implementations: EKS (full), KIND (local), GKE/AKS (planned)
Discovery: Auto-detects cloud environment

Data Flow Patterns

Resource Collection Flow

graph LR
    K8sAPI["K8s API Server"] --> Informers --> Controller --> Reconcile --> ResourceStore["Resource Store"] --> Event

Event Processing Flow

graph LR
    ResourceChange["Resource Change"] --> StoreTransaction["Store Transaction"] --> EventGeneration["Event Generation"] --> TypeFiltering["Type Filtering"] --> Subscribers

Intake Upload Flow

graph LR
    Event --> IntakeWorker["Intake Worker"] --> BatchQueue["Batch Queue"] --> gRPCStream["gRPC Stream"] --> AntimetalPlatform["Antimetal Platform"]

Performance Collection Flow

graph LR
    Kernel["Kernel (/proc, /sys)"] --> Collector --> PerformanceManager["Performance Manager"] --> MetricsStore["Metrics Store"]

Key Design Patterns

1. Event-Driven Communication

Components are loosely coupled through events
Resource Store acts as an event bus
Subscribers filter events by resource type
Non-blocking channel operations

2. Interface-Based Abstractions

// Cloud Provider
type Provider interface {
    Name() string
    ClusterName(ctx context.Context) (string, error)
    Region(ctx context.Context) (string, error)
}

// Collector
type Collector interface {
    Name() string
    Collect(ctx context.Context) (any, error)
}

3. Graceful Degradation

Missing optional data doesn't cause failures
Collectors handle missing files gracefully
Network failures trigger exponential backoff
Partial data is better than no data

4. Resource-Efficient Design

Batching reduces network overhead
In-memory caching minimizes I/O
Concurrent processing maximizes throughput
Careful memory management prevents leaks

Concurrency Model

Goroutine Organization

Main: Orchestration and lifecycle management
Controller Workers: Parallel Kubernetes reconciliation
Intake Worker: Dedicated gRPC streaming
Collectors: Independent metric collection
Event Router: Fan-out event distribution

Synchronization

Channels: Primary communication mechanism
Mutexes: Protecting shared state (sparingly used)
Context: Propagating cancellation and deadlines
WaitGroups: Coordinating shutdown

Extension Points

1. Adding New Collectors

Implement the Collector interface and register:

func init() {
    performance.Register(MetricTypeCustom, NewCustomCollector)
}

2. Adding Cloud Providers

Implement the Provider interface:

type CustomProvider struct{}
func (p *CustomProvider) Name() string { return "custom" }
func (p *CustomProvider) ClusterName(ctx context.Context) (string, error) { ... }
func (p *CustomProvider) Region(ctx context.Context) (string, error) { ... }

3. Custom Resource Types

Define new protobuf messages and register handlers.

Performance Characteristics

Startup Time: < 10 seconds typical
Memory Usage: 100-500MB depending on cluster size
CPU Usage: < 100m typical, spikes during reconciliation
Network: Batched uploads reduce bandwidth
Storage: In-memory with optional persistence

Security Architecture

Container: Runs as non-root user (65532)
Base Image: Distroless for minimal attack surface
Network: TLS for all external connections
Auth: API key-based authentication
RBAC: Minimal Kubernetes permissions

High Availability

Leader Election: Single active instance per cluster
State Recovery: Rebuilds from Kubernetes API
Graceful Handoff: Clean leader transitions
Health Checks: Liveness and readiness probes

Next: Component Diagram | Data Flow

Architecture Overview - antimetal/system-agent GitHub Wiki

Architecture Overview

Core Design Principles

High-Level Architecture

Component Overview

1. Main Controller (`cmd/main.go`)

2. Kubernetes Controller

3. Resource Store

4. Intake Worker

5. Performance Manager

6. Cloud Provider Abstraction

Data Flow Patterns

Resource Collection Flow

Event Processing Flow

Intake Upload Flow

Performance Collection Flow

Key Design Patterns

1. Event-Driven Communication

2. Interface-Based Abstractions

3. Graceful Degradation

4. Resource-Efficient Design

Concurrency Model

Goroutine Organization

Synchronization

Extension Points

1. Adding New Collectors

2. Adding Cloud Providers

3. Custom Resource Types

Performance Characteristics

Security Architecture

High Availability

⚠️ GitHub.com Fallback ⚠️

Architecture Overview - antimetal/system-agent GitHub Wiki

Architecture Overview

Core Design Principles

High-Level Architecture

Component Overview

1. Main Controller (cmd/main.go)

2. Kubernetes Controller

3. Resource Store

4. Intake Worker

5. Performance Manager

6. Cloud Provider Abstraction

Data Flow Patterns

Resource Collection Flow

Event Processing Flow

Intake Upload Flow

Performance Collection Flow

Key Design Patterns

1. Event-Driven Communication

2. Interface-Based Abstractions

3. Graceful Degradation

4. Resource-Efficient Design

Concurrency Model

Goroutine Organization

Synchronization

Extension Points

1. Adding New Collectors

2. Adding Cloud Providers

3. Custom Resource Types

Performance Characteristics

Security Architecture

High Availability

⚠️ **GitHub.com Fallback** ⚠️

1. Main Controller (`cmd/main.go`)

⚠️ GitHub.com Fallback ⚠️