Architecture Overview - antimetal/system-agent GitHub Wiki
The Antimetal System Agent is built with a modular, event-driven architecture designed for reliability, performance, and extensibility.
- Event-Driven Architecture - Components communicate through events and channels
- Separation of Concerns - Each component has a single, well-defined responsibility
- Cloud-Agnostic Design - Abstractions allow for multiple cloud providers
- Fault Tolerance - Automatic recovery and retry mechanisms throughout
- Performance First - Efficient batching, caching, and concurrent processing
graph TB
Platform["Antimetal Platform<br/>(Intake Service)"]
subgraph SystemAgent ["System Agent"]
subgraph MainController ["Main Controller"]
K8sController["K8s<br/>Controller"]
IntakeWorker["Intake<br/>Worker"]
PerfManager["Performance<br/>Manager"]
subgraph ResourceStore ["Resource Store (BadgerDB)"]
Resources["Resources"]
Relationships["Relationships"]
EventRouter["Event Router"]
end
end
subgraph CloudProvider ["Cloud Provider Abstraction"]
EKS["EKS"]
GKE["GKE"]
AKS["AKS"]
KIND["KIND"]
end
end
K8sAPI["Kubernetes<br/>API Server"]
LinuxKernel["Linux Kernel<br/>/proc, /sys, eBPF"]
IntakeWorker -.->|gRPC Stream| Platform
K8sController --> ResourceStore
ResourceStore --> IntakeWorker
PerfManager --> ResourceStore
SystemAgent --> K8sAPI
SystemAgent --> LinuxKernel
The orchestrator that initializes and manages all components:
- Parses configuration and command-line flags
- Sets up the controller-runtime manager
- Initializes the resource store
- Starts all subsystems
- Manages graceful shutdown
Monitors Kubernetes resources using the controller-runtime framework:
- Watches: Nodes, Pods, Services, Deployments, StatefulSets, DaemonSets, ReplicaSets, PVs, PVCs, Jobs
- Features: Leader election, concurrent reconciliation, index-based lookups
- Output: Normalized resources sent to the Resource Store
Central data hub built on BadgerDB:
- Storage: In-memory key-value store for fast access
- Resources: Generic cloud/Kubernetes resources with metadata
- Relationships: RDF triplets (subject-predicate-object)
- Events: Publishes Add/Update/Delete events to subscribers
Streams data to the Antimetal platform:
- Protocol: gRPC with protobuf messages
- Batching: Configurable batch size and time windows
- Reliability: Exponential backoff, automatic reconnection
- Features: Heartbeat mechanism, stream health monitoring
Collects system and hardware metrics:
- Architecture: Pluggable collector system
-
Data Sources:
/proc
,/sys
, eBPF programs - Collectors: CPU, Memory, Network, Disk, NUMA, and more
- Patterns: PointCollector (one-shot) and ContinuousCollector (streaming)
Provides cloud-specific functionality:
- Interface: Simple provider contract (Name, ClusterName, Region)
- Implementations: EKS (full), KIND (local), GKE/AKS (planned)
- Discovery: Auto-detects cloud environment
graph LR
K8sAPI["K8s API Server"] --> Informers --> Controller --> Reconcile --> ResourceStore["Resource Store"] --> Event
graph LR
ResourceChange["Resource Change"] --> StoreTransaction["Store Transaction"] --> EventGeneration["Event Generation"] --> TypeFiltering["Type Filtering"] --> Subscribers
graph LR
Event --> IntakeWorker["Intake Worker"] --> BatchQueue["Batch Queue"] --> gRPCStream["gRPC Stream"] --> AntimetalPlatform["Antimetal Platform"]
graph LR
Kernel["Kernel (/proc, /sys)"] --> Collector --> PerformanceManager["Performance Manager"] --> MetricsStore["Metrics Store"]
- Components are loosely coupled through events
- Resource Store acts as an event bus
- Subscribers filter events by resource type
- Non-blocking channel operations
// Cloud Provider
type Provider interface {
Name() string
ClusterName(ctx context.Context) (string, error)
Region(ctx context.Context) (string, error)
}
// Collector
type Collector interface {
Name() string
Collect(ctx context.Context) (any, error)
}
- Missing optional data doesn't cause failures
- Collectors handle missing files gracefully
- Network failures trigger exponential backoff
- Partial data is better than no data
- Batching reduces network overhead
- In-memory caching minimizes I/O
- Concurrent processing maximizes throughput
- Careful memory management prevents leaks
- Main: Orchestration and lifecycle management
- Controller Workers: Parallel Kubernetes reconciliation
- Intake Worker: Dedicated gRPC streaming
- Collectors: Independent metric collection
- Event Router: Fan-out event distribution
- Channels: Primary communication mechanism
- Mutexes: Protecting shared state (sparingly used)
- Context: Propagating cancellation and deadlines
- WaitGroups: Coordinating shutdown
Implement the Collector interface and register:
func init() {
performance.Register(MetricTypeCustom, NewCustomCollector)
}
Implement the Provider interface:
type CustomProvider struct{}
func (p *CustomProvider) Name() string { return "custom" }
func (p *CustomProvider) ClusterName(ctx context.Context) (string, error) { ... }
func (p *CustomProvider) Region(ctx context.Context) (string, error) { ... }
Define new protobuf messages and register handlers.
- Startup Time: < 10 seconds typical
- Memory Usage: 100-500MB depending on cluster size
- CPU Usage: < 100m typical, spikes during reconciliation
- Network: Batched uploads reduce bandwidth
- Storage: In-memory with optional persistence
- Container: Runs as non-root user (65532)
- Base Image: Distroless for minimal attack surface
- Network: TLS for all external connections
- Auth: API key-based authentication
- RBAC: Minimal Kubernetes permissions
- Leader Election: Single active instance per cluster
- State Recovery: Rebuilds from Kubernetes API
- Graceful Handoff: Clean leader transitions
- Health Checks: Liveness and readiness probes
Next: Component Diagram | Data Flow