Runtime Discovery - antimetal/system-agent GitHub Wiki
✅ COMPLETE: Runtime discovery is fully implemented and operational. The system collects comprehensive container and process information and builds a complete graph representation with relationship types.
Implementation Status:
- ✅ Runtime graph builder
- ✅ Protobuf data models
- ✅ Resource store integration
- ✅ Container discovery system
- ✅ Process topology building
- ✅ Parent-child process relationships
- 🚧 Container-process relationships (planned)
- 🚧 Container-hardware affinity relationships (planned)
The Runtime Graph feature adds container and process discovery with graph representation to the Antimetal Agent, enabling runtime resources to be represented as nodes and relationships in "The Graph" alongside Kubernetes and hardware resources. This creates a complete topology from hardware → containers → processes.
┌─────────────────────────────────────────────────────────────┐
│ Container Discovery │
│ (Cgroup filesystem scanning, multi-runtime support) │
└──────────────────────┬──────────────────────────────────────┘
│ Container information
▼
┌─────────────────────────────────────────────────────────────┐
│ Runtime Manager │
│ - Periodic collection orchestration (30s intervals) │
│ - Runtime snapshot aggregation │
│ - Integration with performance collectors │
└──────────────────────┬──────────────────────────────────────┘
│ Runtime snapshot
▼
┌─────────────────────────────────────────────────────────────┐
│ Runtime Graph Builder │
│ - Converts discovery data to graph nodes │
│ - Creates RDF triplet relationships │
│ - Process topology building │
└──────────────────────┬──────────────────────────────────────┘
│ Resources & Relationships
▼
┌─────────────────────────────────────────────────────────────┐
│ Resource Store │
│ (BadgerDB - stores nodes and relationships) │
└─────────────────────────────────────────────────────────────┘
- Container Discovery scans cgroup filesystem for running containers
- Performance Collectors provide process information (future integration)
- Runtime Manager orchestrates periodic collection (default: 30 seconds)
- Graph Builder transforms raw data into graph nodes and relationships
- Resource Store persists the runtime graph using RDF triplets
graph TB
%% Node Style Definitions
classDef containerNode fill:#e3f2fd,stroke:#0277bd,stroke-width:3px
classDef processNode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef systemProcess fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef k8sProcess fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
%% Container Nodes (discovered from cgroups)
NGINX_CTR["ContainerNode<br/>🐳 nginx-web<br/>📦 nginx:1.21<br/>🏃 containerd<br/>💾 512MB limit<br/>🧮 CPUs: 0-3<br/>📁 /sys/fs/cgroup/system.slice/docker-abc123.scope"]
POSTGRES_CTR["ContainerNode<br/>🐳 postgres-db<br/>📦 postgres:13<br/>🏃 docker<br/>💾 2GB limit<br/>🧮 CPUs: 4-7<br/>📁 /sys/fs/cgroup/docker/def456"]
REDIS_CTR["ContainerNode<br/>🐳 redis-cache<br/>📦 redis:6.2<br/>🏃 cri-o<br/>💾 256MB limit<br/>🧮 CPUs: 0-1<br/>📁 /sys/fs/cgroup/machine.slice/crio-ghi789.scope"]
%% Process Nodes (discovered from /proc)
NGINX_MASTER["ProcessNode<br/>👑 nginx-master<br/>🆔 PID: 1234<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:00"]
NGINX_WORKER1["ProcessNode<br/>👷 nginx-worker<br/>🆔 PID: 1235<br/>🔗 PPID: 1234<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:01"]
NGINX_WORKER2["ProcessNode<br/>👷 nginx-worker<br/>🆔 PID: 1236<br/>🔗 PPID: 1234<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:01"]
POSTGRES_MAIN["ProcessNode<br/>🗄️ postgres<br/>🆔 PID: 2345<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 09:15:00"]
POSTGRES_WRITER["ProcessNode<br/>✍️ postgres-writer<br/>🆔 PID: 2346<br/>🔗 PPID: 2345<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 09:15:02"]
REDIS_MAIN["ProcessNode<br/>⚡ redis-server<br/>🆔 PID: 3456<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 11:00:00"]
SYSTEMD["ProcessNode<br/>🏗️ systemd<br/>🆔 PID: 1<br/>🔗 PPID: 0<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 08:00:00"]
KUBELET["ProcessNode<br/>☸️ kubelet<br/>🆔 PID: 4567<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 08:30:00"]
%% Container-Process Relationships (Planned)
NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_MASTER
NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_WORKER1
NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_WORKER2
POSTGRES_CTR -.->|"Contains<br/>(future)"| POSTGRES_MAIN
POSTGRES_CTR -.->|"Contains<br/>(future)"| POSTGRES_WRITER
REDIS_CTR -.->|"Contains<br/>(future)"| REDIS_MAIN
%% Process Parent-Child Relationships (Implemented)
SYSTEMD -->|"ParentOf<br/>(implemented)"| NGINX_MASTER
SYSTEMD -->|"ParentOf<br/>(implemented)"| POSTGRES_MAIN
SYSTEMD -->|"ParentOf<br/>(implemented)"| REDIS_MAIN
SYSTEMD -->|"ParentOf<br/>(implemented)"| KUBELET
NGINX_MASTER -->|"ParentOf<br/>(implemented)"| NGINX_WORKER1
NGINX_MASTER -->|"ParentOf<br/>(implemented)"| NGINX_WORKER2
POSTGRES_MAIN -->|"ParentOf<br/>(implemented)"| POSTGRES_WRITER
%% Apply Styles
class NGINX_CTR,POSTGRES_CTR,REDIS_CTR containerNode
class NGINX_MASTER,NGINX_WORKER1,NGINX_WORKER2,POSTGRES_MAIN,POSTGRES_WRITER,REDIS_MAIN processNode
class SYSTEMD systemProcess
class KUBELET k8sProcess
Relationship Type | Visual Style | Status | Description | Examples |
---|---|---|---|---|
ParentOf | Solid arrow | ✅ Implemented | Process parent-child relationships via PPID | systemd → nginx-master nginx-master → nginx-worker |
Contains | Dotted arrow | 🚧 Planned | Container-to-process containment | nginx-container → nginx-process |
RunsOn | Dotted arrow | 🚧 Planned | Container-to-hardware affinity | nginx-container → CPU Core 0-3 |
AllocatedTo | Dotted arrow | 🚧 Planned | Container memory allocation | postgres-container → NUMA Node 1 |
Represents a discovered container with its runtime configuration and resource constraints.
Properties:
-
container_id
: Unique container identifier (may be truncated) -
runtime
: Container runtime (docker, containerd, cri-containerd, cri-o, podman) -
cgroup_version
: Cgroup version (v1 or v2) -
cgroup_path
: Filesystem path to container's cgroup directory -
image_name
: Container image name (e.g., "nginx", "postgres") -
image_tag
: Container image tag (e.g., "latest", "13") -
labels
: Runtime-specific labels and annotations -
created_at
: Container creation timestamp -
started_at
: Container start timestamp
Resource Limits:
-
cpu_shares
: Relative CPU weight (cgroup v1 cpu.shares) -
cpu_quota_us
: CPU quota in microseconds per period -
cpu_period_us
: CPU quota enforcement period -
memory_limit_bytes
: Memory limit in bytes -
cpuset_cpus
: CPU cores allowed (e.g., "0-3,8") -
cpuset_mems
: NUMA nodes allowed (e.g., "0,1")
Individual process running on the system.
Properties:
-
pid
: Process identifier -
ppid
: Parent process identifier -
pgid
: Process group identifier -
sid
: Session identifier -
command
: Process command name -
cmdline
: Full command line with arguments -
state
: Process state (R, S, D, Z, T, t, W, X, K, P) -
start_time
: Process start timestamp
Process parent-child relationship based on PPID.
Properties:
-
containment_type
: "process" -
relationship
: "parent_of"
Usage:
- Parent Process → Child Process
Container-to-process containment relationship.
Properties:
-
containment_type
: "process" -
relationship
: "contains"
Usage:
- Container → Process (via cgroup membership)
Container-to-hardware affinity relationships.
Properties:
-
connection_type
: "cpu_affinity" | "numa_affinity" -
relationship
: "runs_on" | "allocated_to"
Usage:
- Container → CPU Core (via cpuset.cpus)
- Container → NUMA Node (via cpuset.mems)
The container discovery system supports all major container runtimes through cgroup filesystem scanning:
- Docker: Standalone Docker daemon
- containerd: Standalone containerd
- CRI-containerd: Kubernetes with containerd CRI
- CRI-O: Kubernetes with CRI-O runtime
- Podman: Rootless and rootful Podman
Cgroup v1 Search Paths:
/sys/fs/cgroup/cpu/docker/ # Docker containers
/sys/fs/cgroup/cpu/containerd/ # containerd containers
/sys/fs/cgroup/cpu/system.slice/ # systemd-managed containers
/sys/fs/cgroup/cpu/machine.slice/ # systemd machines (Podman)
/sys/fs/cgroup/cpu/crio/ # CRI-O containers
Cgroup v2 Unified Hierarchy:
/sys/fs/cgroup/ # All containers in unified hierarchy
├── system.slice/
│ ├── docker-abc123.scope # Docker containers
│ ├── cri-containerd-def456.scope # CRI-containerd containers
│ └── crio-ghi789.scope # CRI-O containers
└── machine.slice/
└── libpod-jkl012.scope # Podman containers
The system handles various container ID formats:
// Systemd scope units
"docker-abc123def456.scope" → "abc123def456"
"cri-containerd-abc123def456.scope" → "abc123def456"
"crio-abc123def456.scope" → "abc123def456"
"libpod-abc123def456.scope" → "abc123def456"
// Directory paths
"/docker/abc123def456" → "abc123def456"
"/containerd/abc123def456" → "abc123def456"
"/crio/abc123def456" → "abc123def456"
// Kubernetes pod paths
"/kubepods.slice/kubepods-pod123.slice/abc123def456" → "abc123def456"
The discovery system handles mixed environments where different runtimes use different cgroup versions:
Example Mixed Environment:
- Docker using cgroup v1 (cgroupfs driver)
- Podman using cgroup v2 (systemd driver)
- Kubernetes using cgroup v2 (systemd driver)
Discovery Strategy:
- Scan cgroup v2 unified hierarchy first
- Scan cgroup v1 subsystems (cpu, memory, etc.)
- Deduplicate containers by ID across versions
- Preserve runtime-specific metadata
The system builds complete process trees using PPID (Parent Process ID) relationships:
systemd (PID: 1)
├── kubelet (PID: 4567, PPID: 1)
├── nginx-master (PID: 1234, PPID: 1)
│ ├── nginx-worker (PID: 1235, PPID: 1234)
│ └── nginx-worker (PID: 1236, PPID: 1234)
└── postgres (PID: 2345, PPID: 1)
└── postgres-writer (PID: 2346, PPID: 2345)
Process states from /proc/[pid]/stat
are mapped to enums:
Linux State | Enum Value | Description |
---|---|---|
R | ProcessStateRunning | Running or runnable |
S | ProcessStateSleeping | Interruptible sleep |
D | ProcessStateDiskSleep | Uninterruptible sleep (usually I/O) |
Z | ProcessStateZombie | Zombie process |
T | ProcessStateStopped | Stopped (by signal) |
t | ProcessStateTracingStop | Tracing stop |
W | ProcessStatePaging | Paging (obsolete) |
X | ProcessStateDead | Dead |
K | ProcessStateWakeKill | Wake kill |
P | ProcessStateParked | Parked |
The runtime graph integrates with performance collectors for comprehensive process information:
Current Integration:
- Basic process discovery from cgroup scanning
- Process metadata from
/proc/[pid]/stat
- Parent-child relationship building
Planned Integration:
- Full process metrics from Process Collector
- CPU usage and memory consumption
- File descriptor and thread counts
- Context switch statistics
Runtime Graph Structure:
├── ContainerNode (pause-container-k8s-pod-abc123)
│ └── [Contains] → ProcessNode (pause, PID: 5678)
├── ContainerNode (nginx-app-container-def456)
│ ├── [Contains] → ProcessNode (nginx-master, PID: 1234)
│ ├── [Contains] → ProcessNode (nginx-worker, PID: 1235)
│ └── [Contains] → ProcessNode (nginx-worker, PID: 1236)
├── ContainerNode (postgres-db-container-ghi789)
│ ├── [Contains] → ProcessNode (postgres-main, PID: 2345)
│ └── [Contains] → ProcessNode (postgres-writer, PID: 2346)
└── System Processes:
├── ProcessNode (systemd, PID: 1)
│ ├── [ParentOf] → ProcessNode (kubelet, PID: 4567)
│ ├── [ParentOf] → ProcessNode (nginx-master, PID: 1234)
│ └── [ParentOf] → ProcessNode (postgres-main, PID: 2345)
└── ProcessNode (nginx-master, PID: 1234)
├── [ParentOf] → ProcessNode (nginx-worker, PID: 1235)
└── [ParentOf] → ProcessNode (nginx-worker, PID: 1236)
KIND Cluster Example (44 containers discovered):
- 21 Kubernetes pods
- 44 total containers (including pause containers)
- Multiple container runtimes detected:
- containerd (Kubernetes workloads)
- Direct container deployments
Runtime Breakdown:
Runtime Distribution:
├── containerd: 35 containers
├── docker: 6 containers
├── cri-containerd: 2 containers
└── unknown: 1 container
Container resource limits are extracted from cgroup files:
/sys/fs/cgroup/cpu/docker/abc123/cpu.shares → cpu_shares
/sys/fs/cgroup/cpu/docker/abc123/cpu.cfs_quota_us → cpu_quota_us
/sys/fs/cgroup/cpu/docker/abc123/cpu.cfs_period_us → cpu_period_us
/sys/fs/cgroup/memory/docker/abc123/memory.limit_in_bytes → memory_limit_bytes
/sys/fs/cgroup/cpuset/docker/abc123/cpuset.cpus → cpuset_cpus
/sys/fs/cgroup/cpuset/docker/abc123/cpuset.mems → cpuset_mems
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.weight → cpu_shares (converted)
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max → cpu_quota_us/cpu_period_us
/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max → memory_limit_bytes
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpuset.cpus → cpuset_cpus
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpuset.mems → cpuset_mems
Container hardware affinity will be determined by parsing cgroup constraints:
CPU Affinity:
cpuset.cpus = "0-3,8" → Container runs on CPU cores 0, 1, 2, 3, 8
Creates RunsOn relationships: Container → CPU Core 0, 1, 2, 3, 8
NUMA Affinity:
cpuset.mems = "0,1" → Container uses NUMA nodes 0 and 1
Creates AllocatedTo relationships: Container → NUMA Node 0, 1
- Container discovery: Filesystem scanning, ~50ms on typical systems
-
Process topology: Process table enumeration via
/proc
- Update interval: 30 seconds (vs 5 minutes for hardware)
- Incremental updates: Only changed containers/processes
- Container node: ~300-800 bytes per container
- Process node: ~200-400 bytes per process
- Typical system: 10-50 containers, 100-500 processes
- Total storage: <50KB per system
- Runtime snapshot: Held temporarily during collection
- Graph builder: Processes incrementally
- Deduplication: Prevents duplicate container discovery
Link containers to their processes via cgroup membership:
Container → [Contains] → Process (via /proc/[pid]/cgroup)
Implementation Plan:
- Parse
/proc/[pid]/cgroup
for each process - Match cgroup paths to discovered containers
- Create Contains relationships
Link containers to hardware resources via cgroup constraints:
Container → [RunsOn] → CPU Core (via cpuset.cpus)
Container → [AllocatedTo] → NUMA Node (via cpuset.mems)
Implementation Plan:
- Parse cpuset.cpus and cpuset.mems from container cgroups
- Cross-reference with hardware topology
- Create RunsOn and AllocatedTo relationships
Bridge runtime graph with Kubernetes and hardware graphs:
K8s Pod → [ScheduledOn] → ContainerNode
ContainerNode → [RunsOn] → SystemNode
ProcessNode → [ExecutesOn] → CPUCoreNode
Integration with performance collectors for rich process data:
- Real-time CPU and memory usage
- I/O statistics and file descriptor counts
- Network connection tracking
- Container-specific cgroup metrics
- Image metadata: Layer information, vulnerabilities
- Runtime configuration: Environment variables, volumes
- Resource usage: Real-time metrics from cgroup stats
- Network topology: Container networking relationships
Modern Kubernetes environments can have extremely high container density:
- Large nodes: 100-500 containers per node
- Microservices: 1000s of containers per cluster
- Pod churn: Frequent container creation/deletion
- Result: Millions of containers across large fleets**
Similar to hardware deduplication, but optimized for container lifecycle:
- Container template deduplication - Common image+configuration patterns stored once
- Ephemeral instance tracking - Only store container-specific data (ID, start time, node)
- Lifecycle-aware storage - Automatic cleanup of terminated containers
- Process tree caching - Common process patterns (nginx master+workers) deduplicated
For 1 million containers:
- Without deduplication: 500 bytes × 1M = 500MB
-
With template catalog:
- Unique container templates: 10,000 × 500 bytes = 5MB
- Container instances: 1M × 50 bytes = 50MB
- Total: 55MB (90% reduction)
The approach scales with container diversity, not container count, making it ideal for standardized microservice deployments.
- Linux cgroups documentation
- Cgroup v2 documentation
- Container Runtime Interface (CRI)
- systemd scope units
- Process states in Linux
- BadgerDB documentation
- Protocol Buffers
This document was created as part of PR #167 container-process-hardware topology implementation. Last updated: 2025-08-21