Runtime Discovery

✅ COMPLETE: Runtime discovery is fully implemented and operational. The system collects comprehensive container and process information and builds a complete graph representation with relationship types.

Implementation Status:

✅ Runtime graph builder

✅ Protobuf data models

✅ Resource store integration

✅ Container discovery system

✅ Process topology building

✅ Parent-child process relationships

🚧 Container-process relationships (planned)

🚧 Container-hardware affinity relationships (planned)

Overview

The Runtime Graph feature adds container and process discovery with graph representation to the Antimetal Agent, enabling runtime resources to be represented as nodes and relationships in "The Graph" alongside Kubernetes and hardware resources. This creates a complete topology from hardware → containers → processes.

Architecture

Components

┌─────────────────────────────────────────────────────────────┐
│                  Container Discovery                        │
│  (Cgroup filesystem scanning, multi-runtime support)       │
└──────────────────────┬──────────────────────────────────────┘
                       │ Container information
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   Runtime Manager                          │
│  - Periodic collection orchestration (30s intervals)       │
│  - Runtime snapshot aggregation                            │
│  - Integration with performance collectors                  │
└──────────────────────┬──────────────────────────────────────┘
                       │ Runtime snapshot
                       ▼
┌─────────────────────────────────────────────────────────────┐
│               Runtime Graph Builder                        │
│  - Converts discovery data to graph nodes                  │
│  - Creates RDF triplet relationships                       │
│  - Process topology building                               │
└──────────────────────┬──────────────────────────────────────┘
                       │ Resources & Relationships
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                    Resource Store                          │
│   (BadgerDB - stores nodes and relationships)              │
└─────────────────────────────────────────────────────────────┘

Data Flow

Container Discovery scans cgroup filesystem for running containers
Performance Collectors provide process information (future integration)
Runtime Manager orchestrates periodic collection (default: 30 seconds)
Graph Builder transforms raw data into graph nodes and relationships
Resource Store persists the runtime graph using RDF triplets

Runtime Ontology

Complete Runtime Graph Diagram

graph TB
    %% Node Style Definitions
    classDef containerNode fill:#e3f2fd,stroke:#0277bd,stroke-width:3px
    classDef processNode fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef systemProcess fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef k8sProcess fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    
    %% Container Nodes (discovered from cgroups)
    NGINX_CTR["ContainerNode<br/>🐳 nginx-web<br/>📦 nginx:1.21<br/>🏃 containerd<br/>💾 512MB limit<br/>🧮 CPUs: 0-3<br/>📁 /sys/fs/cgroup/system.slice/docker-abc123.scope"]
    
    POSTGRES_CTR["ContainerNode<br/>🐳 postgres-db<br/>📦 postgres:13<br/>🏃 docker<br/>💾 2GB limit<br/>🧮 CPUs: 4-7<br/>📁 /sys/fs/cgroup/docker/def456"]
    
    REDIS_CTR["ContainerNode<br/>🐳 redis-cache<br/>📦 redis:6.2<br/>🏃 cri-o<br/>💾 256MB limit<br/>🧮 CPUs: 0-1<br/>📁 /sys/fs/cgroup/machine.slice/crio-ghi789.scope"]
    
    %% Process Nodes (discovered from /proc)
    NGINX_MASTER["ProcessNode<br/>👑 nginx-master<br/>🆔 PID: 1234<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:00"]
    
    NGINX_WORKER1["ProcessNode<br/>👷 nginx-worker<br/>🆔 PID: 1235<br/>🔗 PPID: 1234<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:01"]
    
    NGINX_WORKER2["ProcessNode<br/>👷 nginx-worker<br/>🆔 PID: 1236<br/>🔗 PPID: 1234<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 10:30:01"]
    
    POSTGRES_MAIN["ProcessNode<br/>🗄️ postgres<br/>🆔 PID: 2345<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 09:15:00"]
    
    POSTGRES_WRITER["ProcessNode<br/>✍️ postgres-writer<br/>🆔 PID: 2346<br/>🔗 PPID: 2345<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 09:15:02"]
    
    REDIS_MAIN["ProcessNode<br/>⚡ redis-server<br/>🆔 PID: 3456<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 11:00:00"]
    
    SYSTEMD["ProcessNode<br/>🏗️ systemd<br/>🆔 PID: 1<br/>🔗 PPID: 0<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 08:00:00"]
    
    KUBELET["ProcessNode<br/>☸️ kubelet<br/>🆔 PID: 4567<br/>🔗 PPID: 1<br/>📊 State: S<br/>⏱️ Start: 2025-01-20 08:30:00"]
    
    %% Container-Process Relationships (Planned)
    NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_MASTER
    NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_WORKER1
    NGINX_CTR -.->|"Contains<br/>(future)"| NGINX_WORKER2
    
    POSTGRES_CTR -.->|"Contains<br/>(future)"| POSTGRES_MAIN
    POSTGRES_CTR -.->|"Contains<br/>(future)"| POSTGRES_WRITER
    
    REDIS_CTR -.->|"Contains<br/>(future)"| REDIS_MAIN
    
    %% Process Parent-Child Relationships (Implemented)
    SYSTEMD -->|"ParentOf<br/>(implemented)"| NGINX_MASTER
    SYSTEMD -->|"ParentOf<br/>(implemented)"| POSTGRES_MAIN
    SYSTEMD -->|"ParentOf<br/>(implemented)"| REDIS_MAIN
    SYSTEMD -->|"ParentOf<br/>(implemented)"| KUBELET
    
    NGINX_MASTER -->|"ParentOf<br/>(implemented)"| NGINX_WORKER1
    NGINX_MASTER -->|"ParentOf<br/>(implemented)"| NGINX_WORKER2
    
    POSTGRES_MAIN -->|"ParentOf<br/>(implemented)"| POSTGRES_WRITER
    
    %% Apply Styles
    class NGINX_CTR,POSTGRES_CTR,REDIS_CTR containerNode
    class NGINX_MASTER,NGINX_WORKER1,NGINX_WORKER2,POSTGRES_MAIN,POSTGRES_WRITER,REDIS_MAIN processNode
    class SYSTEMD systemProcess
    class KUBELET k8sProcess

Relationship Types in the Diagram

Relationship Type	Visual Style	Status	Description	Examples
ParentOf	Solid arrow	✅ Implemented	Process parent-child relationships via PPID	systemd → nginx-master nginx-master → nginx-worker
Contains	Dotted arrow	🚧 Planned	Container-to-process containment	nginx-container → nginx-process
RunsOn	Dotted arrow	🚧 Planned	Container-to-hardware affinity	nginx-container → CPU Core 0-3
AllocatedTo	Dotted arrow	🚧 Planned	Container memory allocation	postgres-container → NUMA Node 1

Node Types

ContainerNode

Represents a discovered container with its runtime configuration and resource constraints.

Properties:

container_id: Unique container identifier (may be truncated)
runtime: Container runtime (docker, containerd, cri-containerd, cri-o, podman)
cgroup_version: Cgroup version (v1 or v2)
cgroup_path: Filesystem path to container's cgroup directory
image_name: Container image name (e.g., "nginx", "postgres")
image_tag: Container image tag (e.g., "latest", "13")
labels: Runtime-specific labels and annotations
created_at: Container creation timestamp
started_at: Container start timestamp

Resource Limits:

cpu_shares: Relative CPU weight (cgroup v1 cpu.shares)
cpu_quota_us: CPU quota in microseconds per period
cpu_period_us: CPU quota enforcement period
memory_limit_bytes: Memory limit in bytes
cpuset_cpus: CPU cores allowed (e.g., "0-3,8")
cpuset_mems: NUMA nodes allowed (e.g., "0,1")

ProcessNode

Individual process running on the system.

Properties:

pid: Process identifier
ppid: Parent process identifier
pgid: Process group identifier
sid: Session identifier
command: Process command name
cmdline: Full command line with arguments
state: Process state (R, S, D, Z, T, t, W, X, K, P)
start_time: Process start timestamp

Relationship Types

ParentOfPredicate

Process parent-child relationship based on PPID.

Properties:

containment_type: "process"
relationship: "parent_of"

Usage:

Parent Process → Child Process

ContainsProcessPredicate (Planned)

Container-to-process containment relationship.

Properties:

containment_type: "process"
relationship: "contains"

Usage:

Container → Process (via cgroup membership)

ContainerHardwareAffinityPredicate (Planned)

Container-to-hardware affinity relationships.

Properties:

connection_type: "cpu_affinity" | "numa_affinity"
relationship: "runs_on" | "allocated_to"

Usage:

Container → CPU Core (via cpuset.cpus)
Container → NUMA Node (via cpuset.mems)

Container Discovery System

Multi-Runtime Support

The container discovery system supports all major container runtimes through cgroup filesystem scanning:

Supported Runtimes

Docker: Standalone Docker daemon
containerd: Standalone containerd
CRI-containerd: Kubernetes with containerd CRI
CRI-O: Kubernetes with CRI-O runtime
Podman: Rootless and rootful Podman

Detection Methodology

Cgroup v1 Search Paths:

/sys/fs/cgroup/cpu/docker/          # Docker containers
/sys/fs/cgroup/cpu/containerd/      # containerd containers  
/sys/fs/cgroup/cpu/system.slice/    # systemd-managed containers
/sys/fs/cgroup/cpu/machine.slice/   # systemd machines (Podman)
/sys/fs/cgroup/cpu/crio/            # CRI-O containers

Cgroup v2 Unified Hierarchy:

/sys/fs/cgroup/                     # All containers in unified hierarchy
├── system.slice/
│   ├── docker-abc123.scope         # Docker containers
│   ├── cri-containerd-def456.scope # CRI-containerd containers
│   └── crio-ghi789.scope          # CRI-O containers
└── machine.slice/
    └── libpod-jkl012.scope        # Podman containers

Container ID Extraction Patterns

The system handles various container ID formats:

// Systemd scope units
"docker-abc123def456.scope"           → "abc123def456"
"cri-containerd-abc123def456.scope"   → "abc123def456"  
"crio-abc123def456.scope"             → "abc123def456"
"libpod-abc123def456.scope"           → "abc123def456"

// Directory paths
"/docker/abc123def456"                → "abc123def456"
"/containerd/abc123def456"            → "abc123def456"
"/crio/abc123def456"                  → "abc123def456"

// Kubernetes pod paths
"/kubepods.slice/kubepods-pod123.slice/abc123def456" → "abc123def456"

Cross-Cgroup Version Compatibility

The discovery system handles mixed environments where different runtimes use different cgroup versions:

Example Mixed Environment:

Docker using cgroup v1 (cgroupfs driver)
Podman using cgroup v2 (systemd driver)
Kubernetes using cgroup v2 (systemd driver)

Discovery Strategy:

Scan cgroup v2 unified hierarchy first
Scan cgroup v1 subsystems (cpu, memory, etc.)
Deduplicate containers by ID across versions
Preserve runtime-specific metadata

Process Topology Building

Parent-Child Relationship Discovery

The system builds complete process trees using PPID (Parent Process ID) relationships:

systemd (PID: 1)
├── kubelet (PID: 4567, PPID: 1)
├── nginx-master (PID: 1234, PPID: 1)
│   ├── nginx-worker (PID: 1235, PPID: 1234)
│   └── nginx-worker (PID: 1236, PPID: 1234)
└── postgres (PID: 2345, PPID: 1)
    └── postgres-writer (PID: 2346, PPID: 2345)

Process State Mapping

Process states from /proc/[pid]/stat are mapped to enums:

Linux State	Enum Value	Description
R	ProcessStateRunning	Running or runnable
S	ProcessStateSleeping	Interruptible sleep
D	ProcessStateDiskSleep	Uninterruptible sleep (usually I/O)
Z	ProcessStateZombie	Zombie process
T	ProcessStateStopped	Stopped (by signal)
t	ProcessStateTracingStop	Tracing stop
W	ProcessStatePaging	Paging (obsolete)
X	ProcessStateDead	Dead
K	ProcessStateWakeKill	Wake kill
P	ProcessStateParked	Parked

Integration with Performance Collectors

The runtime graph integrates with performance collectors for comprehensive process information:

Current Integration:

Basic process discovery from cgroup scanning
Process metadata from /proc/[pid]/stat
Parent-child relationship building

Planned Integration:

Full process metrics from Process Collector
CPU usage and memory consumption
File descriptor and thread counts
Context switch statistics

Example Graph Structure

Real-World Kubernetes Node

Runtime Graph Structure:
├── ContainerNode (pause-container-k8s-pod-abc123)
│   └── [Contains] → ProcessNode (pause, PID: 5678)
├── ContainerNode (nginx-app-container-def456)
│   ├── [Contains] → ProcessNode (nginx-master, PID: 1234)
│   ├── [Contains] → ProcessNode (nginx-worker, PID: 1235)
│   └── [Contains] → ProcessNode (nginx-worker, PID: 1236)
├── ContainerNode (postgres-db-container-ghi789)
│   ├── [Contains] → ProcessNode (postgres-main, PID: 2345)
│   └── [Contains] → ProcessNode (postgres-writer, PID: 2346)
└── System Processes:
    ├── ProcessNode (systemd, PID: 1)
    │   ├── [ParentOf] → ProcessNode (kubelet, PID: 4567)
    │   ├── [ParentOf] → ProcessNode (nginx-master, PID: 1234)
    │   └── [ParentOf] → ProcessNode (postgres-main, PID: 2345)
    └── ProcessNode (nginx-master, PID: 1234)
        ├── [ParentOf] → ProcessNode (nginx-worker, PID: 1235)
        └── [ParentOf] → ProcessNode (nginx-worker, PID: 1236)

Container Discovery Results

KIND Cluster Example (44 containers discovered):

21 Kubernetes pods
44 total containers (including pause containers)
Multiple container runtimes detected:
- containerd (Kubernetes workloads)
- Direct container deployments

Runtime Breakdown:

Runtime Distribution:
├── containerd: 35 containers
├── docker: 6 containers  
├── cri-containerd: 2 containers
└── unknown: 1 container

Cgroup Integration

Resource Constraint Discovery

Container resource limits are extracted from cgroup files:

Cgroup v1 Files

/sys/fs/cgroup/cpu/docker/abc123/cpu.shares        → cpu_shares
/sys/fs/cgroup/cpu/docker/abc123/cpu.cfs_quota_us  → cpu_quota_us  
/sys/fs/cgroup/cpu/docker/abc123/cpu.cfs_period_us → cpu_period_us
/sys/fs/cgroup/memory/docker/abc123/memory.limit_in_bytes → memory_limit_bytes
/sys/fs/cgroup/cpuset/docker/abc123/cpuset.cpus    → cpuset_cpus
/sys/fs/cgroup/cpuset/docker/abc123/cpuset.mems    → cpuset_mems

Cgroup v2 Files

/sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.weight     → cpu_shares (converted)
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpu.max        → cpu_quota_us/cpu_period_us
/sys/fs/cgroup/system.slice/docker-abc123.scope/memory.max     → memory_limit_bytes
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpuset.cpus    → cpuset_cpus
/sys/fs/cgroup/system.slice/docker-abc123.scope/cpuset.mems    → cpuset_mems

Cgroup-to-Hardware Mapping (Planned)

Container hardware affinity will be determined by parsing cgroup constraints:

CPU Affinity:

cpuset.cpus = "0-3,8" → Container runs on CPU cores 0, 1, 2, 3, 8
Creates RunsOn relationships: Container → CPU Core 0, 1, 2, 3, 8

NUMA Affinity:

cpuset.mems = "0,1" → Container uses NUMA nodes 0 and 1
Creates AllocatedTo relationships: Container → NUMA Node 0, 1

Performance Considerations

Collection Overhead

Container discovery: Filesystem scanning, ~50ms on typical systems
Process topology: Process table enumeration via /proc
Update interval: 30 seconds (vs 5 minutes for hardware)
Incremental updates: Only changed containers/processes

Storage Impact

Container node: ~300-800 bytes per container
Process node: ~200-400 bytes per process
Typical system: 10-50 containers, 100-500 processes
Total storage: <50KB per system

Memory Usage

Runtime snapshot: Held temporarily during collection
Graph builder: Processes incrementally
Deduplication: Prevents duplicate container discovery

Future Enhancements

Container-Process Relationships

Link containers to their processes via cgroup membership:

Container → [Contains] → Process (via /proc/[pid]/cgroup)

Implementation Plan:

Parse /proc/[pid]/cgroup for each process
Match cgroup paths to discovered containers
Create Contains relationships

Container-Hardware Affinity

Link containers to hardware resources via cgroup constraints:

Container → [RunsOn] → CPU Core (via cpuset.cpus)
Container → [AllocatedTo] → NUMA Node (via cpuset.mems)

Implementation Plan:

Parse cpuset.cpus and cpuset.mems from container cgroups
Cross-reference with hardware topology
Create RunsOn and AllocatedTo relationships

Cross-Graph Integration

Bridge runtime graph with Kubernetes and hardware graphs:

K8s Pod → [ScheduledOn] → ContainerNode
ContainerNode → [RunsOn] → SystemNode
ProcessNode → [ExecutesOn] → CPUCoreNode

Extended Process Metrics

Integration with performance collectors for rich process data:

Real-time CPU and memory usage
I/O statistics and file descriptor counts
Network connection tracking
Container-specific cgroup metrics

Advanced Container Discovery

Image metadata: Layer information, vulnerabilities
Runtime configuration: Environment variables, volumes
Resource usage: Real-time metrics from cgroup stats
Network topology: Container networking relationships

Scalable Storage Architecture for Million-Container Deployments

The Container Density Challenge

Modern Kubernetes environments can have extremely high container density:

Large nodes: 100-500 containers per node
Microservices: 1000s of containers per cluster
Pod churn: Frequent container creation/deletion
Result: Millions of containers across large fleets**

Proposed Approach

Similar to hardware deduplication, but optimized for container lifecycle:

Container template deduplication - Common image+configuration patterns stored once
Ephemeral instance tracking - Only store container-specific data (ID, start time, node)
Lifecycle-aware storage - Automatic cleanup of terminated containers
Process tree caching - Common process patterns (nginx master+workers) deduplicated

Storage Impact

For 1 million containers:

Without deduplication: 500 bytes × 1M = 500MB
With template catalog:
- Unique container templates: 10,000 × 500 bytes = 5MB
- Container instances: 1M × 50 bytes = 50MB
- Total: 55MB (90% reduction)

The approach scales with container diversity, not container count, making it ideal for standardized microservice deployments.

References

This document was created as part of PR #167 container-process-hardware topology implementation. Last updated: 2025-08-21

Runtime Discovery - antimetal/system-agent GitHub Wiki

Runtime Discovery

Overview

Architecture

Components

Data Flow

Runtime Ontology

Complete Runtime Graph Diagram

Relationship Types in the Diagram

Node Types

ContainerNode

ProcessNode

Relationship Types

ParentOfPredicate

ContainsProcessPredicate (Planned)

ContainerHardwareAffinityPredicate (Planned)

Container Discovery System

Multi-Runtime Support

Supported Runtimes

Detection Methodology

Container ID Extraction Patterns

Cross-Cgroup Version Compatibility

Process Topology Building

Parent-Child Relationship Discovery

Process State Mapping

Integration with Performance Collectors

Example Graph Structure

Real-World Kubernetes Node

Container Discovery Results

Cgroup Integration

Resource Constraint Discovery

Cgroup v1 Files

Cgroup v2 Files

Cgroup-to-Hardware Mapping (Planned)

Performance Considerations

Collection Overhead

Storage Impact

Memory Usage

Future Enhancements

Container-Process Relationships

Container-Hardware Affinity

Cross-Graph Integration

Extended Process Metrics

Advanced Container Discovery

Scalable Storage Architecture for Million-Container Deployments

The Container Density Challenge

Proposed Approach

Storage Impact

References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️