Testing - antimetal/system-agent GitHub Wiki
This guide covers testing cgroup collectors in development and production environments.
The project includes automated test scripts that validate:
- Container discovery across different runtimes
- Metric collection from cgroup files
- Compatibility with both cgroup v1 and v2
Steps to verify:
- Check mount points to identify cgroup version
- For v2: Look for the presence of
cgroup.controllers
file in the root - For v1: Check for separate controller directories (cpu, memory, etc.)
Validation process:
- List running containers using your container runtime
- Verify corresponding cgroup directories exist for each container
- Check both systemd scope units and direct runtime directories
Test setup:
- Start a test container with specific resource limits (e.g., 256MB memory, 0.5 CPUs)
- Verify metrics files are created in the container's cgroup directory
- Confirm CPU and memory statistics are being populated
Key metrics to verify:
- CPU usage and throttling statistics
- Memory current usage and limits
- Container identification in cgroup paths
Local testing workflow:
- Create a local Kubernetes cluster using KIND
- Deploy the agent to the cluster
- Monitor agent logs for successful container discovery
- Verify metrics collection across multiple containers
- Container detection
- Metrics collection
- Runtime compatibility
- Missing permissions
- Incomplete cgroup mounts
- Container restarts
- High container churn
- Many containers (100+)
- Rapid container creation/deletion
- Memory pressure scenarios
- Check cgroup mount:
mount | grep cgroup
- Verify paths:
ls -la /sys/fs/cgroup/
- Check permissions:
ls -la /sys/fs/cgroup/*/
- Verify controller enabled:
cat /sys/fs/cgroup/cgroup.controllers
- Check specific files exist
- Review agent logs for errors
- Ensure agent has read access to
/sys
- Check SecurityContext in deployment
- Verify volume mounts are correct
- Cgroup version correctly detected
- All running containers discovered
- CPU metrics collected
- Memory metrics collected
- No permission errors in logs
- Metrics update at expected intervals
- Graceful handling of missing files