Troubleshooting - antimetal/system-agent GitHub Wiki
Troubleshooting
This guide helps diagnose and resolve common issues with the Antimetal System Agent.
Quick Diagnostics
Health Check Script
#!/bin/bash
# antimetal-diagnostics.sh
NAMESPACE="antimetal-system"
echo "=== Antimetal Agent Diagnostics ==="
echo
echo "1. Checking Pod Status:"
kubectl get pods -n $NAMESPACE
echo -e "\n2. Checking Recent Events:"
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo -e "\n3. Checking Logs (last 50 lines):"
kubectl logs -n $NAMESPACE -l app=antimetal-agent --tail=50
echo -e "\n4. Checking RBAC:"
kubectl auth can-i --list --as system:serviceaccount:$NAMESPACE:antimetal-agent | grep -E "(pods|nodes|services)" | head -5
echo -e "\n5. Checking Connectivity:"
kubectl exec -n $NAMESPACE deployment/antimetal-agent -- wget -O- -T 5 https://intake.antimetal.com/health 2>/dev/null || echo "Connection failed"
echo -e "\n6. Checking Resource Usage:"
kubectl top pod -n $NAMESPACE 2>/dev/null || echo "Metrics server not available"
echo -e "\n7. Checking Leader Election:"
kubectl get lease -n $NAMESPACE
echo -e "\n=== End Diagnostics ==="
Common Issues
1. Agent Not Starting
Symptoms
- Pod in
CrashLoopBackOff
orError
state - Container exits immediately
Diagnosis
# Check pod status
kubectl describe pod -n antimetal-system -l app=antimetal-agent
# Check logs
kubectl logs -n antimetal-system -l app=antimetal-agent --previous
# Check events
kubectl get events -n antimetal-system --field-selector reason=Failed
Common Causes and Solutions
Invalid API Key
Error: intake authentication failed: invalid API key
Solution:
# Update secret with correct API key
kubectl create secret generic antimetal-credentials \
-n antimetal-system \
--from-literal=api-key="YOUR_CORRECT_API_KEY" \
--dry-run=client -o yaml | kubectl apply -f -
# Restart pod
kubectl rollout restart deployment/antimetal-agent -n antimetal-system
Network Connectivity Issues
Error: failed to connect to intake service: dial tcp: i/o timeout
Solution:
# Test connectivity from pod
kubectl run -n antimetal-system test-connection --image=busybox --rm -it -- \
wget -O- https://intake.antimetal.com/health
# Check network policies
kubectl get networkpolicies -n antimetal-system
# Check egress rules (if using network policies)
Insufficient Permissions
Error: pods is forbidden: User "system:serviceaccount:antimetal-system:antimetal-agent" cannot list resource "pods"
Solution:
# Apply RBAC manifests
kubectl apply -f https://raw.githubusercontent.com/antimetal/system-agent/main/config/rbac/role.yaml
# Verify permissions
kubectl auth can-i list pods --as system:serviceaccount:antimetal-system:antimetal-agent
2. High Memory Usage
Symptoms
- Pod using more memory than expected
- OOMKilled errors
Diagnosis
# Check memory usage
kubectl top pod -n antimetal-system
# Check for memory limits
kubectl get pod -n antimetal-system -l app=antimetal-agent -o yaml | grep -A5 resources:
# Check store size
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
sh -c 'du -sh /var/lib/antimetal/*'
Solutions
Increase Memory Limits
resources:
requests:
memory: 512Mi
limits:
memory: 1Gi
Reduce Cache Size
storage:
resource:
cacheSize: 5000 # Reduce from default 10000
Filter Resources
kubernetes:
namespaces:
exclude:
- kube-system
- kube-public
- testing
resources:
exclude:
- events
- endpoints
3. No Data in Platform
Symptoms
- Agent running but no data visible in Antimetal platform
- Metrics show no data sent
Diagnosis
# Check intake worker logs
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i intake
# Check metrics
kubectl port-forward -n antimetal-system deployment/antimetal-agent 8080:8080
curl -s http://localhost:8080/metrics | grep antimetal_intake
# Check event flow
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
sh -c 'kill -USR1 1' # Dumps goroutine stack traces
Solutions
Check Batch Configuration
intake:
batchSize: 50 # Reduce for faster sends
batchInterval: "5s" # Send more frequently
Verify API Endpoint
# Test direct connection
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
sh -c 'echo "test" | nc -zv intake.antimetal.com 443'
Enable Debug Logging
operational:
logging:
level: debug
verbosity:
intake: 3
4. Leader Election Issues
Symptoms
- Multiple pods but none active
- Logs show leader election failures
Diagnosis
# Check lease object
kubectl get lease -n antimetal-system antimetal-agent-leader -o yaml
# Check which pod is leader
kubectl get lease -n antimetal-system antimetal-agent-leader \
-o jsonpath='{.spec.holderIdentity}'
# Check for split-brain
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i "leader"
Solutions
Force New Leader
# Delete lease to force re-election
kubectl delete lease -n antimetal-system antimetal-agent-leader
# Or disable leader election for single replica
helm upgrade antimetal-agent antimetal/system-agent \
--set leaderElection.enabled=false
5. Performance Collector Failures
Symptoms
- Missing performance metrics
- Collector errors in logs
Diagnosis
# Check collector status
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i collector
# Check host mounts
kubectl get pod -n antimetal-system -l app=antimetal-agent -o yaml | grep -A10 volumes:
# Test file access
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
ls -la /host/proc/stat
Solutions
Fix Host Mounts
volumes:
- name: proc
hostPath:
path: /proc
type: Directory
- name: sys
hostPath:
path: /sys
type: Directory
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
Disable Failing Collectors
performance:
collectors:
- cpu
- memory
# Remove problematic collectors
6. Cloud Provider Detection Failures
Symptoms
- Wrong or "unknown" cluster name
- Missing region information
Diagnosis
# Check detected provider
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i "provider"
# Force provider detection
kubectl set env deployment/antimetal-agent -n antimetal-system \
CLOUD_PROVIDER=eks
# Check IMDS access (AWS)
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
wget -O- -T 2 http://169.254.169.254/latest/meta-data/instance-id
Solutions
Override Provider
kubernetes:
cloudProvider: eks # or gke, aks, kind
clusterName: "my-cluster"
Fix IAM/Metadata Access
# For EKS with IRSA
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/antimetal-agent
Debug Mode
Enable Comprehensive Debugging
# debug-values.yaml
operational:
logging:
level: debug
format: text
verbosity:
controller: 2
intake: 3
performance: 2
store: 1
# Disable optimizations
leaderElection:
enabled: false
# Reduce intervals for testing
performance:
interval: "10s"
intake:
batchInterval: "5s"
Deploy with debug mode:
helm upgrade antimetal-agent antimetal/system-agent \
-n antimetal-system \
-f debug-values.yaml
Interactive Debugging
# Get shell access
kubectl exec -it -n antimetal-system deployment/antimetal-agent -- sh
# Inside the container:
# Check connectivity
wget -O- https://intake.antimetal.com/health
# Check file access
ls -la /host/proc/
cat /host/proc/stat
# Check environment
env | grep ANTIMETAL
# Test API access
wget -O- --header="Authorization: Bearer $ANTIMETAL_INTAKE_API_KEY" \
https://intake.antimetal.com/v1/health
Memory Profiling
# Enable pprof
operational:
pprof:
enabled: true
bindAddress: ":6060"
# Port forward
kubectl port-forward -n antimetal-system deployment/antimetal-agent 6060:6060
# Capture heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# Capture CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
Log Analysis
Log Levels
Understanding log levels:
ERROR
: Critical issues requiring attentionWARN
: Issues that may need investigationINFO
: Normal operational messagesDEBUG
: Detailed diagnostic information
Useful Log Queries
# Errors only
kubectl logs -n antimetal-system -l app=antimetal-agent | grep ERROR
# Intake-related logs
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(intake|batch|stream)"
# Resource processing
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(reconcile|resource)"
# Performance collection
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(collector|performance)"
Structured Log Parsing
# Parse JSON logs with jq
kubectl logs -n antimetal-system -l app=antimetal-agent | \
jq 'select(.level=="error") | {time: .ts, error: .error, component: .component}'
# Count errors by type
kubectl logs -n antimetal-system -l app=antimetal-agent | \
jq -r 'select(.level=="error") | .error' | sort | uniq -c
Recovery Procedures
Full Reset
# 1. Delete existing deployment
kubectl delete deployment -n antimetal-system antimetal-agent
# 2. Clean up PVCs if any
kubectl delete pvc -n antimetal-system -l app=antimetal-agent
# 3. Delete and recreate secrets
kubectl delete secret -n antimetal-system antimetal-credentials
kubectl create secret generic antimetal-credentials \
-n antimetal-system \
--from-literal=api-key="YOUR_API_KEY"
# 4. Redeploy
helm install antimetal-agent antimetal/system-agent \
-n antimetal-system \
--set intake.apiKey="YOUR_API_KEY"
Partial Recovery
# Just restart pods
kubectl rollout restart deployment/antimetal-agent -n antimetal-system
# Force new leader
kubectl delete lease -n antimetal-system antimetal-agent-leader
# Clear store cache (if using persistent storage)
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
rm -rf /var/lib/antimetal/store/*
Monitoring and Alerts
Key Metrics to Monitor
# Agent health
up{job="antimetal-agent"} == 0
# High error rate
rate(antimetal_errors_total[5m]) > 0.1
# No data being sent
rate(antimetal_intake_batches_sent_total[10m]) == 0
# Memory usage
container_memory_usage_bytes{pod=~"antimetal-agent.*"} > 500000000
# CPU throttling
rate(container_cpu_throttled_periods_total{pod=~"antimetal-agent.*"}[5m]) > 0
Example Alerts
groups:
- name: antimetal-agent
rules:
- alert: AntimetalAgentDown
expr: up{job="antimetal-agent"} == 0
for: 5m
annotations:
summary: "Antimetal agent is down"
- alert: AntimetalAgentHighErrorRate
expr: rate(antimetal_errors_total[5m]) > 0.1
for: 10m
annotations:
summary: "High error rate in Antimetal agent"
- alert: AntimetalAgentNoData
expr: rate(antimetal_intake_batches_sent_total[10m]) == 0
for: 15m
annotations:
summary: "No data being sent to Antimetal"
Getting Help
Before Contacting Support
Gather this information:
# Save diagnostic bundle
kubectl cluster-info dump --namespaces antimetal-system --output-directory ./antimetal-diagnostics
# Get agent version
kubectl get deployment -n antimetal-system antimetal-agent -o jsonpath='{.spec.template.spec.containers[0].image}'
# Export recent logs
kubectl logs -n antimetal-system -l app=antimetal-agent --since=1h > antimetal-agent.log
# Get configuration
kubectl get configmap -n antimetal-system -o yaml > antimetal-config.yaml
Support Channels
- GitHub Issues: github.com/antimetal/system-agent/issues
- Email: [email protected]
- Slack: antimetal-community.slack.com
Include:
- Diagnostic bundle
- Agent version
- Kubernetes version
- Cloud provider
- Error messages
Next Steps
- FAQ - Frequently asked questions
- Configuration Guide - Configuration options
- Performance Monitoring - Understanding collectors
For urgent issues, contact [email protected] with your diagnostic bundle