Troubleshooting - antimetal/system-agent GitHub Wiki

Troubleshooting

This guide helps diagnose and resolve common issues with the Antimetal System Agent.

Quick Diagnostics

Health Check Script

#!/bin/bash
# antimetal-diagnostics.sh

NAMESPACE="antimetal-system"

echo "=== Antimetal Agent Diagnostics ==="
echo

echo "1. Checking Pod Status:"
kubectl get pods -n $NAMESPACE

echo -e "\n2. Checking Recent Events:"
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo -e "\n3. Checking Logs (last 50 lines):"
kubectl logs -n $NAMESPACE -l app=antimetal-agent --tail=50

echo -e "\n4. Checking RBAC:"
kubectl auth can-i --list --as system:serviceaccount:$NAMESPACE:antimetal-agent | grep -E "(pods|nodes|services)" | head -5

echo -e "\n5. Checking Connectivity:"
kubectl exec -n $NAMESPACE deployment/antimetal-agent -- wget -O- -T 5 https://intake.antimetal.com/health 2>/dev/null || echo "Connection failed"

echo -e "\n6. Checking Resource Usage:"
kubectl top pod -n $NAMESPACE 2>/dev/null || echo "Metrics server not available"

echo -e "\n7. Checking Leader Election:"
kubectl get lease -n $NAMESPACE

echo -e "\n=== End Diagnostics ==="

Common Issues

1. Agent Not Starting

Symptoms

Pod in CrashLoopBackOff or Error state
Container exits immediately

Diagnosis

# Check pod status
kubectl describe pod -n antimetal-system -l app=antimetal-agent

# Check logs
kubectl logs -n antimetal-system -l app=antimetal-agent --previous

# Check events
kubectl get events -n antimetal-system --field-selector reason=Failed

Common Causes and Solutions

Invalid API Key

Error: intake authentication failed: invalid API key

Solution:

# Update secret with correct API key
kubectl create secret generic antimetal-credentials \
  -n antimetal-system \
  --from-literal=api-key="YOUR_CORRECT_API_KEY" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart pod
kubectl rollout restart deployment/antimetal-agent -n antimetal-system

Network Connectivity Issues

Error: failed to connect to intake service: dial tcp: i/o timeout

Solution:

# Test connectivity from pod
kubectl run -n antimetal-system test-connection --image=busybox --rm -it -- \
  wget -O- https://intake.antimetal.com/health

# Check network policies
kubectl get networkpolicies -n antimetal-system

# Check egress rules (if using network policies)

Insufficient Permissions

Error: pods is forbidden: User "system:serviceaccount:antimetal-system:antimetal-agent" cannot list resource "pods"

Solution:

# Apply RBAC manifests
kubectl apply -f https://raw.githubusercontent.com/antimetal/system-agent/main/config/rbac/role.yaml

# Verify permissions
kubectl auth can-i list pods --as system:serviceaccount:antimetal-system:antimetal-agent

2. High Memory Usage

Symptoms

Pod using more memory than expected
OOMKilled errors

Diagnosis

# Check memory usage
kubectl top pod -n antimetal-system

# Check for memory limits
kubectl get pod -n antimetal-system -l app=antimetal-agent -o yaml | grep -A5 resources:

# Check store size
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  sh -c 'du -sh /var/lib/antimetal/*'

Solutions

Increase Memory Limits

resources:
  requests:
    memory: 512Mi
  limits:
    memory: 1Gi

Reduce Cache Size

storage:
  resource:
    cacheSize: 5000  # Reduce from default 10000

Filter Resources

kubernetes:
  namespaces:
    exclude:
      - kube-system
      - kube-public
      - testing
  resources:
    exclude:
      - events
      - endpoints

3. No Data in Platform

Symptoms

Agent running but no data visible in Antimetal platform
Metrics show no data sent

Diagnosis

# Check intake worker logs
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i intake

# Check metrics
kubectl port-forward -n antimetal-system deployment/antimetal-agent 8080:8080
curl -s http://localhost:8080/metrics | grep antimetal_intake

# Check event flow
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  sh -c 'kill -USR1 1'  # Dumps goroutine stack traces

Solutions

Check Batch Configuration

intake:
  batchSize: 50        # Reduce for faster sends
  batchInterval: "5s"  # Send more frequently

Verify API Endpoint

# Test direct connection
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  sh -c 'echo "test" | nc -zv intake.antimetal.com 443'

Enable Debug Logging

operational:
  logging:
    level: debug
    verbosity:
      intake: 3

4. Leader Election Issues

Symptoms

Multiple pods but none active
Logs show leader election failures

Diagnosis

# Check lease object
kubectl get lease -n antimetal-system antimetal-agent-leader -o yaml

# Check which pod is leader
kubectl get lease -n antimetal-system antimetal-agent-leader \
  -o jsonpath='{.spec.holderIdentity}'

# Check for split-brain
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i "leader"

Solutions

Force New Leader

# Delete lease to force re-election
kubectl delete lease -n antimetal-system antimetal-agent-leader

# Or disable leader election for single replica
helm upgrade antimetal-agent antimetal/system-agent \
  --set leaderElection.enabled=false

5. Performance Collector Failures

Symptoms

Missing performance metrics
Collector errors in logs

Diagnosis

# Check collector status
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i collector

# Check host mounts
kubectl get pod -n antimetal-system -l app=antimetal-agent -o yaml | grep -A10 volumes:

# Test file access
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  ls -la /host/proc/stat

Solutions

Fix Host Mounts

volumes:
- name: proc
  hostPath:
    path: /proc
    type: Directory
- name: sys
  hostPath:
    path: /sys
    type: Directory

volumeMounts:
- name: proc
  mountPath: /host/proc
  readOnly: true
- name: sys
  mountPath: /host/sys
  readOnly: true

Disable Failing Collectors

performance:
  collectors:
    - cpu
    - memory
    # Remove problematic collectors

6. Cloud Provider Detection Failures

Symptoms

Wrong or "unknown" cluster name
Missing region information

Diagnosis

# Check detected provider
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -i "provider"

# Force provider detection
kubectl set env deployment/antimetal-agent -n antimetal-system \
  CLOUD_PROVIDER=eks

# Check IMDS access (AWS)
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  wget -O- -T 2 http://169.254.169.254/latest/meta-data/instance-id

Solutions

Override Provider

kubernetes:
  cloudProvider: eks  # or gke, aks, kind
  clusterName: "my-cluster"

Fix IAM/Metadata Access

# For EKS with IRSA
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/antimetal-agent

Debug Mode

Enable Comprehensive Debugging

# debug-values.yaml
operational:
  logging:
    level: debug
    format: text
    verbosity:
      controller: 2
      intake: 3
      performance: 2
      store: 1

# Disable optimizations
leaderElection:
  enabled: false

# Reduce intervals for testing
performance:
  interval: "10s"

intake:
  batchInterval: "5s"

Deploy with debug mode:

helm upgrade antimetal-agent antimetal/system-agent \
  -n antimetal-system \
  -f debug-values.yaml

Interactive Debugging

# Get shell access
kubectl exec -it -n antimetal-system deployment/antimetal-agent -- sh

# Inside the container:
# Check connectivity
wget -O- https://intake.antimetal.com/health

# Check file access
ls -la /host/proc/
cat /host/proc/stat

# Check environment
env | grep ANTIMETAL

# Test API access
wget -O- --header="Authorization: Bearer $ANTIMETAL_INTAKE_API_KEY" \
  https://intake.antimetal.com/v1/health

Memory Profiling

# Enable pprof
operational:
  pprof:
    enabled: true
    bindAddress: ":6060"

# Port forward
kubectl port-forward -n antimetal-system deployment/antimetal-agent 6060:6060

# Capture heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Capture CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Log Analysis

Log Levels

Understanding log levels:

ERROR: Critical issues requiring attention
WARN: Issues that may need investigation
INFO: Normal operational messages
DEBUG: Detailed diagnostic information

Useful Log Queries

# Errors only
kubectl logs -n antimetal-system -l app=antimetal-agent | grep ERROR

# Intake-related logs
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(intake|batch|stream)"

# Resource processing
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(reconcile|resource)"

# Performance collection
kubectl logs -n antimetal-system -l app=antimetal-agent | grep -E "(collector|performance)"

Structured Log Parsing

# Parse JSON logs with jq
kubectl logs -n antimetal-system -l app=antimetal-agent | \
  jq 'select(.level=="error") | {time: .ts, error: .error, component: .component}'

# Count errors by type
kubectl logs -n antimetal-system -l app=antimetal-agent | \
  jq -r 'select(.level=="error") | .error' | sort | uniq -c

Recovery Procedures

Full Reset

# 1. Delete existing deployment
kubectl delete deployment -n antimetal-system antimetal-agent

# 2. Clean up PVCs if any
kubectl delete pvc -n antimetal-system -l app=antimetal-agent

# 3. Delete and recreate secrets
kubectl delete secret -n antimetal-system antimetal-credentials
kubectl create secret generic antimetal-credentials \
  -n antimetal-system \
  --from-literal=api-key="YOUR_API_KEY"

# 4. Redeploy
helm install antimetal-agent antimetal/system-agent \
  -n antimetal-system \
  --set intake.apiKey="YOUR_API_KEY"

Partial Recovery

# Just restart pods
kubectl rollout restart deployment/antimetal-agent -n antimetal-system

# Force new leader
kubectl delete lease -n antimetal-system antimetal-agent-leader

# Clear store cache (if using persistent storage)
kubectl exec -n antimetal-system deployment/antimetal-agent -- \
  rm -rf /var/lib/antimetal/store/*

Monitoring and Alerts

Key Metrics to Monitor

# Agent health
up{job="antimetal-agent"} == 0

# High error rate
rate(antimetal_errors_total[5m]) > 0.1

# No data being sent
rate(antimetal_intake_batches_sent_total[10m]) == 0

# Memory usage
container_memory_usage_bytes{pod=~"antimetal-agent.*"} > 500000000

# CPU throttling
rate(container_cpu_throttled_periods_total{pod=~"antimetal-agent.*"}[5m]) > 0

Example Alerts

groups:
- name: antimetal-agent
  rules:
  - alert: AntimetalAgentDown
    expr: up{job="antimetal-agent"} == 0
    for: 5m
    annotations:
      summary: "Antimetal agent is down"
      
  - alert: AntimetalAgentHighErrorRate
    expr: rate(antimetal_errors_total[5m]) > 0.1
    for: 10m
    annotations:
      summary: "High error rate in Antimetal agent"
      
  - alert: AntimetalAgentNoData
    expr: rate(antimetal_intake_batches_sent_total[10m]) == 0
    for: 15m
    annotations:
      summary: "No data being sent to Antimetal"

Getting Help

Before Contacting Support

Gather this information:

# Save diagnostic bundle
kubectl cluster-info dump --namespaces antimetal-system --output-directory ./antimetal-diagnostics

# Get agent version
kubectl get deployment -n antimetal-system antimetal-agent -o jsonpath='{.spec.template.spec.containers[0].image}'

# Export recent logs
kubectl logs -n antimetal-system -l app=antimetal-agent --since=1h > antimetal-agent.log

# Get configuration
kubectl get configmap -n antimetal-system -o yaml > antimetal-config.yaml

Support Channels

GitHub Issues: github.com/antimetal/system-agent/issues
Email: [email protected]
Slack: antimetal-community.slack.com

Include:

Diagnostic bundle
Agent version
Kubernetes version
Cloud provider
Error messages

Next Steps

FAQ - Frequently asked questions
Configuration Guide - Configuration options
Performance Monitoring - Understanding collectors

For urgent issues, contact [email protected] with your diagnostic bundle