Integration Guide - antimetal/system-agent GitHub Wiki

Integration Guide

⚠️ Work in Progress: This documentation is currently being developed and may be incomplete or subject to change.

Overview

This guide explains how to integrate the Antimetal System Agent with your infrastructure, monitoring systems, and workflows. It covers API integration, webhook configuration, and automation scenarios.

Integration Architecture

graph TD
    A[System Agent] --> B[Antimetal API]
    B --> C[Your Systems]
    A --> D[Metrics Export]
    B --> D
    C --> D

API Integration

Accessing Collected Data

The Antimetal API provides programmatic access to data collected by the System Agent.

Authentication

# Using API key
curl -H "Authorization: Bearer YOUR_API_KEY" \
  https://api.antimetal.com/v1/clusters/my-cluster/metrics

API Endpoints

Get Cluster Metrics

GET /v1/clusters/{cluster_name}/metrics

Response:

{
  "cluster_name": "production-eks",
  "timestamp": "2024-01-15T10:30:00Z",
  "metrics": {
    "cpu": {
      "usage_cores": 45.2,
      "capacity_cores": 100,
      "utilization_percent": 45.2
    },
    "memory": {
      "usage_bytes": 68719476736,
      "capacity_bytes": 137438953472,
      "utilization_percent": 50.0
    }
  }
}

Get Resource Inventory

GET /v1/clusters/{cluster_name}/resources

Get Cost Analysis

GET /v1/clusters/{cluster_name}/costs

Client Libraries

Python Client

from antimetal import Client

client = Client(api_key="YOUR_API_KEY")

# Get cluster metrics
metrics = client.get_metrics("production-eks")
print(f"CPU Usage: {metrics.cpu.usage_cores} cores")

# Get resource recommendations
recommendations = client.get_recommendations("production-eks")
for rec in recommendations:
    print(f"{rec.resource}: {rec.action}")

Go Client

import "github.com/antimetal/go-client"

client := antimetal.NewClient("YOUR_API_KEY")

// Get cluster metrics
metrics, err := client.GetMetrics(ctx, "production-eks")
if err != nil {
    log.Fatal(err)
}

fmt.Printf("CPU Usage: %.2f cores\n", metrics.CPU.UsageCores)

Node.js Client

const { AntimetalClient } = require('@antimetal/client');

const client = new AntimetalClient({ apiKey: 'YOUR_API_KEY' });

// Get cluster metrics
const metrics = await client.getMetrics('production-eks');
console.log(`CPU Usage: ${metrics.cpu.usageCores} cores`);

Webhook Integration

Configure webhooks to receive real-time notifications about your infrastructure.

Webhook Configuration

# In Antimetal platform
webhooks:
  - name: cost-alerts
    url: https://your-domain.com/webhooks/antimetal
    secret: YOUR_WEBHOOK_SECRET
    events:
      - cost.anomaly.detected
      - resource.optimization.available
      - cluster.health.degraded

Webhook Handler Example

from flask import Flask, request
import hmac
import hashlib

app = Flask(__name__)
WEBHOOK_SECRET = "YOUR_WEBHOOK_SECRET"

@app.route('/webhooks/antimetal', methods=['POST'])
def handle_webhook():
    # Verify signature
    signature = request.headers.get('X-Antimetal-Signature')
    expected = hmac.new(
        WEBHOOK_SECRET.encode(),
        request.data,
        hashlib.sha256
    ).hexdigest()
    
    if not hmac.compare_digest(signature, expected):
        return "Unauthorized", 401
    
    # Process event
    event = request.json
    if event['type'] == 'cost.anomaly.detected':
        send_alert_to_slack(
            f"Cost anomaly detected in {event['cluster']}: "
            f"${event['data']['amount']} over budget"
        )
    
    return "OK", 200

Webhook Events

Event Type Description Payload
cost.anomaly.detected Unusual cost spike Cost details, threshold
resource.optimization.available Optimization found Resource, savings estimate
cluster.health.degraded Cluster issues Health status, affected resources
quota.limit.approaching Near quota limits Resource type, current usage

Monitoring Integration

Prometheus Metrics

The System Agent exposes Prometheus metrics for monitoring its own health:

# Prometheus scrape config
scrape_configs:
  - job_name: 'antimetal-agent'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: antimetal-agent

Available Metrics

# Agent health
antimetal_agent_up{cluster="production"} 1
antimetal_agent_collection_errors_total{collector="cpu"} 0

# Collection metrics
antimetal_agent_collections_total{collector="cpu",status="success"} 1234
antimetal_agent_collection_duration_seconds{collector="cpu"} 0.023

# Data transmission
antimetal_agent_bytes_sent_total{cluster="production"} 12345678
antimetal_agent_send_errors_total{cluster="production"} 0

Grafana Dashboards

Import our pre-built dashboards:

{
  "dashboard": {
    "title": "Antimetal System Agent",
    "panels": [
      {
        "title": "Collection Success Rate",
        "targets": [{
          "expr": "rate(antimetal_agent_collections_total[5m])"
        }]
      }
    ]
  }
}

Alerting Rules

groups:
  - name: antimetal_agent
    rules:
      - alert: AntimetalAgentDown
        expr: up{job="antimetal-agent"} == 0
        for: 5m
        annotations:
          summary: "Antimetal agent is down"
      
      - alert: AntimetalCollectionFailures
        expr: rate(antimetal_agent_collection_errors_total[5m]) > 0.1
        annotations:
          summary: "High collection error rate"

CI/CD Integration

GitHub Actions

name: Cost Analysis
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  cost-impact:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Analyze Cost Impact
        uses: antimetal/cost-analysis-action@v1
        with:
          api-key: ${{ secrets.ANTIMETAL_API_KEY }}
          cluster: production-eks
      
      - name: Comment PR
        uses: actions/github-script@v6
        with:
          script: |
            const impact = ${{ steps.cost.outputs.impact }};
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              body: `💰 Estimated cost impact: $${impact}/month`
            });

GitLab CI

cost_analysis:
  stage: test
  image: antimetal/cli:latest
  script:
    - antimetal analyze --cluster=$CLUSTER_NAME
  only:
    - merge_requests

Jenkins Pipeline

pipeline {
    agent any
    stages {
        stage('Cost Analysis') {
            steps {
                sh '''
                    antimetal analyze \
                        --cluster=${CLUSTER_NAME} \
                        --format=json > cost-report.json
                '''
                
                publishHTML([
                    reportDir: '.',
                    reportFiles: 'cost-report.json',
                    reportName: 'Cost Analysis'
                ])
            }
        }
    }
}

Automation Examples

Auto-scaling Integration

# Scale based on Antimetal recommendations
import kubernetes
from antimetal import Client

antimetal = Client(api_key="YOUR_API_KEY")
k8s = kubernetes.client.AppsV1Api()

def auto_scale():
    recommendations = antimetal.get_recommendations("production")
    
    for rec in recommendations:
        if rec.type == "scale" and rec.confidence > 0.8:
            k8s.patch_namespaced_deployment_scale(
                name=rec.resource_name,
                namespace=rec.namespace,
                body={"spec": {"replicas": rec.target_replicas}}
            )
            
            print(f"Scaled {rec.resource_name} to {rec.target_replicas}")

Cost Alerting

#!/bin/bash
# Daily cost alert script

COST=$(curl -s -H "Authorization: Bearer $API_KEY" \
  https://api.antimetal.com/v1/clusters/production/costs/daily | \
  jq -r '.total')

if (( $(echo "$COST > 1000" | bc -l) )); then
  slack-notify "⚠️ Daily cost exceeded $1000: $$COST"
fi

Resource Cleanup

# Clean up unused resources based on Antimetal data
def cleanup_unused_resources():
    unused = antimetal.get_unused_resources("production")
    
    for resource in unused:
        if resource.type == "PersistentVolume" and resource.unused_days > 30:
            if confirm_deletion(resource):
                k8s.delete_persistent_volume(resource.name)
                log_deletion(resource)

Platform Integrations

Slack Integration

# Slack bot for cost queries
@slack_command("/antimetal cost")
def handle_cost_command(command):
    cluster = command.text or "production"
    
    metrics = antimetal.get_cost_summary(cluster)
    
    return {
        "text": f"💰 {cluster} costs",
        "attachments": [{
            "fields": [
                {"title": "Daily", "value": f"${metrics.daily}"},
                {"title": "Monthly", "value": f"${metrics.monthly}"},
                {"title": "Trend", "value": metrics.trend}
            ]
        }]
    }

PagerDuty Integration

# PagerDuty integration key
integrations:
  pagerduty:
    routing_key: YOUR_ROUTING_KEY
    triggers:
      - event: cost.spike
        severity: warning
        threshold: 150  # 150% of normal

Datadog Integration

# Send Antimetal metrics to Datadog
from datadog import statsd

def sync_metrics():
    metrics = antimetal.get_metrics("production")
    
    statsd.gauge('antimetal.cpu.usage', metrics.cpu.usage_cores)
    statsd.gauge('antimetal.memory.usage', metrics.memory.usage_bytes)
    statsd.gauge('antimetal.cost.hourly', metrics.cost.hourly)

Best Practices

API Usage

  1. Rate Limiting: Respect rate limits (1000 requests/hour)
  2. Caching: Cache responses appropriately
  3. Error Handling: Implement exponential backoff
  4. Pagination: Use pagination for large datasets

Security

  1. API Keys: Store securely, rotate regularly
  2. Webhooks: Validate signatures
  3. Network: Use HTTPS only
  4. Access Control: Limit API key permissions

Monitoring

  1. Health Checks: Monitor agent and API health
  2. Metrics: Track API usage and errors
  3. Alerts: Set up critical alerts
  4. Logging: Log all API interactions

Troubleshooting

Common Issues

API Connection Errors

# Test API connectivity
curl -H "Authorization: Bearer $API_KEY" \
  https://api.antimetal.com/v1/health

Webhook Delivery Failures

  • Check webhook URL accessibility
  • Verify signature validation
  • Review webhook logs in platform

Data Lag

  • Check agent health
  • Verify network connectivity
  • Review agent logs

See Also