Traffic & Monitoring - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Traffic & Monitoring

We use HAProxy + Prometheus + Grafana + DCGM to monitor system and model health.


▶️ Start Monitoring Stack

cd 60-traffic-monitoring
docker compose up -d

🌐 Access

  • Grafana: http://<router>:3000 (default: admin/admin)
  • Prometheus: http://<router>:9090
  • Alertmanager: http://<router>:9093

📊 Metrics

  • GPU: DCGM exporter (utilization, memory, power)
  • CPU/Memory: node_exporter
  • Containers: cAdvisor
  • Traffic: HAProxy exporter
  • Model: vLLM /metrics

🚨 Alerts

  • GPU utilization > 95% for 3 minutes
  • vLLM latency p95 > 1.5s
  • Disk usage < 10% available

🖼️ Monitoring Diagram

5
⚠️ **GitHub.com Fallback** ⚠️