Traffic & Monitoring - eunki-7/llm-rdma-mlops-lab GitHub Wiki
We use HAProxy + Prometheus + Grafana + DCGM to monitor system and model health.
cd 60-traffic-monitoring
docker compose up -d
- Grafana:
http://<router>:3000
(default: admin/admin) - Prometheus:
http://<router>:9090
- Alertmanager:
http://<router>:9093
- GPU: DCGM exporter (utilization, memory, power)
- CPU/Memory: node_exporter
- Containers: cAdvisor
- Traffic: HAProxy exporter
-
Model: vLLM
/metrics
- GPU utilization > 95% for 3 minutes
- vLLM latency p95 > 1.5s
- Disk usage < 10% available
