System Architecture Overview - datnguyendv/monitoring_tools GitHub Wiki
This page outlines the architecture of our monitoring platform, which is built using modular, scalable, and cloud-native components. The system is designed to collect, store, visualize, and alert on telemetry data from distributed services across single or multiple clusters.
🔧 Core Components
Component | Role |
---|---|
Prometheus | Time-series data collector; pulls metrics from targets and sends them to long-term storage (via remote_write). |
Grafana | Dashboard and visualization layer for metrics and alert inspection. |
Alertmanager | Handles alert notifications based on rules in Prometheus. |
VictoriaMetrics / Thanos | Long-term storage backend; supports high performance and scalability. |
vmagent / node_exporter / custom exporters | Metric collectors and forwarders from services and nodes. |
Consul | Service discovery for dynamic target registration. |
🧱 Architecture Layers
1. Data Collection Layer
- Exporters (e.g., node_exporter, redis_exporter, blackbox_exporter)
- vmagent / Prometheus scrape targets
- Dynamic target discovery via Consul or static configs
2. Data Ingestion Layer
- Prometheus scrapes metrics
- Optional: Push via remote_write to VictoriaMetrics / Thanos Receiver
- vmagent acts as gateway in large-scale setups
3. Storage Layer
- Short-term: Prometheus TSDB
- Long-term:
- VictoriaMetrics Cluster (for high-performance write/read at scale)
- Thanos with S3/GCS/Object Storage (for federated, HA setups)
4. Query & Visualization Layer
- Grafana dashboards
- Thanos Query (aggregates from sidecars or store APIs)
- VMSelect (VictoriaMetrics cluster query component)
5. Alerting Layer
- Prometheus alert rules
- Alertmanager for routing & deduplication
- Notification channels: Slack, Telegram, Email, Opsgenie, etc.
🌐 Multi-Cluster / Multi-Region Support
- We support scalable observability across multiple Kubernetes clusters or VM regions:
- Centralized Thanos Query aggregates metrics from different clusters
- Cross-cluster alerting handled via unified Alertmanager
- Use of vmagent or Prometheus in each cluster for local collection
🔐 Security & Access Control
- Grafana: Role-Based Access Control (RBAC), data source restrictions
- Prometheus/Thanos endpoints: mTLS, basic auth, IP allowlists
- Object storage access via IAM / service accounts