System Architecture Overview - datnguyendv/monitoring_tools GitHub Wiki

This page outlines the architecture of our monitoring platform, which is built using modular, scalable, and cloud-native components. The system is designed to collect, store, visualize, and alert on telemetry data from distributed services across single or multiple clusters.

🔧 Core Components

Component	Role
Prometheus	Time-series data collector; pulls metrics from targets and sends them to long-term storage (via remote_write).
Grafana	Dashboard and visualization layer for metrics and alert inspection.
Alertmanager	Handles alert notifications based on rules in Prometheus.
VictoriaMetrics / Thanos	Long-term storage backend; supports high performance and scalability.
vmagent / node_exporter / custom exporters	Metric collectors and forwarders from services and nodes.
Consul	Service discovery for dynamic target registration.

🧱 Architecture Layers

1. Data Collection Layer

Exporters (e.g., node_exporter, redis_exporter, blackbox_exporter)
vmagent / Prometheus scrape targets
Dynamic target discovery via Consul or static configs

2. Data Ingestion Layer

Prometheus scrapes metrics
Optional: Push via remote_write to VictoriaMetrics / Thanos Receiver
vmagent acts as gateway in large-scale setups

3. Storage Layer

Short-term: Prometheus TSDB
Long-term:
- VictoriaMetrics Cluster (for high-performance write/read at scale)
- Thanos with S3/GCS/Object Storage (for federated, HA setups)

4. Query & Visualization Layer

Grafana dashboards
Thanos Query (aggregates from sidecars or store APIs)
VMSelect (VictoriaMetrics cluster query component)

5. Alerting Layer

Prometheus alert rules
Alertmanager for routing & deduplication
Notification channels: Slack, Telegram, Email, Opsgenie, etc.

🌐 Multi-Cluster / Multi-Region Support

We support scalable observability across multiple Kubernetes clusters or VM regions:
Centralized Thanos Query aggregates metrics from different clusters
Cross-cluster alerting handled via unified Alertmanager
Use of vmagent or Prometheus in each cluster for local collection

🔐 Security & Access Control

Grafana: Role-Based Access Control (RBAC), data source restrictions
Prometheus/Thanos endpoints: mTLS, basic auth, IP allowlists
Object storage access via IAM / service accounts