System Architecture Overview - datnguyendv/monitoring_tools GitHub Wiki

This page outlines the architecture of our monitoring platform, which is built using modular, scalable, and cloud-native components. The system is designed to collect, store, visualize, and alert on telemetry data from distributed services across single or multiple clusters.

🔧 Core Components

Component Role
Prometheus Time-series data collector; pulls metrics from targets and sends them to long-term storage (via remote_write).
Grafana Dashboard and visualization layer for metrics and alert inspection.
Alertmanager Handles alert notifications based on rules in Prometheus.
VictoriaMetrics / Thanos Long-term storage backend; supports high performance and scalability.
vmagent / node_exporter / custom exporters Metric collectors and forwarders from services and nodes.
Consul Service discovery for dynamic target registration.

🧱 Architecture Layers

1. Data Collection Layer

  • Exporters (e.g., node_exporter, redis_exporter, blackbox_exporter)
  • vmagent / Prometheus scrape targets
  • Dynamic target discovery via Consul or static configs

2. Data Ingestion Layer

  • Prometheus scrapes metrics
  • Optional: Push via remote_write to VictoriaMetrics / Thanos Receiver
  • vmagent acts as gateway in large-scale setups

3. Storage Layer

  • Short-term: Prometheus TSDB
  • Long-term:
    • VictoriaMetrics Cluster (for high-performance write/read at scale)
    • Thanos with S3/GCS/Object Storage (for federated, HA setups)

4. Query & Visualization Layer

  • Grafana dashboards
  • Thanos Query (aggregates from sidecars or store APIs)
  • VMSelect (VictoriaMetrics cluster query component)

5. Alerting Layer

  • Prometheus alert rules
  • Alertmanager for routing & deduplication
  • Notification channels: Slack, Telegram, Email, Opsgenie, etc.

🌐 Multi-Cluster / Multi-Region Support

  • We support scalable observability across multiple Kubernetes clusters or VM regions:
  • Centralized Thanos Query aggregates metrics from different clusters
  • Cross-cluster alerting handled via unified Alertmanager
  • Use of vmagent or Prometheus in each cluster for local collection

🔐 Security & Access Control

  • Grafana: Role-Based Access Control (RBAC), data source restrictions
  • Prometheus/Thanos endpoints: mTLS, basic auth, IP allowlists
  • Object storage access via IAM / service accounts