Scaling Strategies for Monitoring Systems - datnguyendv/monitoring_tools GitHub Wiki

This guide outlines scalable architecture patterns and tactics for monitoring infrastructure that must support growing workloads, increasing metric volume, and high availability across multiple environments. It covers Prometheus, Thanos, and VictoriaMetrics use cases.

Scaling Challenges in Monitoring

Common bottlenecks and problems that arise as systems scale:

  • High ingestion rates (100K+ samples/sec)
  • Too many time series (high cardinality)
  • Long retention requirements
  • Multiple teams or tenants
  • Metrics stored across clusters or regions
  • Query latency under load

Scaling Strategies by Component

1. Prometheus

  • Vertical scaling: Increase CPU/RAM for a single instance
  • Horizontal sharding: Deploy multiple Prometheus instances, each scraping a subset of targets
  • Federation: Aggregate summaries from child Prometheus servers to a parent instance (limited use)
  • Remote write: Offload long-term storage to Thanos Receive, VictoriaMetrics, or Cortex

Limits:

  • ~2M active series per instance (beyond this causes stability issues)
  • Not built for horizontal HA natively

2. Thanos

  • Sidecars per Prometheus instance to expose TSDB to Thanos
  • Thanos Receive for native remote write ingestion
  • Thanos Query federates stores and enables global view
  • Thanos Store Gateway scales horizontally with object storage

Scaling Tips:

  • Use bucket index cache for Store Gateway performance
  • Place Query Frontend before Thanos Query for heavy Grafana usage

3. VictoriaMetrics

  • Single-node VM scales up to ~500K samples/sec
  • Cluster mode: Split across vminsert, vmstorage, vmselect
  • Add more vminsert for ingestion scaling
  • Add more vmselect for concurrent read scaling
  • Use replicationFactor: 2+ for HA at storage level

Scaling Tips:

  • Keep ingestion stateless via vmagent
  • Load-balance vminsert behind HAProxy/Nginx or use consistent hashing
  • Run vmstorage on disks with good IOPS (NVMe or SSD)

Scaling Design Patterns

Pattern 1: Small Single-Cluster Deployment

  • 1 Prometheus + 1 Grafana
  • Local TSDB + short retention (7–15d)
  • Remote write to single-node VictoriaMetrics

Pattern 2: Mid-Sized Multi-Team Environment

  • Multiple Prometheus shards (per team/namespace)
  • Thanos Sidecars + Query + Store Gateway
  • Long-term metrics stored in S3/GCS
  • Alertmanager with global config

Pattern 3: High Throughput VM-Based Setup

  • vmagent on each VM/node
  • vmagent → vminsert (load-balanced) → vmstorage (x3)
  • vmselect cluster for Grafana queries
  • Retention: 6–12 months

Pattern 4: Global Observability Platform

  • Prometheus in each cluster with sidecar
  • Thanos Receive + S3
  • Centralized Thanos Query, Query Frontend, Rule
  • Global Grafana + multi-tenant dashboards

Resource Planning Guidelines

Load Type Recommended Setup
< 100K samples/sec Single-node Prometheus + remote_write
100K–500K samples/sec VictoriaMetrics Single-Node or 2 vminsert
> 1M samples/sec VictoriaMetrics Cluster or Thanos Receive
Multi-cluster (>5 clusters) Thanos Query + Sidecars + Object Storage
High retention (1yr+) Thanos + S3/GCS, or VM with compression tuning

Best Practices

  • Avoid high cardinality metrics (labels with dynamic or unbounded values)
  • Use relabeling to drop unnecessary targets/metrics early
  • Enforce per-job or per-namespace sharding
  • Enable query caching (Thanos frontend, VM Select cache)
  • Alert on ingestion errors and dropped samples
  • Automate service discovery (Consul, Kubernetes)