Scaling Strategies for Monitoring Systems - datnguyendv/monitoring_tools GitHub Wiki

This guide outlines scalable architecture patterns and tactics for monitoring infrastructure that must support growing workloads, increasing metric volume, and high availability across multiple environments. It covers Prometheus, Thanos, and VictoriaMetrics use cases.

Scaling Challenges in Monitoring

Common bottlenecks and problems that arise as systems scale:

High ingestion rates (100K+ samples/sec)
Too many time series (high cardinality)
Long retention requirements
Multiple teams or tenants
Metrics stored across clusters or regions
Query latency under load

Scaling Strategies by Component

1. Prometheus

Vertical scaling: Increase CPU/RAM for a single instance
Horizontal sharding: Deploy multiple Prometheus instances, each scraping a subset of targets
Federation: Aggregate summaries from child Prometheus servers to a parent instance (limited use)
Remote write: Offload long-term storage to Thanos Receive, VictoriaMetrics, or Cortex

Limits:

~2M active series per instance (beyond this causes stability issues)
Not built for horizontal HA natively

2. Thanos

Sidecars per Prometheus instance to expose TSDB to Thanos
Thanos Receive for native remote write ingestion
Thanos Query federates stores and enables global view
Thanos Store Gateway scales horizontally with object storage

Scaling Tips:

Use bucket index cache for Store Gateway performance
Place Query Frontend before Thanos Query for heavy Grafana usage

3. VictoriaMetrics

Single-node VM scales up to ~500K samples/sec
Cluster mode: Split across vminsert, vmstorage, vmselect
Add more vminsert for ingestion scaling
Add more vmselect for concurrent read scaling
Use replicationFactor: 2+ for HA at storage level

Scaling Tips:

Keep ingestion stateless via vmagent
Load-balance vminsert behind HAProxy/Nginx or use consistent hashing
Run vmstorage on disks with good IOPS (NVMe or SSD)

Scaling Design Patterns

Pattern 1: Small Single-Cluster Deployment

1 Prometheus + 1 Grafana
Local TSDB + short retention (7–15d)
Remote write to single-node VictoriaMetrics

Pattern 2: Mid-Sized Multi-Team Environment

Multiple Prometheus shards (per team/namespace)
Thanos Sidecars + Query + Store Gateway
Long-term metrics stored in S3/GCS
Alertmanager with global config

Pattern 3: High Throughput VM-Based Setup

vmagent on each VM/node
vmagent → vminsert (load-balanced) → vmstorage (x3)
vmselect cluster for Grafana queries
Retention: 6–12 months

Pattern 4: Global Observability Platform

Prometheus in each cluster with sidecar
Thanos Receive + S3
Centralized Thanos Query, Query Frontend, Rule
Global Grafana + multi-tenant dashboards

Resource Planning Guidelines

Load Type	Recommended Setup
< 100K samples/sec	Single-node Prometheus + remote_write
100K–500K samples/sec	VictoriaMetrics Single-Node or 2 vminsert
> 1M samples/sec	VictoriaMetrics Cluster or Thanos Receive
Multi-cluster (>5 clusters)	Thanos Query + Sidecars + Object Storage
High retention (1yr+)	Thanos + S3/GCS, or VM with compression tuning

Best Practices

Avoid high cardinality metrics (labels with dynamic or unbounded values)
Use relabeling to drop unnecessary targets/metrics early
Enforce per-job or per-namespace sharding
Enable query caching (Thanos frontend, VM Select cache)
Alert on ingestion errors and dropped samples
Automate service discovery (Consul, Kubernetes)