Scaling Strategies for Monitoring Systems - datnguyendv/monitoring_tools GitHub Wiki
This guide outlines scalable architecture patterns and tactics for monitoring infrastructure that must support growing workloads, increasing metric volume, and high availability across multiple environments. It covers Prometheus, Thanos, and VictoriaMetrics use cases.
Scaling Challenges in Monitoring
Common bottlenecks and problems that arise as systems scale:
- High ingestion rates (100K+ samples/sec)
- Too many time series (high cardinality)
- Long retention requirements
- Multiple teams or tenants
- Metrics stored across clusters or regions
- Query latency under load
Scaling Strategies by Component
1. Prometheus
Vertical scaling
: Increase CPU/RAM for a single instanceHorizontal sharding
: Deploy multiple Prometheus instances, each scraping a subset of targetsFederation
: Aggregate summaries from child Prometheus servers to a parent instance (limited use)Remote write
: Offload long-term storage to Thanos Receive, VictoriaMetrics, or Cortex
Limits:
- ~2M active series per instance (beyond this causes stability issues)
- Not built for horizontal HA natively
2. Thanos
- Sidecars per Prometheus instance to expose TSDB to Thanos
- Thanos Receive for native remote write ingestion
- Thanos Query federates stores and enables global view
- Thanos Store Gateway scales horizontally with object storage
Scaling Tips:
- Use bucket index cache for Store Gateway performance
- Place Query Frontend before Thanos Query for heavy Grafana usage
3. VictoriaMetrics
- Single-node VM scales up to ~500K samples/sec
- Cluster mode: Split across
vminsert
,vmstorage
,vmselect
- Add more
vminsert
for ingestion scaling - Add more
vmselect
for concurrent read scaling - Use
replicationFactor: 2+
for HA at storage level
Scaling Tips:
- Keep ingestion stateless via
vmagent
- Load-balance
vminsert
behind HAProxy/Nginx or use consistent hashing - Run
vmstorage
on disks with good IOPS (NVMe or SSD)
Scaling Design Patterns
Pattern 1: Small Single-Cluster Deployment
- 1 Prometheus + 1 Grafana
- Local TSDB + short retention (7–15d)
- Remote write to single-node VictoriaMetrics
Pattern 2: Mid-Sized Multi-Team Environment
- Multiple Prometheus shards (per team/namespace)
- Thanos Sidecars + Query + Store Gateway
- Long-term metrics stored in S3/GCS
- Alertmanager with global config
Pattern 3: High Throughput VM-Based Setup
- vmagent on each VM/node
- vmagent → vminsert (load-balanced) → vmstorage (x3)
- vmselect cluster for Grafana queries
- Retention: 6–12 months
Pattern 4: Global Observability Platform
- Prometheus in each cluster with sidecar
- Thanos Receive + S3
- Centralized Thanos Query, Query Frontend, Rule
- Global Grafana + multi-tenant dashboards
Resource Planning Guidelines
Load Type | Recommended Setup |
---|---|
< 100K samples/sec | Single-node Prometheus + remote_write |
100K–500K samples/sec | VictoriaMetrics Single-Node or 2 vminsert |
> 1M samples/sec | VictoriaMetrics Cluster or Thanos Receive |
Multi-cluster (>5 clusters) | Thanos Query + Sidecars + Object Storage |
High retention (1yr+) | Thanos + S3/GCS, or VM with compression tuning |
Best Practices
- Avoid high cardinality metrics (
labels
with dynamic or unbounded values) - Use relabeling to drop unnecessary targets/metrics early
- Enforce per-job or per-namespace sharding
- Enable query caching (Thanos frontend, VM Select cache)
- Alert on ingestion errors and dropped samples
- Automate service discovery (Consul, Kubernetes)