Architecture Decisions & Comparisons - datnguyendv/monitoring_tools GitHub Wiki
This section covers in-depth comparisons of critical architectural choices for building a scalable monitoring stack. It helps guide when to use certain technologies such as remote_write vs remote_read, or choose between Thanos and VictoriaMetrics based on your use case, performance needs, and operational complexity.
⚙️ Remote Write vs Remote Read
Remote Write
Definition: Prometheus or vmagent sends collected metrics to a remote storage backend (e.g., VictoriaMetrics, Thanos Receive, Cortex) asynchronously. Use cases:
- Offloading time-series data to long-term storage
- Reducing load on Prometheus
- Decoupling scraping and querying
- Multi-tenant ingestion setups
Pros:
- Efficient and low-latency ingestion
- Prometheus becomes stateless (good for scaling or HA)
- Works well with
vmagent
to push from edge nodes or sidecars
Cons:
- Prometheus cannot natively query the remote storage (requires other components)
- Requires additional query layer (e.g., VMSelect, Thanos Query)
Remote Read
Definition: Prometheus queries historical metrics back from the remote storage (typically same backend used in remote_write
).
Use cases:
- Display historical data directly in Prometheus UI
- Minimal integration: no external query layer
Pros:
- Seamless integration with existing Prometheus dashboards
- Simple for backfilling or hybrid queries
Cons:
- Can be slow or inefficient for large ranges
- Not optimized for high cardinality or concurrent queries
- Tight coupling between Prometheus and backend performance
Recommendation Summary
- Use
remote_write
for scalable, long-term storage setups and HA designs. - Use
remote_read
only when you need legacy UI compatibility or minimal architectural changes. - Do not rely solely on
remote_read
in high-scale or distributed environments.
🔍 Thanos vs VictoriaMetrics
Overview Table
Feature | Thanos | VictoriaMetrics |
---|---|---|
Architecture | Modular (Sidecar, Store, Query) | Monolithic or cluster-based |
Storage backend | Object storage (S3, GCS, etc.) | Local disk or S3/GCS |
Query latency | Medium | Very low (especially with VMSelect) |
Query language | Full PromQL | Full PromQL + some extensions |
Deduplication | Yes (via Store + Sidecars) | Yes (via replicationFactor ) |
High availability | Built-in (multi-store/query) | Built-in (via cluster mode) |
Alerting | Thanos Rule | External Alertmanager or built-in VMRule |
Setup complexity | High | Low to medium |
Scaling model | Horizontal via component shards | Horizontal via cluster roles |
When to Use Thanos
Use Thanos if:
- You have
multiple Prometheus instances and want federated query
- You need to store metrics in
object storage
(S3/GCS) - You want to maintain
Prometheus-native format
and separation of concerns - You are operating in a
cloud-native Kubernetes
environment - You need
multi-tenant metrics aggregation
across environments - Your infra has >10 Prometheus shards across clusters and requires unified long-term query access
When to Use VictoriaMetrics
Use VictoriaMetrics if:
- You require
ultra-fast ingestion and querying
at scale - You prefer a
simpler deployment model
with fewer components - You want to avoid operational complexity of Thanos sidecars/stores
- You are using
vmagent
at edge locations or VMs and want efficientremote_write
- Your metric volume exceeds
1M samples/sec
and you need optimized TSDB performance - You want to deploy in a
hybrid environment
(K8s + bare-metal/VM)
Recommendation Summary
Choose Thanos
for long-term, cloud-native, multi-cluster setups, particularly when object storage and query federation across regions/clusters are important.Choose VictoriaMetrics
for high-throughput, cost-effective, and low-latency metric pipelines with simpler operations.Suggested guideline
:Small-to-medium scale (<100K samples/sec, single cluster)
: Prometheus + remote_write to VM single-node is sufficient.Medium-to-large (100K–1M samples/sec)
: VM Cluster is preferred over Thanos for performance.Multi-cluster, cloud-first teams (>5 clusters, 10+ Prometheus)
: Thanos offers better federation and query unification.Low-ops team with few infra engineers
: VictoriaMetrics simplifies lifecycle management.