Architecture Decisions & Comparisons - datnguyendv/monitoring_tools GitHub Wiki

This section covers in-depth comparisons of critical architectural choices for building a scalable monitoring stack. It helps guide when to use certain technologies such as remote_write vs remote_read, or choose between Thanos and VictoriaMetrics based on your use case, performance needs, and operational complexity.

⚙️ Remote Write vs Remote Read

Remote Write

Definition: Prometheus or vmagent sends collected metrics to a remote storage backend (e.g., VictoriaMetrics, Thanos Receive, Cortex) asynchronously. Use cases:

  • Offloading time-series data to long-term storage
  • Reducing load on Prometheus
  • Decoupling scraping and querying
  • Multi-tenant ingestion setups

Pros:

  • Efficient and low-latency ingestion
  • Prometheus becomes stateless (good for scaling or HA)
  • Works well with vmagent to push from edge nodes or sidecars

Cons:

  • Prometheus cannot natively query the remote storage (requires other components)
  • Requires additional query layer (e.g., VMSelect, Thanos Query)

Remote Read

Definition: Prometheus queries historical metrics back from the remote storage (typically same backend used in remote_write). Use cases:

  • Display historical data directly in Prometheus UI
  • Minimal integration: no external query layer

Pros:

  • Seamless integration with existing Prometheus dashboards
  • Simple for backfilling or hybrid queries

Cons:

  • Can be slow or inefficient for large ranges
  • Not optimized for high cardinality or concurrent queries
  • Tight coupling between Prometheus and backend performance

Recommendation Summary

  • Use remote_write for scalable, long-term storage setups and HA designs.
  • Use remote_read only when you need legacy UI compatibility or minimal architectural changes.
  • Do not rely solely on remote_read in high-scale or distributed environments.

🔍 Thanos vs VictoriaMetrics

Overview Table

Feature Thanos VictoriaMetrics
Architecture Modular (Sidecar, Store, Query) Monolithic or cluster-based
Storage backend Object storage (S3, GCS, etc.) Local disk or S3/GCS
Query latency Medium Very low (especially with VMSelect)
Query language Full PromQL Full PromQL + some extensions
Deduplication Yes (via Store + Sidecars) Yes (via replicationFactor)
High availability Built-in (multi-store/query) Built-in (via cluster mode)
Alerting Thanos Rule External Alertmanager or built-in VMRule
Setup complexity High Low to medium
Scaling model Horizontal via component shards Horizontal via cluster roles

When to Use Thanos

Use Thanos if:

  • You have multiple Prometheus instances and want federated query
  • You need to store metrics in object storage (S3/GCS)
  • You want to maintain Prometheus-native format and separation of concerns
  • You are operating in a cloud-native Kubernetes environment
  • You need multi-tenant metrics aggregation across environments
  • Your infra has >10 Prometheus shards across clusters and requires unified long-term query access

When to Use VictoriaMetrics

Use VictoriaMetrics if:

  • You require ultra-fast ingestion and querying at scale
  • You prefer a simpler deployment model with fewer components
  • You want to avoid operational complexity of Thanos sidecars/stores
  • You are using vmagent at edge locations or VMs and want efficient remote_write
  • Your metric volume exceeds 1M samples/sec and you need optimized TSDB performance
  • You want to deploy in a hybrid environment (K8s + bare-metal/VM)

Recommendation Summary

  • Choose Thanos for long-term, cloud-native, multi-cluster setups, particularly when object storage and query federation across regions/clusters are important.
  • Choose VictoriaMetrics for high-throughput, cost-effective, and low-latency metric pipelines with simpler operations.
  • Suggested guideline:
    • Small-to-medium scale (<100K samples/sec, single cluster): Prometheus + remote_write to VM single-node is sufficient.
    • Medium-to-large (100K–1M samples/sec): VM Cluster is preferred over Thanos for performance.
    • Multi-cluster, cloud-first teams (>5 clusters, 10+ Prometheus): Thanos offers better federation and query unification.
    • Low-ops team with few infra engineers: VictoriaMetrics simplifies lifecycle management.