Key Prometheus Queries for Developers - gsinghjay/mvp_qr_gen GitHub Wiki

Key Prometheus Queries for Developers

Overview

This guide provides essential PromQL queries for developers working with the QR Code Generator monitoring system. These queries are battle-tested and used in our production Grafana dashboards to monitor system health, performance, and user experience.

Quick Reference

Most Critical Queries

# QR Redirect P95 Latency (Business Critical)
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket{endpoint=~"/r/.*"}[1h])) by (le)) * 1000 or vector(0)

# Overall System Success Rate
(1 - (sum(rate(app_http_requests_total{status=~"5.."}[1h])) / sum(rate(app_http_requests_total[1h])))) * 100 or vector(100)

# QR Redirect Success Rate
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) / sum(rate(app_http_requests_total{endpoint=~"/r/.*"}[1h])) * 100 or vector(0)

QR Redirect Performance

QR Redirect Latency Analysis

# P95 Latency for QR Redirects (milliseconds)
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket{endpoint=~"/r/.*"}[1h])) by (le)) * 1000 or vector(0)

What it monitors: 95th percentile response time for QR redirect requests
Key patterns:

  • endpoint=~"/r/.*" matches all QR redirect endpoints
  • [1h] time window provides stable percentile calculations
  • * 1000 converts seconds to milliseconds
  • or vector(0) prevents NaN when no data exists

Expected output: ~4.75ms (excellent baseline performance)

QR Redirect Success Rate

# Percentage of successful QR redirects
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) / sum(rate(app_http_requests_total{endpoint=~"/r/.*"}[1h])) * 100 or vector(0)

What it monitors: Percentage of QR redirects that return HTTP 302 (successful redirect)
Key patterns:

  • status="302" identifies successful redirects
  • Division calculates success rate as percentage
  • Numerator: successful redirects, Denominator: all redirect attempts

Expected output: Varies by traffic mix, typically 20-30%

QR Redirect Volume

# QR redirects per minute
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[5m])) * 60 or vector(0)

What it monitors: Rate of successful QR redirects per minute
Key patterns:

  • [5m] shorter window for real-time monitoring
  • * 60 converts per-second rate to per-minute
  • Focuses only on successful redirects (status="302")

System Health Monitoring

Overall System Health

# System-wide success rate (percentage)
(1 - (sum(rate(app_http_requests_total{status=~"5.."}[1h])) / sum(rate(app_http_requests_total[1h])))) * 100 or vector(100)

What it monitors: Overall system health by measuring absence of server errors
Key patterns:

  • status=~"5.." matches all 5xx server errors
  • 1 - inverts error rate to get success rate
  • or vector(100) returns 100% when no requests (healthy default)

Expected output: 100% (no server errors)

Service Availability

# Count of healthy services
sum(up{job=~"prometheus|qr-app|traefik"})

# Individual service status
up{job="qr-app"}

What it monitors: Which services are responding to health checks
Key patterns:

  • up metric indicates service reachability (1=up, 0=down)
  • job=~"prometheus|qr-app|traefik" matches our core services
  • Sum gives total count of healthy services

Expected output: 3 (all services healthy)

API Performance Monitoring

API Response Time Percentiles

# P95 latency for all endpoints
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket[1h])) by (le)) * 1000 or vector(0)

# P95 latency by specific endpoint
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket[1h])) by (le, endpoint)) or vector(0)

What it monitors: Response time distribution across API endpoints
Key patterns:

  • histogram_quantile(0.95, ...) calculates 95th percentile
  • by (le) groups by histogram bucket boundaries
  • by (le, endpoint) adds endpoint grouping for per-endpoint analysis

API Request Volume

# Total API requests per minute
sum(rate(app_http_requests_total{endpoint=~"/api/v1/.*"}[5m])) * 60 or vector(0)

# QR image generation requests per minute
sum(rate(app_http_requests_total{endpoint=~"/api/v1/qr/.*/image"}[5m])) * 60 or vector(0)

What it monitors: API usage patterns and load distribution
Key patterns:

  • endpoint=~"/api/v1/.*" matches all API v1 endpoints
  • endpoint=~"/api/v1/qr/.*/image" matches QR image generation endpoints
  • Rate calculation shows requests per second, multiplied by 60 for per-minute

Error Monitoring

Error Rate by Endpoint

# 4xx client errors by endpoint
sum(rate(app_http_requests_total{status=~"4.."}[5m])) by (endpoint) * 100 or vector(0)

# 5xx server errors by endpoint
sum(rate(app_http_requests_total{status=~"5.."}[5m])) by (endpoint) * 100 or vector(0)

What it monitors: Error patterns across different endpoints
Key patterns:

  • status=~"4.." matches 400-499 client errors
  • status=~"5.." matches 500-599 server errors
  • by (endpoint) groups results by endpoint for detailed analysis
  • * 100 converts to percentage

QR-Specific Error Monitoring

# QR not found errors per minute
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="404"}[5m])) * 60 or vector(0)

# Failed QR redirects (non-302 responses)
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status!="302"}[5m])) * 60 or vector(0)

What it monitors: QR-specific error conditions
Key patterns:

  • status="404" identifies QR codes that don't exist
  • status!="302" identifies any non-successful redirect response
  • Helps distinguish between missing QR codes vs. system errors

Advanced Query Patterns

Top K Analysis

# Top 10 most active QR endpoints
topk(10, sum by (endpoint) (increase(app_http_requests_total{endpoint=~"/r/.*",status="302"}[24h]))) or vector(0)

What it monitors: Most frequently scanned QR codes
Key patterns:

  • topk(10, ...) returns top 10 results
  • increase(...[24h]) shows total count over 24 hours
  • sum by (endpoint) groups by specific QR endpoint

Time-based Analysis

# Daily QR creation count
sum(increase(app_http_requests_total{endpoint="/qr-create",status="200"}[24h])) or vector(0)

# Hourly QR scan pattern
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) * 3600 or vector(0)

What it monitors: Usage patterns over different time periods
Key patterns:

  • increase(...[24h]) shows total count over 24 hours
  • rate(...[1h]) * 3600 shows hourly rate
  • Useful for understanding usage patterns and capacity planning

Query Best Practices

Time Window Selection

  • Real-time monitoring: [5m] for dashboards and alerts
  • Stable percentiles: [1h] for histogram calculations
  • Daily summaries: [24h] with increase() function

Fallback Patterns

  • Counts/Rates: or vector(0) returns 0 when no data
  • Success rates: or vector(100) returns 100% when no errors
  • Always include fallbacks to prevent NaN values in dashboards

Label Matching

  • Exact match: endpoint="/health"
  • Regex patterns: endpoint=~"/r/.*" for QR redirects
  • Negative match: status!="302" for non-redirects
  • Multiple values: status=~"[45].." for 4xx and 5xx errors

Performance Considerations

  • Use appropriate time windows for your use case
  • Group by relevant labels only (by (endpoint), by (status))
  • Prefer rate() over increase() for real-time monitoring
  • Use sum() to aggregate across instances

Metric Sources Reference

Application Metrics

  • Source: FastAPI application (qr-app job)
  • Key metrics: app_http_requests_total, app_http_request_duration_seconds_bucket
  • Labels: endpoint, method, status, instance, job

Infrastructure Metrics

  • Source: Prometheus, Traefik
  • Key metrics: up, traefik_service_requests_total
  • Jobs: prometheus, qr-app, traefik

Common Troubleshooting

Query Returns No Data

  1. Check if the metric exists: app_http_requests_total
  2. Verify label values: {endpoint="/health"}
  3. Adjust time window: [5m] vs [1h]
  4. Add fallback: or vector(0)

NaN Values in Dashboard

  • Always include fallback values (or vector(0))
  • Check for division by zero in rate calculations
  • Ensure histogram buckets exist for percentile calculations

Performance Issues

  • Reduce time window for high-cardinality queries
  • Limit grouping labels (by (endpoint) instead of by (endpoint, method, status))
  • Use recording rules for complex, frequently-used queries

Related Documentation: