Key Prometheus Queries for Developers - gsinghjay/mvp_qr_gen GitHub Wiki
Key Prometheus Queries for Developers
Overview
This guide provides essential PromQL queries for developers working with the QR Code Generator monitoring system. These queries are battle-tested and used in our production Grafana dashboards to monitor system health, performance, and user experience.
Quick Reference
Most Critical Queries
# QR Redirect P95 Latency (Business Critical)
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket{endpoint=~"/r/.*"}[1h])) by (le)) * 1000 or vector(0)
# Overall System Success Rate
(1 - (sum(rate(app_http_requests_total{status=~"5.."}[1h])) / sum(rate(app_http_requests_total[1h])))) * 100 or vector(100)
# QR Redirect Success Rate
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) / sum(rate(app_http_requests_total{endpoint=~"/r/.*"}[1h])) * 100 or vector(0)
QR Redirect Performance
QR Redirect Latency Analysis
# P95 Latency for QR Redirects (milliseconds)
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket{endpoint=~"/r/.*"}[1h])) by (le)) * 1000 or vector(0)
What it monitors: 95th percentile response time for QR redirect requests
Key patterns:
endpoint=~"/r/.*"
matches all QR redirect endpoints[1h]
time window provides stable percentile calculations* 1000
converts seconds to millisecondsor vector(0)
prevents NaN when no data exists
Expected output: ~4.75ms (excellent baseline performance)
QR Redirect Success Rate
# Percentage of successful QR redirects
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) / sum(rate(app_http_requests_total{endpoint=~"/r/.*"}[1h])) * 100 or vector(0)
What it monitors: Percentage of QR redirects that return HTTP 302 (successful redirect)
Key patterns:
status="302"
identifies successful redirects- Division calculates success rate as percentage
- Numerator: successful redirects, Denominator: all redirect attempts
Expected output: Varies by traffic mix, typically 20-30%
QR Redirect Volume
# QR redirects per minute
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[5m])) * 60 or vector(0)
What it monitors: Rate of successful QR redirects per minute
Key patterns:
[5m]
shorter window for real-time monitoring* 60
converts per-second rate to per-minute- Focuses only on successful redirects (
status="302"
)
System Health Monitoring
Overall System Health
# System-wide success rate (percentage)
(1 - (sum(rate(app_http_requests_total{status=~"5.."}[1h])) / sum(rate(app_http_requests_total[1h])))) * 100 or vector(100)
What it monitors: Overall system health by measuring absence of server errors
Key patterns:
status=~"5.."
matches all 5xx server errors1 -
inverts error rate to get success rateor vector(100)
returns 100% when no requests (healthy default)
Expected output: 100% (no server errors)
Service Availability
# Count of healthy services
sum(up{job=~"prometheus|qr-app|traefik"})
# Individual service status
up{job="qr-app"}
What it monitors: Which services are responding to health checks
Key patterns:
up
metric indicates service reachability (1=up, 0=down)job=~"prometheus|qr-app|traefik"
matches our core services- Sum gives total count of healthy services
Expected output: 3 (all services healthy)
API Performance Monitoring
API Response Time Percentiles
# P95 latency for all endpoints
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket[1h])) by (le)) * 1000 or vector(0)
# P95 latency by specific endpoint
histogram_quantile(0.95, sum(rate(app_http_request_duration_seconds_bucket[1h])) by (le, endpoint)) or vector(0)
What it monitors: Response time distribution across API endpoints
Key patterns:
histogram_quantile(0.95, ...)
calculates 95th percentileby (le)
groups by histogram bucket boundariesby (le, endpoint)
adds endpoint grouping for per-endpoint analysis
API Request Volume
# Total API requests per minute
sum(rate(app_http_requests_total{endpoint=~"/api/v1/.*"}[5m])) * 60 or vector(0)
# QR image generation requests per minute
sum(rate(app_http_requests_total{endpoint=~"/api/v1/qr/.*/image"}[5m])) * 60 or vector(0)
What it monitors: API usage patterns and load distribution
Key patterns:
endpoint=~"/api/v1/.*"
matches all API v1 endpointsendpoint=~"/api/v1/qr/.*/image"
matches QR image generation endpoints- Rate calculation shows requests per second, multiplied by 60 for per-minute
Error Monitoring
Error Rate by Endpoint
# 4xx client errors by endpoint
sum(rate(app_http_requests_total{status=~"4.."}[5m])) by (endpoint) * 100 or vector(0)
# 5xx server errors by endpoint
sum(rate(app_http_requests_total{status=~"5.."}[5m])) by (endpoint) * 100 or vector(0)
What it monitors: Error patterns across different endpoints
Key patterns:
status=~"4.."
matches 400-499 client errorsstatus=~"5.."
matches 500-599 server errorsby (endpoint)
groups results by endpoint for detailed analysis* 100
converts to percentage
QR-Specific Error Monitoring
# QR not found errors per minute
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="404"}[5m])) * 60 or vector(0)
# Failed QR redirects (non-302 responses)
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status!="302"}[5m])) * 60 or vector(0)
What it monitors: QR-specific error conditions
Key patterns:
status="404"
identifies QR codes that don't existstatus!="302"
identifies any non-successful redirect response- Helps distinguish between missing QR codes vs. system errors
Advanced Query Patterns
Top K Analysis
# Top 10 most active QR endpoints
topk(10, sum by (endpoint) (increase(app_http_requests_total{endpoint=~"/r/.*",status="302"}[24h]))) or vector(0)
What it monitors: Most frequently scanned QR codes
Key patterns:
topk(10, ...)
returns top 10 resultsincrease(...[24h])
shows total count over 24 hourssum by (endpoint)
groups by specific QR endpoint
Time-based Analysis
# Daily QR creation count
sum(increase(app_http_requests_total{endpoint="/qr-create",status="200"}[24h])) or vector(0)
# Hourly QR scan pattern
sum(rate(app_http_requests_total{endpoint=~"/r/.*",status="302"}[1h])) * 3600 or vector(0)
What it monitors: Usage patterns over different time periods
Key patterns:
increase(...[24h])
shows total count over 24 hoursrate(...[1h]) * 3600
shows hourly rate- Useful for understanding usage patterns and capacity planning
Query Best Practices
Time Window Selection
- Real-time monitoring:
[5m]
for dashboards and alerts - Stable percentiles:
[1h]
for histogram calculations - Daily summaries:
[24h]
withincrease()
function
Fallback Patterns
- Counts/Rates:
or vector(0)
returns 0 when no data - Success rates:
or vector(100)
returns 100% when no errors - Always include fallbacks to prevent NaN values in dashboards
Label Matching
- Exact match:
endpoint="/health"
- Regex patterns:
endpoint=~"/r/.*"
for QR redirects - Negative match:
status!="302"
for non-redirects - Multiple values:
status=~"[45].."
for 4xx and 5xx errors
Performance Considerations
- Use appropriate time windows for your use case
- Group by relevant labels only (
by (endpoint)
,by (status)
) - Prefer
rate()
overincrease()
for real-time monitoring - Use
sum()
to aggregate across instances
Metric Sources Reference
Application Metrics
- Source: FastAPI application (
qr-app
job) - Key metrics:
app_http_requests_total
,app_http_request_duration_seconds_bucket
- Labels:
endpoint
,method
,status
,instance
,job
Infrastructure Metrics
- Source: Prometheus, Traefik
- Key metrics:
up
,traefik_service_requests_total
- Jobs:
prometheus
,qr-app
,traefik
Common Troubleshooting
Query Returns No Data
- Check if the metric exists:
app_http_requests_total
- Verify label values:
{endpoint="/health"}
- Adjust time window:
[5m]
vs[1h]
- Add fallback:
or vector(0)
NaN Values in Dashboard
- Always include fallback values (
or vector(0)
) - Check for division by zero in rate calculations
- Ensure histogram buckets exist for percentile calculations
Performance Issues
- Reduce time window for high-cardinality queries
- Limit grouping labels (
by (endpoint)
instead ofby (endpoint, method, status)
) - Use recording rules for complex, frequently-used queries
Related Documentation:
- Dashboard Suite - Complete dashboard overview
- System Architecture - System design and components
- Observatory Overview - Monitoring infrastructure details