PromQL Rates and Derivatives - the-code-camp/hpe-sre GitHub Wiki

Counter metrics can only increase since they maintain a running count of the given metric - so you pretty much never care about the absolute value of a counter metric.

For example, if you just graph demo_api_request_duration_seconds_count, it will look something like this:

But this information is not that useful. Instead, you typically want to know how fast a counter increases. PromQL has different functions for this.

Rate function

The most common function for calculating the rate of change is rate(). It calculates the per-second increase of a counter as averaged over a specified window of time. To tell rate() the window of time to average over, we have to add a range selector after the series selector, like [5m] for a five-minute window.

For example, this tells us the per-second increase of all the demo_api_request_duration_seconds_count series as averaged over a five-minute window:

rate(demo_api_request_duration_seconds_count[5m])

This is a lot more useful:

rate() and related functions have two behaviors that are important to understand:

Counter metrics can reset to 0 when a scraped process restarts, but rate() automatically handles this by assuming that any decrease in a counter value is a reset. It can then adjust subsequent samples for this. Rates are extrapolated from the first and last samples under the window boundary to the actual window edges. Thus you may get non-integer rates even for counters that only increase by integer amounts.

Instead of calculating the per-second rate, you can also query the total increase over a given time window using the increase() function:

increase(demo_api_request_duration_seconds_count{job="demo"}[1h])

Rate only works with counter metrics

The rate(), irate(), and increase() functions only work for counter metrics, since they treat any decrease in value as a counter reset and can only output non-negative results.

For gauge metrics that track a value that can go up or down (like temperatures or used memory or disk space), use the delta() and deriv() functions instead.

The deriv() function calculates the per-second derivative of a set of series under a specified time window. For example, to calculate by how much disk usage is going up or down per-second when looking at a 15-minute window, query:

deriv(demo_disk_usage_bytes{job="demo"}[15m])

The result should look like this:

The predict_linear() function is an extension to this that allows you to predict what the value of a gauge will be in a given amount of time in the future. For example, the following query will try to predict what the disk usage is in one hour, based on its development in the last 15 minutes:

predict_linear(demo_disk_usage_bytes{job="demo"}[15m], 3600)

This can be useful for building alerts that tell you if the disk is about to fill up in several hours.

Aggregating Over Dimensions

Prometheus' time series data can be highly dimensional. That is helpful for drilling down into details, but often we want to aggregate over dimensions to get a less detailed view. To achieve this, Prometheus offers a number of aggregation operators (sum, avg, min, max, and more). Note that these operators do not aggregate over time, but across multiple series at each point in time.

For example, if we wanted to know the total number of requests that our demo service is handling per second, we could first calculate the individual per-series rates, and then sum up those rates:

# A single aggregated series over all dimensions.
sum(
  # Many series, representing the individual rates for each dimension.
  rate(demo_api_request_duration_seconds_count{job="demo"}[5m])
)

This sums up all individual rates into a single total rate series for the service. Run this in the expression browser for Prometheus.

However, typically we want to preserve some dimensions. For example, we might want to calculate total rates broken out by instance and path, but not care about individual method or status values. To do that, we can add a without() modifier to the sum() aggregator that lists the labels we want to aggregate away (exclude-list approach):

sum without(method, status) (rate(demo_api_request_duration_seconds_count{job="demo"}[5m]))

This is equivalent to using the by() modifier with the opposite set of labels (the ones you want to keep) as an include-list approach:

sum by(instance, path, job) (rate(demo_api_request_duration_seconds_count{job="demo"}[5m]))

The resulting sum is now grouped by instance, path, and job - note that you may have to change the job label - and expression browser will give you suggestions for what to change it to:

You now know how to aggregate over selected dimensions.