Monitoring Stack - zbrewer/homelab GitHub Wiki

My monitoring stack consists of Grafana with InfluxDB as the backing time-series datastore. They are hosted as a single Docker Compose stack.

Installation

After Docker is installed on the host, copy the monitoring directory over including the docker-compose.yml, .env, and influxdb/config.yml files. Then run docker compose up -d from the directory with the docker-compose.yml file.

If necessary, the InfluxDB basic config can be recreated by running $ docker run --rm influxdb:2.0 influxd print-config > config.yml in the directory you would like to create it in. See the InfluxDB Docker Page for more information.

Configuration

The Grafana UI can be reached at http://<docker-ip>:3000 and InfluxDB can be reached at http://<docker-ip>:8086.

First, go to the InfluxDB UI and complete the setup steps, specifying the organization and bucket names. Once complete, change the bucket retention policy to something like 30 days. From there, create an API key for Grafana with read permissions on the buckets to monitor.

Now, open the Grafana UI logging in with the default username of admin and password of admin. Change the password when prompted and then go to Connections > Data Sources, select Add New Data Source, and then select InfluxDB. The database can be set up with one of two query languages: the legacy InlfuxQL or the new Flux. A data source for each can be set up using the same API key, if desired.

Flux

Select Flux as the query language and http://influxdb:8086 as the URL, assuming the config here was used to set up InfluxDB and they are running on the same server. Set the token to the one generated in InfluxDB and the organization/default bucket to the ones created during InfluxDB setup. Save and test the configuration.

Influx

Select InfluxQL as the query language and http://influxdb:8086 as the URL, assuming the config here was used to set up InfluxDB and they are running on the same server. Add a custom HTTP header called Authorization with the value Token <API token> where <API token> is replaced by the token generated in InfluxDB. Set the database to the bucket that was created and the HTTP method to GET. Save and test the connection.

Storage Space

Metrics with high sample rates may start to consume a large amount of data quickly. If the server running InfluxDB is running out of storage, directories under /var/lib/docker/volumes/monitoring_influxdb-data/_data/engine/data (where monitoring_influxdb-data is replaced by the volume name used) can be inspected in order to determine which bucket(s) is using all of the space. Each of the directory names here corresponds to the bucket ID seen in the InfluxDB UI. The ncdu tool can also be helpful for determining where space is going (sudo ncdu /).

If it is determined that a bucket (or buckets) are using all of the storage space, the retention policy for that bucket should be changed to keep data for less time. This can be done from the bucket options in the UI. The retention policies are applied every 30m so it may take that long for the effects of the change to be seen.

That being said, it may still be useful to retain some of that data for a longer time period for historical context or long-term metrics. In that case, the data can be downsampled into a different bucket automatically using an InfluxDb task. See the documentation here for more details.

In short, this can be accomplished by creating a new bucket with a longer-term retention policy. Then, in the Data Explorer, craft a query that returns the metric(s) you would like to downsample, complete with the new downsample resolution (aggregate window). For example, in order to downsample the zpool_stats measurement (and all of its contained fields) to once an hour, the query might look like:

from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)

That being said, the filter could also be more complicated, such as:

from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> filter(fn: (r) => r["_field"] == "alloc")
|> filter(fn: (r) => r["state"] == "ONLINE")
|> filter(fn: (r) => r["host"] == "truenas")
|> filter(fn: (r) => r["name"] == "tank")
|> filter(fn: (r) => r["vdev"] == "root")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)

Once the data is being returned in a way you are happy with, change the query window to the timeframe you'd like to backfill (perform a one-time downsampling copy) and then add a to line to the end of the query telling InfluxDB to copy the data to the destination bucket. For example:

from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
|> to(bucket: "truenas-telegraf-long-term", org: "Brewer")

Once this is done, use a different Data Explorer/Query Builder instance to ensure that the data looks as expected in the new bucket.

Finally, create a Task in InfluxDB to perform this downsampling automatically. This can be done by selecting Save As from the Query Builder where the downsampling query was constructed. Select Task in the resulting dialog box, give it a name, give it a period (the Every field) greater than the downsampling window, and select the output bucket. Click Save As Task and then, from the Tasks page, click on the gear icon next to the Task you just created and select Edit.

In the editor, remove and options relating to the start/end time and change the range directive to |> range(start: -task.every) (instead of |> range(start: v.timeRangeStart, stop: v.timeRangeStop)). Any duplicate to directives should also be removed (saving as a Task also adds one). At this point, it should look something like:

option task = {name: "Downsample zpool_stats", every: 4h}

from(bucket: "truenas-telegraf")
    |> range(start: -task.every)
    |> filter(fn: (r) => r["_measurement"] == "zpool_stats")
    |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
    |> to(bucket: "truenas-telegraf-long-term", org: "Brewer")

Save the Task and, again under the gear icon on the Tasks page, select Run. Take a look at the logs to make sure that it ran successfully. Assuming it did, a Task is now successfully setup to downsample the source data to a separate bucket with a longer retention policy. This new bucket (and its measurements) can now be used in Grafana dashboards and elsewhere.

Grafana alerts

Contact points

Contact points in Grafana define where alerts will be sent. The two options I use are email and Pushover.

Email

To configure email for Grafana, add the following under the environment section for Grafana in the Docker Compose file:

# Configuration for email
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.gmail.com:465
- [email protected]
- GF_SMTP_PASSWORD=${GMAIL_PASSWORD}
- [email protected]
- GF_SMTP_FROM_NAME=Grafana

With [email protected] replaced by the actual Gmail address that will be sending the email notifications. See my docker compose file for an example. You will also need to create an app password for the Gmail account and put it in the .env file. This will look like GMAIL_PASSWORD="password" with password being replaced by the actual app password.

After that is setup, restart Grafana and go to Alerting > Contact points in the web interface. Create a new contact point using the Email integration and specify the Addresses you would like the alert sent to. Click Test and, if all went well, you should receive a test email alert.

Pushover

Pushover is very simple to configure in Grafana. Create a new Application for Grafana in the Pushover UI and copy the Application/API token. Then, in Grafana, create a new Contact Point under Alerting > Contact points and set the integration type to Pushover. Paste in the API Token that you copied from Pushover and then copy the User key from Pushover and paste it in as well. Send a test notification and, once you have confirmed that it works, save the Contact Point.

Global Configuration

There is really only one global configuration option that I had to set before using alerts in Grafana: the GF_SERVER_ROOT_URL environment variable. This tells Grafana what its root URL is so that it can be used in notification templates. Specifically, Grafana's default notification template will include a link to the firing alert and a link to create a silence for it. By default, these links are to a localhost address which, while not wrong, doesn't make it easy to use them. Instead, setting the base URL that you actually use will make the links clickable and much more useful.

The GF_SERVER_ROOT_URL variable can be set from the environment section of the Docker Compose file, like the other environment variables. See my configuration for an example. It is important to note that the URL should start with the protocol (such as https://) and end without a trailing slash since Grafana just appends the path to this string to generate URLs.

Alert Backups

I have exported the alert rules I'm using to a file in this repository. Note that, if these need to be restored, file based provisioning must be used as Grafana doesn't support importing alerts from YAML/JSON through the UI. Alternatively, the YAML can be manually inspected to reconstruct the alerts.

⚠️ **GitHub.com Fallback** ⚠️