Monitoring Stack - zbrewer/homelab GitHub Wiki
My monitoring stack consists of Grafana with InfluxDB as the backing time-series datastore. They are hosted as a single Docker Compose stack.
After Docker is installed on the host, copy the monitoring directory over including the docker-compose.yml, .env, and influxdb/config.yml files. Then run docker compose up -d
from the directory with the docker-compose.yml file.
If necessary, the InfluxDB basic config can be recreated by running $ docker run --rm influxdb:2.0 influxd print-config > config.yml
in the directory you would like to create it in. See the InfluxDB Docker Page for more information.
The Grafana UI can be reached at http://<docker-ip>:3000
and InfluxDB can be reached at http://<docker-ip>:8086
.
First, go to the InfluxDB UI and complete the setup steps, specifying the organization and bucket names. Once complete, change the bucket retention policy to something like 30 days. From there, create an API key for Grafana with read permissions on the buckets to monitor.
Now, open the Grafana UI logging in with the default username of admin
and password of admin
. Change the password when prompted and then go to Connections > Data Sources
, select Add New Data Source
, and then select InfluxDB
. The database can be set up with one of two query languages: the legacy InlfuxQL or the new Flux. A data source for each can be set up using the same API key, if desired.
Select Flux
as the query language and http://influxdb:8086
as the URL, assuming the config here was used to set up InfluxDB and they are running on the same server. Set the token to the one generated in InfluxDB and the organization/default bucket to the ones created during InfluxDB setup. Save and test the configuration.
Select InfluxQL
as the query language and http://influxdb:8086
as the URL, assuming the config here was used to set up InfluxDB and they are running on the same server. Add a custom HTTP header called Authorization
with the value Token <API token>
where <API token>
is replaced by the token generated in InfluxDB. Set the database to the bucket that was created and the HTTP method to GET
. Save and test the connection.
Metrics with high sample rates may start to consume a large amount of data quickly. If the server running InfluxDB is running out of storage, directories under /var/lib/docker/volumes/monitoring_influxdb-data/_data/engine/data
(where monitoring_influxdb-data
is replaced by the volume name used) can be inspected in order to determine which bucket(s) is using all of the space. Each of the directory names here corresponds to the bucket ID seen in the InfluxDB UI. The ncdu
tool can also be helpful for determining where space is going (sudo ncdu /
).
If it is determined that a bucket (or buckets) are using all of the storage space, the retention policy for that bucket should be changed to keep data for less time. This can be done from the bucket options in the UI. The retention policies are applied every 30m so it may take that long for the effects of the change to be seen.
That being said, it may still be useful to retain some of that data for a longer time period for historical context or long-term metrics. In that case, the data can be downsampled into a different bucket automatically using an InfluxDb task. See the documentation here for more details.
In short, this can be accomplished by creating a new bucket with a longer-term retention policy. Then, in the Data Explorer
, craft a query that returns the metric(s) you would like to downsample, complete with the new downsample resolution (aggregate window). For example, in order to downsample the zpool_stats
measurement (and all of its contained fields) to once an hour, the query might look like:
from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
That being said, the filter could also be more complicated, such as:
from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> filter(fn: (r) => r["_field"] == "alloc")
|> filter(fn: (r) => r["state"] == "ONLINE")
|> filter(fn: (r) => r["host"] == "truenas")
|> filter(fn: (r) => r["name"] == "tank")
|> filter(fn: (r) => r["vdev"] == "root")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
Once the data is being returned in a way you are happy with, change the query window to the timeframe you'd like to backfill (perform a one-time downsampling copy) and then add a to
line to the end of the query telling InfluxDB to copy the data to the destination bucket. For example:
from(bucket: "truenas-telegraf")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
|> to(bucket: "truenas-telegraf-long-term", org: "Brewer")
Once this is done, use a different Data Explorer/Query Builder instance to ensure that the data looks as expected in the new bucket.
Finally, create a Task in InfluxDB to perform this downsampling automatically. This can be done by selecting Save As
from the Query Builder where the downsampling query was constructed. Select Task
in the resulting dialog box, give it a name, give it a period (the Every
field) greater than the downsampling window, and select the output bucket. Click Save As Task
and then, from the Tasks
page, click on the gear icon next to the Task you just created and select Edit
.
In the editor, remove and options relating to the start/end time and change the range
directive to |> range(start: -task.every)
(instead of |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
). Any duplicate to
directives should also be removed (saving as a Task also adds one). At this point, it should look something like:
option task = {name: "Downsample zpool_stats", every: 4h}
from(bucket: "truenas-telegraf")
|> range(start: -task.every)
|> filter(fn: (r) => r["_measurement"] == "zpool_stats")
|> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
|> to(bucket: "truenas-telegraf-long-term", org: "Brewer")
Save the Task and, again under the gear icon on the Tasks
page, select Run
. Take a look at the logs to make sure that it ran successfully. Assuming it did, a Task is now successfully setup to downsample the source data to a separate bucket with a longer retention policy. This new bucket (and its measurements) can now be used in Grafana dashboards and elsewhere.
Contact points in Grafana define where alerts will be sent. The two options I use are email and Pushover.
To configure email for Grafana, add the following under the environment
section for Grafana in the Docker Compose file:
# Configuration for email
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.gmail.com:465
- [email protected]
- GF_SMTP_PASSWORD=${GMAIL_PASSWORD}
- [email protected]
- GF_SMTP_FROM_NAME=Grafana
With [email protected]
replaced by the actual Gmail address that will be sending the email notifications. See my docker compose file for an example. You will also need to create an app password for the Gmail account and put it in the .env
file. This will look like GMAIL_PASSWORD="password"
with password
being replaced by the actual app password.
After that is setup, restart Grafana and go to Alerting > Contact points
in the web interface. Create a new contact point using the Email integration and specify the Addresses
you would like the alert sent to. Click Test
and, if all went well, you should receive a test email alert.
Pushover is very simple to configure in Grafana. Create a new Application
for Grafana in the Pushover UI and copy the Application/API token. Then, in Grafana, create a new Contact Point under Alerting > Contact points
and set the integration type to Pushover
. Paste in the API Token
that you copied from Pushover and then copy the User key
from Pushover and paste it in as well. Send a test notification and, once you have confirmed that it works, save the Contact Point.
There is really only one global configuration option that I had to set before using alerts in Grafana: the GF_SERVER_ROOT_URL
environment variable. This tells Grafana what its root URL is so that it can be used in notification templates. Specifically, Grafana's default notification template will include a link to the firing alert and a link to create a silence for it. By default, these links are to a localhost
address which, while not wrong, doesn't make it easy to use them. Instead, setting the base URL that you actually use will make the links clickable and much more useful.
The GF_SERVER_ROOT_URL
variable can be set from the environment
section of the Docker Compose file, like the other environment variables. See my configuration for an example. It is important to note that the URL should start with the protocol (such as https://
) and end without a trailing slash since Grafana just appends the path to this string to generate URLs.
I have exported the alert rules I'm using to a file in this repository. Note that, if these need to be restored, file based provisioning must be used as Grafana doesn't support importing alerts from YAML/JSON through the UI. Alternatively, the YAML can be manually inspected to reconstruct the alerts.