health configuration examples - hilbix/netdata GitHub Wiki

Check the health.d directory for all alarms shipped with netdata.

Here are a few examples:

Example 1

A simple check if an apache server is alive:

template: apache_last_collected_secs
      on: apache.requests
    calc: $now - $last_collected_t
   every: 10s
    warn: $this > ( 5 * $update_every)
    crit: $this > (10 * $update_every)

The above checks that netdata is able to collect data from apache. In detail:

template: apache_last_collected_secs

The above defines a template named apache_last_collected_secs. The name is important since $apache_last_collected_secs resolves to the calc line. So, try to give something descriptive.

      on: apache.requests

The above applies the template to all charts that have context = apache.requests (i.e. all your apache servers).

    calc: $now - $last_collected_t

$now is a standard variable that resolves to the current timestamp. $last_collected_t is the last data collection timestamp of the chart. So this calculation gives the number of seconds passed since the last data collection.

   every: 10s

The alarm will be evaluated every 10 seconds.

    warn: $this > ( 5 * $update_every)
    crit: $this > (10 * $update_every)

If these result in non-zero or true, they trigger the alarm.

$this refers to the value of this alarm (i.e. the result of the calc line. We could also use $apache_last_collected_secs.

$update_every is the update frequency of the chart, in seconds.

So, the warning condition checks if we have not collected data from apache for 5 iterations and the critical condition checks for 10 iterations.

Example 2

Check if any of the disks is critically low on disk space:

template: disk_full_percent
      on: disk.space
    calc: $used * 100 / ($avail + $used)
   every: 1m
    warn: $this > 80
    crit: $this > 95

$used and $avail are the used and avail chart dimensions as shown on the dashboard.

So, the calc line finds the percentage of used space. $this resolves to this percentage.

Example 3

Predict if any disk will run out of space in the near future.

We do this in 2 steps:

Calculate the disk fill rate

  template: disk_fill_rate
        on: disk.space
    lookup: max -1s at -30m unaligned of avail
      calc: ($this - $avail) / (30 * 60)
     every: 15s

In the calc line: $this is the result of the lookup line (i.e. the free space 30 minutes ago) and $avail is the current disk free space. So the calc line will either have a positive number of GB/second if the disk if filling up, or a negative number of GB/second if the disk is freeing up space.

There is no warn or crit lines here. So, this template will just do the calculation and nothing more.

Predict the hours after which the disk will run out of space

 template: disk_full_after_hours
       on: disk.space
     calc: $avail / $disk_fill_rate / 3600
    every: 10s
     warn: $this > 0 and $this < 48
     crit: $this > 0 and $this < 24

the calc line estimates the time in hours, we will run out of disk space. Of course, only positive values are interesting for this check, so the warning and critical conditions check for positive values and that we have enough free space for 48 and 24 hours respectively.

Once this alarm triggers we will receive an email like this:

Example 4

Check if any network interface is dropping packets:

template: 30min_packet_drops
      on: net.drops
  lookup: sum -30m unaligned absolute
   every: 10s
    crit: $this > 0

The lookup line will calculate the sum of the all dropped packets in the last 30 minutes.

The crit line will issue a critical alarm if even a single packet has been dropped.

Note that the drops chart does not exist if a network interface has never dropped a single packet. When netdata detects a dropped packet, it will add the chart and it will automatically attach this alarm to it.