Nagios Hilary errors - Orodan/Hilary GitHub Wiki

Summary

There's a Nagios check that will start to complain when x amount of errors have been generated in the Hilary cluster. "Errors" are considered to be log().error invocations.

Every time an error is logged, the logger:error.count key of the oae-telemetry:counts:data hash will be incremented. Nagios will check that value periodically and complain if it goes over a certain threshold.

Actions to take on warning/error

When the check goes into a warning or error state, you should check the logs on the syslog machines like so (increment the number of lines to grep for more errors):

tail-hilary -n 400 | filter-bunyan -l error

Once you've resolved the issue, you can reset the count by setting the count back to 0. There's a script on cache0 in /root/reset-error-count that does this for you.