Service monitoring - PanDAWMS/panda-harvester GitHub Wiki
Harvester service metrics can be pushed out. You need to enable it in your harvester configuration file. It requires psutil >= 5.4.8 and harvester code after 12 Feb 2019
[service_monitor]
active = True
disk_volumes = data,data1
pidfile = /var/log/harvester/panda_harvester.pid
- disk_volumes is optional, and supports a comma separated list of volumes
- pidfile is only mandatory when using uwsgi
The logs will be written to panda-service_monitor.log
. A healthy snippet is:
2019-03-28 03:36:15,559 panda.log.service_monitor: DEBUG Running service monitor
2019-03-28 03:36:15,576 panda.log.service_monitor: DEBUG Memory usage: 178.6640625 MiB/2.5024127947056387%, CPU usage: 0.0
2019-03-28 03:36:15,589 panda.log.service_monitor: DEBUG Disk usage of data: 69.0 %
...
Once harvester is pushing out service metrics, you need to configure the thresholds and alerts on the alerting agent. The completed xml file will have to be added to the configuration directory(send it to the service managers):
<?xml version="1.0"?>
<instances>
<instance harvesterid="YOUR HARVESTER ID" instanceisenable="True">
<hostlist>
<host hostname="THE HOST RUNNING HARVESTER" hostisenable="True">
<contacts>
<email>WHO TO NOTIFY 1</email>
<email>WHO TO NOTIFY 2</email>
</contacts>
<metrics>
<metric name="lastsubmittedworker" enable="True">
<value>30</value>
</metric>
<metric name="lastheartbeat" enable="True">
<value>30</value>
</metric>
<metric name="memory" enable="True">
<memory_warning>50</memory_warning>
<memory_critical>80</memory_critical>
</metric>
<metric name="cpu" enable="True">
<cpu_warning>50</cpu_warning>
<cpu_critical>80</cpu_critical>
</metric>
<metric name="disk" enable="True">
<disk_warning>75</disk_warning>
<disk_critical>80</disk_critical>
</metric>
</metrics>
</host>
... YOU CAN ADD MULTIPLE HOSTS
</hostlist>
</instance>
</instances>
- lastsubmittedworker and lastheartbeat examples: 30 (minutes), 60d... (you can disable the metric in cases where you don't expect regular worker submission)
- disk_warning/critical, cpu_warning/critical, memory_warning/critical: 50 (expressed in %)