Service monitoring - PanDAWMS/panda-harvester GitHub Wiki

Configuration of the Harvester agent to collect metrics

Harvester service metrics can be pushed out. You need to enable it in your harvester configuration file. It requires psutil >= 5.4.8 and harvester code after 12 Feb 2019

[service_monitor]
active = True
disk_volumes = data,data1
pidfile = /var/log/harvester/panda_harvester.pid
  • disk_volumes is optional, and supports a comma separated list of volumes
  • pidfile is only mandatory when using uwsgi

The logs will be written to panda-service_monitor.log. A healthy snippet is:

2019-03-28 03:36:15,559 panda.log.service_monitor: DEBUG    Running service monitor
2019-03-28 03:36:15,576 panda.log.service_monitor: DEBUG    Memory usage: 178.6640625 MiB/2.5024127947056387%, CPU usage: 0.0
2019-03-28 03:36:15,589 panda.log.service_monitor: DEBUG    Disk usage of data: 69.0 %
...

Configuration of the central alerting agent

Once harvester is pushing out service metrics, you need to configure the thresholds and alerts on the alerting agent. The completed xml file will have to be added to the configuration directory(send it to the service managers):

<?xml version="1.0"?>
<instances>
    <instance harvesterid="YOUR HARVESTER ID" instanceisenable="True">
        <hostlist>
            <host hostname="THE HOST RUNNING HARVESTER" hostisenable="True">
                <contacts>
                    <email>WHO TO NOTIFY 1</email>
                    <email>WHO TO NOTIFY 2</email>
                </contacts>
                <metrics>
                    <metric name="lastsubmittedworker" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="lastheartbeat" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="memory" enable="True">
                        <memory_warning>50</memory_warning>
                        <memory_critical>80</memory_critical>
                    </metric>
                    <metric name="cpu" enable="True">
                        <cpu_warning>50</cpu_warning>
                        <cpu_critical>80</cpu_critical>
                    </metric>
                    <metric name="disk" enable="True">
                        <disk_warning>75</disk_warning>
                        <disk_critical>80</disk_critical>
                    </metric>
                </metrics>
            </host>
... YOU CAN ADD MULTIPLE HOSTS
        </hostlist>
    </instance>
</instances>
  • lastsubmittedworker and lastheartbeat examples: 30 (minutes), 60d... (you can disable the metric in cases where you don't expect regular worker submission)
  • disk_warning/critical, cpu_warning/critical, memory_warning/critical: 50 (expressed in %)
⚠️ **GitHub.com Fallback** ⚠️