icinga event_handler(事件处理器) - HAAdmin/HAcluster GitHub Wiki

简介
什么时候触发event handlers？
event handler类型
启动event handler
event handler执行顺序
编写event handler命令
赋予event handler命令的权限
例子：一个服务的event handler

1.简介 event handler是当主机或服务状态发生变化时运行，可选的系统命令（脚本或可执行文件）。在检查被调度（初始化）时，event handlers在系统上被执行。对于icinga项目来说，event handlers的一个明显功能就是用来在发现问题前主动解决它。还有一些其他的使用方式：

    重启一个失败的服务
    在helpdesk系统录入故障单
    记录事件信息到数据库
    主机的电源开关*
    等等

*主机的电源开关是一种自动化脚本，这个不应该轻易被执行。考虑到执行的后果，执行自动重启要小心谨慎。

2.什么时候会触发event handlers？当一个服务或者主机处在如下状态时，将执行event handlers：

    处在SOFT问题状态
    初始化进入HARD问题状态
    初始化从一个SOFT或者HARD问题状态恢复

——————白话文：其实就是当服务状态发生变化时，触发事件处理器。至于什么状态触发什么事件，是需要自己写脚本决定的 3.event handler类型当你定义主机和状态变化时的事件处理器，有许多不同的类型可选：

Global host event handler

Global service event handler

Host-specific event handlers

Service-specific event handlers

全局的主机和服务event handlers，在每个host或service状态发生变化时都会运行，并且会在指定的host或service单独的事件处理器运行前优先执行。可以通过设置主配置文件（icinga.cfg）中global_host_event_handler 和 global_service_event_handler 选项来指定全局事件处理器命令。单独的主机和服务也可以在处理状态变化时，执行自己的时间处理器命令。可以通过在定义host和service时使用event_handler指定一个事件处理器。在全局主机或服务的事件处理器被执行后（这是可选的），特定的主机和服务的事件处理器才会被执行。 4.启动event handler 在主配置文件（icinga.cfg）中设置enable_event_handlers可启动或停用程序范围内的事件处理器。特定的主机和服务事件处理器，可以在定义每个host和service时使用event_handler_enabled来启动或停用。如果全局的enable_event_handlers选项被停用，那么特定的主机和服务事件处理器将不会被执行。 5.event handler执行顺序之前提到过的，全局的主机和服务事件处理器是在特定的主机/服务事件处理器前立刻被执行的。在通知被发出后，事件处理器在HARD问题和恢复状态时被执行。注意这里有个大坑：全局事件处理器单个主机或服务事件处理器执行顺序开启不指定（系统会默认未开启）执行全局事件处理器开启关闭无事件处理器被执行开启开启先执行全局的，然后执行单个的关闭开启执行单个的事件处理器关闭关闭无事件处理器被执行特别注意标红的地方，当全局事件处理器开启时，如果单个的未开启，那么系统认为是关闭的！抱怨下：官方文档也能这样啊，留个坑给大家，害我找了半天这个BUG，以为是其他地方的问题呢。唉 6.编写event handler命令最好用shell或perl脚本语言来写event handler命令，但也可以是任何能运行的可执行语言。脚本需要用到的最少的变量参数有：服务类型： $SERVICESTATE$ , $SERVICESTATETYPE$ , $SERVICEATTEMPT$ 主机类型： $HOSTSTATE$ , $HOSTSTATETYPE$ , $HOSTATTEMPT$ 脚本应该检查传递给它的这些参数的值，并且基于这些值采取一些必要的行动。理解event handlers如何工作的最好办法就是看一个例子（请看第8部分的内容）。 TIP:可以在发布的icinga发行版的contrib/eventhandlers/子目录找到额外的event handler脚本例子。其中一些例子演示了外部命令的使用方法，以此来实现冗余和分布式监控环境。 7.赋予event handler命令的权限 event handler命令的执行者权限通常就是icinga运行时使用的用户权限。这样的话，如果你想写一个event handler脚本来重启一个系统服务，通常就需要赋予这些任务执行的root权限。理想情况是，你估计好要执行的事件处理器类型，然后在执行一些必要的系统命令上给与icinga用户足够的权限。你可能需要尝试用sudo实现这些。 8.例子：一个服务的event handler 假设在监控本地机器的HTTP服务，然后HTTP服务定义时的指定其event handler命令为restart-httpd。另外，假设设置了max_check_attempts为4或者更大（服务在被认为是存在问题之前要被检查4次）。以下是缩略版的服务定义：

define service{ host_name somehost service_description HTTP max_check_attempts 4 event_handler restart-httpd ... }

一旦我们在服务定义里指定了event handler，那么就必须定义这个event handler为一个command。下面是restart-httpd命令定义的例子。请注意命令行中事件处理器脚本接收的哪些参数——这些很重要！

define command{ command_name restart-httpd command_line /usr/local/icinga/libexec/eventhandlers/restart-httpd $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ }

然后，我们来实际地写这个事件处理器脚本（/usr/local/icinga/libexec/eventhandlers/restart-httpd）

#!/bin/sh

Event handler script for restarting the web server on the local machine

Note: This script will only restart the web server if the service is

retried 3 times (in a "soft" state) or if the web service somehow

manages to fall into a "hard" error state.

What state is the HTTP service in?

case "$1" in OK) # The service just came back up, so don't do anything... ;; WARNING) # We don't really care about warning states, since the service is probably still running... ;; UNKNOWN) # We don't know what might be causing an unknown error, so don't do anything... ;; CRITICAL) # Aha! The HTTP service appears to have a problem - perhaps we should restart the server...

    # Is this a "soft" or a "hard" state?
    case "$2" in
            
    # We're in a "soft" state, meaning that Icinga is in the middle of retrying the
    # check before it turns into a "hard" state and contacts get notified...
    SOFT)
                    
            # What check attempt are we on?  We don't want to restart the web server on the first
            # check, because it may just be a fluke!
            case "$3" in
                            
            # Wait until the check has been tried 3 times before restarting the web server.
            # If the check fails on the 4th time (after we restart the web server), the state
            # type will turn to "hard" and contacts will be notified of the problem.
            # Hopefully this will restart the web server successfully, so the 4th check will
            # result in a "soft" recovery.  If that happens no one gets notified because we
            # fixed the problem!
            3)
                    echo -n "Restarting HTTP service (3rd soft critical state)..."
                    # Call the init script to restart the HTTPD server
                    /etc/rc.d/init.d/httpd restart
                    ;;
                    esac
            ;;
                            
    # The HTTP service somehow managed to turn into a hard error without getting fixed.
    # It should have been restarted by the code above, but for some reason it didn't.
    # Let's give it one last try, shall we?  
    # Note: Contacts have already been notified of a problem with the service at this
    # point (unless you disabled notifications for this service)
    HARD)
            echo -n "Restarting HTTP service..."
            # Call the init script to restart the HTTPD server
            /etc/rc.d/init.d/httpd restart
            ;;
    esac
    ;;

esac exit 0

这个简单的脚本在重启本地的web服务时，提供了两种不同重启策略的实例：

在服务被检查了三次并且是处在SOFT CRITICAL状态时；
服务第一次转变为HARD CRITICAL状态时

脚本理论上应该会重启web服务，并在服务转变为HARD问题状态前解决这个问题，但有一个反馈说第一次时event未工作。值得注意的是，event handler只有在服务第一次转变为HARD问题状态时才被执行（这个可以通过检查icinga_servicestate表来确认，你会发现有个字段叫last_hard_state_change，它会记录HARD状态发生的时间，经过我的测试，event handler被执行就是在这个change发生后）。这是为了避免如果这个服务保持HARD状态时，icinga不停地执行脚本来重启web服务。相信我，你不会希望这样的