Ticket ID #334: Self‐Healing NRPE Service Implementation - GriffinKat/group-a GitHub Wiki

NRPE Self-Healing via Nagios Event Handler

Summary

The nagios-nrpe-server service on monitored hosts may crash unexpectedly, leading to Nagios showing the service as CRITICAL and generating false alerts. Manual restarts are inefficient and error-prone. To resolve this, we implemented an automated self-healing mechanism using Nagios event handlers to restart the NRPE service when failure is detected.


Implementation Steps

Define Nagios Service in Puppet

In the nagios module's config.pp, a service definition was added to monitor the NRPE daemon process:

nagios_service { "check-nrpe-daemon":
  service_description   => "NRPE Daemon Process",
  hostgroup_name        => "remote-disks",
  check_command         => "check_nrpe!check_nrpe_daemon",
  event_handler         => "restart-nrpe-handler",
  event_handler_enabled => 1,
  target                => "/etc/nagios4/conf.d/apps_services.cfg",
  max_check_attempts    => 3,
  retry_check_interval  => 1,
  normal_check_interval => 5,
  check_period          => "24x7",
  notification_interval => 30,
  notification_period   => "24x7",
  notification_options  => "w,u,c",
  contact_groups        => "admins,slackgroup",
  mode                  => '0644'
}

This ensures the service is checked regularly, and if it enters a CRITICAL state, the handler is triggered.


Define Event Handler Command in commands.cfg

In /etc/nagios4/objects/commands.cfg:

define command {
  command_name    restart-nrpe-handler
  command_line    sudo -u group-a /usr/local/bin/restart_nrpe.sh "$HOSTNAME$" "$SERVICESTATE$" "$STATETYPE$" "$SERVICEATTEMPT$"
}

This links the service event to the actual script using Nagios macros for host and service state information.


Define NRPE Check in nrpe.cfg

To check the NRPE process itself, this line was added in the nrpe module's files directory:

command[check_nrpe_daemon]=/usr/lib/nagios/plugins/check_procs -w 1: -c 1: -C nrpe

Puppet ensures this command is deployed to all relevant hosts.


Create the Event Handler Script

Path: /usr/local/bin/restart_nrpe.sh on the Nagios master (mgmt-a):

#!/bin/bash

REMOTE_HOST="$1"
SERVICE_STATE="$2"
STATE_TYPE="$3"
SERVICE_ATTEMPT="$4"

REMOTE_CMD="sudo systemctl restart nagios-nrpe-server.service"

if [ "$SERVICE_STATE" == "CRITICAL" ](/GriffinKat/group-a/wiki/-"$SERVICE_STATE"-==-"CRITICAL"-); then
    echo "$(date): Restarting NRPE on $REMOTE_HOST due to CRITICAL state..."
    ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no group-a@"$REMOTE_HOST" "$REMOTE_CMD"

    if [ $? -eq 0 ]; then
        echo "NRPE restarted successfully on $REMOTE_HOST"
        exit 0
    else
        echo "Failed to restart NRPE on $REMOTE_HOST"
        exit 2
    fi
else
    echo "$(date): No restart needed for state $SERVICE_STATE"
    exit 0
fi

Ownership and permissions:

sudo chown nagios:nagios /usr/local/bin/restart_nrpe.sh
sudo chmod 755 /usr/local/bin/restart_nrpe.sh

SSH Access Configuration

To allow the script to SSH into monitored hosts without a password:

The SSH key had already been generated on mgmt-a under /home/group-a/.ssh/id_rsa.pub. It was reused and copied to each remote host.

Commands used to copy SSH key:

sudo cat ~/.ssh/id_rsa.pub | ssh group-a@apps-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@db-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@backup-a "cat >> ~/.ssh/authorized_keys"

This ensures passwordless login for group-a.


Configure Sudoers via Puppet

Edited Puppet's sudo module to include:

group-a ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server.service
nagios ALL=(group-a) NOPASSWD: /usr/local/bin/restart_nrpe.sh

This allows group-a to restart services without a password and nagios to run the handler as group-a.


Validation Procedures

  • Confirm NRPE is Active

image

  • Simulate Failure

image

  • Wait 1–2 minutes and check Nagios Web UI

    • Confirm service status goes to CRITICAL
    • This means the failure was detected

image

  • Confirm Recovery

    • Wait a bit longer
    • The status should return to OK in the Nagios UI

image

image

  • The logs also confirm the failure and recovery process:

image


Ticket Reference- https://rt.dataraster.com/Ticket/Display.html?id=334