Ticket ID #334: Self‐Healing NRPE Service Implementation - GriffinKat/group-a GitHub Wiki

NRPE Self-Healing via Nagios Event Handler

Summary

The nagios-nrpe-server service on monitored hosts may crash unexpectedly, leading to Nagios showing the service as CRITICAL and generating false alerts. Manual restarts are inefficient and error-prone. To resolve this, we implemented an automated self-healing mechanism using Nagios event handlers to restart the NRPE service when failure is detected.

Implementation Steps

Define Nagios Service in Puppet

In the nagios module's config.pp, a service definition was added to monitor the NRPE daemon process:

nagios_service { "check-nrpe-daemon":
  service_description   => "NRPE Daemon Process",
  hostgroup_name        => "remote-disks",
  check_command         => "check_nrpe!check_nrpe_daemon",
  event_handler         => "restart-nrpe-handler",
  event_handler_enabled => 1,
  target                => "/etc/nagios4/conf.d/apps_services.cfg",
  max_check_attempts    => 3,
  retry_check_interval  => 1,
  normal_check_interval => 5,
  check_period          => "24x7",
  notification_interval => 30,
  notification_period   => "24x7",
  notification_options  => "w,u,c",
  contact_groups        => "admins,slackgroup",
  mode                  => '0644'
}

This ensures the service is checked regularly, and if it enters a CRITICAL state, the handler is triggered.

Define Event Handler Command in commands.cfg

In /etc/nagios4/objects/commands.cfg:

define command {
  command_name    restart-nrpe-handler
  command_line    sudo -u group-a /usr/local/bin/restart_nrpe.sh "$HOSTNAME$" "$SERVICESTATE$" "$STATETYPE$" "$SERVICEATTEMPT$"
}

This links the service event to the actual script using Nagios macros for host and service state information.

Define NRPE Check in `nrpe.cfg`

To check the NRPE process itself, this line was added in the nrpe module's files directory:

command[check_nrpe_daemon]=/usr/lib/nagios/plugins/check_procs -w 1: -c 1: -C nrpe

Puppet ensures this command is deployed to all relevant hosts.

Create the Event Handler Script

Path: /usr/local/bin/restart_nrpe.sh on the Nagios master (mgmt-a):

#!/bin/bash

REMOTE_HOST="$1"
SERVICE_STATE="$2"
STATE_TYPE="$3"
SERVICE_ATTEMPT="$4"

REMOTE_CMD="sudo systemctl restart nagios-nrpe-server.service"

if [ "$SERVICE_STATE" == "CRITICAL" ](/GriffinKat/group-a/wiki/-"$SERVICE_STATE"-==-"CRITICAL"-); then
    echo "$(date): Restarting NRPE on $REMOTE_HOST due to CRITICAL state..."
    ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no group-a@"$REMOTE_HOST" "$REMOTE_CMD"

    if [ $? -eq 0 ]; then
        echo "NRPE restarted successfully on $REMOTE_HOST"
        exit 0
    else
        echo "Failed to restart NRPE on $REMOTE_HOST"
        exit 2
    fi
else
    echo "$(date): No restart needed for state $SERVICE_STATE"
    exit 0
fi

Ownership and permissions:

sudo chown nagios:nagios /usr/local/bin/restart_nrpe.sh
sudo chmod 755 /usr/local/bin/restart_nrpe.sh

SSH Access Configuration

To allow the script to SSH into monitored hosts without a password:

The SSH key had already been generated on mgmt-a under /home/group-a/.ssh/id_rsa.pub. It was reused and copied to each remote host.

Commands used to copy SSH key:

sudo cat ~/.ssh/id_rsa.pub | ssh group-a@apps-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@db-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@backup-a "cat >> ~/.ssh/authorized_keys"

This ensures passwordless login for group-a.

Configure Sudoers via Puppet

Edited Puppet's sudo module to include:

group-a ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server.service
nagios ALL=(group-a) NOPASSWD: /usr/local/bin/restart_nrpe.sh

This allows group-a to restart services without a password and nagios to run the handler as group-a.

Validation Procedures

Confirm NRPE is Active

Simulate Failure

Wait 1–2 minutes and check Nagios Web UI
- Confirm service status goes to CRITICAL
- This means the failure was detected

Confirm Recovery
- Wait a bit longer
- The status should return to OK in the Nagios UI

The logs also confirm the failure and recovery process: