Ticket 337 ‐ Self‐Healing NRPE Service Implementation - SupaHotBall/OE2-Group-D GitHub Wiki

Task

  • Scenario The nagios-nrpe-server service on a monitored node may unexpectedly stop, causing monitoring failures. Instead of relying on manual fixes, your task is to simulate a self-healing mechanism using Nagios event handlers to automatically detect the failure and restart the service.

  • Task Objective

  • Configure Nagios to monitor the nagios-nrpe-server service on a remote host.
  • Set up an event handler that automatically detect and restarts/recover crashed NRPE services on monitored hosts, reducing manual intervention for critical monitoring failures.
  • Log the recovery action for verification.
  • Acceptance Criteria
  • An event handler script named restart_nrpe.sh is written and tested.
  • Nagios is configured to trigger the script when NRPE becomes CRITICAL.
  • Stopping the service triggers auto-restart by Nagios.
  • Logs confirm the healing action.
  • Verification performed using Nagios web UI and system logs.
  • Validation Procedures
  • Confirm NRPE Is Active
  • Simulate Failure (Use the Nagios web UI or check)
  • Confirm the NRPE Agent Status goes to CRITICAL.
  • Confirm Recovery. Nagios web UI should now show NRPE Agent Status = OK.
  • Documentation & Deliverables:
  • Working restart_nrpe.sh event handler.
  • Logs showing detection and self-healing.
  • Screenshots/logs from Nagios UI showing CRITICAL → OK recovery.
  • Explanation of how your script works

Steps Taken

Create the event handler script

#!/bin/bash
HOST="$1"
LOGFILE="/var/log/nagios/nrpe_recovery.log"

sudo -u group-d ssh -o StrictHostKeyChecking=no "$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"

if [ $? -eq 0 ]; then
    echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"
else
    echo "$(date): FAILED to restart NRPE on $HOST" >> "$LOGFILE"
    exit 2
fi

Run sudo chmod +x /usr/local/bin/restart_nrpe.sh to set the correct permissions

Define the nagios command by first creating a cfg file which contains it

sudo nano /etc/nagios4/conf.d/restart_handlers.cfg

Then define the command in this file.

define command {
    command_name    restart-nrpe
    command_line    /usr/local/bin/restart_nrpe.sh $HOSTADDRESS$
}

In the nagios module, add check-nrpe-server command within the config.pp file

nagios_service { "check-nrpe-server":
  use                     => "generic-service",
  host_name               => [ "apps-d", "backup-d", "db-d" ],
  service_description     => "NRPE Server Health",
  check_command           => "check_nrpe!check_users",
  max_check_attempts      => 4,
  check_interval          => 1,
  retry_interval          => 1,
  check_period            => "24x7",
  notification_interval   => 30,
  notification_period     => "24x7",
  notification_options    => "w,u,c,r",
  contact_groups          => "slackgroup",
  event_handler           => "restart-nrpe",
  event_handler_enabled   => 1,
  target                  => "/etc/nagios4/conf.d/ppt_services.cfg",
  mode                    => "0644",
}

The "NRPE Server Health" service check is used by the Nagios server to monitor whether the NRPEdaemon is:

  • Running and reachable on the remote host
  • Responding to plugin requests (like check_users, check_load, etc.)
  • Not down

Enable SSH Key Access from nagios to Remote Hosts

sudo -u nagios ssh-keygen
sudo -u nagios ssh-copy-id -i /home/nagios/.ssh/id_rsa group-d@<host>

Repeat for backup-d, apps-d, db-d

Allow passwordless sudo for restart command

Run sudo visudo and add

Defaults:group-d !requiretty
group-d ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server

Create log file for output on the management server

sudo mkdir -p /var/log/nagios
sudo touch /var/log/nagios/nrpe_recovery.log
sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log

Logs showing detection and self-healing

image

These logs show a successful an automatic restart of the nagios-nrpe-server when the script has detected that a check has become critical

In order to validate this again, stop the nrpe service on any monitored node (in this case, we test with backup-d)

image

image

It only shows 2 critical states in total before the event handler kicks in to restart the service, so it doesn't show many critical errors in the UI before it's all green again. It usually only shows logged in users and nrpe server health switching to critical before the script is ran and nagios-nrpe-server is restarted

The log can be checked for timestamps which show that nagios-nrpe-server was successfully restarted again at 10:55pm, around 4 minutes after it was manually stopped

image

The result after shows that the states have been successfully switched from CRITICAL to OK

image

Explanation of Script

  • #!/bin/bash

  • HOST="$1"

    • Captures the first argument passed to the script (the target hostname)
  • LOGFILE="/var/log/nagios/nrpe_recovery.log"

    • Defines the file where the script logs success or failure messages
  • sudo -u group-d ssh -o StrictHostKeyChecking=no "$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"

    • SSHs into the specified host as user group-d
    • Remotely runs sudo systemctl restart nagios-nrpe-server to restart the NRPE service
    • Bypasses SSH host fingerprint prompts to avoid hanging the script
  • if [ $? -eq 0 ]; then

    • Checks if the SSH command succeeded (exit code 0)
  • echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"

    • Logs a success message with timestamp if the restart succeeded
  • else

    • If the SSH command failed:

    • echo "$(date): FAILED to restart NRPE on $HOST" >> "$LOGFILE"

      • Logs a failure message with timestamp
    • exit 2

      • Exits with status code 2 to signal an error to Nagios

Purpose

This script serves as a Nagios event handler to:

  • Automatically restart the nagios-nrpe-server service on remote hosts
  • Log the outcome for auditing
  • Enable self-healing for NRPE service outages

Challenges

Challenges and Troubleshooting Summary

Configuration & Pre-Flight Errors

  • Nagios failed to start due to undefined commands

    • restart-nrpe and check_owncloud_login were not defined
    • Fix: Removed obsolete check_owncloud_login from services and defined restart-nrpe in restart_handlers.cfg
  • Changes in ppt_services.cfg overwritten

    • Puppet re-applied old configs, reintroducing deleted lines
    • Fix: Updated nagios_service definitions in Puppet module (config.pp) and ran puppet agent --test

SSH & Authentication Issues

  • Event handler failed due to missing SSH access

    • nagios user didn’t have SSH key or access to remote hosts
    • Fix:
      • Generated key with sudo -u nagios ssh-keygen
      • Used ssh-copy-id to push the public key to group-d@<host>
      • Ensured /home/nagios/.ssh existed and was owned by nagios
  • Host fingerprint prompt blocked event handler

    • nagios couldn't confirm new host keys via ssh
    • Fix: Added -o StrictHostKeyChecking=no to suppress the interactive prompt
  • Permission denied when nagios tried to use group-d's key

    • nagios couldn’t access /home/group-d/.ssh/id_rsa
    • Fix: Copied the SSH key to /home/nagios/.ssh/, adjusted ownership and permissions

Sudo Problems on Remote Hosts

  • sudo: a terminal is required to read the password

    • When group-d ran sudo systemctl restart via SSH
    • Fix: Updated /etc/sudoers on each monitored node:
      Defaults:group-d !requiretty
      group-d ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server
      
  • Wrong path to systemctl caused restart to fail

    • Initially assumed /bin/systemctl, but which systemctl returned /usr/bin/systemctl
    • Fix: Updated script to use full path /usr/bin/systemctl

Logging Issues

  • nrpe_recovery.log not created automatically
    • Script failed silently or wrote nothing
    • Fix: Created log file manually and ensured proper ownership:
      sudo touch /var/log/nagios/nrpe_recovery.log
      sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log

Event Handler Not Triggering

  • NRPE was stopped, but nothing logged or restarted
    • Service hadn’t reached a HARD CRITICAL state yet
    • Fix: Waited ~1–2 minutes (based on check_interval and max_check_attempts) for Nagios to escalate and trigger handler

Debugging Strategy

  • Regularly ran:
    sudo nagios4 -v /etc/nagios4/nagios.cfg
    
  • Checked nagios recovery log

External Resources

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html https://exchange.nagios.org/directory/Addons/Monitoring-Agents


Ticket Reference

https://rt.dataraster.com/Ticket/Display.html?id=337


⚠️ **GitHub.com Fallback** ⚠️