#336: Self‐Healing NRPE Service Implementation - Rmhibbert/oe2-group-c GitHub Wiki

Step by step

Create the Event Handler Script

Description: This script connects to the target host and restarts the NRPE service.

Command: sudo nano /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh

#!/bin/bash

HOST="$1"
LOGFILE="/var/log/nagios/nrpe_recovery.log"

echo "$(date): Script triggered for host $HOST" >> "$LOGFILE"

# Run the restart command and log both stdout and stderr
/usr/bin/ssh -i /home/group-c/.ssh/id_rsa -o StrictHostKeyChecking=no group-c@"$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server" >> "$LOGFILE" 2>&1
SSH_EXIT=$?

if [ $SSH_EXIT -eq 0 ]; then
    echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"
else
    echo "$(date): FAILED to restart NRPE on $HOST (exit code $SSH_EXIT)" >> "$LOGFILE"
    exit 2
fi

Line	Explanation
`#!/bin/bash`	Specifies that the script should be run using the Bash shell.
`HOST="$1"`	Stores the first argument (Nagios host address) into a variable named `HOST`.
`LOGFILE=...`	Defines the path where recovery attempts will be logged.
`echo "$(date)...`	Logs the current date and target host when the script is triggered.
`ssh -i ...`	Connects to the remote host using a specific SSH key and runs `sudo systemctl restart nagios-nrpe-server`.
`SSH_EXIT=$?`	Captures the exit status of the SSH command.
`if [ $SSH_EXIT -eq 0 ]; then ...`	Checks if the SSH command was successful and logs accordingly. If not, it logs a failure and exits with code `2`.

Make the Script Executable

Description: Grants execute permissions to the restart_nrpe.sh script so that Nagios can run it as an event handler during service failures.

Command: sudo chmod +x /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh

Define the Event Handler in Nagios

Description: Creates a custom Nagios command definition named restart-nrpe that tells Nagios how to invoke the event handler script when a service check fails. This command will be referenced later in the service configuration.

Command: sudo nano /etc/nagios4/conf.d/event_handler_nrpe.cfg

define command {
    command_name    restart-nrpe
    command_line    /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh $HOSTADDRESS$
}

Add the Service Configuration in Puppet

Description: Defines a Nagios service resource in Puppet that monitors the NRPE server health for all hosts in the core-servers host group. It uses the check_nrpe plugin and links the restart-nrpe event handler to automatically recover the service if it fails. This configuration ensures the setup is managed and deployed consistently across environments.

Command: sudo nano /etc/puppetlabs/code/modules/nagios/manifests/config.pp

nagios_service { "check-nrpe-server":
  use                     => "generic-service",
  hostgroup_name          => "core-servers",
  service_description     => "NRPE Server Health",
  check_command           => "check_nrpe!check_users",
  max_check_attempts      => 4,
  check_interval          => 1,
  retry_interval          => 1,
  check_period            => "24x7",
  notification_interval   => 30,
  notification_period     => "24x7",
  notification_options    => "w,u,c,r",
  contact_groups          => "slackgroup",
  event_handler           => "restart-nrpe",
  event_handler_enabled   => 1,
  target                  => "/etc/nagios4/conf.d/ppt_services.cfg",
  mode                    => "0644",
}

Apply Puppet and Restart Services

Description: Applies the latest Puppet configuration to ensure the Nagios service definition and event handler are deployed. Then, it restarts both the nagios-nrpe-server and nagios4 services to apply changes and ensure that monitoring and event handling are functioning as intended.

Command:

sudo puppet agent --test
sudo systemctl restart nagios-nrpe-server.service
sudo systemctl restart nagios4.service

Prepare SSH Access (Group-C and Nagios Users)

Description: Ensures that both the group-c and nagios users can access the monitored hosts (backup-c, apps-c, db-c) via SSH without being prompted for a password. This is critical for automated remote command execution by the event handler script. For the nagios user, it also involves setting up the home directory and generating SSH keys if not already present.

Command:

sudo -u group-c ssh-copy-id group-c@backup-c
sudo -u group-c ssh-copy-id group-c@apps-c
sudo -u group-c ssh-copy-id group-c@db-c

For nagios user SSH key and home setup

sudo mkdir -p /home/nagios
sudo chown nagios:nagios /home/nagios
sudo -u nagios ssh-keygen

sudo -u nagios ssh-copy-id group-c@backup-c
sudo -u nagios ssh-copy-id group-c@apps-c
sudo -u nagios ssh-copy-id group-c@db-c

Create Log File for Event Handler Output

Description: Creates a dedicated log file that the event handler script uses to record recovery attempts and outcomes. This log is essential for troubleshooting and verifying that the script executed as expected. Permissions are set to allow the nagios user to write to the log.

Command:

sudo mkdir -p /var/log/nagios
sudo touch /var/log/nagios/nrpe_recovery.log
sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log

Simulate Failure

Description: Stop the NRPE service in any server

Command: sudo systemctl stop nagios-nrpe-server

Monitor Log on Management Server

Description: Allows real-time monitoring of the event handler’s activity by viewing the contents of the recovery log. This helps confirm whether the restart_nrpe.sh script is being triggered correctly and if NRPE restarts are succeeding or failing.

Command: sudo tail -f /var/log/nagios/nrpe_recovery.log

Check Nagios Web UI

Description:
Use the Nagios Web Interface to verify that the automatic recovery is working as expected.

Steps:

Navigate to Services → NRPE Server Health
When a monitored server goes into a CRITICAL state, the event handler should be triggered
Within approximately 1 minute, the service should automatically recover and return to OK

Before Recovery
The service is in a CRITICAL state:

After Recovery
The service has been successfully restarted and is now OK:

As shown above, the NRPE service is automatically restarted around 1 minute after a failure is detected.

Challenges

🛠️ Challenge: Script Prompted for Password on Remote Host

Problem:

The event handler script failed to restart the NRPE service because sudo on the remote host prompted for a password. This caused the automatic recovery to break, as no password can be provided in a non-interactive script.

🔍 Root Cause:

The original SSH command used in the script did not ensure passwordless sudo access for the group-c user.

✅ Solution:

The problem was resolved by updating this critical line in the script:

# ✅ Final working version:
 /usr/bin/ssh -i /home/group-c/.ssh/id_rsa -o StrictHostKeyChecking=no group-c@"$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"
 SSH_EXIT=$?