#336: Self‐Healing NRPE Service Implementation - Rmhibbert/oe2-group-c GitHub Wiki
Step by step
Create the Event Handler Script
Description: This script connects to the target host and restarts the NRPE service.
Command: sudo nano /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh
#!/bin/bash
HOST="$1"
LOGFILE="/var/log/nagios/nrpe_recovery.log"
echo "$(date): Script triggered for host $HOST" >> "$LOGFILE"
# Run the restart command and log both stdout and stderr
/usr/bin/ssh -i /home/group-c/.ssh/id_rsa -o StrictHostKeyChecking=no group-c@"$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server" >> "$LOGFILE" 2>&1
SSH_EXIT=$?
if [ $SSH_EXIT -eq 0 ]; then
echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"
else
echo "$(date): FAILED to restart NRPE on $HOST (exit code $SSH_EXIT)" >> "$LOGFILE"
exit 2
fi
Line | Explanation |
---|---|
#!/bin/bash |
Specifies that the script should be run using the Bash shell. |
HOST="$1" |
Stores the first argument (Nagios host address) into a variable named HOST . |
LOGFILE=... |
Defines the path where recovery attempts will be logged. |
echo "$(date)... |
Logs the current date and target host when the script is triggered. |
ssh -i ... |
Connects to the remote host using a specific SSH key and runs sudo systemctl restart nagios-nrpe-server . |
SSH_EXIT=$? |
Captures the exit status of the SSH command. |
if [ $SSH_EXIT -eq 0 ]; then ... |
Checks if the SSH command was successful and logs accordingly. If not, it logs a failure and exits with code 2 . |
Make the Script Executable
Description: Grants execute permissions to the restart_nrpe.sh
script so that Nagios can run it as an event handler during service failures.
Command: sudo chmod +x /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh
Define the Event Handler in Nagios
Description: Creates a custom Nagios command definition named restart-nrpe
that tells Nagios how to invoke the event handler script when a service check fails. This command will be referenced later in the service configuration.
Command: sudo nano /etc/nagios4/conf.d/event_handler_nrpe.cfg
define command {
command_name restart-nrpe
command_line /usr/local/nagios/libexec/eventhandlers/restart_nrpe.sh $HOSTADDRESS$
}
Add the Service Configuration in Puppet
Description: Defines a Nagios service resource in Puppet that monitors the NRPE server health for all hosts in the core-servers
host group. It uses the check_nrpe
plugin and links the restart-nrpe
event handler to automatically recover the service if it fails. This configuration ensures the setup is managed and deployed consistently across environments.
Command: sudo nano /etc/puppetlabs/code/modules/nagios/manifests/config.pp
nagios_service { "check-nrpe-server":
use => "generic-service",
hostgroup_name => "core-servers",
service_description => "NRPE Server Health",
check_command => "check_nrpe!check_users",
max_check_attempts => 4,
check_interval => 1,
retry_interval => 1,
check_period => "24x7",
notification_interval => 30,
notification_period => "24x7",
notification_options => "w,u,c,r",
contact_groups => "slackgroup",
event_handler => "restart-nrpe",
event_handler_enabled => 1,
target => "/etc/nagios4/conf.d/ppt_services.cfg",
mode => "0644",
}
Apply Puppet and Restart Services
Description: Applies the latest Puppet configuration to ensure the Nagios service definition and event handler are deployed. Then, it restarts both the nagios-nrpe-server
and nagios4
services to apply changes and ensure that monitoring and event handling are functioning as intended.
Command:
sudo puppet agent --test
sudo systemctl restart nagios-nrpe-server.service
sudo systemctl restart nagios4.service
Prepare SSH Access (Group-C and Nagios Users)
Description: Ensures that both the group-c
and nagios
users can access the monitored hosts (backup-c
, apps-c
, db-c
) via SSH without being prompted for a password. This is critical for automated remote command execution by the event handler script. For the nagios
user, it also involves setting up the home directory and generating SSH keys if not already present.
Command:
sudo -u group-c ssh-copy-id group-c@backup-c
sudo -u group-c ssh-copy-id group-c@apps-c
sudo -u group-c ssh-copy-id group-c@db-c
For nagios user SSH key and home setup
sudo mkdir -p /home/nagios
sudo chown nagios:nagios /home/nagios
sudo -u nagios ssh-keygen
sudo -u nagios ssh-copy-id group-c@backup-c
sudo -u nagios ssh-copy-id group-c@apps-c
sudo -u nagios ssh-copy-id group-c@db-c
Create Log File for Event Handler Output
Description: Creates a dedicated log file that the event handler script uses to record recovery attempts and outcomes. This log is essential for troubleshooting and verifying that the script executed as expected. Permissions are set to allow the nagios
user to write to the log.
Command:
sudo mkdir -p /var/log/nagios
sudo touch /var/log/nagios/nrpe_recovery.log
sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log
Simulate Failure
Description: Stop the NRPE service in any server
Command: sudo systemctl stop nagios-nrpe-server
Monitor Log on Management Server
Description: Allows real-time monitoring of the event handler’s activity by viewing the contents of the recovery log. This helps confirm whether the restart_nrpe.sh
script is being triggered correctly and if NRPE restarts are succeeding or failing.
Command: sudo tail -f /var/log/nagios/nrpe_recovery.log
Check Nagios Web UI
Description:
Use the Nagios Web Interface to verify that the automatic recovery is working as expected.
Steps:
- Navigate to Services → NRPE Server Health
- When a monitored server goes into a CRITICAL state, the event handler should be triggered
- Within approximately 1 minute, the service should automatically recover and return to OK
Before Recovery
The service is in a CRITICAL state:
After Recovery
The service has been successfully restarted and is now OK:
As shown above, the NRPE service is automatically restarted around 1 minute after a failure is detected.
Challenges
🛠️ Challenge: Script Prompted for Password on Remote Host
Problem:
The event handler script failed to restart the NRPE service because sudo
on the remote host prompted for a password. This caused the automatic recovery to break, as no password can be provided in a non-interactive script.
🔍 Root Cause:
The original SSH command used in the script did not ensure passwordless sudo
access for the group-c
user.
✅ Solution:
The problem was resolved by updating this critical line in the script:
# ✅ Final working version:
/usr/bin/ssh -i /home/group-c/.ssh/id_rsa -o StrictHostKeyChecking=no group-c@"$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"
SSH_EXIT=$?