Ticket 337 ‐ Self‐Healing NRPE Service Implementation - SupaHotBall/OE2-Group-D GitHub Wiki
-
Scenario The nagios-nrpe-server service on a monitored node may unexpectedly stop, causing monitoring failures. Instead of relying on manual fixes, your task is to simulate a self-healing mechanism using Nagios event handlers to automatically detect the failure and restart the service.
-
Task Objective
- Configure Nagios to monitor the nagios-nrpe-server service on a remote host.
- Set up an event handler that automatically detect and restarts/recover crashed NRPE services on monitored hosts, reducing manual intervention for critical monitoring failures.
- Log the recovery action for verification.
- Acceptance Criteria
- An event handler script named restart_nrpe.sh is written and tested.
- Nagios is configured to trigger the script when NRPE becomes CRITICAL.
- Stopping the service triggers auto-restart by Nagios.
- Logs confirm the healing action.
- Verification performed using Nagios web UI and system logs.
- Validation Procedures
- Confirm NRPE Is Active
- Simulate Failure (Use the Nagios web UI or check)
- Confirm the NRPE Agent Status goes to CRITICAL.
- Confirm Recovery. Nagios web UI should now show NRPE Agent Status = OK.
- Documentation & Deliverables:
- Working restart_nrpe.sh event handler.
- Logs showing detection and self-healing.
- Screenshots/logs from Nagios UI showing CRITICAL → OK recovery.
- Explanation of how your script works
Create the event handler script
#!/bin/bash
HOST="$1"
LOGFILE="/var/log/nagios/nrpe_recovery.log"
sudo -u group-d ssh -o StrictHostKeyChecking=no "$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"
if [ $? -eq 0 ]; then
echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"
else
echo "$(date): FAILED to restart NRPE on $HOST" >> "$LOGFILE"
exit 2
fi
Run sudo chmod +x /usr/local/bin/restart_nrpe.sh
to set the correct permissions
Define the nagios command by first creating a cfg file which contains it
sudo nano /etc/nagios4/conf.d/restart_handlers.cfg
Then define the command in this file.
define command {
command_name restart-nrpe
command_line /usr/local/bin/restart_nrpe.sh $HOSTADDRESS$
}
In the nagios module, add check-nrpe-server command within the config.pp file
nagios_service { "check-nrpe-server":
use => "generic-service",
host_name => [ "apps-d", "backup-d", "db-d" ],
service_description => "NRPE Server Health",
check_command => "check_nrpe!check_users",
max_check_attempts => 4,
check_interval => 1,
retry_interval => 1,
check_period => "24x7",
notification_interval => 30,
notification_period => "24x7",
notification_options => "w,u,c,r",
contact_groups => "slackgroup",
event_handler => "restart-nrpe",
event_handler_enabled => 1,
target => "/etc/nagios4/conf.d/ppt_services.cfg",
mode => "0644",
}
The "NRPE Server Health" service check is used by the Nagios server to monitor whether the NRPEdaemon is:
- Running and reachable on the remote host
- Responding to plugin requests (like check_users, check_load, etc.)
- Not down
Enable SSH Key Access from nagios to Remote Hosts
sudo -u nagios ssh-keygen
sudo -u nagios ssh-copy-id -i /home/nagios/.ssh/id_rsa group-d@<host>
Repeat for backup-d
, apps-d
, db-d
Allow passwordless sudo for restart command
Run sudo visudo
and add
Defaults:group-d !requiretty
group-d ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server
Create log file for output on the management server
sudo mkdir -p /var/log/nagios
sudo touch /var/log/nagios/nrpe_recovery.log
sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log
These logs show a successful an automatic restart of the nagios-nrpe-server when the script has detected that a check has become critical
In order to validate this again, stop the nrpe service on any monitored node (in this case, we test with backup-d)
It only shows 2 critical states in total before the event handler kicks in to restart the service, so it doesn't show many critical errors in the UI before it's all green again. It usually only shows logged in users and nrpe server health switching to critical before the script is ran and nagios-nrpe-server is restarted
The log can be checked for timestamps which show that nagios-nrpe-server was successfully restarted again at 10:55pm, around 4 minutes after it was manually stopped
The result after shows that the states have been successfully switched from CRITICAL to OK
-
#!/bin/bash
-
HOST="$1"
- Captures the first argument passed to the script (the target hostname)
-
LOGFILE="/var/log/nagios/nrpe_recovery.log"
- Defines the file where the script logs success or failure messages
-
sudo -u group-d ssh -o StrictHostKeyChecking=no "$HOST" "sudo /usr/bin/systemctl restart nagios-nrpe-server"
- SSHs into the specified host as user
group-d
- Remotely runs
sudo systemctl restart nagios-nrpe-server
to restart the NRPE service - Bypasses SSH host fingerprint prompts to avoid hanging the script
- SSHs into the specified host as user
-
if [ $? -eq 0 ]; then
- Checks if the SSH command succeeded (exit code
0
)
- Checks if the SSH command succeeded (exit code
-
echo "$(date): Successfully restarted NRPE on $HOST" >> "$LOGFILE"
- Logs a success message with timestamp if the restart succeeded
-
else
-
If the SSH command failed:
-
echo "$(date): FAILED to restart NRPE on $HOST" >> "$LOGFILE"
- Logs a failure message with timestamp
-
exit 2
- Exits with status code
2
to signal an error to Nagios
- Exits with status code
-
Purpose
This script serves as a Nagios event handler to:
- Automatically restart the
nagios-nrpe-server
service on remote hosts - Log the outcome for auditing
- Enable self-healing for NRPE service outages
Challenges and Troubleshooting Summary
Configuration & Pre-Flight Errors
-
Nagios failed to start due to undefined commands
-
restart-nrpe
andcheck_owncloud_login
were not defined - Fix: Removed obsolete
check_owncloud_login
from services and definedrestart-nrpe
inrestart_handlers.cfg
-
-
Changes in
ppt_services.cfg
overwritten- Puppet re-applied old configs, reintroducing deleted lines
- Fix: Updated
nagios_service
definitions in Puppet module (config.pp
) and ranpuppet agent --test
SSH & Authentication Issues
-
Event handler failed due to missing SSH access
-
nagios
user didn’t have SSH key or access to remote hosts - Fix:
- Generated key with
sudo -u nagios ssh-keygen
- Used
ssh-copy-id
to push the public key togroup-d@<host>
- Ensured
/home/nagios/.ssh
existed and was owned bynagios
- Generated key with
-
-
Host fingerprint prompt blocked event handler
-
nagios
couldn't confirm new host keys viassh
- Fix: Added
-o StrictHostKeyChecking=no
to suppress the interactive prompt
-
-
Permission denied when
nagios
tried to usegroup-d
's key-
nagios
couldn’t access/home/group-d/.ssh/id_rsa
- Fix: Copied the SSH key to
/home/nagios/.ssh/
, adjusted ownership and permissions
-
Sudo Problems on Remote Hosts
-
sudo: a terminal is required to read the password
- When
group-d
ransudo systemctl restart
via SSH - Fix: Updated
/etc/sudoers
on each monitored node:Defaults:group-d !requiretty group-d ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server
- When
-
Wrong path to
systemctl
caused restart to fail- Initially assumed
/bin/systemctl
, butwhich systemctl
returned/usr/bin/systemctl
- Fix: Updated script to use full path
/usr/bin/systemctl
- Initially assumed
Logging Issues
-
nrpe_recovery.log
not created automatically- Script failed silently or wrote nothing
- Fix: Created log file manually and ensured proper ownership:
sudo touch /var/log/nagios/nrpe_recovery.log sudo chown nagios:nagios /var/log/nagios/nrpe_recovery.log
Event Handler Not Triggering
-
NRPE was stopped, but nothing logged or restarted
- Service hadn’t reached a
HARD
CRITICAL
state yet - Fix: Waited ~1–2 minutes (based on
check_interval
andmax_check_attempts
) for Nagios to escalate and trigger handler
- Service hadn’t reached a
Debugging Strategy
- Regularly ran:
sudo nagios4 -v /etc/nagios4/nagios.cfg
- Checked nagios recovery log
https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/eventhandlers.html https://exchange.nagios.org/directory/Addons/Monitoring-Agents
https://rt.dataraster.com/Ticket/Display.html?id=337