Ticket ID #334: Self‐Healing NRPE Service Implementation - GriffinKat/group-a GitHub Wiki
NRPE Self-Healing via Nagios Event Handler
Summary
The nagios-nrpe-server service on monitored hosts may crash unexpectedly, leading to Nagios showing the service as CRITICAL and generating false alerts. Manual restarts are inefficient and error-prone. To resolve this, we implemented an automated self-healing mechanism using Nagios event handlers to restart the NRPE service when failure is detected.
Implementation Steps
Define Nagios Service in Puppet
In the nagios module's config.pp
, a service definition was added to monitor the NRPE daemon process:
nagios_service { "check-nrpe-daemon":
service_description => "NRPE Daemon Process",
hostgroup_name => "remote-disks",
check_command => "check_nrpe!check_nrpe_daemon",
event_handler => "restart-nrpe-handler",
event_handler_enabled => 1,
target => "/etc/nagios4/conf.d/apps_services.cfg",
max_check_attempts => 3,
retry_check_interval => 1,
normal_check_interval => 5,
check_period => "24x7",
notification_interval => 30,
notification_period => "24x7",
notification_options => "w,u,c",
contact_groups => "admins,slackgroup",
mode => '0644'
}
This ensures the service is checked regularly, and if it enters a CRITICAL state, the handler is triggered.
Define Event Handler Command in commands.cfg
In /etc/nagios4/objects/commands.cfg
:
define command {
command_name restart-nrpe-handler
command_line sudo -u group-a /usr/local/bin/restart_nrpe.sh "$HOSTNAME$" "$SERVICESTATE$" "$STATETYPE$" "$SERVICEATTEMPT$"
}
This links the service event to the actual script using Nagios macros for host and service state information.
nrpe.cfg
Define NRPE Check in To check the NRPE process itself, this line was added in the nrpe module's files directory:
command[check_nrpe_daemon]=/usr/lib/nagios/plugins/check_procs -w 1: -c 1: -C nrpe
Puppet ensures this command is deployed to all relevant hosts.
Create the Event Handler Script
Path: /usr/local/bin/restart_nrpe.sh
on the Nagios master (mgmt-a):
#!/bin/bash
REMOTE_HOST="$1"
SERVICE_STATE="$2"
STATE_TYPE="$3"
SERVICE_ATTEMPT="$4"
REMOTE_CMD="sudo systemctl restart nagios-nrpe-server.service"
if [ "$SERVICE_STATE" == "CRITICAL" ](/GriffinKat/group-a/wiki/-"$SERVICE_STATE"-==-"CRITICAL"-); then
echo "$(date): Restarting NRPE on $REMOTE_HOST due to CRITICAL state..."
ssh -o BatchMode=yes -o ConnectTimeout=5 -o StrictHostKeyChecking=no group-a@"$REMOTE_HOST" "$REMOTE_CMD"
if [ $? -eq 0 ]; then
echo "NRPE restarted successfully on $REMOTE_HOST"
exit 0
else
echo "Failed to restart NRPE on $REMOTE_HOST"
exit 2
fi
else
echo "$(date): No restart needed for state $SERVICE_STATE"
exit 0
fi
Ownership and permissions:
sudo chown nagios:nagios /usr/local/bin/restart_nrpe.sh
sudo chmod 755 /usr/local/bin/restart_nrpe.sh
SSH Access Configuration
To allow the script to SSH into monitored hosts without a password:
The SSH key had already been generated on mgmt-a
under /home/group-a/.ssh/id_rsa.pub
. It was reused and copied to each remote host.
Commands used to copy SSH key:
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@apps-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@db-a "cat >> ~/.ssh/authorized_keys"
sudo cat ~/.ssh/id_rsa.pub | ssh group-a@backup-a "cat >> ~/.ssh/authorized_keys"
This ensures passwordless login for group-a
.
Configure Sudoers via Puppet
Edited Puppet's sudo
module to include:
group-a ALL=(ALL) NOPASSWD: /usr/bin/systemctl restart nagios-nrpe-server.service
nagios ALL=(group-a) NOPASSWD: /usr/local/bin/restart_nrpe.sh
This allows group-a
to restart services without a password and nagios
to run the handler as group-a.
Validation Procedures
- Confirm NRPE is Active
- Simulate Failure
-
Wait 1–2 minutes and check Nagios Web UI
- Confirm service status goes to CRITICAL
- This means the failure was detected
-
Confirm Recovery
- Wait a bit longer
- The status should return to OK in the Nagios UI
- The logs also confirm the failure and recovery process: