Ticket ID #318 NRPE Log Monitoring Implementation - GriffinKat/group-a GitHub Wiki
Implementing NRPE Log Monitoring for Syslog
Summary
As part of our proactive monitoring enhancement for database infrastructure, we identified a critical observability gap:
-
No existing monitoring for critical log patterns in /var/log/syslog
-
Delayed visibility into runtime errors such as ERROR, CRITICAL, and WARNING
-
No alerting integration for log anomalies impacting database availability
This violates our operational goals for early detection and fast response to infrastructure incidents.
Steps to Implement NRPE Log Monitoring for Syslog
Identify Critical Log Patterns
Define the log patterns to detect in /var/log/syslog:
PATTERNS="ERROR|CRITICAL|WARNING"
Document Normal vs. Abnormal Log Volumes
Define volume thresholds:
-
OK: Less than 3 matches ✅
-
WARNING: 3 to 9 matches ⚠️
-
CRITICAL: 10 or more matches ❌
Create a custom script to monitor critical log patterns
Create the script at /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
#!/bin/bash
# Count how many lines in /var/log/syslog contain the words ERROR, CRITICAL, or WARNING
COUNT=$(grep -E "ERROR|CRITICAL|WARNING" /var/log/syslog | wc -l)
# If 10 or more matching lines, return CRITICAL (exit code 2)
if [ "$COUNT" -ge 10 ]; then
echo "CRITICAL - $COUNT log entries"
exit 2
# If 3 to 9 matching lines, return WARNING (exit code 1)
elif [ "$COUNT" -ge 3 ]; then
echo "WARNING - $COUNT log entries"
exit 1
# If fewer than 3 matches, return OK (exit code 0)
else
echo "OK - $COUNT log entries"
exit 0
fi
Make it executable:
sudo chmod +x /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
nagios
user
Ensure Script Runs as Make sure your script has the correct permissions:
sudo chown nagios:nagios /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
sudo chmod 755 /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
Grant Read Access to /var/log/syslog
for nagios
user:
sudo usermod -aG adm nagios
NOTE: Allows the nagios user to read /var/log/syslog
, which is group-owned by adm
Then reboot the system or log out the user session:
sudo logout
Just restarting the nagios-nrpe-server
is not enough — group membership only applies after a new session is started (i.e., reboot or re-login)
Add NRPE Command Configuration
Update the nrpe.cfg
file within the nrpe
module on the mgmt-a
server and add a new command definition for the syslog monitoring check.
command[check_db_syslog_patterns]=/usr/lib/nagios/plugins/check_db_syslog_patterns.sh
Execute the command on the db-a
server to apply the updated NRPE
module configuration.
sudo /opt/puppetlabs/puppet/bin/puppet agent --server=mgmt-a --no-daemonize --verbose --onetime
Restart the nagios-nrpe-server
:
sudo systemctl restart nagios-nrpe-server.service
Trigger a Manual Test
Run script manually:
sudo -u nagios /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
Evidence before simulating an error:
Simulate an error:
logger "ERROR: TEST ERROR MESSAGE"
Run Remote NRPE Test
From the mgmt server:
/usr/lib/nagios/plugins/check_nrpe -H db-a -c check_db_syslog_patterns
Evidence NRPE responds to checks within 3-10 second
Define Nagios Service in Puppet
Modify the config.pp
file within the nagios
module to add a service definition for monitoring syslog alerts
# Syslog alert service for db-a
nagios_service { "syslog-alerts":
service_description => "Syslog Log Alerts",
host_name => "db-a",
check_command => "check_nrpe!check_db_syslog_patterns",
max_check_attempts => 3,
retry_interval => 1,
check_interval => 5,
check_period => "24x7",
notification_interval => 30,
notification_period => "24x7",
notification_options => "w,u,c,r",
contact_groups => "admins,slackgroup",
target => "/etc/nagios4/conf.d/ppt_services.cfg",
mode => "0644",
}
Apply with:
sudo /opt/puppetlabs/bin/puppet agent --test
Restart Nagios and Apache to Apply Service:
sudo systemctl restart nagios4.service
sudo systemctl restart apache2.service
Validation Procedures for Syslog Monitoring
Use the logger command to simulate log entries:
logger "WARNING: TEST ERROR MESSAGE"
logger "ERROR: TEST ERROR MESSAGE"
logger "CRITICAL: TEST ERROR MESSAGE"
Repeat the above to increase the number of matching entries (e.g., 10+ to reach CRITICAL threshold)
Nagios WEB UI with OK
status for 1 entry
Nagios WEB UI with WARNING
status for 4 entry
Nagios WEB UI with CRITICAL
status for 10 entry
Slack notifications received:
Cleanup After Testing
sudo truncate -s 0 /var/log/syslog