Ticket ID #318 NRPE Log Monitoring Implementation - GriffinKat/group-a GitHub Wiki

Implementing NRPE Log Monitoring for Syslog

Summary

As part of our proactive monitoring enhancement for database infrastructure, we identified a critical observability gap:

  • No existing monitoring for critical log patterns in /var/log/syslog

  • Delayed visibility into runtime errors such as ERROR, CRITICAL, and WARNING

  • No alerting integration for log anomalies impacting database availability

This violates our operational goals for early detection and fast response to infrastructure incidents.


Steps to Implement NRPE Log Monitoring for Syslog

Identify Critical Log Patterns

Define the log patterns to detect in /var/log/syslog:

PATTERNS="ERROR|CRITICAL|WARNING"

Document Normal vs. Abnormal Log Volumes

Define volume thresholds:

  • OK: Less than 3 matches ✅

  • WARNING: 3 to 9 matches ⚠️

  • CRITICAL: 10 or more matches ❌


Create a custom script to monitor critical log patterns

Create the script at /usr/lib/nagios/plugins/check_db_syslog_patterns.sh

#!/bin/bash

# Count how many lines in /var/log/syslog contain the words ERROR, CRITICAL, or WARNING
COUNT=$(grep -E "ERROR|CRITICAL|WARNING" /var/log/syslog | wc -l)

# If 10 or more matching lines, return CRITICAL (exit code 2)
if [ "$COUNT" -ge 10 ]; then
  echo "CRITICAL - $COUNT log entries"
  exit 2

# If 3 to 9 matching lines, return WARNING (exit code 1)
elif [ "$COUNT" -ge 3 ]; then
  echo "WARNING - $COUNT log entries"
  exit 1

# If fewer than 3 matches, return OK (exit code 0)
else
  echo "OK - $COUNT log entries"
  exit 0
fi

Make it executable:

sudo chmod +x /usr/lib/nagios/plugins/check_db_syslog_patterns.sh

Screenshot 2025-05-21 133512


Ensure Script Runs as nagios user

Make sure your script has the correct permissions:

sudo chown nagios:nagios /usr/lib/nagios/plugins/check_db_syslog_patterns.sh
sudo chmod 755 /usr/lib/nagios/plugins/check_db_syslog_patterns.sh

Screenshot 2025-05-21 134009

Grant Read Access to /var/log/syslog for nagios user:

sudo usermod -aG adm nagios

NOTE: Allows the nagios user to read /var/log/syslog, which is group-owned by adm

Then reboot the system or log out the user session:

sudo logout

Just restarting the nagios-nrpe-server is not enough — group membership only applies after a new session is started (i.e., reboot or re-login)


Add NRPE Command Configuration

Update the nrpe.cfg file within the nrpe module on the mgmt-a server and add a new command definition for the syslog monitoring check.

command[check_db_syslog_patterns]=/usr/lib/nagios/plugins/check_db_syslog_patterns.sh

image

Execute the command on the db-a server to apply the updated NRPE module configuration.

sudo /opt/puppetlabs/puppet/bin/puppet agent --server=mgmt-a --no-daemonize --verbose --onetime

image

Restart the nagios-nrpe-server:

sudo systemctl restart nagios-nrpe-server.service

Trigger a Manual Test

Run script manually:

sudo -u nagios /usr/lib/nagios/plugins/check_db_syslog_patterns.sh

Evidence before simulating an error:

Screenshot 2025-05-21 141216

Simulate an error:

logger "ERROR: TEST ERROR MESSAGE"

image


Run Remote NRPE Test

From the mgmt server:

/usr/lib/nagios/plugins/check_nrpe -H db-a -c check_db_syslog_patterns

image

Evidence NRPE responds to checks within 3-10 second

image


Define Nagios Service in Puppet

Modify the config.pp file within the nagios module to add a service definition for monitoring syslog alerts

  # Syslog alert service for db-a
  nagios_service { "syslog-alerts":
    service_description     => "Syslog Log Alerts",
    host_name               => "db-a",
    check_command           => "check_nrpe!check_db_syslog_patterns",
    max_check_attempts      => 3,
    retry_interval          => 1,
    check_interval          => 5,
    check_period            => "24x7",
    notification_interval   => 30,
    notification_period     => "24x7",
    notification_options    => "w,u,c,r",
    contact_groups          => "admins,slackgroup",
    target                  => "/etc/nagios4/conf.d/ppt_services.cfg",
    mode                    => "0644",
  }

image

Apply with:

sudo /opt/puppetlabs/bin/puppet agent --test

image

Restart Nagios and Apache to Apply Service:

sudo systemctl restart nagios4.service
sudo systemctl restart apache2.service

Validation Procedures for Syslog Monitoring

Use the logger command to simulate log entries:

logger "WARNING: TEST ERROR MESSAGE"
logger "ERROR: TEST ERROR MESSAGE"
logger "CRITICAL: TEST ERROR MESSAGE"

Repeat the above to increase the number of matching entries (e.g., 10+ to reach CRITICAL threshold)

Nagios WEB UI with OK status for 1 entry

image

Nagios WEB UI with WARNING status for 4 entry

image

image

Nagios WEB UI with CRITICAL status for 10 entry

image

image

Slack notifications received:

image


Cleanup After Testing

sudo truncate -s 0 /var/log/syslog

image


Ticket Reference- https://rt.dataraster.com/Ticket/Display.html?id=318