Ticket 321 ‐ NRPE Log Monitoring Implementation - SupaHotBall/OE2-Group-D GitHub Wiki
Task
Implementation Tasks
- Identify critical log patterns requiring alerts
- Document normal vs. abnormal log volumes
- Establish severity thresholds (Warning: 3+ matches, Critical: 10+ matches)
- Uses efficient grep pattern matching
- Ensure your script runs as nagios user
Validation Procedures
- Test Cases:
- Force test errors: logger "TEST ERROR MESSAGE"
- Verify Nagios alert transitions (OK → WARNING → CRITICAL)
Acceptance Criteria
- ✔ Custom script detects all specified log patterns
- ✔ NRPE responds to checks within 3-10 second
- ✔ Nagios generates alerts within 5 minutes of threshold breach
- ✔ Uses efficient grep pattern matching
- ✔ Documentation includes:
- Script parameters and thresholds
Documentation & Deliverables: • Clear documentation with evidence of before and after logs • Script and clear description of how script works • Evidence of log/syslog and alert generated in slack.
Steps Taken
Normal vs Abnormal Log Volumes
- Normal Log Volume: < 3 matching entries (No issue, everything normal)
- Abnormal Volume (Warning): 3 to 9 entries matching ERROR/CRITICAL/WARNING
- Abnormal Volume (Critical): 10 or more entries matching
Step 1: Create the custom log check script
sudo nano /usr/lib/nagios/plugins/check_syslog_alerts.sh
#!/bin/bash
LOG_FILE="/var/log/syslog"
PATTERNS="ERROR|CRITICAL|WARNING"
COUNT=$(grep -E "$PATTERNS" "$LOG_FILE" | wc -l)
if [ "$COUNT" -ge 10 ]; then
echo "CRITICAL - $COUNT matching log entries found"
exit 2
elif [ "$COUNT" -ge 3 ]; then
echo "WARNING - $COUNT matching log entries found"
exit 1
else
echo "OK - $COUNT matching log entries found"
exit 0
fi
Parameter Explanation:
-
LOG_FILE: Target log file to scan
-
PATTERNS: Regex match for ERROR|CRITICAL|WARNING
-
COUNT: Number of matching lines
-
Exit codes: 0 = OK, 1 = WARNING, 2 = CRITICAL
Step 2: Make the script executable
sudo chmod +x /usr/lib/nagios/plugins/check_syslog_alerts.sh
Step 3: Test script manually
sudo /usr/lib/nagios/plugins/check_syslog_alerts.sh
Step 4: Simulate a log entry
logger "ERROR: TEST ERROR MESSAGE"
sudo /usr/lib/nagios/plugins/check_syslog_alerts.sh
Expected output:
WARNING - 1 matching log entries found
Step 5: Open the NRPE config file (on mgmt server):
sudo nano /etc/puppetlabs/code/modules/nagios_nrpe/files/nrpe.cfg
Add this line at the bottom of the file:
command[check_syslog_alerts]=/usr/lib/nagios/plugins/check_syslog_alerts.sh
Then apply:
sudo puppet agent --test
Step 6: Remote NRPE Test (on mgmt server)
/usr/lib/nagios/plugins/check_nrpe -H db-d -c check_syslog_alerts
Step 7: Define service in Nagios config (on mgmt)
sudo nano /etc/puppetlabs/code/modules/nagios/manifests/config.pp
Add:
nagios_service { "syslog-alerts":
service_description => "Syslog Log Alerts",
host_name => "db-d",
check_command => "check_nrpe!check_syslog_alerts",
max_check_attempts => 3,
retry_interval => 1,
check_interval => 5,
check_period => "24x7",
notification_interval => 30,
notification_period => "24x7",
notification_options => "w,u,c,r",
contact_groups => "slackgroup",
target => "/etc/nagios4/conf.d/ppt_services.cfg",
mode => "0644",
}
Apply config:
sudo puppet agent --test
Step 8: Restart Nagios (on mgmt server)
sudo systemctl restart nagios4
Then you should be able to see this:
🚀 Validation Results ✅ Nagios Service View:
- Status: OK
- Last Check: Shows successful script execution
- State Info: OK - X matching log entries found
💬 Slack Alert Transitions:
10:01 AM - UNKNOWN: NRPE: Command not defined
10:06 AM - OK: Syslog Log Alerts is OK
10:33 AM - UNKNOWN: Command not defined (before fix)
11:03 AM - UNKNOWN: Command not defined (before fix)
✉️ Final Slack Message:
db-d/Syslog Log Alerts is OK
Challenges
External Resources
Ticket Reference
https://rt.dataraster.com/Ticket/Display.html?id=321