Resolving High System Load Alert

Alert Description:

This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.

Alert Rule:

scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))

Step 1: Verify the Alert

Log into the monitoring system and confirm the alert details.
Check if the alert is still active or if it was a temporary spike.

Step 2: Assess the Situation

SSH into the affected system
Run uptime to view the current load averages.
Use top or htop to get an overview of system resource usage.

Step 3: Identify High Resource Consumers

In top/htop, sort processes by CPU usage ('%CPU' column).
Identify any processes consuming an unusually high amount of CPU.
Note the process IDs (PIDs) of high consumers.

Step 4: Investigate Problematic Processes

For each high-consuming process: a. Run ps aux | grep <PID> to get more details. b. Check if the process is expected to be running and consuming high resources. c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).

Step 5: Address Issues

If a process is misbehaving: a. Try restarting the process: sudo systemctl restart <service-name> or kill -15 <PID> b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name> or kill -9 <PID> c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.

Step 6: Check System Resources

Run free -h to check memory usage. If memory is low, it might cause high CPU usage due to swapping.
Use df -h to check disk usage. Full disks can cause various issues.
Check I/O wait using iostat -x 1. High wait times might indicate disk issues.

Step 7: Review Recent Changes

Check recent system or application updates that might have caused the issue.
Review any recent configuration changes.

Step 8: Implement Short-term Fix

Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.

Step 9: Monitor the Situation

Continue monitoring the system load using top or htop.
Verify that the alert resolves in the monitoring system.

Step 10: Plan Long-term Solution

If the issue is recurring, plan for a long-term solution. This might include:

Upgrading hardware resources
Optimizing application code
Load balancing or scaling out the service

Troubleshooting High System Load Alert - nwanoch/hng_boilerplate_python_fastapi_web GitHub Wiki

Resolving High System Load Alert

Alert Description:

Alert Rule:

Prepared By Devops Python Team

⚠️ GitHub.com Fallback ⚠️

Troubleshooting High System Load Alert - nwanoch/hng_boilerplate_python_fastapi_web GitHub Wiki

Resolving High System Load Alert

Alert Description:

Alert Rule:

Prepared By Devops Python Team

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️