Troubleshooting High System Load Alert - nwanoch/hng_boilerplate_python_fastapi_web GitHub Wiki
This alert triggers when the 1-minute load average on a system exceeds a certain percentage of available CPU cores.
scalar(node_load1{instance="localhost:9100",job="node_exporter"}) * 100 / count(count(node_cpu_seconds_total{instance="localhost:9100",job="node_exporter"}) by (cpu))Step 1: Verify the Alert
- Log into the monitoring system and confirm the alert details.
- Check if the alert is still active or if it was a temporary spike.
Step 2: Assess the Situation
- SSH into the affected system
- Run
uptimeto view the current load averages. - Use
toporhtopto get an overview of system resource usage.
Step 3: Identify High Resource Consumers
- In top/htop, sort processes by CPU usage ('%CPU' column).
- Identify any processes consuming an unusually high amount of CPU.
- Note the process IDs (PIDs) of high consumers.
Step 4: Investigate Problematic Processes
For each high-consuming process:
a. Run ps aux | grep <PID> to get more details.
b. Check if the process is expected to be running and consuming high resources.
c. Investigate logs related to the process (usually in /var/log/ or application-specific locations).
Step 5: Address Issues
If a process is misbehaving:
a. Try restarting the process: sudo systemctl restart <service-name> or kill -15 <PID>
b. If restart doesn't help, consider stopping the process temporarily: sudo systemctl stop <service-name> or kill -9 <PID>
c. If the high load is due to expected behavior (e.g., batch job), consider rescheduling or optimizing the task.
Step 6: Check System Resources
- Run
free -hto check memory usage. If memory is low, it might cause high CPU usage due to swapping. - Use
df -hto check disk usage. Full disks can cause various issues. - Check I/O wait using
iostat -x 1. High wait times might indicate disk issues.
Step 7: Review Recent Changes
- Check recent system or application updates that might have caused the issue.
- Review any recent configuration changes.
Step 8: Implement Short-term Fix
Based on findings, implement a short-term fix to reduce system load. This might include stopping non-critical services, killing runaway processes, or adding resources.
Step 9: Monitor the Situation
- Continue monitoring the system load using top or htop.
- Verify that the alert resolves in the monitoring system.
Step 10: Plan Long-term Solution
If the issue is recurring, plan for a long-term solution. This might include:
- Upgrading hardware resources
- Optimizing application code
- Load balancing or scaling out the service