Monitoring - MSF-OCG/LIME-EMR GitHub Wiki

Summary of monitors

Component Main Metric/KPI Alert Threshold Actions HQ Infra Field Infra App Team Dev Team Local Key User
Linux VM & network CPU usage CPU > 80% Investigate high-usage processes X X X
Memory usage Memory > 75% Adjust resource allocation X X X X
Disk space Disk < 20% free Free or expand disk
Network availability Instance not reachable Investigate network and cell availability
EMR Frontend (OpenMRS) Uptime & health checks Downtime > 5 min Restart container if health check fails X X X
URL: http://hostname, Port 80 HTTP error rates (4xx, 5xx) 4xx/5xx > 2% of requests Investigate application logs for performance/config issues X X X
Response time Response time > 10s Investigate performance issues X X X
EMR Admin (OpenMRS) Uptime & health checks Downtime > 5 min Restart container if health check fails X X X
URL: http://hostname/openmrs/admin, Port 80 HTTP error rates (4xx, 5xx) 4xx/5xx > 2% of requests Investigate application logs for performance/config issues X X X
Response time Response time > 20s Investigate performance issues X X X
Interoperability Service (OpenFN) Uptime & health checks Downtime > 5 min Restart container if health check fails X X X
URL: http://hostname, Port 4000 HTTP error rates (4xx, 5xx) 4xx/5xx > 2% of requests Investigate application logs for performance/config issues X X X
Response time Response time > 10s Investigate performance issues X X X
Backups & Storage Docker volume backup success/failure (cron job) Any backup job failure Rerun or fix failing backups X X
(Full & Incremental) Storage capacity usage (NAS) Storage usage > 80% Clean up or expand storage X X
Restore test results Restore test failure Soft backup restore upon backup process completion