Monitoring & Alerting model - yurkka23/iMusic_team GitHub Wiki

Monitoring

Metric Measures Infrastructure Connection Collection Method
CPU Load Percentage Utilization API Servers Monitoring via system agents (Zabbix, Prometheus, or Nagios).
Memory Usage Usage Percentage, GB API and Database Servers System agents or tools (Zabbix, Nagios).
API Request Latency Milliseconds API Servers Monitoring response time through logs. API performance monitoring tools (New Relic, AppDynamics) analyze response times from logs or telemetry.
Disk Usage GB, Percentage Storage Servers System agent monitoring disk capacity. Disk monitoring tools (Nagios, Zabbix) track storage consumption using agent-based data collection.
Active Users Count of Concurrent Sessions Web Application Servers Collected via session tracking and real-time analytics platforms (Mixpanel, Google Analytics, or Amplitude).
Network Throughput Mbps, Packets Count Network Interfaces Network monitoring tools (SolarWinds, Wireshark) analyze traffic and packet data on network interfaces.
Database Query Latency Milliseconds Database Servers Database monitoring tools (Percona, AWS CloudWatch) measure query performance.
Login Success Rate Percentage Authorization Servers Authentication logs(Active Directory, Okta) track login attempts and success rates.
Song Processing Latency Milliseconds Media Processing Servers Media processing logs or job queue monitoring tools (RabbitMQ, Kafka) analyze job completion times.
Failed Logins Count of Failed Attempts Authorization Servers Security monitoring tools (Fail2Ban, Splunk) analyze logs for failed login attempts.
Crash Rate Percentage or Count Application and API Servers Crash reporting tools (eCrashlytics, Sentry) capture app crash events and generate real-time alerts.
Error Rate Percentage of Errors Application, API, Database Servers Error tracking platforms (Sentry, Raygun) log and categorize errors across the stack.
Failed Database Queries Count of Failed Queries Database Servers Query profiling tools (Percona) monitor and log failed queries for analysis.

Alerting

Metric Min/Max Type Criticality Mitigation Plan
CPU Load > 90% Critical High Optimize load balancing, reschedule tasks.
Memory Usage > 85% Critical High Add more memory, clear cache.
API Request Latency > 200 ms Critical High Optimize API queries, increase resources.
Disk Usage < 15% or > 90% Warning Medium Archive old data, expand disk capacity.
Active Users < 50 Users Warning Medium Investigate possible service outage.
Network Throughput < 50 Mbps Warning Medium Check network interfaces, optimize traffic flow.
Database Query Latency > 300 ms Critical High Optimize queries, scale database infrastructure.
Login Success Rate < 80% Warning Medium Investigate potential issues with authentication.
Song Processing Latency > 300 ms Warning Medium Monitor load on servers, check for inefficient encoding pipelines, and preemptively allocate resources.
Failed Logins > 50 attempts/minute Critical High Block suspicious IPs, implement CAPTCHA.
Crash Rate > 3 crashes/hour Critical High Debug crash logs, stabilize application.
Error Rate > 5% Critical High Review error logs, fix recurring issues.
Failed Database Queries > 50 queries/hour Critical High Investigate database schema, optimize connections.
⚠️ **GitHub.com Fallback** ⚠️