Resiliency model - yurkka23/iMusic_team GitHub Wiki

Overview

CID diagram

CID

RMA

Discover Phase: Identify Failures

ID Interaction Failure Short Name Failure Description Response
1 Listener → Authentication Server Authentication Timeout The server fails to respond within the expected time due to high traffic. Notify the user and retry the request automatically after a delay. Monitor server performance and scale up if necessary.
2 Listener → Recommendation API Data Unavailable Recommendation data is missing due to a failure in the Analytics Database. Provide fallback recommendations or display a user-friendly error. Alert the database team to resolve the issue.
3 Streaming Server → Music Database Query Failure The database fails to return metadata for the requested track due to connection issues. Retry the query. If the failure persists, display a "track unavailable" message. Log the incident for further investigation.

Rate Phase: Analyze Failures

ID Interaction Impact Likelihood Time to Detect (TTD) Time to Recover (TTR) Risk (Impact × Likelihood)
1 Listener → Authentication Server High Medium < 5 minutes 10 minutes High
2 Listener → Recommendation API Medium High 5–15 minutes 20 minutes Medium
3 Streaming Server → Music Database High Low < 5 minutes 15 minutes Medium

Act Phase: Mitigation Strategies

Scaling for High Traffic:

  • Introduce auto-scaling for the Authentication Server during peak traffic.
  • Implement caching for frequent authentication requests.

Fallback for Recommendations:

  • Cache popular recommendations to serve users when the Analytics Database is unavailable.
  • Regularly back up recommendation data.

Database Redundancy:

  • Use database replication and load balancing for the Music Database.
  • Monitor database health and configure automated failover.