Project Concept ‐ Error Handling & Analysis - Campus-Castolo/m300 GitHub Wiki

⚠️ Error Handling & Analysis

This document outlines how errors are handled within the cloud infrastructure. It covers logging standards, simulated fault conditions, corresponding CloudWatch insights, alert responses, and structured recovery procedures. It fulfills the F1 competency of the M300 matrix.


📝 ERROR LOGGING FORMAT

To ensure errors are both machine- and human-readable, logs follow a structured JSON format:

{
  "timestamp": "2025-04-15T10:00:00Z",
  "level": "error",
  "source": "ecs-task/wordpress",
  "message": "Database connection failed",
  "context": {
    "db_endpoint": "rds-instance.xyz.eu-central-1.rds.amazonaws.com",
    "task_id": "abcd1234"
  }
}

✅ Logs are streamed to CloudWatch via /ecs/wordpress log group.


🧪 SETTING UP SIMULATED FAILURE SCENARIOS

Scenario How to Simulate Expected Error Resolution
ECS Task Crash Manually stop ECS task or deploy container with faulty entrypoint CrashLoopBackOff in ECS logs Revert image version or fix entrypoint
ALB Health Check Failure Temporarily stop ECS container or block port 80/443 ALB shows "Unhealthy Targets", 5xx errors Restart ECS task or adjust health check

📝 For each scenario:

  • Run test
  • Take CloudWatch snapshot
  • Observe alarms and log data

📈 CloudWatch Evidence & Logs

image

image


📡 Alert Response Documentation

  1. Trigger condition: e.g., ECS CPU < 5% or RDS CPU > 80%
  2. Alarm: CloudWatch Alarm pushes to SNS Topic cloudwatch-alerts
  3. Notification: Email and/or SMS sent to admin
  4. Response Actions:
    • Admin reviews CloudWatch logs
    • Error is diagnosed
    • Fix is deployed (restart ECS, optimize DB, etc.)

image


🔧 Troubleshooting Flowchart

flowchart TD
  A[Alert Triggered] --> B[Check CloudWatch Logs]
  B --> C{Is ECS failing?}
  C -- Yes --> D[Review container logs & restart task]
  C -- No --> E{Is RDS under load?}
  E -- Yes --> F[Inspect slow queries or scale RDS]
  E -- No --> G[Check ALB Health & Targets]
  G --> H[Restart ECS or fix ALB routing]

🧩 Error Categorization Table

Error Type Cause Detection Resolution
ECS CrashLoopBackOff Broken container startup ECS console, logs Fix image, restart
RDS CPU Spike Inefficient query CloudWatch metric Optimize or scale
ALB 5xx Surge Unhealthy ECS targets ALB metrics + logs Restart ECS or fix app
Lambda Timeout Misconfigured backup logic Lambda logs Adjust timeout/code
SNS Not Triggering Missing alarm actions Alarm shows OK only Link alarm to SNS topic