Project Concept ‐ Error Handling & Analysis - Campus-Castolo/m300 GitHub Wiki

⚠️ Error Handling & Analysis

This document outlines how errors are handled within the cloud infrastructure. It covers logging standards, simulated fault conditions, corresponding CloudWatch insights, alert responses, and structured recovery procedures. It fulfills the F1 competency of the M300 matrix.

📝 ERROR LOGGING FORMAT

To ensure errors are both machine- and human-readable, logs follow a structured JSON format:

{
  "timestamp": "2025-04-15T10:00:00Z",
  "level": "error",
  "source": "ecs-task/wordpress",
  "message": "Database connection failed",
  "context": {
    "db_endpoint": "rds-instance.xyz.eu-central-1.rds.amazonaws.com",
    "task_id": "abcd1234"
  }
}

✅ Logs are streamed to CloudWatch via /ecs/wordpress log group.

🧪 SETTING UP SIMULATED FAILURE SCENARIOS

Scenario	How to Simulate	Expected Error	Resolution
ECS Task Crash	Manually stop ECS task or deploy container with faulty entrypoint	`CrashLoopBackOff` in ECS logs	Revert image version or fix entrypoint
ALB Health Check Failure	Temporarily stop ECS container or block port 80/443	ALB shows "Unhealthy Targets", 5xx errors	Restart ECS task or adjust health check

📝 For each scenario:

Run test
Take CloudWatch snapshot
Observe alarms and log data

📈 CloudWatch Evidence & Logs

📡 Alert Response Documentation

Trigger condition: e.g., ECS CPU < 5% or RDS CPU > 80%
Alarm: CloudWatch Alarm pushes to SNS Topic cloudwatch-alerts
Notification: Email and/or SMS sent to admin
Response Actions:
- Admin reviews CloudWatch logs
- Error is diagnosed
- Fix is deployed (restart ECS, optimize DB, etc.)

🔧 Troubleshooting Flowchart

flowchart TD
  A[Alert Triggered] --> B[Check CloudWatch Logs]
  B --> C{Is ECS failing?}
  C -- Yes --> D[Review container logs & restart task]
  C -- No --> E{Is RDS under load?}
  E -- Yes --> F[Inspect slow queries or scale RDS]
  E -- No --> G[Check ALB Health & Targets]
  G --> H[Restart ECS or fix ALB routing]

🧩 Error Categorization Table

Error Type	Cause	Detection	Resolution
ECS CrashLoopBackOff	Broken container startup	ECS console, logs	Fix image, restart
RDS CPU Spike	Inefficient query	CloudWatch metric	Optimize or scale
ALB 5xx Surge	Unhealthy ECS targets	ALB metrics + logs	Restart ECS or fix app
Lambda Timeout	Misconfigured backup logic	Lambda logs	Adjust timeout/code
SNS Not Triggering	Missing alarm actions	Alarm shows `OK` only	Link alarm to SNS topic