Project Concept ‐ Error Handling & Analysis - Campus-Castolo/m300 GitHub Wiki
⚠️ Error Handling & Analysis
This document outlines how errors are handled within the cloud infrastructure. It covers logging standards, simulated fault conditions, corresponding CloudWatch insights, alert responses, and structured recovery procedures. It fulfills the F1 competency of the M300 matrix.
📝 ERROR LOGGING FORMAT
To ensure errors are both machine- and human-readable, logs follow a structured JSON format:
{
"timestamp": "2025-04-15T10:00:00Z",
"level": "error",
"source": "ecs-task/wordpress",
"message": "Database connection failed",
"context": {
"db_endpoint": "rds-instance.xyz.eu-central-1.rds.amazonaws.com",
"task_id": "abcd1234"
}
}
✅ Logs are streamed to CloudWatch via /ecs/wordpress
log group.
🧪 SETTING UP SIMULATED FAILURE SCENARIOS
Scenario | How to Simulate | Expected Error | Resolution |
---|---|---|---|
ECS Task Crash | Manually stop ECS task or deploy container with faulty entrypoint | CrashLoopBackOff in ECS logs |
Revert image version or fix entrypoint |
ALB Health Check Failure | Temporarily stop ECS container or block port 80/443 | ALB shows "Unhealthy Targets", 5xx errors | Restart ECS task or adjust health check |
📝 For each scenario:
- Run test
- Take CloudWatch snapshot
- Observe alarms and log data
📈 CloudWatch Evidence & Logs
📡 Alert Response Documentation
- Trigger condition: e.g., ECS CPU < 5% or RDS CPU > 80%
- Alarm: CloudWatch Alarm pushes to SNS Topic
cloudwatch-alerts
- Notification: Email and/or SMS sent to admin
- Response Actions:
- Admin reviews CloudWatch logs
- Error is diagnosed
- Fix is deployed (restart ECS, optimize DB, etc.)
🔧 Troubleshooting Flowchart
flowchart TD
A[Alert Triggered] --> B[Check CloudWatch Logs]
B --> C{Is ECS failing?}
C -- Yes --> D[Review container logs & restart task]
C -- No --> E{Is RDS under load?}
E -- Yes --> F[Inspect slow queries or scale RDS]
E -- No --> G[Check ALB Health & Targets]
G --> H[Restart ECS or fix ALB routing]
🧩 Error Categorization Table
Error Type | Cause | Detection | Resolution |
---|---|---|---|
ECS CrashLoopBackOff | Broken container startup | ECS console, logs | Fix image, restart |
RDS CPU Spike | Inefficient query | CloudWatch metric | Optimize or scale |
ALB 5xx Surge | Unhealthy ECS targets | ALB metrics + logs | Restart ECS or fix app |
Lambda Timeout | Misconfigured backup logic | Lambda logs | Adjust timeout/code |
SNS Not Triggering | Missing alarm actions | Alarm shows OK only |
Link alarm to SNS topic |