Assessment ‐ Advanced Actionplan - Campus-Castolo/m300 GitHub Wiki

🚀 To Advanced Action Plan – M300 Competency Completion Guide

This action plan outlines the precise technical steps I need to take to upgrade all remaining intermediate competencies to the Advanced level in the M300 matrix.


🔷 D1 – Netzwerkverbindungen konfigurieren und testen (Network Configuration & Testing)

🎯 Goals:

  • Implement full visibility and proof of network connectivity and control.
  • Monitor and document VPC traffic and security posture.

✅ Tasks:

  1. Enable VPC Flow Logs

    resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
      name              = "/aws/vpc/flow-logs"
      retention_in_days = 14
    }
    
    resource "aws_flow_log" "vpc" {
      vpc_id              = aws_vpc.main.id
      traffic_type        = "ALL"
      log_destination     = aws_cloudwatch_log_group.vpc_flow_logs.arn
      log_destination_type = "cloud-watch-logs"
    }
    
  2. ECS ↔ RDS Test

    • Use exec into a running ECS task:
      nc -vz <rds-endpoint> 3306
      
    • Screenshot output and document result.
  3. Lambda ↔ RDS Snapshot Verification

    • In CloudWatch, verify snapshot logs:
      • Log group: /aws/lambda/rds-backup
      • Look for CreateDBSnapshot events.
  4. (Optional): Add ICMP allow rule and perform ping test between ECS and RDS (if allowed).

  5. Document all results in corrected_network_topology.md.


🔷 E2 – Betrieb und Überwachung von Services (Service Operations & Observability)

🎯 Goals:

  • Validate observability, restore capability, and automated recovery behaviors.

✅ Tasks:

  1. Create CloudWatch Dashboards

    • ECS Metrics:
      • CPUUtilization, MemoryUtilization
    • RDS Metrics:
      • FreeStorageSpace, DatabaseConnections
  2. Configure Autoscaling for ECS

    resource "aws_appautoscaling_target" "ecs" {
      max_capacity       = 4
      min_capacity       = 1
      resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.wordpress.name}"
      scalable_dimension = "ecs:service:DesiredCount"
      service_namespace  = "ecs"
    }
    
    resource "aws_appautoscaling_policy" "cpu" {
      policy_type = "TargetTrackingScaling"
      ...
    }
    
  3. Trigger and Document CloudWatch Alarm

    • Temporarily lower threshold (e.g., CPU < 5%).
    • Receive SNS alert via email/SMS.
    • Take screenshot of received notification.
  4. Restore from Snapshot

    • Create a new DB from an RDS snapshot via console or Terraform.
    • Validate login via MySQL client.
    • Document procedure in updated_security_concept.md.

🔷 F1 – Fehleranalyse und Protokollierung (Error Analysis & Logging)

🎯 Goals:

  • Prove ability to detect, log, and resolve system-level failures with documentation.

✅ Tasks:

  1. Structured Log Format for ECS

    • Modify app/container to output JSON logs:
      {"timestamp":"...", "level":"error", "message":"Database unreachable"}
      
  2. Simulate Common Failures

    • ECS Task Crash: stop task manually or inject bad config.
    • RDS CPU spike: run stress test query.
    • ALB health check fail: block port temporarily.
  3. Observe Logs + Alerts

    • Ensure alert triggers from failure.
    • Copy relevant log output (timestamp, error, affected system).
  4. Error Classification Table

    Error Type Cause Detection Resolution
    ECS CrashLoopBackOff Bad image env ECS logs, alarm Re-deploy image
    RDS CPU Spike Query overload CloudWatch metric Optimize / restart
    ALB 5xx Surge No healthy ECS tasks ALB logs, health fail Restart ECS service

✅ Summary Table – Final Verification

Category Action Status Notes
D1 ⏳ In Progress Add flow logs, connectivity tests
E2 ⏳ In Progress Add dashboards, restore test
F1 ⏳ In Progress Simulate & log error resolution