5. Sprints 5 and 6 Kick Off - ITA-Dnipro/PyDataCenter5000 GitHub Wiki

Sprint 5: Smart Alerting & Observability Enhancements

Focus: Making alerts intelligent, actionable, and tunable; enriching observability features.


Task Ideas

  1. Alert Rules Engine (DSL-based or YAML)

    • Type: backend

    • Description: Let users define flexible alert conditions like:

      - name: HighCPU
        condition: cpu_usage > 90
        duration: 3m
        severity: critical
      
    • AC: Support for thresholds, durations, severity levels, and notification mapping.

  2. Alert Deduplication & Suppression Logic

    • Type: backend
    • Description: Avoid alert storms from repetitive failures. Add logic to suppress duplicates within N minutes or during blackout windows.
    • AC: Alerts are deduplicated; suppression logs are traceable.
  3. Integration with Notification Channels (Slack, Email, Telegram)

    • Type: backend
    • Description: Allow plug-in alert handlers to notify via common channels.
    • AC: Configurable alert channels per environment or agent group.
  4. Auto-Tagging of Events with Context

    • Type: backend
    • Description: Enrich alerts with tags like env=prod, region=eu-west, or role=db.
    • AC: Tags included in webhook payloads or alert messages.
  5. Real-Time Agent Dashboard (WebSocket or Polling)

    • Type: frontend + backend
    • Description: Visualize agent status and latest health pings live.
    • AC: No manual refresh needed; shows latest per-agent data.
  6. Latency Heatmap and Outlier Detection

    • Type: backend + visualization
    • Description: Analyze network or CPU latency over time, highlight abnormal spikes.
    • AC: At least one working heatmap and spike detection algorithm (e.g., z-score).
  7. Dead Manโ€™s Switch Monitoring

    • Type: backend
    • Description: For critical agents, fire an alert if no health report is received in X time.
    • AC: Triggers alert only once per missing window.

Sprint 6: Scalability, Security & DevOps Hardening

Focus: Making the platform cloud-ready, secure, and CI/CD-friendly.


โœ… Task Ideas

  1. JWT or OAuth2-Based Agent Authentication

    • Type: agent + backend
    • Description: Replace basic auth with tokens issued by the controller.
    • AC: Expired tokens rejected; renewals logged.
  2. Rate Limiting and Throttling

    • Type: backend
    • Description: Protect APIs from abuse or misbehaving agents.
    • AC: Defined limits (e.g., 60 requests/min) with appropriate error codes.
  3. Horizontal Scaling for Controller (Gunicorn / Uvicorn + Load Balancer)

    • Type: backend + deployment
    • Description: Run multiple controller replicas with shared DB and cache (Redis).
    • AC: Load balancer distributes traffic correctly; sessions stateless.
  4. CI/CD Pipeline for Agent & Backend (GitHub Actions or GitLab CI)

    • Type: DevOps
    • Description: Automate testing, packaging, versioning, and container pushing.
    • AC: PR merge triggers build, test, and image deploy.
  5. Config-as-Code for Agent Setup

    • Type: agent
    • Description: Allow agents to be fully configured from a YAML file.
    • AC: Agent reads config at startup, supports overrides.
  6. End-to-End Encryption of Agent-Controller Traffic

    • Type: agent + backend
    • Description: Enforce HTTPS with optional mutual TLS.
    • AC: TLS is default; controller rejects HTTP.
  7. Benchmark Suite for Performance Regression

    • Type: QA
    • Description: Add automated stress benchmarks to measure response times under load.
    • AC: Benchmarks included in CI; red flags if latency regresses.