5. Sprints 5 and 6 Kick Off - ITA-Dnipro/PyDataCenter5000 GitHub Wiki
Sprint 5: Smart Alerting & Observability Enhancements
Focus: Making alerts intelligent, actionable, and tunable; enriching observability features.
Task Ideas
-
Alert Rules Engine (DSL-based or YAML)
-
Type:
backend
-
Description: Let users define flexible alert conditions like:
- name: HighCPU condition: cpu_usage > 90 duration: 3m severity: critical
-
AC: Support for thresholds, durations, severity levels, and notification mapping.
-
-
Alert Deduplication & Suppression Logic
- Type:
backend
- Description: Avoid alert storms from repetitive failures. Add logic to suppress duplicates within N minutes or during blackout windows.
- AC: Alerts are deduplicated; suppression logs are traceable.
- Type:
-
Integration with Notification Channels (Slack, Email, Telegram)
- Type:
backend
- Description: Allow plug-in alert handlers to notify via common channels.
- AC: Configurable alert channels per environment or agent group.
- Type:
-
Auto-Tagging of Events with Context
- Type:
backend
- Description: Enrich alerts with tags like
env=prod
,region=eu-west
, orrole=db
. - AC: Tags included in webhook payloads or alert messages.
- Type:
-
Real-Time Agent Dashboard (WebSocket or Polling)
- Type:
frontend + backend
- Description: Visualize agent status and latest health pings live.
- AC: No manual refresh needed; shows latest per-agent data.
- Type:
-
Latency Heatmap and Outlier Detection
- Type:
backend + visualization
- Description: Analyze network or CPU latency over time, highlight abnormal spikes.
- AC: At least one working heatmap and spike detection algorithm (e.g., z-score).
- Type:
-
Dead Manโs Switch Monitoring
- Type:
backend
- Description: For critical agents, fire an alert if no health report is received in X time.
- AC: Triggers alert only once per missing window.
- Type:
Sprint 6: Scalability, Security & DevOps Hardening
Focus: Making the platform cloud-ready, secure, and CI/CD-friendly.
โ Task Ideas
-
JWT or OAuth2-Based Agent Authentication
- Type:
agent + backend
- Description: Replace basic auth with tokens issued by the controller.
- AC: Expired tokens rejected; renewals logged.
- Type:
-
Rate Limiting and Throttling
- Type:
backend
- Description: Protect APIs from abuse or misbehaving agents.
- AC: Defined limits (e.g., 60 requests/min) with appropriate error codes.
- Type:
-
Horizontal Scaling for Controller (Gunicorn / Uvicorn + Load Balancer)
- Type:
backend + deployment
- Description: Run multiple controller replicas with shared DB and cache (Redis).
- AC: Load balancer distributes traffic correctly; sessions stateless.
- Type:
-
CI/CD Pipeline for Agent & Backend (GitHub Actions or GitLab CI)
- Type:
DevOps
- Description: Automate testing, packaging, versioning, and container pushing.
- AC: PR merge triggers build, test, and image deploy.
- Type:
-
Config-as-Code for Agent Setup
- Type:
agent
- Description: Allow agents to be fully configured from a YAML file.
- AC: Agent reads config at startup, supports overrides.
- Type:
-
End-to-End Encryption of Agent-Controller Traffic
- Type:
agent + backend
- Description: Enforce HTTPS with optional mutual TLS.
- AC: TLS is default; controller rejects HTTP.
- Type:
-
Benchmark Suite for Performance Regression
- Type:
QA
- Description: Add automated stress benchmarks to measure response times under load.
- AC: Benchmarks included in CI; red flags if latency regresses.
- Type: