4. Sprint 3‐4 Kickoff "Data Center Management Platform" - ITA-Dnipro/PyDataCenter5000 GitHub Wiki

Sprint 3: Automation & Observability

Task 1: Implement Agent Self-Healing

Type: agent Add logic in the agent to attempt service restarts (e.g., SSH, Nginx) if they are detected as down. Use subprocess to run safe restart commands with logging.

Goal: Agents not only detect issues but try to fix them.

Task 2: Add Resource Usage Reporting

Type: agent Agents should periodically collect and report:

CPU usage
RAM usage
Disk space
Load average

Add these to the POST /api/status/ request payload.

Task 3: Store Extended Metrics in the Backend

Type: backend Update Django models to store additional metrics per timestamp. Extend admin view or dashboard to show system resource usage over time.

Task 4: Historical Metrics & Graphing

Type: backend + frontend Use a library like Plotly.js or Chart.js in the dashboard to show:

CPU/RAM trend
Process failures over time
Uptime tracking

Optional: Integrate with SQLite or TimescaleDB for time-series support.

Task 5: Role-Based Access Control (RBAC)

Type: backend Add user roles:

Admin: Full control
Operator: Can send commands
Viewer: Read-only

Use Django’s permissions or django-guardian.

Task 6: Add RESTful API Docs (Swagger/OpenAPI)

Type: backend Generate and expose interactive API documentation using drf-yasg or drf-spectacular.

Task 7: Alert Rules Engine

Type: backend Define alert rules in the DB, e.g.:

“If CPU > 80% for 5 mins, send alert”
“If nginx is down more than 3 times/hour”

Evaluate in background jobs (e.g., Celery or custom cron).

Task 8: External Webhook Integration

Type: communication Allow sending alerts to external systems:

Slack
PagerDuty
Email/SMS (via SMTP)

Use webhook configuration from admin or .ini.

Task 9: Implement Retry Logic for Agent Communication

Type: agent Agents should retry sending data if the controller is temporarily unavailable. Add exponential backoff and logging of failures.

Goal: Improve fault tolerance in communication.

Task 10: Add Agent Health Endpoint

Type: agent + backend Expose an HTTP endpoint on each agent (e.g., /health) to return a JSON health snapshot. Controller or load balancer can periodically ping it for liveness checks.

Goal: Enable easier monitoring and debugging.

Sprint 4: Intelligence & Extensibility

Task 1: Predictive Health (Optional AI module)

Type: backend Use simple ML/statistical methods to:

Forecast load
Detect anomalies (e.g., agent stops reporting or latency spikes)

Could be a separate microservice.

Task 2: Plugin System for Custom Agents

Type: agent + backend Define a plugin folder where new checks can be added (e.g., check_postgres.py). Controller can enable/disable checks per agent.

Task 3: Docker & Cloud VM Support

Type: deployment Enable deploying agents as Docker containers or EC2 instances. Add templates for Dockerfile and Vagrant/AWS CLI.

Task 4: Load Testing and Stress Simulation

Type: QA Simulate:

Dozens of agent connections
Massive command dispatch
Network interruptions

Use locust or pytest-benchmark.

Task 5: Centralized Log Aggregation

Type: backend Agents send log snippets to controller for aggregation (stdout, syslog). Store in DB or forward to ELK/Graylog.

Task 6: Plugin-Based Monitoring Framework

Type: agent Refactor the agent to support pluggable monitors (CPU, memory, disk, etc.) using a config file or folder-based plugin architecture.

Goal: Allow future extensibility without changing core agent logic.

Task 7: Multi-Controller Support (HA Mode)

Type: backend Add ability for agents to be configured with multiple controller URLs (primary/secondary). Failover logic should be implemented on the agent side.

Goal: Improve resilience and support high availability.

Task 8: Build Package & Installer for Agent

Type: agent + deployment Create a .deb or .rpm package (or zip+install script) to allow users to install the agent easily on new VMs.

Goal: Ease real-world deployment and CI integration.

Task 9: Add Environment Tagging Support

Type: backend + agent Allow agents to be tagged with metadata:

env: staging / prod
role: db / app / cache

Useful for filtering in dashboard and alerts.

Task 10: Remote Agent Upgrade Support

Type: backend + agent Implement version check in agent status payload. Controller can push an update command if agent version is outdated.

Goal: Enable centralized control of updates.