4. Sprint 3‐4 Kickoff "Data Center Management Platform" - ITA-Dnipro/PyDataCenter5000 GitHub Wiki

Sprint 3: Automation & Observability

Task 1: Implement Agent Self-Healing

Type: agent Add logic in the agent to attempt service restarts (e.g., SSH, Nginx) if they are detected as down. Use subprocess to run safe restart commands with logging.

Goal: Agents not only detect issues but try to fix them.


Task 2: Add Resource Usage Reporting

Type: agent Agents should periodically collect and report:

  • CPU usage
  • RAM usage
  • Disk space
  • Load average

Add these to the POST /api/status/ request payload.


Task 3: Store Extended Metrics in the Backend

Type: backend Update Django models to store additional metrics per timestamp. Extend admin view or dashboard to show system resource usage over time.


Task 4: Historical Metrics & Graphing

Type: backend + frontend Use a library like Plotly.js or Chart.js in the dashboard to show:

  • CPU/RAM trend
  • Process failures over time
  • Uptime tracking

Optional: Integrate with SQLite or TimescaleDB for time-series support.


Task 5: Role-Based Access Control (RBAC)

Type: backend Add user roles:

  • Admin: Full control
  • Operator: Can send commands
  • Viewer: Read-only

Use Django’s permissions or django-guardian.


Task 6: Add RESTful API Docs (Swagger/OpenAPI)

Type: backend Generate and expose interactive API documentation using drf-yasg or drf-spectacular.


Task 7: Alert Rules Engine

Type: backend Define alert rules in the DB, e.g.:

  • “If CPU > 80% for 5 mins, send alert”
  • “If nginx is down more than 3 times/hour”

Evaluate in background jobs (e.g., Celery or custom cron).


Task 8: External Webhook Integration

Type: communication Allow sending alerts to external systems:

  • Slack
  • PagerDuty
  • Email/SMS (via SMTP)

Use webhook configuration from admin or .ini.


Task 9: Implement Retry Logic for Agent Communication

Type: agent Agents should retry sending data if the controller is temporarily unavailable. Add exponential backoff and logging of failures.

Goal: Improve fault tolerance in communication.


Task 10: Add Agent Health Endpoint

Type: agent + backend Expose an HTTP endpoint on each agent (e.g., /health) to return a JSON health snapshot. Controller or load balancer can periodically ping it for liveness checks.

Goal: Enable easier monitoring and debugging.


Sprint 4: Intelligence & Extensibility

Task 1: Predictive Health (Optional AI module)

Type: backend Use simple ML/statistical methods to:

  • Forecast load
  • Detect anomalies (e.g., agent stops reporting or latency spikes)

Could be a separate microservice.


Task 2: Plugin System for Custom Agents

Type: agent + backend Define a plugin folder where new checks can be added (e.g., check_postgres.py). Controller can enable/disable checks per agent.


Task 3: Docker & Cloud VM Support

Type: deployment Enable deploying agents as Docker containers or EC2 instances. Add templates for Dockerfile and Vagrant/AWS CLI.


Task 4: Load Testing and Stress Simulation

Type: QA Simulate:

  • Dozens of agent connections
  • Massive command dispatch
  • Network interruptions

Use locust or pytest-benchmark.


Task 5: Centralized Log Aggregation

Type: backend Agents send log snippets to controller for aggregation (stdout, syslog). Store in DB or forward to ELK/Graylog.


Task 6: Plugin-Based Monitoring Framework

Type: agent Refactor the agent to support pluggable monitors (CPU, memory, disk, etc.) using a config file or folder-based plugin architecture.

Goal: Allow future extensibility without changing core agent logic.


Task 7: Multi-Controller Support (HA Mode)

Type: backend Add ability for agents to be configured with multiple controller URLs (primary/secondary). Failover logic should be implemented on the agent side.

Goal: Improve resilience and support high availability.


Task 8: Build Package & Installer for Agent

Type: agent + deployment Create a .deb or .rpm package (or zip+install script) to allow users to install the agent easily on new VMs.

Goal: Ease real-world deployment and CI integration.


Task 9: Add Environment Tagging Support

Type: backend + agent Allow agents to be tagged with metadata:

  • env: staging / prod
  • role: db / app / cache

Useful for filtering in dashboard and alerts.


Task 10: Remote Agent Upgrade Support

Type: backend + agent Implement version check in agent status payload. Controller can push an update command if agent version is outdated.

Goal: Enable centralized control of updates.