4. Sprint 3‐4 Kickoff "Data Center Management Platform" - ITA-Dnipro/PyDataCenter5000 GitHub Wiki
Sprint 3: Automation & Observability
Task 1: Implement Agent Self-Healing
Type: agent
Add logic in the agent to attempt service restarts (e.g., SSH, Nginx) if they are detected as down. Use subprocess to run safe restart commands with logging.
Goal: Agents not only detect issues but try to fix them.
Task 2: Add Resource Usage Reporting
Type: agent
Agents should periodically collect and report:
- CPU usage
- RAM usage
- Disk space
- Load average
Add these to the POST /api/status/
request payload.
Task 3: Store Extended Metrics in the Backend
Type: backend
Update Django models to store additional metrics per timestamp. Extend admin view or dashboard to show system resource usage over time.
Task 4: Historical Metrics & Graphing
Type: backend + frontend
Use a library like Plotly.js or Chart.js in the dashboard to show:
- CPU/RAM trend
- Process failures over time
- Uptime tracking
Optional: Integrate with SQLite or TimescaleDB for time-series support.
Task 5: Role-Based Access Control (RBAC)
Type: backend
Add user roles:
- Admin: Full control
- Operator: Can send commands
- Viewer: Read-only
Use Django’s permissions
or django-guardian
.
Task 6: Add RESTful API Docs (Swagger/OpenAPI)
Type: backend
Generate and expose interactive API documentation using drf-yasg or drf-spectacular.
Task 7: Alert Rules Engine
Type: backend
Define alert rules in the DB, e.g.:
- “If CPU > 80% for 5 mins, send alert”
- “If nginx is down more than 3 times/hour”
Evaluate in background jobs (e.g., Celery or custom cron).
Task 8: External Webhook Integration
Type: communication
Allow sending alerts to external systems:
- Slack
- PagerDuty
- Email/SMS (via SMTP)
Use webhook configuration from admin or .ini
.
Task 9: Implement Retry Logic for Agent Communication
Type: agent
Agents should retry sending data if the controller is temporarily unavailable. Add exponential backoff and logging of failures.
Goal: Improve fault tolerance in communication.
Task 10: Add Agent Health Endpoint
Type: agent + backend
Expose an HTTP endpoint on each agent (e.g., /health
) to return a JSON health snapshot.
Controller or load balancer can periodically ping it for liveness checks.
Goal: Enable easier monitoring and debugging.
Sprint 4: Intelligence & Extensibility
Task 1: Predictive Health (Optional AI module)
Type: backend
Use simple ML/statistical methods to:
- Forecast load
- Detect anomalies (e.g., agent stops reporting or latency spikes)
Could be a separate microservice.
Task 2: Plugin System for Custom Agents
Type: agent + backend
Define a plugin folder where new checks can be added (e.g., check_postgres.py
). Controller can enable/disable checks per agent.
Task 3: Docker & Cloud VM Support
Type: deployment
Enable deploying agents as Docker containers or EC2 instances. Add templates for Dockerfile and Vagrant/AWS CLI.
Task 4: Load Testing and Stress Simulation
Type: QA
Simulate:
- Dozens of agent connections
- Massive command dispatch
- Network interruptions
Use locust
or pytest-benchmark
.
Task 5: Centralized Log Aggregation
Type: backend
Agents send log snippets to controller for aggregation (stdout, syslog). Store in DB or forward to ELK/Graylog.
Task 6: Plugin-Based Monitoring Framework
Type: agent
Refactor the agent to support pluggable monitors (CPU, memory, disk, etc.) using a config file or folder-based plugin architecture.
Goal: Allow future extensibility without changing core agent logic.
Task 7: Multi-Controller Support (HA Mode)
Type: backend
Add ability for agents to be configured with multiple controller URLs (primary/secondary).
Failover logic should be implemented on the agent side.
Goal: Improve resilience and support high availability.
Task 8: Build Package & Installer for Agent
Type: agent + deployment
Create a .deb
or .rpm
package (or zip+install script) to allow users to install the agent easily on new VMs.
Goal: Ease real-world deployment and CI integration.
Task 9: Add Environment Tagging Support
Type: backend + agent
Allow agents to be tagged with metadata:
env: staging / prod
role: db / app / cache
Useful for filtering in dashboard and alerts.
Task 10: Remote Agent Upgrade Support
Type: backend + agent
Implement version check in agent status payload. Controller can push an update command if agent version is outdated.
Goal: Enable centralized control of updates.