Best Practices - fleXRPL/datadog-monitor-deployer GitHub Wiki
Best Practices
This guide outlines best practices for using the DataDog Monitor Deployer effectively in production environments.
Monitor Design
Naming Conventions
- Consistent Format
- Use descriptive, action-oriented names
- Include service/component name
- Indicate environment if relevant
- Example:
[Service] [Resource] [Condition] Alert
monitors:
- name: "API Service High Latency Alert"
- name: "Web Frontend Error Rate Alert"
- name: "Database Connection Pool Saturation Alert"
- Hierarchical Organization
- Group related monitors
- Use consistent prefixes
- Maintain clear categorization
Query Construction
- Performance
- Use appropriate time windows
- Avoid excessive tag combinations
- Consider evaluation delay
- Use efficient aggregation methods
monitors:
- name: "Efficient Query Example"
query: "avg(last_5m):avg:system.cpu.user{service:api} by {host} > 80"
options:
evaluation_delay: 900
- Reliability
- Include relevant tags
- Use stable metrics
- Handle missing data appropriately
- Consider seasonality
Alert Configuration
Thresholds
- Multiple Levels
- Set warning and critical thresholds
- Include recovery thresholds
- Use appropriate margins
options:
thresholds:
critical: 90
warning: 80
ok: 70
critical_recovery: 85
warning_recovery: 75
- Validation
- Test thresholds in non-production first
- Document threshold rationale
- Review and adjust periodically
Notifications
- Message Content
- Include clear problem description
- Add troubleshooting steps
- Provide context and impact
- Include relevant links
message: |
{{#is_alert}}
High CPU usage detected on {{host.name}}
Current Value: {{value}}%
Threshold: {{threshold}}%
Impact: Potential service degradation
Actions:
1. Check system load
2. Review recent deployments
3. Scale if necessary
Runbook: https://wiki.example.com/runbooks/high-cpu
Dashboard: https://datadog.example.com/dashboard/cpu
{{/is_alert}}
- Routing
- Define clear escalation paths
- Use appropriate urgency levels
- Consider time zones
- Implement notification delays
Project Organization
File Structure
monitors/
├── infrastructure/
│ ├── compute.yaml
│ ├── network.yaml
│ └── storage.yaml
├── applications/
│ ├── frontend.yaml
│ ├── backend.yaml
│ └── database.yaml
├── business/
│ ├── transactions.yaml
│ └── users.yaml
└── templates/
├── base.yaml
└── common.yaml
Version Control
-
Repository Management
- Use meaningful commit messages
- Implement branch protection
- Require code reviews
- Maintain changelog
-
Change Process
- Create feature branches
- Use pull requests
- Include tests
- Document changes
Templating
Template Design
- Reusability
- Create base templates
- Use variables effectively
- Implement inheritance
- Maintain consistency
# templates/base.yaml
template:
defaults:
tags:
- "team:platform"
- "env:production"
options:
notify_no_data: true
evaluation_delay: 900
# monitors/service.yaml
monitors:
- template: base
name: "Service Alert"
type: "metric alert"
query: "avg(last_5m):avg:system.cpu.user{service:{{service}}} > {{threshold}}"
- Variables
- Use descriptive names
- Provide default values
- Document requirements
- Validate inputs
Testing and Validation
Monitor Testing
-
Validation Process
- Test in development environment
- Verify thresholds
- Check notification delivery
- Simulate failure conditions
-
Quality Checks
- Validate configuration syntax
- Check query performance
- Review alert messages
- Test recovery scenarios
Automation
- CI/CD Integration
- Automate validation
- Implement deployment gates
- Include rollback procedures
- Monitor deployment success
# .github/workflows/monitors.yml
name: Monitor Deployment
on:
push:
paths:
- "monitors/**"
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Validate Monitors
run: |
dd-monitor validate monitors/**/*.yaml
Security
Access Control
-
Authentication
- Use environment variables
- Rotate credentials regularly
- Implement least privilege
- Audit access
-
Sensitive Data
- Never commit credentials
- Use secret management
- Encrypt sensitive values
- Implement access logs
Maintenance
Monitor Lifecycle
-
Regular Review
- Audit active monitors
- Update thresholds
- Remove obsolete monitors
- Document changes
-
Performance Optimization
- Review query efficiency
- Optimize evaluation timing
- Adjust aggregation methods
- Monitor API usage
Documentation
-
Required Documentation
- Monitor purpose
- Alert thresholds
- Response procedures
- Contact information
-
Knowledge Base
- Maintain runbooks
- Document common issues
- Include troubleshooting guides
- Update regularly