Best Practices - fleXRPL/datadog-monitor-deployer GitHub Wiki

Best Practices

This guide outlines best practices for using the DataDog Monitor Deployer effectively in production environments.

Monitor Design

Naming Conventions

Consistent Format
- Use descriptive, action-oriented names
- Include service/component name
- Indicate environment if relevant
- Example: [Service] [Resource] [Condition] Alert

monitors:
  - name: "API Service High Latency Alert"
  - name: "Web Frontend Error Rate Alert"
  - name: "Database Connection Pool Saturation Alert"

Hierarchical Organization
- Group related monitors
- Use consistent prefixes
- Maintain clear categorization

Query Construction

Performance
- Use appropriate time windows
- Avoid excessive tag combinations
- Consider evaluation delay
- Use efficient aggregation methods

monitors:
  - name: "Efficient Query Example"
    query: "avg(last_5m):avg:system.cpu.user{service:api} by {host} > 80"
    options:
      evaluation_delay: 900

Reliability
- Include relevant tags
- Use stable metrics
- Handle missing data appropriately
- Consider seasonality

Alert Configuration

Thresholds

Multiple Levels
- Set warning and critical thresholds
- Include recovery thresholds
- Use appropriate margins

options:
  thresholds:
    critical: 90
    warning: 80
    ok: 70
    critical_recovery: 85
    warning_recovery: 75

Validation
- Test thresholds in non-production first
- Document threshold rationale
- Review and adjust periodically

Notifications

Message Content
- Include clear problem description
- Add troubleshooting steps
- Provide context and impact
- Include relevant links

message: |
  {{#is_alert}}
  High CPU usage detected on {{host.name}}

  Current Value: {{value}}%
  Threshold: {{threshold}}%

  Impact: Potential service degradation

  Actions:
  1. Check system load
  2. Review recent deployments
  3. Scale if necessary

  Runbook: https://wiki.example.com/runbooks/high-cpu
  Dashboard: https://datadog.example.com/dashboard/cpu
  {{/is_alert}}

Routing
- Define clear escalation paths
- Use appropriate urgency levels
- Consider time zones
- Implement notification delays

Project Organization

File Structure

monitors/
├── infrastructure/
│   ├── compute.yaml
│   ├── network.yaml
│   └── storage.yaml
├── applications/
│   ├── frontend.yaml
│   ├── backend.yaml
│   └── database.yaml
├── business/
│   ├── transactions.yaml
│   └── users.yaml
└── templates/
    ├── base.yaml
    └── common.yaml

Version Control

Repository Management
- Use meaningful commit messages
- Implement branch protection
- Require code reviews
- Maintain changelog
Change Process
- Create feature branches
- Use pull requests
- Include tests
- Document changes

Templating

Template Design

Reusability
- Create base templates
- Use variables effectively
- Implement inheritance
- Maintain consistency

# templates/base.yaml
template:
  defaults:
    tags:
      - "team:platform"
      - "env:production"
    options:
      notify_no_data: true
      evaluation_delay: 900

# monitors/service.yaml
monitors:
  - template: base
    name: "Service Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:{{service}}} > {{threshold}}"

Variables
- Use descriptive names
- Provide default values
- Document requirements
- Validate inputs

Testing and Validation

Monitor Testing

Validation Process
- Test in development environment
- Verify thresholds
- Check notification delivery
- Simulate failure conditions
Quality Checks
- Validate configuration syntax
- Check query performance
- Review alert messages
- Test recovery scenarios

Automation

CI/CD Integration
- Automate validation
- Implement deployment gates
- Include rollback procedures
- Monitor deployment success

# .github/workflows/monitors.yml
name: Monitor Deployment
on:
  push:
    paths:
      - "monitors/**"
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Validate Monitors
        run: |
          dd-monitor validate monitors/**/*.yaml

Security

Access Control

Authentication
- Use environment variables
- Rotate credentials regularly
- Implement least privilege
- Audit access
Sensitive Data
- Never commit credentials
- Use secret management
- Encrypt sensitive values
- Implement access logs

Maintenance

Monitor Lifecycle

Regular Review
- Audit active monitors
- Update thresholds
- Remove obsolete monitors
- Document changes
Performance Optimization
- Review query efficiency
- Optimize evaluation timing
- Adjust aggregation methods
- Monitor API usage

Documentation

Required Documentation
- Monitor purpose
- Alert thresholds
- Response procedures
- Contact information
Knowledge Base
- Maintain runbooks
- Document common issues
- Include troubleshooting guides
- Update regularly

Best Practices - fleXRPL/datadog-monitor-deployer GitHub Wiki

Best Practices

Monitor Design

Naming Conventions

Query Construction

Alert Configuration

Thresholds

Notifications

Project Organization

File Structure

Version Control

Templating

Template Design

Testing and Validation

Monitor Testing

Automation

Security

Access Control

Maintenance

Monitor Lifecycle

Documentation

Additional Resources