Best Practices - fleXRPL/datadog-monitor-deployer GitHub Wiki

Best Practices

This guide outlines best practices for using the DataDog Monitor Deployer effectively in production environments.

Monitor Design

Naming Conventions

  1. Consistent Format
    • Use descriptive, action-oriented names
    • Include service/component name
    • Indicate environment if relevant
    • Example: [Service] [Resource] [Condition] Alert
monitors:
  - name: "API Service High Latency Alert"
  - name: "Web Frontend Error Rate Alert"
  - name: "Database Connection Pool Saturation Alert"
  1. Hierarchical Organization
    • Group related monitors
    • Use consistent prefixes
    • Maintain clear categorization

Query Construction

  1. Performance
    • Use appropriate time windows
    • Avoid excessive tag combinations
    • Consider evaluation delay
    • Use efficient aggregation methods
monitors:
  - name: "Efficient Query Example"
    query: "avg(last_5m):avg:system.cpu.user{service:api} by {host} > 80"
    options:
      evaluation_delay: 900
  1. Reliability
    • Include relevant tags
    • Use stable metrics
    • Handle missing data appropriately
    • Consider seasonality

Alert Configuration

Thresholds

  1. Multiple Levels
    • Set warning and critical thresholds
    • Include recovery thresholds
    • Use appropriate margins
options:
  thresholds:
    critical: 90
    warning: 80
    ok: 70
    critical_recovery: 85
    warning_recovery: 75
  1. Validation
    • Test thresholds in non-production first
    • Document threshold rationale
    • Review and adjust periodically

Notifications

  1. Message Content
    • Include clear problem description
    • Add troubleshooting steps
    • Provide context and impact
    • Include relevant links
message: |
  {{#is_alert}}
  High CPU usage detected on {{host.name}}

  Current Value: {{value}}%
  Threshold: {{threshold}}%

  Impact: Potential service degradation

  Actions:
  1. Check system load
  2. Review recent deployments
  3. Scale if necessary

  Runbook: https://wiki.example.com/runbooks/high-cpu
  Dashboard: https://datadog.example.com/dashboard/cpu
  {{/is_alert}}
  1. Routing
    • Define clear escalation paths
    • Use appropriate urgency levels
    • Consider time zones
    • Implement notification delays

Project Organization

File Structure

monitors/
├── infrastructure/
│   ├── compute.yaml
│   ├── network.yaml
│   └── storage.yaml
├── applications/
│   ├── frontend.yaml
│   ├── backend.yaml
│   └── database.yaml
├── business/
│   ├── transactions.yaml
│   └── users.yaml
└── templates/
    ├── base.yaml
    └── common.yaml

Version Control

  1. Repository Management

    • Use meaningful commit messages
    • Implement branch protection
    • Require code reviews
    • Maintain changelog
  2. Change Process

    • Create feature branches
    • Use pull requests
    • Include tests
    • Document changes

Templating

Template Design

  1. Reusability
    • Create base templates
    • Use variables effectively
    • Implement inheritance
    • Maintain consistency
# templates/base.yaml
template:
  defaults:
    tags:
      - "team:platform"
      - "env:production"
    options:
      notify_no_data: true
      evaluation_delay: 900

# monitors/service.yaml
monitors:
  - template: base
    name: "Service Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:{{service}}} > {{threshold}}"
  1. Variables
    • Use descriptive names
    • Provide default values
    • Document requirements
    • Validate inputs

Testing and Validation

Monitor Testing

  1. Validation Process

    • Test in development environment
    • Verify thresholds
    • Check notification delivery
    • Simulate failure conditions
  2. Quality Checks

    • Validate configuration syntax
    • Check query performance
    • Review alert messages
    • Test recovery scenarios

Automation

  1. CI/CD Integration
    • Automate validation
    • Implement deployment gates
    • Include rollback procedures
    • Monitor deployment success
# .github/workflows/monitors.yml
name: Monitor Deployment
on:
  push:
    paths:
      - "monitors/**"
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Validate Monitors
        run: |
          dd-monitor validate monitors/**/*.yaml

Security

Access Control

  1. Authentication

    • Use environment variables
    • Rotate credentials regularly
    • Implement least privilege
    • Audit access
  2. Sensitive Data

    • Never commit credentials
    • Use secret management
    • Encrypt sensitive values
    • Implement access logs

Maintenance

Monitor Lifecycle

  1. Regular Review

    • Audit active monitors
    • Update thresholds
    • Remove obsolete monitors
    • Document changes
  2. Performance Optimization

    • Review query efficiency
    • Optimize evaluation timing
    • Adjust aggregation methods
    • Monitor API usage

Documentation

  1. Required Documentation

    • Monitor purpose
    • Alert thresholds
    • Response procedures
    • Contact information
  2. Knowledge Base

    • Maintain runbooks
    • Document common issues
    • Include troubleshooting guides
    • Update regularly

Additional Resources