Configuration Guide - fleXRPL/datadog-monitor-deployer GitHub Wiki

Configuration Guide

This guide provides detailed information about configuring monitors using the DataDog Monitor Deployer.

Configuration File Format

Monitor configurations can be defined in either YAML or JSON format. YAML is recommended for better readability.

Basic Structure

monitors:
  - name: "Monitor Name"
    type: "monitor_type"
    query: "monitor_query"
    message: "alert_message"
    tags: []
    options: {}

Monitor Properties

Required Fields

Field Type Description
name string Display name of the monitor
type string Type of monitor (e.g., "metric alert", "log alert")
query string Monitor query/condition
message string Alert notification message

Optional Fields

Field Type Description
tags array List of tags for categorization
priority integer Alert priority (1-5)
restricted_roles array Roles with access to the monitor
options object Additional monitor options

Monitor Types

Supported Types

  • metric alert - Threshold alerts on metrics
  • service check - Status-based monitoring
  • event alert - Event-based monitoring
  • query alert - Complex query monitoring
  • composite - Combined monitor conditions
  • log alert - Log-based monitoring
  • process alert - Process monitoring
  • trace-analytics alert - APM monitoring
  • slo alert - SLO monitoring
  • event-v2 alert - Enhanced event monitoring
  • audit alert - Audit log monitoring
  • rum alert - Real user monitoring
  • ci-pipelines alert - CI pipeline monitoring
  • error-tracking alert - Error tracking

Options Configuration

Common Options

options:
  notify_no_data: true
  no_data_timeframe: 10
  notify_audit: false
  timeout_h: 0
  evaluation_delay: 900
  new_host_delay: 300
  include_tags: true
  require_full_window: false
  renotify_interval: 60

Thresholds Configuration

options:
  thresholds:
    critical: 90
    warning: 80
    ok: 70
    critical_recovery: 85
    warning_recovery: 75

Notification Configuration

options:
  notification_preset_name: "custom"
  notification_targets:
    - type: "slack"
      channel: "#alerts"
    - type: "email"
      address: "[email protected]"
    - type: "pagerduty"
      service_key: "key123"

Template System

Basic Template

template:
  defaults:
    tags:
      - "team:platform"
      - "env:production"
    options:
      notify_no_data: true
      evaluation_delay: 900

monitors:
  - template: base
    name: "CPU Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{*} > 80"

Variable Substitution

template:
  variables:
    threshold: 80
    service: "web"
    team: "platform"

monitors:
  - name: "{{ service }} CPU Usage"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:{{ service }}} > {{ threshold }}"
    tags:
      - "team:{{ team }}"

Advanced Configuration

Composite Monitors

monitors:
  - name: "Service Health"
    type: "composite"
    query: "12345 && 67890"
    message: "Multiple conditions met"
    options:
      notify_no_data: false

Scheduled Downtime

downtime:
  scope: "env:production"
  start: "2024-03-01T00:00:00Z"
  end: "2024-03-02T00:00:00Z"
  message: "Scheduled maintenance"

Monitor Groups

groups:
  infrastructure:
    monitors:
      - name: "CPU Alert"
        type: "metric alert"
        query: "avg(last_5m):avg:system.cpu.user{*} > 80"
      - name: "Memory Alert"
        type: "metric alert"
        query: "avg(last_5m):avg:system.mem.used{*} > 90"

Environment-Specific Configuration

Using Environment Variables

monitors:
  - name: "${SERVICE_NAME} Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:${SERVICE_NAME}} > ${THRESHOLD}"

Environment Overrides

environments:
  production:
    threshold: 90
    notification_channel: "#prod-alerts"
  staging:
    threshold: 80
    notification_channel: "#staging-alerts"

Validation

Schema Validation

The tool validates configurations against a JSON schema that ensures:

  • Required fields are present
  • Field types are correct
  • Values are within allowed ranges
  • Enum values are valid

Query Validation

Queries are validated for:

  • Syntax correctness
  • Metric existence
  • Tag validity
  • Function support

Best Practices

  1. Naming Conventions

    • Use descriptive names
    • Include environment/service
    • Be consistent
  2. Organization

    • Group related monitors
    • Use templates for common patterns
    • Maintain clear structure
  3. Version Control

    • Commit configurations
    • Use meaningful commits
    • Review changes
  4. Documentation

    • Comment complex queries
    • Include runbooks
    • Document variables

Additional Resources