Configuration Guide - fleXRPL/datadog-monitor-deployer GitHub Wiki

Configuration Guide

This guide provides detailed information about configuring monitors using the DataDog Monitor Deployer.

Configuration File Format

Monitor configurations can be defined in either YAML or JSON format. YAML is recommended for better readability.

Basic Structure

monitors:
  - name: "Monitor Name"
    type: "monitor_type"
    query: "monitor_query"
    message: "alert_message"
    tags: []
    options: {}

Monitor Properties

Required Fields

Field	Type	Description
`name`	string	Display name of the monitor
`type`	string	Type of monitor (e.g., "metric alert", "log alert")
`query`	string	Monitor query/condition
`message`	string	Alert notification message

Optional Fields

Field	Type	Description
`tags`	array	List of tags for categorization
`priority`	integer	Alert priority (1-5)
`restricted_roles`	array	Roles with access to the monitor
`options`	object	Additional monitor options

Monitor Types

Supported Types

metric alert - Threshold alerts on metrics
service check - Status-based monitoring
event alert - Event-based monitoring
query alert - Complex query monitoring
composite - Combined monitor conditions
log alert - Log-based monitoring
process alert - Process monitoring
trace-analytics alert - APM monitoring
slo alert - SLO monitoring
event-v2 alert - Enhanced event monitoring
audit alert - Audit log monitoring
rum alert - Real user monitoring
ci-pipelines alert - CI pipeline monitoring
error-tracking alert - Error tracking

Options Configuration

Common Options

options:
  notify_no_data: true
  no_data_timeframe: 10
  notify_audit: false
  timeout_h: 0
  evaluation_delay: 900
  new_host_delay: 300
  include_tags: true
  require_full_window: false
  renotify_interval: 60

Thresholds Configuration

options:
  thresholds:
    critical: 90
    warning: 80
    ok: 70
    critical_recovery: 85
    warning_recovery: 75

Notification Configuration

options:
  notification_preset_name: "custom"
  notification_targets:
    - type: "slack"
      channel: "#alerts"
    - type: "email"
      address: "[email protected]"
    - type: "pagerduty"
      service_key: "key123"

Template System

Basic Template

template:
  defaults:
    tags:
      - "team:platform"
      - "env:production"
    options:
      notify_no_data: true
      evaluation_delay: 900

monitors:
  - template: base
    name: "CPU Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{*} > 80"

Variable Substitution

template:
  variables:
    threshold: 80
    service: "web"
    team: "platform"

monitors:
  - name: "{{ service }} CPU Usage"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:{{ service }}} > {{ threshold }}"
    tags:
      - "team:{{ team }}"

Advanced Configuration

Composite Monitors

monitors:
  - name: "Service Health"
    type: "composite"
    query: "12345 && 67890"
    message: "Multiple conditions met"
    options:
      notify_no_data: false

Scheduled Downtime

downtime:
  scope: "env:production"
  start: "2024-03-01T00:00:00Z"
  end: "2024-03-02T00:00:00Z"
  message: "Scheduled maintenance"

Monitor Groups

groups:
  infrastructure:
    monitors:
      - name: "CPU Alert"
        type: "metric alert"
        query: "avg(last_5m):avg:system.cpu.user{*} > 80"
      - name: "Memory Alert"
        type: "metric alert"
        query: "avg(last_5m):avg:system.mem.used{*} > 90"

Environment-Specific Configuration

Using Environment Variables

monitors:
  - name: "${SERVICE_NAME} Alert"
    type: "metric alert"
    query: "avg(last_5m):avg:system.cpu.user{service:${SERVICE_NAME}} > ${THRESHOLD}"

Environment Overrides

environments:
  production:
    threshold: 90
    notification_channel: "#prod-alerts"
  staging:
    threshold: 80
    notification_channel: "#staging-alerts"

Validation

Schema Validation

The tool validates configurations against a JSON schema that ensures:

Required fields are present
Field types are correct
Values are within allowed ranges
Enum values are valid

Query Validation

Queries are validated for:

Syntax correctness
Metric existence
Tag validity
Function support

Best Practices

Naming Conventions
- Use descriptive names
- Include environment/service
- Be consistent
Organization
- Group related monitors
- Use templates for common patterns
- Maintain clear structure
Version Control
- Commit configurations
- Use meaningful commits
- Review changes
Documentation
- Comment complex queries
- Include runbooks
- Document variables

Additional Resources

Monitor Types - Examples of different monitor types
Templating Guide - Advanced templating usage
Best Practices - Configuration best practices
DataDog API Documentation