Home - CodySchluenz/tester GitHub Wiki

Dynatrace Monitoring & Alerting Configuration Checklist

Purpose

This checklist serves as a comprehensive inventory of our current Dynatrace configuration across all environments. Use this template to document the current state and identify configuration drift between environments.

Environments to Audit: LAB | INT | QA | PROD


Environment: [ENVIRONMENT_NAME]

Audit Date: [DATE]
Audited By: [TEAM_MEMBER]
Dynatrace Version: [VERSION]
OneAgent Version: [VERSION]


1. Anomaly Detection Configuration

Application Performance Anomaly Detection

Response Time Anomalies

  • Enabled: Yes/No
  • Sensitivity: Low | Medium | High | Custom
  • Threshold Method: Automatic | Fixed | Relative
  • Custom Thresholds:
    • Response time degradation: ___% increase over ___-minute window
    • Absolute threshold: ___ ms
    • Minimum requests: ___ per minute
  • Alert on:
    • Individual services
    • Service groups
    • All services
  • Detection timeframe: ___ minutes
  • Alert delay: ___ minutes

Error Rate Anomalies

  • Enabled: Yes/No
  • Sensitivity: Low | Medium | High | Custom
  • Threshold Configuration:
    • Error rate increase: ___% over ___-minute baseline
    • Absolute error rate: ___%
    • Minimum error count: ___ errors
  • Error categories monitored:
    • HTTP 4xx errors
    • HTTP 5xx errors
    • Custom exceptions
    • Database errors
  • Alert delay: ___ minutes

Traffic Anomalies

  • Enabled: Yes/No
  • Traffic drop detection:
    • Threshold: ___% decrease over ___ minutes
    • Minimum traffic volume: ___ requests/minute
  • Traffic spike detection:
    • Threshold: ___% increase over ___ minutes
    • Maximum traffic threshold: ___ requests/minute

Infrastructure Anomaly Detection

Host Resource Anomalies

  • CPU Usage: Enabled/Disabled
    • Threshold: ___% for ___ minutes
    • Alert on: Average | Peak | 95th percentile
  • Memory Usage: Enabled/Disabled
    • Threshold: ___% for ___ minutes
    • Alert on: Average | Peak | 95th percentile
  • Disk Usage: Enabled/Disabled
    • Threshold: ___% for ___ minutes
    • Include swap: Yes/No
  • Network Anomalies: Enabled/Disabled
    • Retransmission rate: ___%
    • Connection failure rate: ___%

Process Anomalies

  • Process Crashes: Enabled/Disabled
    • Alert on first crash: Yes/No
    • Alert after ___ crashes in ___ minutes
  • Process Unavailability: Enabled/Disabled
    • Detection timeout: ___ minutes
  • Process Resource Consumption: Enabled/Disabled
    • CPU threshold: ___%
    • Memory threshold: ___MB

Database Anomaly Detection

  • Database Connection Failures: Enabled/Disabled
    • Failure rate threshold: ___%
    • Minimum connections: ___
  • Database Response Time: Enabled/Disabled
    • Threshold: ___ ms increase over baseline
    • Detection window: ___ minutes
  • Database Resource Consumption: Enabled/Disabled
    • CPU threshold: ___%
    • Lock wait time: ___ ms

2. Custom Alerting Rules

Business Logic Alerts

API-Specific Thresholds

API Endpoint Response Time SLA Error Rate SLA Current Threshold Alert Enabled
/api/v1/authenticate ___ ms ___% ___ ms Y/N
/api/v1/products ___ ms ___% ___ ms Y/N
/api/v1/orders ___ ms ___% ___ ms Y/N
/api/v1/analytics ___ ms ___% ___ ms Y/N
[Add more endpoints]

Custom Metrics Alerts

  • Tenant Provisioning Success Rate

    • Current threshold: ___%
    • Time window: ___ minutes
    • Alert enabled: Y/N
  • API Gateway Rate Limiting

    • Threshold hit rate: ___%
    • Alert on tenant: All/Specific
    • Notification delay: ___ minutes
  • License Utilization

    • Warning threshold: ___%
    • Critical threshold: ___%
    • Check frequency: ___ minutes

Infrastructure-Specific Alerts

Kubernetes Cluster Health

  • Pod Restart Rate

    • Threshold: ___ restarts per ___ minutes
    • Namespace scope: All/Specific
    • Alert enabled: Y/N
  • Node Resource Pressure

    • Memory pressure threshold: ___%
    • Disk pressure threshold: ___%
    • CPU pressure threshold: ___%
  • Persistent Volume Usage

    • Warning threshold: ___%
    • Critical threshold: ___%
    • Check frequency: ___ minutes

AWS Service Integration Alerts

  • RDS Performance

    • CPU utilization: ___%
    • Connection count: ___
    • Read/Write latency: ___ ms
  • ElastiCache Performance

    • CPU utilization: ___%
    • Memory utilization: ___%
    • Cache hit ratio: ___%
  • Load Balancer Health

    • Unhealthy target threshold: ___
    • Response time threshold: ___ ms
    • Error rate threshold: ___%

3. Notification Configuration

Alerting Profiles

Profile Name Severity Levels Notification Delay Active Hours Enabled
Critical Production P1 ___ minutes 24/7 Y/N
Business Hours P1, P2 ___ minutes 8AM-6PM Y/N
Development Team P2, P3 ___ minutes 9AM-5PM Y/N
[Add more profiles]

Notification Channels

  • Email Notifications

    • Recipients: [List email addresses]
    • Alert formats: HTML/Text
    • Frequency limits: ___ per hour
  • Slack Integration

    • Channel: #[channel-name]
    • Webhook URL: [Configured/Not Configured]
    • Message format: Standard/Custom
  • PagerDuty Integration

    • Service key: [Configured/Not Configured]
    • Escalation policy: [Policy name]
    • Auto-resolution: Enabled/Disabled
  • Teams Integration

    • Channel: [channel-name]
    • Webhook URL: [Configured/Not Configured]

Problem Notification Rules

  • Immediate Notification (P1)

    • Services: [List critical services]
    • Notification delay: ___ minutes
    • Channels: [List channels]
  • Delayed Notification (P2/P3)

    • Notification delay: ___ minutes
    • Business hours only: Y/N
    • Channels: [List channels]

4. Maintenance Windows

Scheduled Maintenance Windows

Window Name Schedule Duration Affected Services Alerting Suppressed
Weekly Maintenance [Day/Time] ___ hours [Services] Y/N
Deployment Window [Day/Time] ___ hours [Services] Y/N
Patch Window [Day/Time] ___ hours [Services] Y/N

Ad-hoc Maintenance Configuration

  • Manual maintenance window creation: Enabled/Disabled
  • Default duration: ___ hours
  • Auto-extension capability: Enabled/Disabled
  • Approval required: Y/N

5. Dashboard Configuration

Built-in Dashboards in Use

  • Application Performance Overview: Active/Inactive
  • Infrastructure Overview: Active/Inactive
  • Database Performance: Active/Inactive
  • Kubernetes Overview: Active/Inactive
  • AWS Services: Active/Inactive

Custom Dashboards

Dashboard Name Owner Last Updated Shared With Purpose
[Dashboard name] [Owner] [Date] [Team/Role] [Purpose]
[Dashboard name] [Owner] [Date] [Team/Role] [Purpose]

Dashboard Sharing & Access

  • Public dashboards: Count: ___
  • Team-restricted dashboards: Count: ___
  • Role-based access control: Enabled/Disabled

6. Service-Level Monitoring

Current SLI/SLO Configuration

Service SLI Metric Current SLO Measurement Window Error Budget Alerting
[Service name] [Metric] [Target] [Timeframe] [Budget] Y/N
[Service name] [Metric] [Target] [Timeframe] [Budget] Y/N

Error Budget Configuration

  • Error budget calculation: Enabled/Disabled
  • Budget period: Daily/Weekly/Monthly
  • Budget burn rate alerts: Enabled/Disabled
  • Fast burn threshold: ___% of budget in ___ hours
  • Slow burn threshold: ___% of budget in ___ days

7. Synthetic Monitoring

Active Synthetic Tests

Test Name Type Frequency Locations Alert Threshold Enabled
[Test name] HTTP/Browser ___ minutes [Locations] [Threshold] Y/N
[Test name] HTTP/Browser ___ minutes [Locations] [Threshold] Y/N

Synthetic Monitor Configuration

  • Global outage detection: Enabled/Disabled
    • Locations required: ___ out of ___
  • Performance regression detection: Enabled/Disabled
    • Threshold: ___% slower than baseline
  • Content verification: Enabled/Disabled
    • Text checks configured: Y/N

8. Log Monitoring

Log Analysis Configuration

  • Log ingestion: Enabled/Disabled
    • Volume per day: ___ GB
    • Retention period: ___ days
  • Log-based alerts: Count: ___
  • Custom log metrics: Count: ___

Log-Based Alerting Rules

Rule Name Log Source Pattern/Filter Threshold Alert Frequency Enabled
[Rule name] [Source] [Pattern] [Threshold] [Frequency] Y/N
[Rule name] [Source] [Pattern] [Threshold] [Frequency] Y/N

9. Integration Health

Monitoring Tool Integrations

  • AWS CloudWatch: Connected/Disconnected
    • Last sync: [Date/Time]
    • Metrics imported: ___ count
  • Kubernetes API: Connected/Disconnected
    • Cluster count: ___
    • Namespace monitoring: All/Selective
  • Prometheus: Connected/Disconnected
    • Endpoint count: ___
    • Custom metrics: ___ count

External Service Dependencies

Service Connection Status Monitoring Method Alert on Failure Last Verified
[Service] [Status] [Method] Y/N [Date]
[Service] [Status] [Method] Y/N [Date]

10. Configuration Management

Tagging Strategy

  • Environment tags: Consistent/Inconsistent
    • Tag format: [Format used]
  • Application tags: Consistent/Inconsistent
    • Tag format: [Format used]
  • Team ownership tags: Consistent/Inconsistent
    • Tag format: [Format used]

Configuration Backup

  • Configuration export: Last performed: [Date]
  • Alert rule backup: Last performed: [Date]
  • Dashboard backup: Last performed: [Date]

Review Summary

Configuration Completeness Score

Overall Score: ___/100

Breakdown by Category:

  • Anomaly Detection: ___/25
  • Custom Alerting: ___/20
  • Notifications: ___/15
  • SLO Monitoring: ___/15
  • Synthetic Monitoring: ___/10
  • Integration Health: ___/10
  • Documentation: ___/5

Critical Gaps Identified

  1. [Gap description and priority]
  2. [Gap description and priority]
  3. [Gap description and priority]

Immediate Action Items

  • [Action item with owner and deadline]
  • [Action item with owner and deadline]
  • [Action item with owner and deadline]

Environment Parity Issues

(To be completed after all environments are audited)

  • [List differences between environments]
  • [Priority level for standardization]

Audit Completed By: [Name]
Date: [Date]
Next Review Date: [Date]