Home - CodySchluenz/tester GitHub Wiki
Dynatrace Monitoring & Alerting Configuration Checklist
Purpose
This checklist serves as a comprehensive inventory of our current Dynatrace configuration across all environments. Use this template to document the current state and identify configuration drift between environments.
Environments to Audit: LAB | INT | QA | PROD
Environment: [ENVIRONMENT_NAME]
Audit Date: [DATE]
Audited By: [TEAM_MEMBER]
Dynatrace Version: [VERSION]
OneAgent Version: [VERSION]
1. Anomaly Detection Configuration
Application Performance Anomaly Detection
Response Time Anomalies
- Enabled: Yes/No
- Sensitivity: Low | Medium | High | Custom
- Threshold Method: Automatic | Fixed | Relative
- Custom Thresholds:
- Response time degradation: ___% increase over ___-minute window
- Absolute threshold: ___ ms
- Minimum requests: ___ per minute
- Alert on:
- Individual services
- Service groups
- All services
- Detection timeframe: ___ minutes
- Alert delay: ___ minutes
Error Rate Anomalies
- Enabled: Yes/No
- Sensitivity: Low | Medium | High | Custom
- Threshold Configuration:
- Error rate increase: ___% over ___-minute baseline
- Absolute error rate: ___%
- Minimum error count: ___ errors
- Error categories monitored:
- HTTP 4xx errors
- HTTP 5xx errors
- Custom exceptions
- Database errors
- Alert delay: ___ minutes
Traffic Anomalies
- Enabled: Yes/No
- Traffic drop detection:
- Threshold: ___% decrease over ___ minutes
- Minimum traffic volume: ___ requests/minute
- Traffic spike detection:
- Threshold: ___% increase over ___ minutes
- Maximum traffic threshold: ___ requests/minute
Infrastructure Anomaly Detection
Host Resource Anomalies
- CPU Usage: Enabled/Disabled
- Threshold: ___% for ___ minutes
- Alert on: Average | Peak | 95th percentile
- Memory Usage: Enabled/Disabled
- Threshold: ___% for ___ minutes
- Alert on: Average | Peak | 95th percentile
- Disk Usage: Enabled/Disabled
- Threshold: ___% for ___ minutes
- Include swap: Yes/No
- Network Anomalies: Enabled/Disabled
- Retransmission rate: ___%
- Connection failure rate: ___%
Process Anomalies
- Process Crashes: Enabled/Disabled
- Alert on first crash: Yes/No
- Alert after ___ crashes in ___ minutes
- Process Unavailability: Enabled/Disabled
- Detection timeout: ___ minutes
- Process Resource Consumption: Enabled/Disabled
- CPU threshold: ___%
- Memory threshold: ___MB
Database Anomaly Detection
- Database Connection Failures: Enabled/Disabled
- Failure rate threshold: ___%
- Minimum connections: ___
- Database Response Time: Enabled/Disabled
- Threshold: ___ ms increase over baseline
- Detection window: ___ minutes
- Database Resource Consumption: Enabled/Disabled
- CPU threshold: ___%
- Lock wait time: ___ ms
2. Custom Alerting Rules
Business Logic Alerts
API-Specific Thresholds
| API Endpoint | Response Time SLA | Error Rate SLA | Current Threshold | Alert Enabled |
|---|---|---|---|---|
/api/v1/authenticate |
___ ms | ___% | ___ ms | Y/N |
/api/v1/products |
___ ms | ___% | ___ ms | Y/N |
/api/v1/orders |
___ ms | ___% | ___ ms | Y/N |
/api/v1/analytics |
___ ms | ___% | ___ ms | Y/N |
| [Add more endpoints] |
Custom Metrics Alerts
-
Tenant Provisioning Success Rate
- Current threshold: ___%
- Time window: ___ minutes
- Alert enabled: Y/N
-
API Gateway Rate Limiting
- Threshold hit rate: ___%
- Alert on tenant: All/Specific
- Notification delay: ___ minutes
-
License Utilization
- Warning threshold: ___%
- Critical threshold: ___%
- Check frequency: ___ minutes
Infrastructure-Specific Alerts
Kubernetes Cluster Health
-
Pod Restart Rate
- Threshold: ___ restarts per ___ minutes
- Namespace scope: All/Specific
- Alert enabled: Y/N
-
Node Resource Pressure
- Memory pressure threshold: ___%
- Disk pressure threshold: ___%
- CPU pressure threshold: ___%
-
Persistent Volume Usage
- Warning threshold: ___%
- Critical threshold: ___%
- Check frequency: ___ minutes
AWS Service Integration Alerts
-
RDS Performance
- CPU utilization: ___%
- Connection count: ___
- Read/Write latency: ___ ms
-
ElastiCache Performance
- CPU utilization: ___%
- Memory utilization: ___%
- Cache hit ratio: ___%
-
Load Balancer Health
- Unhealthy target threshold: ___
- Response time threshold: ___ ms
- Error rate threshold: ___%
3. Notification Configuration
Alerting Profiles
| Profile Name | Severity Levels | Notification Delay | Active Hours | Enabled |
|---|---|---|---|---|
| Critical Production | P1 | ___ minutes | 24/7 | Y/N |
| Business Hours | P1, P2 | ___ minutes | 8AM-6PM | Y/N |
| Development Team | P2, P3 | ___ minutes | 9AM-5PM | Y/N |
| [Add more profiles] |
Notification Channels
-
Email Notifications
- Recipients: [List email addresses]
- Alert formats: HTML/Text
- Frequency limits: ___ per hour
-
Slack Integration
- Channel: #[channel-name]
- Webhook URL: [Configured/Not Configured]
- Message format: Standard/Custom
-
PagerDuty Integration
- Service key: [Configured/Not Configured]
- Escalation policy: [Policy name]
- Auto-resolution: Enabled/Disabled
-
Teams Integration
- Channel: [channel-name]
- Webhook URL: [Configured/Not Configured]
Problem Notification Rules
-
Immediate Notification (P1)
- Services: [List critical services]
- Notification delay: ___ minutes
- Channels: [List channels]
-
Delayed Notification (P2/P3)
- Notification delay: ___ minutes
- Business hours only: Y/N
- Channels: [List channels]
4. Maintenance Windows
Scheduled Maintenance Windows
| Window Name | Schedule | Duration | Affected Services | Alerting Suppressed |
|---|---|---|---|---|
| Weekly Maintenance | [Day/Time] | ___ hours | [Services] | Y/N |
| Deployment Window | [Day/Time] | ___ hours | [Services] | Y/N |
| Patch Window | [Day/Time] | ___ hours | [Services] | Y/N |
Ad-hoc Maintenance Configuration
- Manual maintenance window creation: Enabled/Disabled
- Default duration: ___ hours
- Auto-extension capability: Enabled/Disabled
- Approval required: Y/N
5. Dashboard Configuration
Built-in Dashboards in Use
- Application Performance Overview: Active/Inactive
- Infrastructure Overview: Active/Inactive
- Database Performance: Active/Inactive
- Kubernetes Overview: Active/Inactive
- AWS Services: Active/Inactive
Custom Dashboards
| Dashboard Name | Owner | Last Updated | Shared With | Purpose |
|---|---|---|---|---|
| [Dashboard name] | [Owner] | [Date] | [Team/Role] | [Purpose] |
| [Dashboard name] | [Owner] | [Date] | [Team/Role] | [Purpose] |
Dashboard Sharing & Access
- Public dashboards: Count: ___
- Team-restricted dashboards: Count: ___
- Role-based access control: Enabled/Disabled
6. Service-Level Monitoring
Current SLI/SLO Configuration
| Service | SLI Metric | Current SLO | Measurement Window | Error Budget | Alerting |
|---|---|---|---|---|---|
| [Service name] | [Metric] | [Target] | [Timeframe] | [Budget] | Y/N |
| [Service name] | [Metric] | [Target] | [Timeframe] | [Budget] | Y/N |
Error Budget Configuration
- Error budget calculation: Enabled/Disabled
- Budget period: Daily/Weekly/Monthly
- Budget burn rate alerts: Enabled/Disabled
- Fast burn threshold: ___% of budget in ___ hours
- Slow burn threshold: ___% of budget in ___ days
7. Synthetic Monitoring
Active Synthetic Tests
| Test Name | Type | Frequency | Locations | Alert Threshold | Enabled |
|---|---|---|---|---|---|
| [Test name] | HTTP/Browser | ___ minutes | [Locations] | [Threshold] | Y/N |
| [Test name] | HTTP/Browser | ___ minutes | [Locations] | [Threshold] | Y/N |
Synthetic Monitor Configuration
- Global outage detection: Enabled/Disabled
- Locations required: ___ out of ___
- Performance regression detection: Enabled/Disabled
- Threshold: ___% slower than baseline
- Content verification: Enabled/Disabled
- Text checks configured: Y/N
8. Log Monitoring
Log Analysis Configuration
- Log ingestion: Enabled/Disabled
- Volume per day: ___ GB
- Retention period: ___ days
- Log-based alerts: Count: ___
- Custom log metrics: Count: ___
Log-Based Alerting Rules
| Rule Name | Log Source | Pattern/Filter | Threshold | Alert Frequency | Enabled |
|---|---|---|---|---|---|
| [Rule name] | [Source] | [Pattern] | [Threshold] | [Frequency] | Y/N |
| [Rule name] | [Source] | [Pattern] | [Threshold] | [Frequency] | Y/N |
9. Integration Health
Monitoring Tool Integrations
- AWS CloudWatch: Connected/Disconnected
- Last sync: [Date/Time]
- Metrics imported: ___ count
- Kubernetes API: Connected/Disconnected
- Cluster count: ___
- Namespace monitoring: All/Selective
- Prometheus: Connected/Disconnected
- Endpoint count: ___
- Custom metrics: ___ count
External Service Dependencies
| Service | Connection Status | Monitoring Method | Alert on Failure | Last Verified |
|---|---|---|---|---|
| [Service] | [Status] | [Method] | Y/N | [Date] |
| [Service] | [Status] | [Method] | Y/N | [Date] |
10. Configuration Management
Tagging Strategy
- Environment tags: Consistent/Inconsistent
- Tag format: [Format used]
- Application tags: Consistent/Inconsistent
- Tag format: [Format used]
- Team ownership tags: Consistent/Inconsistent
- Tag format: [Format used]
Configuration Backup
- Configuration export: Last performed: [Date]
- Alert rule backup: Last performed: [Date]
- Dashboard backup: Last performed: [Date]
Review Summary
Configuration Completeness Score
Overall Score: ___/100
Breakdown by Category:
- Anomaly Detection: ___/25
- Custom Alerting: ___/20
- Notifications: ___/15
- SLO Monitoring: ___/15
- Synthetic Monitoring: ___/10
- Integration Health: ___/10
- Documentation: ___/5
Critical Gaps Identified
- [Gap description and priority]
- [Gap description and priority]
- [Gap description and priority]
Immediate Action Items
- [Action item with owner and deadline]
- [Action item with owner and deadline]
- [Action item with owner and deadline]
Environment Parity Issues
(To be completed after all environments are audited)
- [List differences between environments]
- [Priority level for standardization]
Audit Completed By: [Name]
Date: [Date]
Next Review Date: [Date]