Governance Policies Operational Excellence Monitoring And Observability - Azure/az-prototype GitHub Wiki
Governance policies for Monitoring Observability
Domain: performance
| Name | Description |
|---|---|
| Full observability stack | Application Insights (auto-instrumentation) + metric alerts (P95, errors, CPU) + distributed tracing (W3C) + saved KQL queries + availability tests |
| Three pillars of observability | Metrics (alerts, dashboards), Logs (KQL queries, saved searches), Traces (distributed tracing, service map) |
| Description | Instead |
|---|---|
| Do not deploy applications without Application Insights | Enable auto-instrumentation via APPLICATIONINSIGHTS_CONNECTION_STRING on all compute resources |
| Do not use InstrumentationKey for Application Insights configuration | Use ConnectionString — InstrumentationKey is deprecated and does not support regional ingestion |
| Do not create alerts without action groups | Configure action groups with email, webhook, or Logic App receivers for all metric alerts |
| Do not rely solely on internal health probes | Add external availability tests from multiple global locations to detect network-level outages |
- Application Insights overview
- KQL query language reference
- Metric alerts
- Distributed tracing
- Availability tests
| Check | Severity | Description |
|---|---|---|
| WAF-OPEX-OBS-001 | Required | Configure Application Insights with auto-instrumentation for .NET, Python, and Node.js — use connection string, not instrumentation key |
| WAF-OPEX-OBS-002 | Required | Configure custom metric alerts for key performance indicators — P95 latency, error rate, throughput, and resource utilization |
| WAF-OPEX-OBS-003 | Required | Enable W3C distributed tracing with trace context propagation across all services in the request chain |
| WAF-OPEX-OBS-004 | Recommended | Create standard KQL queries for performance monitoring — P95 latency, error rates, throughput, and slow dependency calls |
| WAF-OPEX-OBS-005 | Recommended | Configure availability tests for public endpoints — standard URL ping test and multi-step web tests |
Configure Application Insights with auto-instrumentation for .NET, Python, and Node.js — use connection string, not instrumentation key
Severity: Required
Rationale: Application Insights provides request tracking, dependency tracing, and performance metrics. Connection strings support regional ingestion endpoints
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer, monitoring-agent
- Microsoft.Insights/components
- Microsoft.OperationalInsights/workspaces
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
Configure custom metric alerts for key performance indicators — P95 latency, error rate, throughput, and resource utilization
Severity: Required
Rationale: Metric alerts provide proactive notification before performance degradation becomes user-visible; without alerts, issues are discovered by users
Agents: terraform-agent, bicep-agent, cloud-architect, monitoring-agent
- Microsoft.Insights/components
- Microsoft.OperationalInsights/workspaces
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Insights/actionGroups | ag-ops | Action group for alert notifications — required for metric alerts to trigger email/webhook/Logic App notifications |
Enable W3C distributed tracing with trace context propagation across all services in the request chain
Severity: Required
Rationale: Without distributed tracing, diagnosing performance issues in microservices requires correlating logs across multiple systems manually. W3C traceparent header provides automatic correlation
Agents: app-developer, csharp-developer, python-developer, cloud-architect, monitoring-agent
- Microsoft.Insights/components
- Microsoft.OperationalInsights/workspaces
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
Create standard KQL queries for performance monitoring — P95 latency, error rates, throughput, and slow dependency calls
Severity: Recommended
Rationale: Pre-built KQL queries enable rapid diagnosis during incidents; without them, engineers spend 15-30 minutes writing queries instead of investigating
Agents: monitoring-agent, cloud-architect, qa-engineer
- Microsoft.Insights/components
- Microsoft.OperationalInsights/workspaces
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
Configure availability tests for public endpoints — standard URL ping test and multi-step web tests
Severity: Recommended
Rationale: Availability tests detect outages from external perspective (outside Azure network); internal health checks may pass while external access fails
Agents: terraform-agent, bicep-agent, cloud-architect, monitoring-agent
- Microsoft.Insights/components
- Microsoft.OperationalInsights/workspaces
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service