Governance Policies Operational Excellence Monitoring And Observability - Azure/az-prototype GitHub Wiki

Monitoring & Observability

Governance policies for Monitoring Observability

Domain: performance

Patterns

Name Description
Full observability stack Application Insights (auto-instrumentation) + metric alerts (P95, errors, CPU) + distributed tracing (W3C) + saved KQL queries + availability tests
Three pillars of observability Metrics (alerts, dashboards), Logs (KQL queries, saved searches), Traces (distributed tracing, service map)

Anti-Patterns

Description Instead
Do not deploy applications without Application Insights Enable auto-instrumentation via APPLICATIONINSIGHTS_CONNECTION_STRING on all compute resources
Do not use InstrumentationKey for Application Insights configuration Use ConnectionString — InstrumentationKey is deprecated and does not support regional ingestion
Do not create alerts without action groups Configure action groups with email, webhook, or Logic App receivers for all metric alerts
Do not rely solely on internal health probes Add external availability tests from multiple global locations to detect network-level outages

References


Checks (5)

Check Severity Description
WAF-OPEX-OBS-001 Required Configure Application Insights with auto-instrumentation for .NET, Python, and Node.js — use connection string, not instrumentation key
WAF-OPEX-OBS-002 Required Configure custom metric alerts for key performance indicators — P95 latency, error rate, throughput, and resource utilization
WAF-OPEX-OBS-003 Required Enable W3C distributed tracing with trace context propagation across all services in the request chain
WAF-OPEX-OBS-004 Recommended Create standard KQL queries for performance monitoring — P95 latency, error rates, throughput, and slow dependency calls
WAF-OPEX-OBS-005 Recommended Configure availability tests for public endpoints — standard URL ping test and multi-step web tests

WAF-OPEX-OBS-001

Configure Application Insights with auto-instrumentation for .NET, Python, and Node.js — use connection string, not instrumentation key

Severity: Required
Rationale: Application Insights provides request tracking, dependency tracing, and performance metrics. Connection strings support regional ingestion endpoints
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer, monitoring-agent

Targets

  • Microsoft.Insights/components
  • Microsoft.OperationalInsights/workspaces
  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ApiManagement/service

WAF-OPEX-OBS-002

Configure custom metric alerts for key performance indicators — P95 latency, error rate, throughput, and resource utilization

Severity: Required
Rationale: Metric alerts provide proactive notification before performance degradation becomes user-visible; without alerts, issues are discovered by users
Agents: terraform-agent, bicep-agent, cloud-architect, monitoring-agent

Targets

  • Microsoft.Insights/components
  • Microsoft.OperationalInsights/workspaces
  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ApiManagement/service

Companion Resources

Resource Name Purpose
Microsoft.Insights/actionGroups ag-ops Action group for alert notifications — required for metric alerts to trigger email/webhook/Logic App notifications

WAF-OPEX-OBS-003

Enable W3C distributed tracing with trace context propagation across all services in the request chain

Severity: Required
Rationale: Without distributed tracing, diagnosing performance issues in microservices requires correlating logs across multiple systems manually. W3C traceparent header provides automatic correlation
Agents: app-developer, csharp-developer, python-developer, cloud-architect, monitoring-agent

Targets

  • Microsoft.Insights/components
  • Microsoft.OperationalInsights/workspaces
  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ApiManagement/service

WAF-OPEX-OBS-004

Create standard KQL queries for performance monitoring — P95 latency, error rates, throughput, and slow dependency calls

Severity: Recommended
Rationale: Pre-built KQL queries enable rapid diagnosis during incidents; without them, engineers spend 15-30 minutes writing queries instead of investigating
Agents: monitoring-agent, cloud-architect, qa-engineer

Targets

  • Microsoft.Insights/components
  • Microsoft.OperationalInsights/workspaces
  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ApiManagement/service

WAF-OPEX-OBS-005

Configure availability tests for public endpoints — standard URL ping test and multi-step web tests

Severity: Recommended
Rationale: Availability tests detect outages from external perspective (outside Azure network); internal health checks may pass while external access fails
Agents: terraform-agent, bicep-agent, cloud-architect, monitoring-agent

Targets

  • Microsoft.Insights/components
  • Microsoft.OperationalInsights/workspaces
  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ApiManagement/service

⚠️ **GitHub.com Fallback** ⚠️