Governance Policies Reliability High Availability - Azure/az-prototype GitHub Wiki

High Availability

Governance policies for High Availability

Domain: reliability

Patterns

Name Description
Health endpoint pattern Implement a /healthz endpoint in every service that checks all downstream dependencies (database connectivity, cache availability, external API reachability) and returns structured health status.

Anti-Patterns

Description Instead
Deploying all resources in a single availability zone without zone redundancy Spread resources across availability zones 1, 2, and 3 for datacenter-level fault tolerance
Using TCP health probes that only check port availability Use HTTP/HTTPS health probes with /healthz endpoint that validates application and dependency health
Relying on single-region deployment for production workloads Deploy to at least two regions with Front Door or Traffic Manager for automatic failover
Using Standard_LRS storage for production data Use Standard_ZRS (zone-redundant) or Standard_GZRS (geo-zone-redundant) for production storage
Deploying databases without geo-replication Configure SQL failover groups, Cosmos DB multi-region, or PostgreSQL read replicas for DR

References


Checks (5)

Check Severity Description
WAF-REL-HA-001 Recommended Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise.
WAF-REL-HA-002 Recommended Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier.
WAF-REL-HA-003 Required Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments.
WAF-REL-HA-004 Required Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance.
WAF-REL-HA-005 Recommended Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery.

WAF-REL-HA-001

Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise.

Severity: Recommended
Rationale: Azure availability zones are physically separated datacenters within a region. Zone-redundant deployments survive a full datacenter failure (power, cooling, networking). Without zone redundancy, a single datacenter outage takes down the entire service. Azure SLA improves from 99.9% to 99.95%-99.99% with zone redundancy.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Storage/storageAccounts
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/managedEnvironments
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/containerApps
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.Web/sites
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.Network/loadBalancers
  • Microsoft.Network/applicationGateways
  • Microsoft.Network/frontDoors
  • Microsoft.Network/trafficManagerProfiles
  • Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-002

Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier.

Severity: Recommended
Rationale: Multi-region deployment protects against region-wide outages (natural disasters, regional Azure incidents). Azure SLA for multi-region architectures can reach 99.99%+. Without multi- region, a regional outage causes complete service unavailability.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Cdn/profiles
  • Microsoft.Cdn/profiles/afdEndpoints
  • Microsoft.Cdn/profiles/originGroups
  • Microsoft.Cdn/profiles/originGroups/origins
  • Microsoft.Cdn/profiles/afdEndpoints/routes
  • Microsoft.Network/trafficmanagerprofiles
  • Microsoft.Network/trafficmanagerprofiles/azureEndpoints
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/containerApps
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.Web/sites
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.Network/loadBalancers
  • Microsoft.Network/applicationGateways
  • Microsoft.Network/frontDoors
  • Microsoft.Network/trafficManagerProfiles
  • Microsoft.DBforPostgreSQL/flexibleServers

Companion Resources

Resource Name Purpose
Microsoft.Cdn/profiles/securityPolicies waf-security-policy WAF policy attached to Front Door endpoint for DDoS and bot protection
Microsoft.Network/privateLinkServices pls-origin Private Link service for Front Door to origin connectivity (Private Link origin)
Microsoft.Insights/diagnosticSettings diag-frontdoor Diagnostic settings for Front Door access logs and health probe logs

WAF-REL-HA-003

Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments.

Severity: Required
Rationale: VMs without zone placement risk co-location in a single datacenter. Availability zones provide 99.99% SLA vs 99.95% for availability sets. Zone-redundant VMSS automatically balances instances across zones.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/containerApps
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.Web/sites
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.Network/loadBalancers
  • Microsoft.Network/applicationGateways
  • Microsoft.Network/frontDoors
  • Microsoft.Network/trafficManagerProfiles
  • Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-004

Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance.

Severity: Required
Rationale: Health probes are the foundation of automatic failover. Without application-level health checks, traffic continues flowing to unhealthy backends. TCP-only probes miss application-level failures (database down, disk full, deadlock).
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Network/loadBalancers/probes
  • Microsoft.Network/applicationGateways
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/containerApps
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.Web/sites
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.Network/loadBalancers
  • Microsoft.Network/applicationGateways
  • Microsoft.Network/frontDoors
  • Microsoft.Network/trafficManagerProfiles
  • Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-005

Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery.

Severity: Recommended
Rationale: Geo-replication protects against region-wide outages and reduces read latency for geographically distributed users. Without geo- replication, a regional outage causes complete data unavailability with potential data loss up to the last backup (RPO of hours).
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Sql/servers/databases
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.Sql/servers/failoverGroups
  • Microsoft.DocumentDB/databaseAccounts
  • Microsoft.DBforPostgreSQL/flexibleServers
  • Microsoft.ContainerService/managedClusters
  • Microsoft.App/containerApps
  • Microsoft.Cache/redis
  • Microsoft.ServiceBus/namespaces
  • Microsoft.Web/sites
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets
  • Microsoft.Network/loadBalancers
  • Microsoft.Network/applicationGateways
  • Microsoft.Network/frontDoors
  • Microsoft.Network/trafficManagerProfiles
  • Microsoft.DBforPostgreSQL/flexibleServers

Companion Resources

Resource Name Purpose
Microsoft.Sql/servers sql-server Secondary SQL Server in paired region for failover group partner
Microsoft.Network/privateEndpoints pe-resource Private endpoints for secondary region database servers
Microsoft.Insights/diagnosticSettings diag-resource Diagnostic settings for replication lag monitoring and failover events

⚠️ **GitHub.com Fallback** ⚠️