High Availability

Governance policies for High Availability

Domain: reliability

Patterns

Name	Description
Health endpoint pattern	Implement a /healthz endpoint in every service that checks all downstream dependencies (database connectivity, cache availability, external API reachability) and returns structured health status.

Anti-Patterns

Description	Instead
Deploying all resources in a single availability zone without zone redundancy	Spread resources across availability zones 1, 2, and 3 for datacenter-level fault tolerance
Using TCP health probes that only check port availability	Use HTTP/HTTPS health probes with /healthz endpoint that validates application and dependency health
Relying on single-region deployment for production workloads	Deploy to at least two regions with Front Door or Traffic Manager for automatic failover
Using Standard_LRS storage for production data	Use Standard_ZRS (zone-redundant) or Standard_GZRS (geo-zone-redundant) for production storage
Deploying databases without geo-replication	Configure SQL failover groups, Cosmos DB multi-region, or PostgreSQL read replicas for DR

References

Checks (5)

Check	Severity	Description
WAF-REL-HA-001	Recommended	Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise.
WAF-REL-HA-002	Recommended	Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier.
WAF-REL-HA-003	Required	Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments.
WAF-REL-HA-004	Required	Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance.
WAF-REL-HA-005	Recommended	Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery.

WAF-REL-HA-001

Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise.

Severity: Recommended
Rationale: Azure availability zones are physically separated datacenters within a region. Zone-redundant deployments survive a full datacenter failure (power, cooling, networking). Without zone redundancy, a single datacenter outage takes down the entire service. Azure SLA improves from 99.9% to 99.95%-99.99% with zone redundancy.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Storage/storageAccounts
Microsoft.ContainerService/managedClusters
Microsoft.App/managedEnvironments
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.DBforPostgreSQL/flexibleServers
Microsoft.ContainerService/managedClusters
Microsoft.App/containerApps
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.Web/sites
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways
Microsoft.Network/frontDoors
Microsoft.Network/trafficManagerProfiles
Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-002

Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier.

Severity: Recommended
Rationale: Multi-region deployment protects against region-wide outages (natural disasters, regional Azure incidents). Azure SLA for multi-region architectures can reach 99.99%+. Without multi- region, a regional outage causes complete service unavailability.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Cdn/profiles
Microsoft.Cdn/profiles/afdEndpoints
Microsoft.Cdn/profiles/originGroups
Microsoft.Cdn/profiles/originGroups/origins
Microsoft.Cdn/profiles/afdEndpoints/routes
Microsoft.Network/trafficmanagerprofiles
Microsoft.Network/trafficmanagerprofiles/azureEndpoints
Microsoft.ContainerService/managedClusters
Microsoft.App/containerApps
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.Web/sites
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways
Microsoft.Network/frontDoors
Microsoft.Network/trafficManagerProfiles
Microsoft.DBforPostgreSQL/flexibleServers

Companion Resources

Resource	Name	Purpose
Microsoft.Cdn/profiles/securityPolicies	waf-security-policy	WAF policy attached to Front Door endpoint for DDoS and bot protection
Microsoft.Network/privateLinkServices	pls-origin	Private Link service for Front Door to origin connectivity (Private Link origin)
Microsoft.Insights/diagnosticSettings	diag-frontdoor	Diagnostic settings for Front Door access logs and health probe logs

WAF-REL-HA-003

Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments.

Severity: Required
Rationale: VMs without zone placement risk co-location in a single datacenter. Availability zones provide 99.99% SLA vs 99.95% for availability sets. Zone-redundant VMSS automatically balances instances across zones.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.ContainerService/managedClusters
Microsoft.App/containerApps
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.Web/sites
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways
Microsoft.Network/frontDoors
Microsoft.Network/trafficManagerProfiles
Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-004

Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance.

Severity: Required
Rationale: Health probes are the foundation of automatic failover. Without application-level health checks, traffic continues flowing to unhealthy backends. TCP-only probes miss application-level failures (database down, disk full, deadlock).
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Network/loadBalancers/probes
Microsoft.Network/applicationGateways
Microsoft.ContainerService/managedClusters
Microsoft.App/containerApps
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.Web/sites
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways
Microsoft.Network/frontDoors
Microsoft.Network/trafficManagerProfiles
Microsoft.DBforPostgreSQL/flexibleServers

WAF-REL-HA-005

Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery.

Severity: Recommended
Rationale: Geo-replication protects against region-wide outages and reduces read latency for geographically distributed users. Without geo- replication, a regional outage causes complete data unavailability with potential data loss up to the last backup (RPO of hours).
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

Microsoft.Sql/servers/databases
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/failoverGroups
Microsoft.DocumentDB/databaseAccounts
Microsoft.DBforPostgreSQL/flexibleServers
Microsoft.ContainerService/managedClusters
Microsoft.App/containerApps
Microsoft.Cache/redis
Microsoft.ServiceBus/namespaces
Microsoft.Web/sites
Microsoft.Compute/virtualMachines
Microsoft.Compute/virtualMachineScaleSets
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways
Microsoft.Network/frontDoors
Microsoft.Network/trafficManagerProfiles
Microsoft.DBforPostgreSQL/flexibleServers

Companion Resources

Resource	Name	Purpose
Microsoft.Sql/servers	sql-server	Secondary SQL Server in paired region for failover group partner
Microsoft.Network/privateEndpoints	pe-resource	Private endpoints for secondary region database servers
Microsoft.Insights/diagnosticSettings	diag-resource	Diagnostic settings for replication lag monitoring and failover events

Governance Policies Reliability High Availability - Azure/az-prototype GitHub Wiki

High Availability

Patterns

Anti-Patterns

References

Checks (5)

WAF-REL-HA-001

Targets

WAF-REL-HA-002

Targets

Companion Resources

WAF-REL-HA-003

Targets

WAF-REL-HA-004

Targets

WAF-REL-HA-005

Targets

Companion Resources

⚠️ GitHub.com Fallback ⚠️

Governance Policies Reliability High Availability - Azure/az-prototype GitHub Wiki

High Availability

Patterns

Anti-Patterns

References

Checks (5)

WAF-REL-HA-001

Targets

WAF-REL-HA-002

Targets

Companion Resources

WAF-REL-HA-003

Targets

WAF-REL-HA-004

Targets

WAF-REL-HA-005

Targets

Companion Resources

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️