Governance Policies Reliability High Availability - Azure/az-prototype GitHub Wiki
Governance policies for High Availability
Domain: reliability
| Name | Description |
|---|---|
| Health endpoint pattern | Implement a /healthz endpoint in every service that checks all downstream dependencies (database connectivity, cache availability, external API reachability) and returns structured health status. |
| Description | Instead |
|---|---|
| Deploying all resources in a single availability zone without zone redundancy | Spread resources across availability zones 1, 2, and 3 for datacenter-level fault tolerance |
| Using TCP health probes that only check port availability | Use HTTP/HTTPS health probes with /healthz endpoint that validates application and dependency health |
| Relying on single-region deployment for production workloads | Deploy to at least two regions with Front Door or Traffic Manager for automatic failover |
| Using Standard_LRS storage for production data | Use Standard_ZRS (zone-redundant) or Standard_GZRS (geo-zone-redundant) for production storage |
| Deploying databases without geo-replication | Configure SQL failover groups, Cosmos DB multi-region, or PostgreSQL read replicas for DR |
- Azure Well-Architected Framework — Reliability pillar
- Availability zones and regions
- Azure Front Door — origins and origin groups
- SQL Database auto-failover groups
- Health endpoint monitoring pattern
| Check | Severity | Description |
|---|---|---|
| WAF-REL-HA-001 | Recommended | Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise. |
| WAF-REL-HA-002 | Recommended | Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier. |
| WAF-REL-HA-003 | Required | Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments. |
| WAF-REL-HA-004 | Required | Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance. |
| WAF-REL-HA-005 | Recommended | Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery. |
Enable zone redundancy for ALL production PaaS services. Every service that supports availability zones MUST be configured with zone-redundant deployment. This is the single most impactful reliability control — it protects against datacenter-level failures with zero application changes. Configure the exact zone properties per service type: zoneRedundant for Container Apps and Service Bus Premium; zones for AKS node pools, VMs, and Public IPs; ZRS replication for Storage; zone-redundant HA for SQL and PostgreSQL Flexible; multi-AZ writes for Cosmos DB; zone redundancy for Redis Enterprise.
Severity: Recommended
Rationale: Azure availability zones are physically separated datacenters within a region. Zone-redundant deployments survive a full datacenter failure (power, cooling, networking). Without zone redundancy, a single datacenter outage takes down the entire service. Azure SLA improves from 99.9% to 99.95%-99.99% with zone redundancy.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Storage/storageAccounts
- Microsoft.ContainerService/managedClusters
- Microsoft.App/managedEnvironments
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.ContainerService/managedClusters
- Microsoft.App/containerApps
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.Web/sites
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
- Microsoft.Network/frontDoors
- Microsoft.Network/trafficManagerProfiles
- Microsoft.DBforPostgreSQL/flexibleServers
Deploy critical workloads across multiple Azure regions using Azure Front Door or Traffic Manager for active-active or active-passive failover. Front Door is preferred for HTTP workloads (global load balancing with WAF, SSL offload, and sub-second failover). Traffic Manager is for non-HTTP protocols (DNS-based routing, 30-60s failover). Each region must be independently deployable with its own data tier.
Severity: Recommended
Rationale: Multi-region deployment protects against region-wide outages (natural disasters, regional Azure incidents). Azure SLA for multi-region architectures can reach 99.99%+. Without multi- region, a regional outage causes complete service unavailability.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Cdn/profiles
- Microsoft.Cdn/profiles/afdEndpoints
- Microsoft.Cdn/profiles/originGroups
- Microsoft.Cdn/profiles/originGroups/origins
- Microsoft.Cdn/profiles/afdEndpoints/routes
- Microsoft.Network/trafficmanagerprofiles
- Microsoft.Network/trafficmanagerprofiles/azureEndpoints
- Microsoft.ContainerService/managedClusters
- Microsoft.App/containerApps
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.Web/sites
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
- Microsoft.Network/frontDoors
- Microsoft.Network/trafficManagerProfiles
- Microsoft.DBforPostgreSQL/flexibleServers
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Cdn/profiles/securityPolicies | waf-security-policy | WAF policy attached to Front Door endpoint for DDoS and bot protection |
| Microsoft.Network/privateLinkServices | pls-origin | Private Link service for Front Door to origin connectivity (Private Link origin) |
| Microsoft.Insights/diagnosticSettings | diag-frontdoor | Diagnostic settings for Front Door access logs and health probe logs |
Deploy production VMs and VM Scale Sets across availability zones. Single VMs MUST specify a zones property. VM Scale Sets MUST use zones = ["1", "2", "3"] with max spreading (platformFaultDomainCount = 1) for optimal zone distribution. Availability sets are legacy — use zones instead for new deployments.
Severity: Required
Rationale: VMs without zone placement risk co-location in a single datacenter. Availability zones provide 99.99% SLA vs 99.95% for availability sets. Zone-redundant VMSS automatically balances instances across zones.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.ContainerService/managedClusters
- Microsoft.App/containerApps
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.Web/sites
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
- Microsoft.Network/frontDoors
- Microsoft.Network/trafficManagerProfiles
- Microsoft.DBforPostgreSQL/flexibleServers
Configure health probes for ALL load-balanced services. Every Load Balancer, Application Gateway, and Front Door MUST have health probes that check application-level health (not just TCP connectivity). Use HTTP/HTTPS probes with a dedicated /healthz endpoint that validates downstream dependencies. Probes must have appropriate intervals and thresholds to balance detection speed with false-positive avoidance.
Severity: Required
Rationale: Health probes are the foundation of automatic failover. Without application-level health checks, traffic continues flowing to unhealthy backends. TCP-only probes miss application-level failures (database down, disk full, deadlock).
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Network/loadBalancers/probes
- Microsoft.Network/applicationGateways
- Microsoft.ContainerService/managedClusters
- Microsoft.App/containerApps
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.Web/sites
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
- Microsoft.Network/frontDoors
- Microsoft.Network/trafficManagerProfiles
- Microsoft.DBforPostgreSQL/flexibleServers
Configure geo-replication for all production databases. SQL Database must have active geo-replication or auto-failover groups to a paired region. Cosmos DB must have multi-region writes enabled with automatic failover. PostgreSQL Flexible must have read replicas in a secondary region. Geo-replication provides both read scaling and disaster recovery.
Severity: Recommended
Rationale: Geo-replication protects against region-wide outages and reduces read latency for geographically distributed users. Without geo- replication, a regional outage causes complete data unavailability with potential data loss up to the last backup (RPO of hours).
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Sql/servers/databases
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/failoverGroups
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.DBforPostgreSQL/flexibleServers
- Microsoft.ContainerService/managedClusters
- Microsoft.App/containerApps
- Microsoft.Cache/redis
- Microsoft.ServiceBus/namespaces
- Microsoft.Web/sites
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
- Microsoft.Network/frontDoors
- Microsoft.Network/trafficManagerProfiles
- Microsoft.DBforPostgreSQL/flexibleServers
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Sql/servers | sql-server | Secondary SQL Server in paired region for failover group partner |
| Microsoft.Network/privateEndpoints | pe-resource | Private endpoints for secondary region database servers |
| Microsoft.Insights/diagnosticSettings | diag-resource | Diagnostic settings for replication lag monitoring and failover events |