Governance Policies Reliability Fault Tolerance - Azure/az-prototype GitHub Wiki
Governance policies for Fault Tolerance
Domain: reliability
| Name | Description |
|---|---|
| Circuit breaker with retry composition | Compose circuit breaker and retry policies correctly: retry wraps the circuit breaker, so transient failures are retried but sustained failures trip the circuit. |
| Competing consumers pattern | Scale consumers independently from producers using queue-based load leveling. Multiple consumers process from the same queue concurrently, each handling one message at a time. |
| Description | Instead |
|---|---|
| Making synchronous calls to external services without timeout or circuit breaker | Wrap all external calls with circuit breaker + retry + timeout using Polly, resilience4j, or Dapr |
| Deploying containers without CPU and memory resource limits | Set explicit CPU and memory limits on every container to prevent resource starvation |
| Processing bursty workloads synchronously without a message queue | Use Service Bus or Event Hub as a buffer between producers and consumers |
| Hardcoding feature flags in application code | Use Azure App Configuration for centralized feature flag management with instant toggle capability |
| Using Service Bus connection strings instead of managed identity | Disable local auth (disableLocalAuth: true) and use RBAC with managed identity |
| Deploying AKS workloads without Pod Disruption Budgets | Create PDBs with minAvailable or maxUnavailable to protect availability during voluntary disruptions |
- Azure Well-Architected Framework — Design for resilience
- Circuit breaker pattern
- Retry pattern with exponential backoff
- Bulkhead pattern
- Queue-based load leveling pattern
- Graceful degradation pattern
| Check | Severity | Description |
|---|---|---|
| WAF-REL-FT-001 | Required | Implement the circuit breaker pattern for ALL external service calls. Circuit breakers prevent cascading failures by stopping calls to a failing dependency after a threshold of consecutive errors. Use Dapr resiliency policies for Container Apps, Polly for .NET applications, resilience4j for Java, and APIM circuit breaker policy for API gateway-level protection. Every circuit breaker MUST define: failure threshold, open duration (timeout), and half-open probe count. |
| WAF-REL-FT-002 | Required | Configure retry policies with exponential backoff and jitter for ALL external service calls. Azure SDK clients have built-in retry policies — configure them explicitly rather than relying on defaults. For custom HTTP calls, implement exponential backoff with jitter to avoid thundering herd effects. Maximum retry count MUST be bounded (3-5 retries). Base delay MUST start at 1-2 seconds. Jitter MUST be added to prevent synchronized retries. |
| WAF-REL-FT-003 | Required | Implement bulkhead isolation to prevent a single failing component from consuming all system resources. Container Apps and AKS MUST have resource limits (CPU/memory) per container. AKS MUST have Pod Disruption Budgets (PDBs) to ensure minimum availability during voluntary disruptions. Thread pools and connection pools MUST be bounded. Separate critical and non-critical workloads into different compute instances. |
| WAF-REL-FT-004 | Recommended | Implement graceful degradation patterns so that partial failures do not cause total service unavailability. Use feature flags to disable non-critical features when dependencies fail. Configure fallback endpoints and cached responses. Implement degraded mode that serves stale data or reduced functionality rather than returning errors. Azure App Configuration with feature filters provides centralized feature flag management. |
| WAF-REL-FT-005 | Required | Use queue-based load leveling for all workloads with variable or bursty traffic patterns. Place Service Bus queues or Event Hubs between producers and consumers to absorb traffic spikes and decouple processing rate from arrival rate. Service Bus Premium tier provides zone redundancy, large message support, and FIFO ordering. Event Hubs is for high-throughput streaming (millions of events/sec). NEVER process high-volume workloads synchronously without a buffer. |
Implement the circuit breaker pattern for ALL external service calls. Circuit breakers prevent cascading failures by stopping calls to a failing dependency after a threshold of consecutive errors. Use Dapr resiliency policies for Container Apps, Polly for .NET applications, resilience4j for Java, and APIM circuit breaker policy for API gateway-level protection. Every circuit breaker MUST define: failure threshold, open duration (timeout), and half-open probe count.
Severity: Required
Rationale: Without circuit breakers, a single failing dependency causes all callers to block on timeout, exhausting connection pools and thread pools, which cascades failure to the entire system. Circuit breakers fail fast, preserve resources, and allow recovery.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.App/containerApps
- Microsoft.Web/sites
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
- Microsoft.ServiceBus/namespaces
- Microsoft.EventHub/namespaces
- Microsoft.Cache/redis
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
Configure retry policies with exponential backoff and jitter for ALL external service calls. Azure SDK clients have built-in retry policies — configure them explicitly rather than relying on defaults. For custom HTTP calls, implement exponential backoff with jitter to avoid thundering herd effects. Maximum retry count MUST be bounded (3-5 retries). Base delay MUST start at 1-2 seconds. Jitter MUST be added to prevent synchronized retries.
Severity: Required
Rationale: Transient failures (network glitches, throttling, brief service restarts) are inevitable in distributed systems. Without retry, every transient failure becomes a user-visible error. Without backoff, rapid retries overwhelm the recovering service. Without jitter, synchronized retries from multiple clients create load spikes.
Agents: cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.App/containerApps
- Microsoft.Web/sites
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
- Microsoft.ServiceBus/namespaces
- Microsoft.EventHub/namespaces
- Microsoft.Cache/redis
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
Implement bulkhead isolation to prevent a single failing component from consuming all system resources. Container Apps and AKS MUST have resource limits (CPU/memory) per container. AKS MUST have Pod Disruption Budgets (PDBs) to ensure minimum availability during voluntary disruptions. Thread pools and connection pools MUST be bounded. Separate critical and non-critical workloads into different compute instances.
Severity: Required
Rationale: Without bulkhead isolation, a single runaway process can consume all CPU/memory, starving healthy workloads. Connection pool exhaustion from one dependency blocks all other outbound calls. PDBs prevent Kubernetes evictions from violating availability requirements.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.App/containerApps
- Microsoft.Web/sites
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
- Microsoft.ServiceBus/namespaces
- Microsoft.EventHub/namespaces
- Microsoft.Cache/redis
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
Implement graceful degradation patterns so that partial failures do not cause total service unavailability. Use feature flags to disable non-critical features when dependencies fail. Configure fallback endpoints and cached responses. Implement degraded mode that serves stale data or reduced functionality rather than returning errors. Azure App Configuration with feature filters provides centralized feature flag management.
Severity: Recommended
Rationale: Users prefer a degraded experience over a complete outage. If the recommendation engine fails, the e-commerce site should still show products without recommendations — not return a 500 error. Feature flags enable instant degradation without redeployment.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.App/containerApps
- Microsoft.Web/sites
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
- Microsoft.ServiceBus/namespaces
- Microsoft.EventHub/namespaces
- Microsoft.Cache/redis
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Network/privateEndpoints | pe-resource | Private endpoint for App Configuration (groupId: configurationStores) |
| Microsoft.Network/privateDnsZones | privatelink.azconfig.io | Private DNS zone privatelink.azconfig.io for App Configuration private endpoint |
| Microsoft.Insights/diagnosticSettings | diag-resource | Diagnostic settings for App Configuration audit and request logs |
Use queue-based load leveling for all workloads with variable or bursty traffic patterns. Place Service Bus queues or Event Hubs between producers and consumers to absorb traffic spikes and decouple processing rate from arrival rate. Service Bus Premium tier provides zone redundancy, large message support, and FIFO ordering. Event Hubs is for high-throughput streaming (millions of events/sec). NEVER process high-volume workloads synchronously without a buffer.
Severity: Required
Rationale: Synchronous processing of bursty traffic causes cascading failures when arrival rate exceeds processing capacity. Queues absorb spikes, enable independent scaling of producers and consumers, and provide at-least-once delivery guarantees. Without queues, every traffic spike risks service overload and data loss.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.App/containerApps
- Microsoft.Web/sites
- Microsoft.ContainerService/managedClusters
- Microsoft.ApiManagement/service
- Microsoft.ServiceBus/namespaces
- Microsoft.EventHub/namespaces
- Microsoft.Cache/redis
- Microsoft.DocumentDB/databaseAccounts
- Microsoft.Sql/servers/databases
- Microsoft.Network/loadBalancers
- Microsoft.Network/applicationGateways
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.ServiceBus/namespaces/queues | sb-namespace | Dead-letter queue (automatic sub-queue) — monitor for poison messages |
| Microsoft.Network/privateEndpoints | pe-resource | Private endpoint for Service Bus / Event Hub namespace |
| Microsoft.Insights/diagnosticSettings | diag-metrics | Diagnostic settings for queue depth, dead-letter count, and throughput metrics |