Fault Tolerance

Governance policies for Fault Tolerance

Domain: reliability

Patterns

Name	Description
Circuit breaker with retry composition	Compose circuit breaker and retry policies correctly: retry wraps the circuit breaker, so transient failures are retried but sustained failures trip the circuit.
Competing consumers pattern	Scale consumers independently from producers using queue-based load leveling. Multiple consumers process from the same queue concurrently, each handling one message at a time.

Anti-Patterns

Description	Instead
Making synchronous calls to external services without timeout or circuit breaker	Wrap all external calls with circuit breaker + retry + timeout using Polly, resilience4j, or Dapr
Deploying containers without CPU and memory resource limits	Set explicit CPU and memory limits on every container to prevent resource starvation
Processing bursty workloads synchronously without a message queue	Use Service Bus or Event Hub as a buffer between producers and consumers
Hardcoding feature flags in application code	Use Azure App Configuration for centralized feature flag management with instant toggle capability
Using Service Bus connection strings instead of managed identity	Disable local auth (disableLocalAuth: true) and use RBAC with managed identity
Deploying AKS workloads without Pod Disruption Budgets	Create PDBs with minAvailable or maxUnavailable to protect availability during voluntary disruptions

References

Checks (5)

Check	Severity	Description
WAF-REL-FT-001	Required	Implement the circuit breaker pattern for ALL external service calls. Circuit breakers prevent cascading failures by stopping calls to a failing dependency after a threshold of consecutive errors. Use Dapr resiliency policies for Container Apps, Polly for .NET applications, resilience4j for Java, and APIM circuit breaker policy for API gateway-level protection. Every circuit breaker MUST define: failure threshold, open duration (timeout), and half-open probe count.
WAF-REL-FT-002	Required	Configure retry policies with exponential backoff and jitter for ALL external service calls. Azure SDK clients have built-in retry policies — configure them explicitly rather than relying on defaults. For custom HTTP calls, implement exponential backoff with jitter to avoid thundering herd effects. Maximum retry count MUST be bounded (3-5 retries). Base delay MUST start at 1-2 seconds. Jitter MUST be added to prevent synchronized retries.
WAF-REL-FT-003	Required	Implement bulkhead isolation to prevent a single failing component from consuming all system resources. Container Apps and AKS MUST have resource limits (CPU/memory) per container. AKS MUST have Pod Disruption Budgets (PDBs) to ensure minimum availability during voluntary disruptions. Thread pools and connection pools MUST be bounded. Separate critical and non-critical workloads into different compute instances.
WAF-REL-FT-004	Recommended	Implement graceful degradation patterns so that partial failures do not cause total service unavailability. Use feature flags to disable non-critical features when dependencies fail. Configure fallback endpoints and cached responses. Implement degraded mode that serves stale data or reduced functionality rather than returning errors. Azure App Configuration with feature filters provides centralized feature flag management.
WAF-REL-FT-005	Required	Use queue-based load leveling for all workloads with variable or bursty traffic patterns. Place Service Bus queues or Event Hubs between producers and consumers to absorb traffic spikes and decouple processing rate from arrival rate. Service Bus Premium tier provides zone redundancy, large message support, and FIFO ordering. Event Hubs is for high-throughput streaming (millions of events/sec). NEVER process high-volume workloads synchronously without a buffer.

WAF-REL-FT-001

Implement the circuit breaker pattern for ALL external service calls. Circuit breakers prevent cascading failures by stopping calls to a failing dependency after a threshold of consecutive errors. Use Dapr resiliency policies for Container Apps, Polly for .NET applications, resilience4j for Java, and APIM circuit breaker policy for API gateway-level protection. Every circuit breaker MUST define: failure threshold, open duration (timeout), and half-open probe count.

Severity: Required
Rationale: Without circuit breakers, a single failing dependency causes all callers to block on timeout, exhausting connection pools and thread pools, which cascades failure to the entire system. Circuit breakers fail fast, preserve resources, and allow recovery.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.App/containerApps
Microsoft.Web/sites
Microsoft.ContainerService/managedClusters
Microsoft.ApiManagement/service
Microsoft.ServiceBus/namespaces
Microsoft.EventHub/namespaces
Microsoft.Cache/redis
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways

WAF-REL-FT-002

Configure retry policies with exponential backoff and jitter for ALL external service calls. Azure SDK clients have built-in retry policies — configure them explicitly rather than relying on defaults. For custom HTTP calls, implement exponential backoff with jitter to avoid thundering herd effects. Maximum retry count MUST be bounded (3-5 retries). Base delay MUST start at 1-2 seconds. Jitter MUST be added to prevent synchronized retries.

Severity: Required
Rationale: Transient failures (network glitches, throttling, brief service restarts) are inevitable in distributed systems. Without retry, every transient failure becomes a user-visible error. Without backoff, rapid retries overwhelm the recovering service. Without jitter, synchronized retries from multiple clients create load spikes.
Agents: cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.App/containerApps
Microsoft.Web/sites
Microsoft.ContainerService/managedClusters
Microsoft.ApiManagement/service
Microsoft.ServiceBus/namespaces
Microsoft.EventHub/namespaces
Microsoft.Cache/redis
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways

WAF-REL-FT-003

Implement bulkhead isolation to prevent a single failing component from consuming all system resources. Container Apps and AKS MUST have resource limits (CPU/memory) per container. AKS MUST have Pod Disruption Budgets (PDBs) to ensure minimum availability during voluntary disruptions. Thread pools and connection pools MUST be bounded. Separate critical and non-critical workloads into different compute instances.

Severity: Required
Rationale: Without bulkhead isolation, a single runaway process can consume all CPU/memory, starving healthy workloads. Connection pool exhaustion from one dependency blocks all other outbound calls. PDBs prevent Kubernetes evictions from violating availability requirements.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.App/containerApps
Microsoft.Web/sites
Microsoft.ContainerService/managedClusters
Microsoft.ApiManagement/service
Microsoft.ServiceBus/namespaces
Microsoft.EventHub/namespaces
Microsoft.Cache/redis
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways

WAF-REL-FT-004

Implement graceful degradation patterns so that partial failures do not cause total service unavailability. Use feature flags to disable non-critical features when dependencies fail. Configure fallback endpoints and cached responses. Implement degraded mode that serves stale data or reduced functionality rather than returning errors. Azure App Configuration with feature filters provides centralized feature flag management.

Severity: Recommended
Rationale: Users prefer a degraded experience over a complete outage. If the recommendation engine fails, the e-commerce site should still show products without recommendations — not return a 500 error. Feature flags enable instant degradation without redeployment.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.App/containerApps
Microsoft.Web/sites
Microsoft.ContainerService/managedClusters
Microsoft.ApiManagement/service
Microsoft.ServiceBus/namespaces
Microsoft.EventHub/namespaces
Microsoft.Cache/redis
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways

Companion Resources

Resource	Name	Purpose
Microsoft.Network/privateEndpoints	pe-resource	Private endpoint for App Configuration (groupId: configurationStores)
Microsoft.Network/privateDnsZones	privatelink.azconfig.io	Private DNS zone privatelink.azconfig.io for App Configuration private endpoint
Microsoft.Insights/diagnosticSettings	diag-resource	Diagnostic settings for App Configuration audit and request logs

WAF-REL-FT-005

Use queue-based load leveling for all workloads with variable or bursty traffic patterns. Place Service Bus queues or Event Hubs between producers and consumers to absorb traffic spikes and decouple processing rate from arrival rate. Service Bus Premium tier provides zone redundancy, large message support, and FIFO ordering. Event Hubs is for high-throughput streaming (millions of events/sec). NEVER process high-volume workloads synchronously without a buffer.

Severity: Required
Rationale: Synchronous processing of bursty traffic causes cascading failures when arrival rate exceeds processing capacity. Queues absorb spikes, enable independent scaling of producers and consumers, and provide at-least-once delivery guarantees. Without queues, every traffic spike risks service overload and data loss.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

Microsoft.App/containerApps
Microsoft.Web/sites
Microsoft.ContainerService/managedClusters
Microsoft.ApiManagement/service
Microsoft.ServiceBus/namespaces
Microsoft.EventHub/namespaces
Microsoft.Cache/redis
Microsoft.DocumentDB/databaseAccounts
Microsoft.Sql/servers/databases
Microsoft.Network/loadBalancers
Microsoft.Network/applicationGateways

Companion Resources

Resource	Name	Purpose
Microsoft.ServiceBus/namespaces/queues	sb-namespace	Dead-letter queue (automatic sub-queue) — monitor for poison messages
Microsoft.Network/privateEndpoints	pe-resource	Private endpoint for Service Bus / Event Hub namespace
Microsoft.Insights/diagnosticSettings	diag-metrics	Diagnostic settings for queue depth, dead-letter count, and throughput metrics

Governance Policies Reliability Fault Tolerance - Azure/az-prototype GitHub Wiki

Fault Tolerance

Patterns

Anti-Patterns

References

Checks (5)

WAF-REL-FT-001

Targets

WAF-REL-FT-002

Targets

WAF-REL-FT-003

Targets

WAF-REL-FT-004

Targets

Companion Resources

WAF-REL-FT-005

Targets

Companion Resources

⚠️ GitHub.com Fallback ⚠️

Governance Policies Reliability Fault Tolerance - Azure/az-prototype GitHub Wiki

Fault Tolerance

Patterns

Anti-Patterns

References

Checks (5)

WAF-REL-FT-001

Targets

WAF-REL-FT-002

Targets

WAF-REL-FT-003

Targets

WAF-REL-FT-004

Targets

Companion Resources

WAF-REL-FT-005

Targets

Companion Resources

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️