Governance Policies Reliability Deployment Safety - Azure/az-prototype GitHub Wiki

Deployment Safety

Governance policies for Deployment Safety

Domain: reliability

Patterns

Name Description
Slot swap deployment script Deploy to staging slot, validate health, swap to production, and provide rollback capability in a single deploy.sh script.
Container Apps canary deployment script Deploy a new Container App revision with canary traffic splitting, validate, then shift all traffic.

Anti-Patterns

Description Instead
Deploying directly to production without a staging phase Use deployment slots (App Service), revision traffic splitting (Container Apps), or rolling updates (AKS)
Using mutable image tags like 'latest' in production Tag images with immutable identifiers (git SHA, build number, semantic version)
Making manual changes to production infrastructure Define all infrastructure as code and apply changes through CI/CD pipelines
Storing Terraform state locally Use Azure Storage remote backend with versioning, locking, and Entra ID authentication
Deploying without rollback capability Ensure every deployment has a tested rollback path executable within 5 minutes
Rebuilding container images for each environment Build once, promote the same image artifact through dev, staging, production

References


Checks (5)

Check Severity Description
WAF-REL-DEPLOY-001 Required Implement blue-green or canary deployment for ALL production services. App Service MUST use deployment slots (staging slot with auto-swap or manual swap). Container Apps MUST use revision- based traffic splitting (route percentage of traffic to new revision). AKS MUST use rolling update strategy with max surge and max unavailable. Functions MUST use deployment slots for premium/dedicated plans. NEVER deploy directly to production without a staging phase.
WAF-REL-DEPLOY-002 Required Validate application health BEFORE shifting production traffic to a new deployment. App Service slots MUST pass health check validation before swap. Container Apps canary revisions MUST pass readiness probes before receiving traffic. AKS deployments MUST have readiness probes that validate application health including downstream dependencies. Health validation MUST check database connectivity, cache availability, and external API reachability — not just HTTP 200 from the root endpoint.
WAF-REL-DEPLOY-003 Required Ensure every production deployment has a tested rollback path. App Service MUST be able to swap back to the previous slot. Container Apps MUST be able to shift 100% traffic back to the previous revision. AKS MUST have previous deployment revision history preserved. Container Registry MUST retain previous image versions. Terraform state MUST be stored remotely with versioning to enable state rollback. Rollback MUST be executable within 5 minutes.
WAF-REL-DEPLOY-004 Required ALL infrastructure MUST be defined as code (Terraform or Bicep). NEVER make manual changes to production infrastructure — all changes must go through the IaC pipeline. Terraform state MUST be stored in a remote backend (Azure Storage) with locking (Azure Blob lease). Enable drift detection to identify manual changes. Use separate state files per environment (dev, staging, production) to isolate blast radius.
WAF-REL-DEPLOY-005 Required Use immutable infrastructure patterns for ALL containerized workloads. Container images MUST be versioned with unique tags (git SHA, build number, or semantic version) — NEVER use mutable tags like 'latest'. Images MUST be built once and promoted through environments (dev -> staging -> production) without rebuilding. NEVER modify running containers in place — deploy new immutable images. ACR MUST have content trust and image quarantine for production images.

WAF-REL-DEPLOY-001

Implement blue-green or canary deployment for ALL production services. App Service MUST use deployment slots (staging slot with auto-swap or manual swap). Container Apps MUST use revision- based traffic splitting (route percentage of traffic to new revision). AKS MUST use rolling update strategy with max surge and max unavailable. Functions MUST use deployment slots for premium/dedicated plans. NEVER deploy directly to production without a staging phase.

Severity: Required
Rationale: Direct-to-production deployments are the #1 cause of production incidents. Blue-green deployment enables zero-downtime releases with instant rollback. Canary deployment validates changes with a subset of traffic before full rollout. Without staging, a bad deploy takes down 100% of users immediately.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ContainerRegistry/registries
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets

Companion Resources

Resource Name Purpose
Microsoft.Web/sites/slots app Staging deployment slot for App Service blue-green deployment
Microsoft.Insights/diagnosticSettings diag-resource Diagnostic settings for deployment slot swap events and health check logs

WAF-REL-DEPLOY-002

Validate application health BEFORE shifting production traffic to a new deployment. App Service slots MUST pass health check validation before swap. Container Apps canary revisions MUST pass readiness probes before receiving traffic. AKS deployments MUST have readiness probes that validate application health including downstream dependencies. Health validation MUST check database connectivity, cache availability, and external API reachability — not just HTTP 200 from the root endpoint.

Severity: Required
Rationale: Deploying code that passes build/test but fails at runtime (wrong connection strings, missing config, incompatible schema) is a common failure mode. Health gates catch these failures before they affect users. Without gates, the first sign of failure is user-facing errors.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ContainerRegistry/registries
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets

WAF-REL-DEPLOY-003

Ensure every production deployment has a tested rollback path. App Service MUST be able to swap back to the previous slot. Container Apps MUST be able to shift 100% traffic back to the previous revision. AKS MUST have previous deployment revision history preserved. Container Registry MUST retain previous image versions. Terraform state MUST be stored remotely with versioning to enable state rollback. Rollback MUST be executable within 5 minutes.

Severity: Required
Rationale: Rollback is the emergency brake for deployments. If a deployment causes issues that health checks miss (performance degradation, data corruption, business logic bugs), rollback is the only way to restore service quickly. Without rollback, the only option is a forward fix under pressure.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ContainerRegistry/registries
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets

Companion Resources

Resource Name Purpose
Microsoft.ContainerRegistry/registries acr Container Registry with retention policy for image version history
Microsoft.Storage/storageAccounts st-data Storage account with versioning for Terraform state rollback
Microsoft.Authorization/roleAssignments Storage Blob Data Contributor RBAC for state storage — Storage Blob Data Contributor for deployment identity

WAF-REL-DEPLOY-004

ALL infrastructure MUST be defined as code (Terraform or Bicep). NEVER make manual changes to production infrastructure — all changes must go through the IaC pipeline. Terraform state MUST be stored in a remote backend (Azure Storage) with locking (Azure Blob lease). Enable drift detection to identify manual changes. Use separate state files per environment (dev, staging, production) to isolate blast radius.

Severity: Required
Rationale: Manual infrastructure changes are untraceable, unreproducible, and un-reviewable. IaC provides version control, peer review, audit trail, and reproducible environments. Remote state with locking prevents concurrent modifications that corrupt state.
Agents: terraform-agent, bicep-agent, cloud-architect

Targets

  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ContainerRegistry/registries
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets

WAF-REL-DEPLOY-005

Use immutable infrastructure patterns for ALL containerized workloads. Container images MUST be versioned with unique tags (git SHA, build number, or semantic version) — NEVER use mutable tags like 'latest'. Images MUST be built once and promoted through environments (dev -> staging -> production) without rebuilding. NEVER modify running containers in place — deploy new immutable images. ACR MUST have content trust and image quarantine for production images.

Severity: Required
Rationale: Mutable infrastructure (in-place updates, SSH patches, config changes on running servers) causes configuration drift, makes debugging impossible, and prevents reliable rollback. Immutable infrastructure ensures every deployment is reproducible and traceable to a specific build artifact.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer

Targets

  • Microsoft.Web/sites
  • Microsoft.App/containerApps
  • Microsoft.ContainerService/managedClusters
  • Microsoft.ContainerRegistry/registries
  • Microsoft.Compute/virtualMachines
  • Microsoft.Compute/virtualMachineScaleSets

Companion Resources

Resource Name Purpose
Microsoft.ContainerRegistry/registries acr Container Registry with Premium SKU for content trust, quarantine, and retention policies
Microsoft.Authorization/roleAssignments role-assignment AcrPush role for CI/CD identity, AcrPull role for application identity
Microsoft.Network/privateEndpoints pe-resource Private endpoint for Container Registry (groupId: registry)

⚠️ **GitHub.com Fallback** ⚠️