Governance Policies Reliability Deployment Safety - Azure/az-prototype GitHub Wiki
Governance policies for Deployment Safety
Domain: reliability
| Name | Description |
|---|---|
| Slot swap deployment script | Deploy to staging slot, validate health, swap to production, and provide rollback capability in a single deploy.sh script. |
| Container Apps canary deployment script | Deploy a new Container App revision with canary traffic splitting, validate, then shift all traffic. |
| Description | Instead |
|---|---|
| Deploying directly to production without a staging phase | Use deployment slots (App Service), revision traffic splitting (Container Apps), or rolling updates (AKS) |
| Using mutable image tags like 'latest' in production | Tag images with immutable identifiers (git SHA, build number, semantic version) |
| Making manual changes to production infrastructure | Define all infrastructure as code and apply changes through CI/CD pipelines |
| Storing Terraform state locally | Use Azure Storage remote backend with versioning, locking, and Entra ID authentication |
| Deploying without rollback capability | Ensure every deployment has a tested rollback path executable within 5 minutes |
| Rebuilding container images for each environment | Build once, promote the same image artifact through dev, staging, production |
- Azure Well-Architected Framework — Keep it simple
- App Service deployment slots
- Container Apps traffic splitting
- Terraform remote state in Azure
- Immutable infrastructure pattern
- Blue-green deployment pattern
| Check | Severity | Description |
|---|---|---|
| WAF-REL-DEPLOY-001 | Required | Implement blue-green or canary deployment for ALL production services. App Service MUST use deployment slots (staging slot with auto-swap or manual swap). Container Apps MUST use revision- based traffic splitting (route percentage of traffic to new revision). AKS MUST use rolling update strategy with max surge and max unavailable. Functions MUST use deployment slots for premium/dedicated plans. NEVER deploy directly to production without a staging phase. |
| WAF-REL-DEPLOY-002 | Required | Validate application health BEFORE shifting production traffic to a new deployment. App Service slots MUST pass health check validation before swap. Container Apps canary revisions MUST pass readiness probes before receiving traffic. AKS deployments MUST have readiness probes that validate application health including downstream dependencies. Health validation MUST check database connectivity, cache availability, and external API reachability — not just HTTP 200 from the root endpoint. |
| WAF-REL-DEPLOY-003 | Required | Ensure every production deployment has a tested rollback path. App Service MUST be able to swap back to the previous slot. Container Apps MUST be able to shift 100% traffic back to the previous revision. AKS MUST have previous deployment revision history preserved. Container Registry MUST retain previous image versions. Terraform state MUST be stored remotely with versioning to enable state rollback. Rollback MUST be executable within 5 minutes. |
| WAF-REL-DEPLOY-004 | Required | ALL infrastructure MUST be defined as code (Terraform or Bicep). NEVER make manual changes to production infrastructure — all changes must go through the IaC pipeline. Terraform state MUST be stored in a remote backend (Azure Storage) with locking (Azure Blob lease). Enable drift detection to identify manual changes. Use separate state files per environment (dev, staging, production) to isolate blast radius. |
| WAF-REL-DEPLOY-005 | Required | Use immutable infrastructure patterns for ALL containerized workloads. Container images MUST be versioned with unique tags (git SHA, build number, or semantic version) — NEVER use mutable tags like 'latest'. Images MUST be built once and promoted through environments (dev -> staging -> production) without rebuilding. NEVER modify running containers in place — deploy new immutable images. ACR MUST have content trust and image quarantine for production images. |
Implement blue-green or canary deployment for ALL production services. App Service MUST use deployment slots (staging slot with auto-swap or manual swap). Container Apps MUST use revision- based traffic splitting (route percentage of traffic to new revision). AKS MUST use rolling update strategy with max surge and max unavailable. Functions MUST use deployment slots for premium/dedicated plans. NEVER deploy directly to production without a staging phase.
Severity: Required
Rationale: Direct-to-production deployments are the #1 cause of production incidents. Blue-green deployment enables zero-downtime releases with instant rollback. Canary deployment validates changes with a subset of traffic before full rollout. Without staging, a bad deploy takes down 100% of users immediately.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ContainerRegistry/registries
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.Web/sites/slots | app | Staging deployment slot for App Service blue-green deployment |
| Microsoft.Insights/diagnosticSettings | diag-resource | Diagnostic settings for deployment slot swap events and health check logs |
Validate application health BEFORE shifting production traffic to a new deployment. App Service slots MUST pass health check validation before swap. Container Apps canary revisions MUST pass readiness probes before receiving traffic. AKS deployments MUST have readiness probes that validate application health including downstream dependencies. Health validation MUST check database connectivity, cache availability, and external API reachability — not just HTTP 200 from the root endpoint.
Severity: Required
Rationale: Deploying code that passes build/test but fails at runtime (wrong connection strings, missing config, incompatible schema) is a common failure mode. Health gates catch these failures before they affect users. Without gates, the first sign of failure is user-facing errors.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ContainerRegistry/registries
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
Ensure every production deployment has a tested rollback path. App Service MUST be able to swap back to the previous slot. Container Apps MUST be able to shift 100% traffic back to the previous revision. AKS MUST have previous deployment revision history preserved. Container Registry MUST retain previous image versions. Terraform state MUST be stored remotely with versioning to enable state rollback. Rollback MUST be executable within 5 minutes.
Severity: Required
Rationale: Rollback is the emergency brake for deployments. If a deployment causes issues that health checks miss (performance degradation, data corruption, business logic bugs), rollback is the only way to restore service quickly. Without rollback, the only option is a forward fix under pressure.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ContainerRegistry/registries
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.ContainerRegistry/registries | acr | Container Registry with retention policy for image version history |
| Microsoft.Storage/storageAccounts | st-data | Storage account with versioning for Terraform state rollback |
| Microsoft.Authorization/roleAssignments | Storage Blob Data Contributor | RBAC for state storage — Storage Blob Data Contributor for deployment identity |
ALL infrastructure MUST be defined as code (Terraform or Bicep). NEVER make manual changes to production infrastructure — all changes must go through the IaC pipeline. Terraform state MUST be stored in a remote backend (Azure Storage) with locking (Azure Blob lease). Enable drift detection to identify manual changes. Use separate state files per environment (dev, staging, production) to isolate blast radius.
Severity: Required
Rationale: Manual infrastructure changes are untraceable, unreproducible, and un-reviewable. IaC provides version control, peer review, audit trail, and reproducible environments. Remote state with locking prevents concurrent modifications that corrupt state.
Agents: terraform-agent, bicep-agent, cloud-architect
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ContainerRegistry/registries
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
Use immutable infrastructure patterns for ALL containerized workloads. Container images MUST be versioned with unique tags (git SHA, build number, or semantic version) — NEVER use mutable tags like 'latest'. Images MUST be built once and promoted through environments (dev -> staging -> production) without rebuilding. NEVER modify running containers in place — deploy new immutable images. ACR MUST have content trust and image quarantine for production images.
Severity: Required
Rationale: Mutable infrastructure (in-place updates, SSH patches, config changes on running servers) causes configuration drift, makes debugging impossible, and prevents reliable rollback. Immutable infrastructure ensures every deployment is reproducible and traceable to a specific build artifact.
Agents: terraform-agent, bicep-agent, cloud-architect, app-developer, csharp-developer, python-developer
- Microsoft.Web/sites
- Microsoft.App/containerApps
- Microsoft.ContainerService/managedClusters
- Microsoft.ContainerRegistry/registries
- Microsoft.Compute/virtualMachines
- Microsoft.Compute/virtualMachineScaleSets
| Resource | Name | Purpose |
|---|---|---|
| Microsoft.ContainerRegistry/registries | acr | Container Registry with Premium SKU for content trust, quarantine, and retention policies |
| Microsoft.Authorization/roleAssignments | role-assignment | AcrPush role for CI/CD identity, AcrPull role for application identity |
| Microsoft.Network/privateEndpoints | pe-resource | Private endpoint for Container Registry (groupId: registry) |