AWS EKS Contingency (Disaster Recovery) Plan

Document version: 1.0 Last updated: Sep 11, 2025 Cluster(s) covered: {{CLUSTER_NAME}} (replace) Primary region: {{PRIMARY_REGION}} (replace) DR owner: {{NAME / TEAM}} (replace) Contact: {{pager / phone / email}} (replace)

1. Purpose & Scope

Purpose: ensure rapid, repeatable recovery of applications and cluster resources running on Amazon EKS after outages, data loss, or security incidents. Scope: EKS control plane and worker node recovery, Kubernetes resource and persistent data recovery (PVCs/EBS/EFS), cross-AZ and cross-region failover strategies, and runbooks for common incidents.

Key assumptions:

EKS control plane is managed by AWS (HA by AWS across AZs) — plan focuses on cluster state, worker nodes, applications, and persistent data. ([AWS Documentation]1)

2. Objectives & Metrics

Recovery Time Objective (RTO): {{hours}} for full app recovery (set per app).
Recovery Point Objective (RPO): {{minutes/hours}} for critical data (set per app).
Primary success criteria: apps restored, traffic routed, data within RPO, and verification tests passed.

(Define these per-application and store in the Application DR matrix in Appendix A.) See DR strategy options from AWS (Pilot Light, Warm Standby, Active-Active). ([AWS Documentation]2)

3. Roles & Responsibilities

DR Owner: overall coordination and sign-off.
Platform/Cluster Owner: cluster creation, control-plane health checks, node management.
Application Owners: validate app-level backups, test recoveries, app config.
Network/Infra: VPC, routing, Load Balancers, DNS failover.
SecOps: security, access revocation, forensic data capture.
Communications: stakeholder updates and external notifications.

Include on-call roster and escalation matrix (phone/pager/email).

4. Inventory (What to protect)

Kubernetes objects: Namespaces, Deployments, StatefulSets, DaemonSets, Services, Ingress, ConfigMaps, Secrets, RBAC, CRDs.
Persistent data: EBS volumes (PVC-backed), EFS, S3 buckets used by apps, RDS/managed DBs (if used).
Cluster configuration: IAM roles, node groups, VPC/subnets, security groups, cluster-autoscaler/Karpenter config.
Cluster add-ons: CNI config, CoreDNS, metrics servers, Helm releases, cert-manager, ingress controllers.
CI/CD pipelines, image registry (ECR), and external dependencies.

5. DR Strategy Summary (Recommended)

Backup & Restore (primary) — Periodic and scheduled backups of Kubernetes resources and persistent volumes using Velero (backup of cluster objects + snapshots of EBS volumes or object-level copy to S3). Test restores regularly. Velero is widely recommended for EKS backups. ([Amazon Web Services, Inc.]3)
AZ-resilience — Spread node groups across multiple AZs; use Pod topology spread and multi-AZ nodegroups to survive single AZ failures. ([Amazon Web Services, Inc.]4)
Cross-region strategy — For higher SLAs, maintain a warm standby or pilot-light cluster in another region, or use multi-region storage (replicated S3, cross-region EFS solutions) depending on RTO/RPO. AWS offers templates and guidance for multi-region EKS DR. ([Amazon Web Services, Inc.]5)
Immutable artifacts & IaC — Store all cluster manifests and infra as code (Terraform/CloudFormation/eksctl) in source control so a cluster can be recreated quickly.
Protect backups — S3 Object Lock, encryption, IAM restrictions and offline/cross-account copies to mitigate ransomware. ([Trilio]6)

6. Tools & Technologies (recommended)

Velero for cluster resource + PVC backups and restore workflows. ([velero.io]7)
AWS EBS Snapshots (for large PVs) or CSI snapshotter integration via Velero. ([devopscube.com]8)
S3 for backup storage (versioning, Object Lock for immutability).
EFS for stateful workloads that need cross-AZ availability and easier multi-AZ failover. ([Amazon Web Services, Inc.]5)
Infrastructure as Code: eksctl / Terraform / CloudFormation.
Observability: CloudWatch, Prometheus + Alertmanager, logs in centralized store.
DNS failover: Route 53 health checks & failover policies.
Optional vendors: Portworx, Trilio, Veeam — for enterprises needing application-consistent or high-performance snapshots. ([Portworx]9)

7. Backup Policy (template)

Cluster resource backups (Velero): daily full backups; hourly incremental for critical namespaces. Retention: 30 days (adjust per compliance).
PV backups / EBS snapshots: schedule hourly/daily snapshots for critical PVs depending on RPO. Use cross-region copies for critical datasets.
Object stores & DBs: snapshot DBs via native backup (RDS snapshots/Aurora backups); S3 versioning & replication for critical buckets.
Secrets & Config: include in Velero (encrypted) or store secrets in external KMS-protected store; rotate keys per policy.
Backup retention & immutability: use S3 Object Lock for at-risk data (ransomware). ([NetApp]10)

8. Runbooks — rapid actions for common incidents

8.1 Control plane degraded (AWS-managed)

Symptoms: API server errors, kubectl times out. Actions:

Check AWS EKS control plane health in AWS Console / eksctl / CloudWatch. (AWS manages control plane HA; partial outages are rare). ([AWS Documentation]1)
Confirm IAM permissions and network ACLs (sometimes network misconfiguration causes access issues).
Contact AWS Support if outage persists; follow AWS status page and region announcements.
If cluster is entirely unrecoverable and RTO demands, fail over to warm standby cluster (see 8.4).

8.2 Node group or AZ failure

Symptoms: Pod eviction, pending pods, node NotReady. Actions:

Verify node status: kubectl get nodes -o wide.
If autoscaling: check cluster-autoscaler/Karpenter logs and AWS EC2 instance limits. Restart or increase capacity; create new nodegroup in healthy AZ. ([Amazon Web Services, Inc.]4)
Drain and cordon unhealthy nodes: kubectl drain <node> --ignore-daemonsets --delete-local-data.
Replace node group (eksctl create nodegroup) or scale managed nodegroup.

8.3 Persistent volume data corruption or deletion (app-level)

Symptoms: application data missing or corrupted. Actions:

Immediately snapshot the remaining volumes (prevent further changes).
Identify last good backup (Velero backup/snapshot).
Restore PV from snapshot to new PVC or restore Velero backup of PVCs: velero restore create --from-backup <backup-name>. Test in isolated namespace before replacing prod. ([velero.io]7)

8.4 Region outage — failover to secondary region (pilot light / warm standby)

Prereq: secondary region cluster is pre-provisioned (pilot light contains minimal infra; warm standby runs scaled-down apps). Actions:

Validate DNS failover plan (Route 53 health checks). Update weights / failover policy.
Promote secondary environment: scale up node groups, ensure image registry access (ECR replication or mirrored registry), and restore latest backups (Velero + PV snapshots) into the secondary cluster.
Test critical transaction flows and confirm data integrity.
Repoint Route 53 or LB to secondary region. ([Amazon Web Services, Inc.]5)

8.5 Ransomware / data exfiltration suspected

Actions:

Isolate affected nodes and revoke/rotate compromised credentials and access keys.
Freeze writes to critical data stores (snapshots and S3 object lock).
Restore from immutable backups (S3 Object Lock / cross-account backups).
Engage Security/IR processes; preserve logs (CloudTrail, VPC Flow Logs, app logs) for forensics.

9. Recovery Verification & Acceptance Criteria

After any restore, run these checks:

All pods in critical namespaces in Running or Completed state. kubectl get pods -n <ns>.
Application health checks and smoke tests pass (end-to-end).
Data integrity check (row counts, checksums) against the expected dataset from before the incident.
External traffic flows as expected (DNS, load balancer).
Postmortem and lessons captured.

10. Testing & Exercises

Quarterly: restore test from Velero into an isolated test cluster.
Bi-annually: failover exercise to warm standby region (DNS failover + data restore).
Annually: full DR test (simulate region outage, failover, operations). Record test dates, outcomes, RTO/RPO met, and action items.

11. Communication Plan

Initial notification: internal Slack channel + email to execs & stakeholders. Template: include incident time, impact, initial mitigation, expected next update.
Status cadence: every 30 min until stabilised; then every 2 hours.
Post-incident: full incident report and remediation timeline within 3 business days.

12. Post-Incident Review

Capture root cause, timeline, actions taken, success/failure of runbooks, and long-term mitigations.
Update runbooks and playbooks; schedule follow-up tasks with owners and deadlines.

13. Appendix A — Application DR Matrix (template)

For each application, list:

App name:
Namespace:
RTO:
RPO:
Critical resources:
Backup frequency:
Recovery owner:
Test history:

14. Appendix B — Useful commands & sample manifests

Velero: install (example)

# Assume AWS creds present and S3 bucket created for backups
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket my-velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1,s3ForcePathStyle="true",s3Url=https://s3.amazonaws.com

(Create VolumeSnapshotLocation and backup schedules per Velero docs.) ([velero.io]7)

Create a Velero backup

velero backup create daily-backup-$(date +%F) --include-namespaces prod,shared

Restore a Velero backup

velero restore create --from-backup daily-backup-2025-09-10

Kubernetes node actions

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data

EBS snapshot (CLI example)

aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-restore snapshot"

(Always test commands in non-prod first. Provide IAM policies least privilege for Velero and snapshot operations — see Velero and AWS docs for required permissions.) ([Amazon Web Services, Inc.]3)

15. Security & Compliance Considerations

Encrypt backups at rest and in transit (S3 encryption + KMS for secrets).
Use cross-account backups and S3 Object Lock to protect against accidental/malicious deletion. ([NetApp]10)
Audit backup/restore operations (CloudTrail). Document retention per retention policy and regulatory needs.

16. Continuous Improvement

After each test or incident, update RTO/RPO and runbooks.
Automate restore verification tests and integrate into CI pipelines where possible.
Regularly review third-party tool versions and AWS best-practices pages (EKS docs and AWS DR whitepaper).

Authoritative references & further reading:

Amazon EKS disaster recovery & resilience guidance. ([AWS Documentation]1)
AWS blog: Backup and restore Amazon EKS with Velero. ([Amazon Web Services, Inc.]3)
AWS blog: Multi-region DR patterns for EKS + EFS. ([Amazon Web Services, Inc.]5)
Velero documentation (file system & volume snapshot locations). ([velero.io]7)
AWS Whitepaper: Disaster Recovery on AWS (strategies: pilot light, warm standby, active-active). ([AWS Documentation]2)

17. Next steps (actionable checklist to implement now)

Install and configure Velero with S3 backup bucket and a VolumeSnapshotLocation for CSI snapshots. ([velero.io]7)
Add backup schedules for critical namespaces & PVs; set retention and immutability as required.
Implement IaC scripts to recreate cluster infra (eksctl/Terraform) and store in Git.
Provision a pilot-light cluster in a secondary region for critical workloads. ([Amazon Web Services, Inc.]5)
Run first restore test in isolated test cluster and document outcomes.

🚨 EKS Disaster Recovery Runbook (On-Call Quick Guide)

Cluster: {{CLUSTER_NAME}} Primary Region: {{PRIMARY_REGION}} DR Owner Contact: {{NAME / PHONE / PAGER}}

1. Initial Response

Acknowledge alert in PagerDuty/Slack.
Notify DR Owner + relevant app owners.

Assess scope:

kubectl get nodes -o wide
kubectl get pods --all-namespaces

Check AWS EKS Console + CloudWatch for control plane health.

2. Common Scenarios & Actions

🔹 API/Control Plane Down

Verify AWS status page & region health.
If confirmed AWS outage → escalate to DR Owner + AWS Support.
If unrecoverable → trigger Failover to Secondary Region (Section 3).

🔹 Node/AZ Failure

kubectl get nodes
kubectl drain <node> --ignore-daemonsets --delete-local-data

Scale or replace node group (eksctl scale nodegroup).
Verify workloads reschedule in healthy AZ.

🔹 Data Corruption / Lost PV

velero restore create --from-backup <latest-backup>

Or restore specific PV snapshot:

aws ec2 create-volume --snapshot-id <snap-id> --availability-zone <az>

Validate restored data with app owner.

🔹 Ransomware / Security Incident

Isolate affected nodes:

kubectl cordon <node>; kubectl drain <node>

Revoke compromised IAM keys.
Restore from immutable backup (S3 Object Lock / Velero).
Engage SecOps.

3. Region Failover (Pilot Light / Warm Standby)

Scale up standby cluster/nodegroups in {{SECONDARY_REGION}}.

Restore latest backup:

velero restore create --from-backup <latest>

Switch DNS in Route 53 (failover policy).
Run smoke tests → confirm app health before declaring recovery.

4. Verification (Post-Recovery)

kubectl get pods -n <critical-ns> → all Running.
App endpoints return 200.
Data integrity verified (checksum/row counts).
External traffic routed correctly (DNS/LB).

5. Communication

First update: 15 min after incident start.
Update cadence: every 30 min until stable.
Final report: after resolution → incident ticket + postmortem.

6. Useful Commands

velero backup create manual-$(date +%F) --include-namespaces prod
velero restore create --from-backup <backup-name>
kubectl get events --sort-by='.lastTimestamp'

✅ Golden Rule: If unsure → escalate to DR Owner immediately. 📌 Keep this runbook printed or bookmarked in the on-call doc hub.

Trigger/Symptoms
Impact
Resolution steps

🔐 High-Value Contingency Scenarios — Single Region EKS

1. EKS Control Plane API Unreachable (Regional Service Issue)

Trigger/Symptoms:

kubectl times out or fails across all users.
AWS EKS console shows control plane degraded.
CloudWatch alarms for API errors.

Impact:

No ability to deploy new workloads, modify resources, or scale manually.
Running workloads continue, but cluster operations blocked.

Resolution:

Check [AWS Health Dashboard](https://status.aws.amazon.com/) for region-wide issues.
If AWS-managed outage:
- Notify stakeholders, escalate to AWS Support.
- Monitor until AWS resolves (EKS control plane HA is AWS-managed).
If caused by IAM/network misconfig:
- Validate kubeconfig & IAM roles.
- Check VPC routing, NACLs, SGs blocking API endpoint.
- Revert last infra change via IaC.

2. Node Group Failure (e.g., AZ Outage or EC2 Capacity Issue)

Trigger/Symptoms:

Pods stuck in Pending state.
Nodes show NotReady.
Events indicate scheduling failures due to no capacity.

Impact:

Workloads on failed nodes evicted.
Reduced capacity, degraded performance.

Resolution:

Check node health:
```
kubectl get nodes
```
If AZ outage suspected:
- Scale node group in healthy AZ.
- Add temporary node group via eksctl or Terraform in another AZ within the same region.
If EC2 instance limits hit:
- Verify AWS EC2 service quotas; request quota increase if needed.

Drain unrecoverable nodes:

kubectl drain <node> --ignore-daemonsets --delete-local-data

3. Persistent Volume Data Corruption (EBS-backed PVC)

Trigger/Symptoms:

Application logs show I/O errors or missing data.
Pod restarts continuously due to corrupted storage.

Impact:

Application downtime or degraded service.
Possible data loss if last backup is old.

Resolution:

Snapshot the affected volume to preserve forensic copy:
```
aws ec2 create-snapshot --volume-id <vol-id>
```
Restore from last good Velero backup or EBS snapshot:
```
velero restore create --from-backup <backup-name>
```
Attach restored volume to new PVC and redeploy workload.
Validate data integrity and confirm RPO met.

4. Ingress Controller / Load Balancer Failure

Trigger/Symptoms:

External users cannot access apps.
ALB/NLB health checks failing.
kubectl get ingress shows errors in status.

Impact:

All or partial app unavailability for users.

Resolution:

Verify ingress controller pod status (e.g., NGINX, AWS Load Balancer Controller).
```
kubectl get pods -n kube-system | grep ingress
```
Restart failing controller pods.
If AWS Load Balancer issue:
- Delete and redeploy Service/Ingress → triggers new LB provisioning.
- Update DNS if LB hostname changes.
Validate app endpoints via curl or synthetic monitoring.

5. Ransomware / Malicious Deletion of Kubernetes Resources

Trigger/Symptoms:

Sudden deletion of namespaces or workloads.
Audit logs show unusual user activity.
Data inaccessible or encrypted at app level.

Impact:

Business-critical services offline.
Possible loss of data beyond RPO.

Resolution:

Immediately revoke compromised IAM/K8s credentials.
Isolate suspicious nodes (cordon, drain).
Restore from immutable Velero backups stored in S3 with Object Lock.
```
velero restore create --from-backup <backup-name>
```
Run smoke tests to confirm recovery.
Engage Security/IR team for root cause investigation.

6. Cluster Add-on Failure (e.g., CNI or CoreDNS)

Trigger/Symptoms:

Pods stuck in ContainerCreating due to networking errors.
DNS resolution inside cluster fails.
Cluster-wide service discovery breaks.

Impact:

Service-to-service communication disrupted.
App deployments fail.

Resolution:

Check add-on pods:
```
kubectl get pods -n kube-system
```
Restart failed CoreDNS/CNI pods.
If misconfiguration suspected (e.g., CNI upgrade):
- Roll back to last known working add-on version via eksctl or Helm.
- Apply tested IaC configs.
Validate by deploying a test pod and resolving DNS / making service requests.

eks_cp - robjcook/sync GitHub Wiki

AWS EKS Contingency (Disaster Recovery) Plan

1. Purpose & Scope

2. Objectives & Metrics

3. Roles & Responsibilities

4. Inventory (What to protect)

5. DR Strategy Summary (Recommended)

6. Tools & Technologies (recommended)

7. Backup Policy (template)

8. Runbooks — rapid actions for common incidents

8.1 Control plane degraded (AWS-managed)

8.2 Node group or AZ failure

8.3 Persistent volume data corruption or deletion (app-level)

8.4 Region outage — failover to secondary region (pilot light / warm standby)

8.5 Ransomware / data exfiltration suspected

9. Recovery Verification & Acceptance Criteria

10. Testing & Exercises

11. Communication Plan

12. Post-Incident Review

13. Appendix A — Application DR Matrix (template)

14. Appendix B — Useful commands & sample manifests

Velero: install (example)

Create a Velero backup

Restore a Velero backup

Kubernetes node actions

EBS snapshot (CLI example)

15. Security & Compliance Considerations

16. Continuous Improvement

17. Next steps (actionable checklist to implement now)

🚨 EKS Disaster Recovery Runbook (On-Call Quick Guide)

1. Initial Response

2. Common Scenarios & Actions

🔹 API/Control Plane Down

🔹 Node/AZ Failure

🔹 Data Corruption / Lost PV

🔹 Ransomware / Security Incident

3. Region Failover (Pilot Light / Warm Standby)

4. Verification (Post-Recovery)

5. Communication

6. Useful Commands

🔐 High-Value Contingency Scenarios — Single Region EKS

1. EKS Control Plane API Unreachable (Regional Service Issue)

2. Node Group Failure (e.g., AZ Outage or EC2 Capacity Issue)

3. Persistent Volume Data Corruption (EBS-backed PVC)

4. Ingress Controller / Load Balancer Failure

5. Ransomware / Malicious Deletion of Kubernetes Resources

6. Cluster Add-on Failure (e.g., CNI or CoreDNS)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️