eks_cp - robjcook/sync GitHub Wiki

AWS EKS Contingency (Disaster Recovery) Plan

Document version: 1.0 Last updated: Sep 11, 2025 Cluster(s) covered: {{CLUSTER_NAME}} (replace) Primary region: {{PRIMARY_REGION}} (replace) DR owner: {{NAME / TEAM}} (replace) Contact: {{pager / phone / email}} (replace)


1. Purpose & Scope

Purpose: ensure rapid, repeatable recovery of applications and cluster resources running on Amazon EKS after outages, data loss, or security incidents. Scope: EKS control plane and worker node recovery, Kubernetes resource and persistent data recovery (PVCs/EBS/EFS), cross-AZ and cross-region failover strategies, and runbooks for common incidents.

Key assumptions:

  • EKS control plane is managed by AWS (HA by AWS across AZs) — plan focuses on cluster state, worker nodes, applications, and persistent data. ([AWS Documentation]1)

2. Objectives & Metrics

  • Recovery Time Objective (RTO): {{hours}} for full app recovery (set per app).
  • Recovery Point Objective (RPO): {{minutes/hours}} for critical data (set per app).
  • Primary success criteria: apps restored, traffic routed, data within RPO, and verification tests passed.

(Define these per-application and store in the Application DR matrix in Appendix A.) See DR strategy options from AWS (Pilot Light, Warm Standby, Active-Active). ([AWS Documentation]2)


3. Roles & Responsibilities

  • DR Owner: overall coordination and sign-off.
  • Platform/Cluster Owner: cluster creation, control-plane health checks, node management.
  • Application Owners: validate app-level backups, test recoveries, app config.
  • Network/Infra: VPC, routing, Load Balancers, DNS failover.
  • SecOps: security, access revocation, forensic data capture.
  • Communications: stakeholder updates and external notifications.

Include on-call roster and escalation matrix (phone/pager/email).


4. Inventory (What to protect)

  • Kubernetes objects: Namespaces, Deployments, StatefulSets, DaemonSets, Services, Ingress, ConfigMaps, Secrets, RBAC, CRDs.
  • Persistent data: EBS volumes (PVC-backed), EFS, S3 buckets used by apps, RDS/managed DBs (if used).
  • Cluster configuration: IAM roles, node groups, VPC/subnets, security groups, cluster-autoscaler/Karpenter config.
  • Cluster add-ons: CNI config, CoreDNS, metrics servers, Helm releases, cert-manager, ingress controllers.
  • CI/CD pipelines, image registry (ECR), and external dependencies.

5. DR Strategy Summary (Recommended)

  1. Backup & Restore (primary) — Periodic and scheduled backups of Kubernetes resources and persistent volumes using Velero (backup of cluster objects + snapshots of EBS volumes or object-level copy to S3). Test restores regularly. Velero is widely recommended for EKS backups. ([Amazon Web Services, Inc.]3)
  2. AZ-resilience — Spread node groups across multiple AZs; use Pod topology spread and multi-AZ nodegroups to survive single AZ failures. ([Amazon Web Services, Inc.]4)
  3. Cross-region strategy — For higher SLAs, maintain a warm standby or pilot-light cluster in another region, or use multi-region storage (replicated S3, cross-region EFS solutions) depending on RTO/RPO. AWS offers templates and guidance for multi-region EKS DR. ([Amazon Web Services, Inc.]5)
  4. Immutable artifacts & IaC — Store all cluster manifests and infra as code (Terraform/CloudFormation/eksctl) in source control so a cluster can be recreated quickly.
  5. Protect backups — S3 Object Lock, encryption, IAM restrictions and offline/cross-account copies to mitigate ransomware. ([Trilio]6)

6. Tools & Technologies (recommended)

  • Velero for cluster resource + PVC backups and restore workflows. ([velero.io]7)
  • AWS EBS Snapshots (for large PVs) or CSI snapshotter integration via Velero. ([devopscube.com]8)
  • S3 for backup storage (versioning, Object Lock for immutability).
  • EFS for stateful workloads that need cross-AZ availability and easier multi-AZ failover. ([Amazon Web Services, Inc.]5)
  • Infrastructure as Code: eksctl / Terraform / CloudFormation.
  • Observability: CloudWatch, Prometheus + Alertmanager, logs in centralized store.
  • DNS failover: Route 53 health checks & failover policies.
  • Optional vendors: Portworx, Trilio, Veeam — for enterprises needing application-consistent or high-performance snapshots. ([Portworx]9)

7. Backup Policy (template)

  • Cluster resource backups (Velero): daily full backups; hourly incremental for critical namespaces. Retention: 30 days (adjust per compliance).
  • PV backups / EBS snapshots: schedule hourly/daily snapshots for critical PVs depending on RPO. Use cross-region copies for critical datasets.
  • Object stores & DBs: snapshot DBs via native backup (RDS snapshots/Aurora backups); S3 versioning & replication for critical buckets.
  • Secrets & Config: include in Velero (encrypted) or store secrets in external KMS-protected store; rotate keys per policy.
  • Backup retention & immutability: use S3 Object Lock for at-risk data (ransomware). ([NetApp]10)

8. Runbooks — rapid actions for common incidents

8.1 Control plane degraded (AWS-managed)

Symptoms: API server errors, kubectl times out. Actions:

  1. Check AWS EKS control plane health in AWS Console / eksctl / CloudWatch. (AWS manages control plane HA; partial outages are rare). ([AWS Documentation]1)
  2. Confirm IAM permissions and network ACLs (sometimes network misconfiguration causes access issues).
  3. Contact AWS Support if outage persists; follow AWS status page and region announcements.
  4. If cluster is entirely unrecoverable and RTO demands, fail over to warm standby cluster (see 8.4).

8.2 Node group or AZ failure

Symptoms: Pod eviction, pending pods, node NotReady. Actions:

  1. Verify node status: kubectl get nodes -o wide.
  2. If autoscaling: check cluster-autoscaler/Karpenter logs and AWS EC2 instance limits. Restart or increase capacity; create new nodegroup in healthy AZ. ([Amazon Web Services, Inc.]4)
  3. Drain and cordon unhealthy nodes: kubectl drain <node> --ignore-daemonsets --delete-local-data.
  4. Replace node group (eksctl create nodegroup) or scale managed nodegroup.

8.3 Persistent volume data corruption or deletion (app-level)

Symptoms: application data missing or corrupted. Actions:

  1. Immediately snapshot the remaining volumes (prevent further changes).
  2. Identify last good backup (Velero backup/snapshot).
  3. Restore PV from snapshot to new PVC or restore Velero backup of PVCs: velero restore create --from-backup <backup-name>. Test in isolated namespace before replacing prod. ([velero.io]7)

8.4 Region outage — failover to secondary region (pilot light / warm standby)

Prereq: secondary region cluster is pre-provisioned (pilot light contains minimal infra; warm standby runs scaled-down apps). Actions:

  1. Validate DNS failover plan (Route 53 health checks). Update weights / failover policy.
  2. Promote secondary environment: scale up node groups, ensure image registry access (ECR replication or mirrored registry), and restore latest backups (Velero + PV snapshots) into the secondary cluster.
  3. Test critical transaction flows and confirm data integrity.
  4. Repoint Route 53 or LB to secondary region. ([Amazon Web Services, Inc.]5)

8.5 Ransomware / data exfiltration suspected

Actions:

  1. Isolate affected nodes and revoke/rotate compromised credentials and access keys.
  2. Freeze writes to critical data stores (snapshots and S3 object lock).
  3. Restore from immutable backups (S3 Object Lock / cross-account backups).
  4. Engage Security/IR processes; preserve logs (CloudTrail, VPC Flow Logs, app logs) for forensics.

9. Recovery Verification & Acceptance Criteria

After any restore, run these checks:

  • All pods in critical namespaces in Running or Completed state. kubectl get pods -n <ns>.
  • Application health checks and smoke tests pass (end-to-end).
  • Data integrity check (row counts, checksums) against the expected dataset from before the incident.
  • External traffic flows as expected (DNS, load balancer).
  • Postmortem and lessons captured.

10. Testing & Exercises

  • Quarterly: restore test from Velero into an isolated test cluster.
  • Bi-annually: failover exercise to warm standby region (DNS failover + data restore).
  • Annually: full DR test (simulate region outage, failover, operations). Record test dates, outcomes, RTO/RPO met, and action items.

11. Communication Plan

  • Initial notification: internal Slack channel + email to execs & stakeholders. Template: include incident time, impact, initial mitigation, expected next update.
  • Status cadence: every 30 min until stabilised; then every 2 hours.
  • Post-incident: full incident report and remediation timeline within 3 business days.

12. Post-Incident Review

  • Capture root cause, timeline, actions taken, success/failure of runbooks, and long-term mitigations.
  • Update runbooks and playbooks; schedule follow-up tasks with owners and deadlines.

13. Appendix A — Application DR Matrix (template)

For each application, list:

  • App name:
  • Namespace:
  • RTO:
  • RPO:
  • Critical resources:
  • Backup frequency:
  • Recovery owner:
  • Test history:

14. Appendix B — Useful commands & sample manifests

Velero: install (example)

# Assume AWS creds present and S3 bucket created for backups
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.10.0 \
  --bucket my-velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1,s3ForcePathStyle="true",s3Url=https://s3.amazonaws.com

(Create VolumeSnapshotLocation and backup schedules per Velero docs.) ([velero.io]7)

Create a Velero backup

velero backup create daily-backup-$(date +%F) --include-namespaces prod,shared

Restore a Velero backup

velero restore create --from-backup daily-backup-2025-09-10

Kubernetes node actions

kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data

EBS snapshot (CLI example)

aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-restore snapshot"

(Always test commands in non-prod first. Provide IAM policies least privilege for Velero and snapshot operations — see Velero and AWS docs for required permissions.) ([Amazon Web Services, Inc.]3)


15. Security & Compliance Considerations

  • Encrypt backups at rest and in transit (S3 encryption + KMS for secrets).
  • Use cross-account backups and S3 Object Lock to protect against accidental/malicious deletion. ([NetApp]10)
  • Audit backup/restore operations (CloudTrail). Document retention per retention policy and regulatory needs.

16. Continuous Improvement

  • After each test or incident, update RTO/RPO and runbooks.
  • Automate restore verification tests and integrate into CI pipelines where possible.
  • Regularly review third-party tool versions and AWS best-practices pages (EKS docs and AWS DR whitepaper).

Authoritative references & further reading:


17. Next steps (actionable checklist to implement now)

  1. Install and configure Velero with S3 backup bucket and a VolumeSnapshotLocation for CSI snapshots. ([velero.io]7)
  2. Add backup schedules for critical namespaces & PVs; set retention and immutability as required.
  3. Implement IaC scripts to recreate cluster infra (eksctl/Terraform) and store in Git.
  4. Provision a pilot-light cluster in a secondary region for critical workloads. ([Amazon Web Services, Inc.]5)
  5. Run first restore test in isolated test cluster and document outcomes.


🚨 EKS Disaster Recovery Runbook (On-Call Quick Guide)

Cluster: {{CLUSTER_NAME}} Primary Region: {{PRIMARY_REGION}} DR Owner Contact: {{NAME / PHONE / PAGER}}


1. Initial Response

  1. Acknowledge alert in PagerDuty/Slack.

  2. Notify DR Owner + relevant app owners.

  3. Assess scope:

    kubectl get nodes -o wide
    kubectl get pods --all-namespaces
  4. Check AWS EKS Console + CloudWatch for control plane health.


2. Common Scenarios & Actions

🔹 API/Control Plane Down

  • Verify AWS status page & region health.
  • If confirmed AWS outage → escalate to DR Owner + AWS Support.
  • If unrecoverable → trigger Failover to Secondary Region (Section 3).

🔹 Node/AZ Failure

kubectl get nodes
kubectl drain <node> --ignore-daemonsets --delete-local-data
  • Scale or replace node group (eksctl scale nodegroup).
  • Verify workloads reschedule in healthy AZ.

🔹 Data Corruption / Lost PV

velero restore create --from-backup <latest-backup>
  • Or restore specific PV snapshot:

    aws ec2 create-volume --snapshot-id <snap-id> --availability-zone <az>
  • Validate restored data with app owner.

🔹 Ransomware / Security Incident

  • Isolate affected nodes:

    kubectl cordon <node>; kubectl drain <node>
  • Revoke compromised IAM keys.

  • Restore from immutable backup (S3 Object Lock / Velero).

  • Engage SecOps.


3. Region Failover (Pilot Light / Warm Standby)

  1. Scale up standby cluster/nodegroups in {{SECONDARY_REGION}}.

  2. Restore latest backup:

    velero restore create --from-backup <latest>
  3. Switch DNS in Route 53 (failover policy).

  4. Run smoke tests → confirm app health before declaring recovery.


4. Verification (Post-Recovery)

  • kubectl get pods -n <critical-ns> → all Running.
  • App endpoints return 200.
  • Data integrity verified (checksum/row counts).
  • External traffic routed correctly (DNS/LB).

5. Communication

  • First update: 15 min after incident start.
  • Update cadence: every 30 min until stable.
  • Final report: after resolution → incident ticket + postmortem.

6. Useful Commands

velero backup create manual-$(date +%F) --include-namespaces prod
velero restore create --from-backup <backup-name>
kubectl get events --sort-by='.lastTimestamp'

Golden Rule: If unsure → escalate to DR Owner immediately. 📌 Keep this runbook printed or bookmarked in the on-call doc hub.


  • Trigger/Symptoms
  • Impact
  • Resolution steps

🔐 High-Value Contingency Scenarios — Single Region EKS


1. EKS Control Plane API Unreachable (Regional Service Issue)

Trigger/Symptoms:

  • kubectl times out or fails across all users.
  • AWS EKS console shows control plane degraded.
  • CloudWatch alarms for API errors.

Impact:

  • No ability to deploy new workloads, modify resources, or scale manually.
  • Running workloads continue, but cluster operations blocked.

Resolution:

  1. Check [AWS Health Dashboard](https://status.aws.amazon.com/) for region-wide issues.

  2. If AWS-managed outage:

    • Notify stakeholders, escalate to AWS Support.
    • Monitor until AWS resolves (EKS control plane HA is AWS-managed).
  3. If caused by IAM/network misconfig:

    • Validate kubeconfig & IAM roles.
    • Check VPC routing, NACLs, SGs blocking API endpoint.
    • Revert last infra change via IaC.

2. Node Group Failure (e.g., AZ Outage or EC2 Capacity Issue)

Trigger/Symptoms:

  • Pods stuck in Pending state.
  • Nodes show NotReady.
  • Events indicate scheduling failures due to no capacity.

Impact:

  • Workloads on failed nodes evicted.
  • Reduced capacity, degraded performance.

Resolution:

  1. Check node health:

    kubectl get nodes
  2. If AZ outage suspected:

    • Scale node group in healthy AZ.
    • Add temporary node group via eksctl or Terraform in another AZ within the same region.
  3. If EC2 instance limits hit:

    • Verify AWS EC2 service quotas; request quota increase if needed.
  4. Drain unrecoverable nodes:

    kubectl drain <node> --ignore-daemonsets --delete-local-data

3. Persistent Volume Data Corruption (EBS-backed PVC)

Trigger/Symptoms:

  • Application logs show I/O errors or missing data.
  • Pod restarts continuously due to corrupted storage.

Impact:

  • Application downtime or degraded service.
  • Possible data loss if last backup is old.

Resolution:

  1. Snapshot the affected volume to preserve forensic copy:

    aws ec2 create-snapshot --volume-id <vol-id>
  2. Restore from last good Velero backup or EBS snapshot:

    velero restore create --from-backup <backup-name>
  3. Attach restored volume to new PVC and redeploy workload.

  4. Validate data integrity and confirm RPO met.


4. Ingress Controller / Load Balancer Failure

Trigger/Symptoms:

  • External users cannot access apps.
  • ALB/NLB health checks failing.
  • kubectl get ingress shows errors in status.

Impact:

  • All or partial app unavailability for users.

Resolution:

  1. Verify ingress controller pod status (e.g., NGINX, AWS Load Balancer Controller).

    kubectl get pods -n kube-system | grep ingress
  2. Restart failing controller pods.

  3. If AWS Load Balancer issue:

    • Delete and redeploy Service/Ingress → triggers new LB provisioning.
    • Update DNS if LB hostname changes.
  4. Validate app endpoints via curl or synthetic monitoring.


5. Ransomware / Malicious Deletion of Kubernetes Resources

Trigger/Symptoms:

  • Sudden deletion of namespaces or workloads.
  • Audit logs show unusual user activity.
  • Data inaccessible or encrypted at app level.

Impact:

  • Business-critical services offline.
  • Possible loss of data beyond RPO.

Resolution:

  1. Immediately revoke compromised IAM/K8s credentials.

  2. Isolate suspicious nodes (cordon, drain).

  3. Restore from immutable Velero backups stored in S3 with Object Lock.

    velero restore create --from-backup <backup-name>
  4. Run smoke tests to confirm recovery.

  5. Engage Security/IR team for root cause investigation.


6. Cluster Add-on Failure (e.g., CNI or CoreDNS)

Trigger/Symptoms:

  • Pods stuck in ContainerCreating due to networking errors.
  • DNS resolution inside cluster fails.
  • Cluster-wide service discovery breaks.

Impact:

  • Service-to-service communication disrupted.
  • App deployments fail.

Resolution:

  1. Check add-on pods:

    kubectl get pods -n kube-system
  2. Restart failed CoreDNS/CNI pods.

  3. If misconfiguration suspected (e.g., CNI upgrade):

    • Roll back to last known working add-on version via eksctl or Helm.
    • Apply tested IaC configs.
  4. Validate by deploying a test pod and resolving DNS / making service requests.

⚠️ **GitHub.com Fallback** ⚠️