eks_cp - robjcook/sync GitHub Wiki
Document version: 1.0
Last updated: Sep 11, 2025
Cluster(s) covered: {{CLUSTER_NAME}}
(replace)
Primary region: {{PRIMARY_REGION}}
(replace)
DR owner: {{NAME / TEAM}}
(replace)
Contact: {{pager / phone / email}}
(replace)
Purpose: ensure rapid, repeatable recovery of applications and cluster resources running on Amazon EKS after outages, data loss, or security incidents. Scope: EKS control plane and worker node recovery, Kubernetes resource and persistent data recovery (PVCs/EBS/EFS), cross-AZ and cross-region failover strategies, and runbooks for common incidents.
Key assumptions:
- EKS control plane is managed by AWS (HA by AWS across AZs) — plan focuses on cluster state, worker nodes, applications, and persistent data. ([AWS Documentation]1)
-
Recovery Time Objective (RTO):
{{hours}}
for full app recovery (set per app). -
Recovery Point Objective (RPO):
{{minutes/hours}}
for critical data (set per app). - Primary success criteria: apps restored, traffic routed, data within RPO, and verification tests passed.
(Define these per-application and store in the Application DR matrix in Appendix A.) See DR strategy options from AWS (Pilot Light, Warm Standby, Active-Active). ([AWS Documentation]2)
- DR Owner: overall coordination and sign-off.
- Platform/Cluster Owner: cluster creation, control-plane health checks, node management.
- Application Owners: validate app-level backups, test recoveries, app config.
- Network/Infra: VPC, routing, Load Balancers, DNS failover.
- SecOps: security, access revocation, forensic data capture.
- Communications: stakeholder updates and external notifications.
Include on-call roster and escalation matrix (phone/pager/email).
- Kubernetes objects: Namespaces, Deployments, StatefulSets, DaemonSets, Services, Ingress, ConfigMaps, Secrets, RBAC, CRDs.
- Persistent data: EBS volumes (PVC-backed), EFS, S3 buckets used by apps, RDS/managed DBs (if used).
- Cluster configuration: IAM roles, node groups, VPC/subnets, security groups, cluster-autoscaler/Karpenter config.
- Cluster add-ons: CNI config, CoreDNS, metrics servers, Helm releases, cert-manager, ingress controllers.
- CI/CD pipelines, image registry (ECR), and external dependencies.
- Backup & Restore (primary) — Periodic and scheduled backups of Kubernetes resources and persistent volumes using Velero (backup of cluster objects + snapshots of EBS volumes or object-level copy to S3). Test restores regularly. Velero is widely recommended for EKS backups. ([Amazon Web Services, Inc.]3)
- AZ-resilience — Spread node groups across multiple AZs; use Pod topology spread and multi-AZ nodegroups to survive single AZ failures. ([Amazon Web Services, Inc.]4)
- Cross-region strategy — For higher SLAs, maintain a warm standby or pilot-light cluster in another region, or use multi-region storage (replicated S3, cross-region EFS solutions) depending on RTO/RPO. AWS offers templates and guidance for multi-region EKS DR. ([Amazon Web Services, Inc.]5)
- Immutable artifacts & IaC — Store all cluster manifests and infra as code (Terraform/CloudFormation/eksctl) in source control so a cluster can be recreated quickly.
- Protect backups — S3 Object Lock, encryption, IAM restrictions and offline/cross-account copies to mitigate ransomware. ([Trilio]6)
- Velero for cluster resource + PVC backups and restore workflows. ([velero.io]7)
- AWS EBS Snapshots (for large PVs) or CSI snapshotter integration via Velero. ([devopscube.com]8)
- S3 for backup storage (versioning, Object Lock for immutability).
- EFS for stateful workloads that need cross-AZ availability and easier multi-AZ failover. ([Amazon Web Services, Inc.]5)
- Infrastructure as Code: eksctl / Terraform / CloudFormation.
- Observability: CloudWatch, Prometheus + Alertmanager, logs in centralized store.
- DNS failover: Route 53 health checks & failover policies.
- Optional vendors: Portworx, Trilio, Veeam — for enterprises needing application-consistent or high-performance snapshots. ([Portworx]9)
- Cluster resource backups (Velero): daily full backups; hourly incremental for critical namespaces. Retention: 30 days (adjust per compliance).
- PV backups / EBS snapshots: schedule hourly/daily snapshots for critical PVs depending on RPO. Use cross-region copies for critical datasets.
- Object stores & DBs: snapshot DBs via native backup (RDS snapshots/Aurora backups); S3 versioning & replication for critical buckets.
- Secrets & Config: include in Velero (encrypted) or store secrets in external KMS-protected store; rotate keys per policy.
- Backup retention & immutability: use S3 Object Lock for at-risk data (ransomware). ([NetApp]10)
Symptoms: API server errors, kubectl
times out.
Actions:
- Check AWS EKS control plane health in AWS Console /
eksctl
/ CloudWatch. (AWS manages control plane HA; partial outages are rare). ([AWS Documentation]1) - Confirm IAM permissions and network ACLs (sometimes network misconfiguration causes access issues).
- Contact AWS Support if outage persists; follow AWS status page and region announcements.
- If cluster is entirely unrecoverable and RTO demands, fail over to warm standby cluster (see 8.4).
Symptoms: Pod eviction, pending pods, node NotReady. Actions:
- Verify node status:
kubectl get nodes -o wide
. - If autoscaling: check cluster-autoscaler/Karpenter logs and AWS EC2 instance limits. Restart or increase capacity; create new nodegroup in healthy AZ. ([Amazon Web Services, Inc.]4)
- Drain and cordon unhealthy nodes:
kubectl drain <node> --ignore-daemonsets --delete-local-data
. - Replace node group (eksctl create nodegroup) or scale managed nodegroup.
Symptoms: application data missing or corrupted. Actions:
- Immediately snapshot the remaining volumes (prevent further changes).
- Identify last good backup (Velero backup/snapshot).
- Restore PV from snapshot to new PVC or restore Velero backup of PVCs:
velero restore create --from-backup <backup-name>
. Test in isolated namespace before replacing prod. ([velero.io]7)
Prereq: secondary region cluster is pre-provisioned (pilot light contains minimal infra; warm standby runs scaled-down apps). Actions:
- Validate DNS failover plan (Route 53 health checks). Update weights / failover policy.
- Promote secondary environment: scale up node groups, ensure image registry access (ECR replication or mirrored registry), and restore latest backups (Velero + PV snapshots) into the secondary cluster.
- Test critical transaction flows and confirm data integrity.
- Repoint Route 53 or LB to secondary region. ([Amazon Web Services, Inc.]5)
Actions:
- Isolate affected nodes and revoke/rotate compromised credentials and access keys.
- Freeze writes to critical data stores (snapshots and S3 object lock).
- Restore from immutable backups (S3 Object Lock / cross-account backups).
- Engage Security/IR processes; preserve logs (CloudTrail, VPC Flow Logs, app logs) for forensics.
After any restore, run these checks:
- All pods in critical namespaces in
Running
orCompleted
state.kubectl get pods -n <ns>
. - Application health checks and smoke tests pass (end-to-end).
- Data integrity check (row counts, checksums) against the expected dataset from before the incident.
- External traffic flows as expected (DNS, load balancer).
- Postmortem and lessons captured.
- Quarterly: restore test from Velero into an isolated test cluster.
- Bi-annually: failover exercise to warm standby region (DNS failover + data restore).
- Annually: full DR test (simulate region outage, failover, operations). Record test dates, outcomes, RTO/RPO met, and action items.
- Initial notification: internal Slack channel + email to execs & stakeholders. Template: include incident time, impact, initial mitigation, expected next update.
- Status cadence: every 30 min until stabilised; then every 2 hours.
- Post-incident: full incident report and remediation timeline within 3 business days.
- Capture root cause, timeline, actions taken, success/failure of runbooks, and long-term mitigations.
- Update runbooks and playbooks; schedule follow-up tasks with owners and deadlines.
For each application, list:
- App name:
- Namespace:
- RTO:
- RPO:
- Critical resources:
- Backup frequency:
- Recovery owner:
- Test history:
# Assume AWS creds present and S3 bucket created for backups
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--bucket my-velero-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1,s3ForcePathStyle="true",s3Url=https://s3.amazonaws.com
(Create VolumeSnapshotLocation and backup schedules per Velero docs.) ([velero.io]7)
velero backup create daily-backup-$(date +%F) --include-namespaces prod,shared
velero restore create --from-backup daily-backup-2025-09-10
kubectl cordon <node-name>
kubectl drain <node-name> --ignore-daemonsets --delete-local-data
aws ec2 create-snapshot --volume-id vol-0123456789abcdef0 --description "pre-restore snapshot"
(Always test commands in non-prod first. Provide IAM policies least privilege for Velero and snapshot operations — see Velero and AWS docs for required permissions.) ([Amazon Web Services, Inc.]3)
- Encrypt backups at rest and in transit (S3 encryption + KMS for secrets).
- Use cross-account backups and S3 Object Lock to protect against accidental/malicious deletion. ([NetApp]10)
- Audit backup/restore operations (CloudTrail). Document retention per retention policy and regulatory needs.
- After each test or incident, update RTO/RPO and runbooks.
- Automate restore verification tests and integrate into CI pipelines where possible.
- Regularly review third-party tool versions and AWS best-practices pages (EKS docs and AWS DR whitepaper).
Authoritative references & further reading:
- Amazon EKS disaster recovery & resilience guidance. ([AWS Documentation]1)
- AWS blog: Backup and restore Amazon EKS with Velero. ([Amazon Web Services, Inc.]3)
- AWS blog: Multi-region DR patterns for EKS + EFS. ([Amazon Web Services, Inc.]5)
- Velero documentation (file system & volume snapshot locations). ([velero.io]7)
- AWS Whitepaper: Disaster Recovery on AWS (strategies: pilot light, warm standby, active-active). ([AWS Documentation]2)
- Install and configure Velero with S3 backup bucket and a VolumeSnapshotLocation for CSI snapshots. ([velero.io]7)
- Add backup schedules for critical namespaces & PVs; set retention and immutability as required.
- Implement IaC scripts to recreate cluster infra (eksctl/Terraform) and store in Git.
- Provision a pilot-light cluster in a secondary region for critical workloads. ([Amazon Web Services, Inc.]5)
- Run first restore test in isolated test cluster and document outcomes.
Cluster: {{CLUSTER_NAME}}
Primary Region: {{PRIMARY_REGION}}
DR Owner Contact: {{NAME / PHONE / PAGER}}
-
Acknowledge alert in PagerDuty/Slack.
-
Notify DR Owner + relevant app owners.
-
Assess scope:
kubectl get nodes -o wide kubectl get pods --all-namespaces
-
Check AWS EKS Console + CloudWatch for control plane health.
- Verify AWS status page & region health.
- If confirmed AWS outage → escalate to DR Owner + AWS Support.
- If unrecoverable → trigger Failover to Secondary Region (Section 3).
kubectl get nodes
kubectl drain <node> --ignore-daemonsets --delete-local-data
- Scale or replace node group (
eksctl scale nodegroup
). - Verify workloads reschedule in healthy AZ.
velero restore create --from-backup <latest-backup>
-
Or restore specific PV snapshot:
aws ec2 create-volume --snapshot-id <snap-id> --availability-zone <az>
-
Validate restored data with app owner.
-
Isolate affected nodes:
kubectl cordon <node>; kubectl drain <node>
-
Revoke compromised IAM keys.
-
Restore from immutable backup (S3 Object Lock / Velero).
-
Engage SecOps.
-
Scale up standby cluster/nodegroups in
{{SECONDARY_REGION}}
. -
Restore latest backup:
velero restore create --from-backup <latest>
-
Switch DNS in Route 53 (failover policy).
-
Run smoke tests → confirm app health before declaring recovery.
-
kubectl get pods -n <critical-ns>
→ allRunning
. - App endpoints return 200.
- Data integrity verified (checksum/row counts).
- External traffic routed correctly (DNS/LB).
- First update: 15 min after incident start.
- Update cadence: every 30 min until stable.
- Final report: after resolution → incident ticket + postmortem.
velero backup create manual-$(date +%F) --include-namespaces prod
velero restore create --from-backup <backup-name>
kubectl get events --sort-by='.lastTimestamp'
✅ Golden Rule: If unsure → escalate to DR Owner immediately. 📌 Keep this runbook printed or bookmarked in the on-call doc hub.
- Trigger/Symptoms
- Impact
- Resolution steps
Trigger/Symptoms:
-
kubectl
times out or fails across all users. - AWS EKS console shows control plane degraded.
- CloudWatch alarms for API errors.
Impact:
- No ability to deploy new workloads, modify resources, or scale manually.
- Running workloads continue, but cluster operations blocked.
Resolution:
-
Check [AWS Health Dashboard](https://status.aws.amazon.com/) for region-wide issues.
-
If AWS-managed outage:
- Notify stakeholders, escalate to AWS Support.
- Monitor until AWS resolves (EKS control plane HA is AWS-managed).
-
If caused by IAM/network misconfig:
- Validate kubeconfig & IAM roles.
- Check VPC routing, NACLs, SGs blocking API endpoint.
- Revert last infra change via IaC.
Trigger/Symptoms:
- Pods stuck in
Pending
state. - Nodes show
NotReady
. - Events indicate scheduling failures due to no capacity.
Impact:
- Workloads on failed nodes evicted.
- Reduced capacity, degraded performance.
Resolution:
-
Check node health:
kubectl get nodes
-
If AZ outage suspected:
- Scale node group in healthy AZ.
- Add temporary node group via
eksctl
or Terraform in another AZ within the same region.
-
If EC2 instance limits hit:
- Verify AWS EC2 service quotas; request quota increase if needed.
-
Drain unrecoverable nodes:
kubectl drain <node> --ignore-daemonsets --delete-local-data
Trigger/Symptoms:
- Application logs show I/O errors or missing data.
- Pod restarts continuously due to corrupted storage.
Impact:
- Application downtime or degraded service.
- Possible data loss if last backup is old.
Resolution:
-
Snapshot the affected volume to preserve forensic copy:
aws ec2 create-snapshot --volume-id <vol-id>
-
Restore from last good Velero backup or EBS snapshot:
velero restore create --from-backup <backup-name>
-
Attach restored volume to new PVC and redeploy workload.
-
Validate data integrity and confirm RPO met.
Trigger/Symptoms:
- External users cannot access apps.
- ALB/NLB health checks failing.
-
kubectl get ingress
shows errors in status.
Impact:
- All or partial app unavailability for users.
Resolution:
-
Verify ingress controller pod status (e.g., NGINX, AWS Load Balancer Controller).
kubectl get pods -n kube-system | grep ingress
-
Restart failing controller pods.
-
If AWS Load Balancer issue:
- Delete and redeploy Service/Ingress → triggers new LB provisioning.
- Update DNS if LB hostname changes.
-
Validate app endpoints via curl or synthetic monitoring.
Trigger/Symptoms:
- Sudden deletion of namespaces or workloads.
- Audit logs show unusual user activity.
- Data inaccessible or encrypted at app level.
Impact:
- Business-critical services offline.
- Possible loss of data beyond RPO.
Resolution:
-
Immediately revoke compromised IAM/K8s credentials.
-
Isolate suspicious nodes (
cordon
,drain
). -
Restore from immutable Velero backups stored in S3 with Object Lock.
velero restore create --from-backup <backup-name>
-
Run smoke tests to confirm recovery.
-
Engage Security/IR team for root cause investigation.
Trigger/Symptoms:
- Pods stuck in
ContainerCreating
due to networking errors. - DNS resolution inside cluster fails.
- Cluster-wide service discovery breaks.
Impact:
- Service-to-service communication disrupted.
- App deployments fail.
Resolution:
-
Check add-on pods:
kubectl get pods -n kube-system
-
Restart failed CoreDNS/CNI pods.
-
If misconfiguration suspected (e.g., CNI upgrade):
- Roll back to last known working add-on version via
eksctl
or Helm. - Apply tested IaC configs.
- Roll back to last known working add-on version via
-
Validate by deploying a test pod and resolving DNS / making service requests.