Technical ‐ AWS ‐ Disaster Recovery - Yves-Guduszeit/Interview GitHub Wiki
Disaster recovery (DR) on AWS involves designing systems, policies, and procedures to minimize downtime and data loss in the event of a disaster.
AWS provides a range of services and features to facilitate disaster recovery.
Here’s a detailed guide:
- Recovery Time Objective (RTO): How quickly must systems be restored?
- Recovery Point Objective (RPO): What is the maximum acceptable data loss?
These metrics influence the design of your DR strategy, balancing cost and performance.
AWS supports four primary DR strategies, each with increasing complexity and cost but lower RTO and RPO:
- Backup and Restore
- Suitable for non-critical workloads.
- Regularly back up data to Amazon S3, Amazon S3 Glacier, or AWS Backup.
- Restore data and systems manually during a disaster.
- Pilot Light
- Keep a minimal version of the system running in AWS.
- Critical components (e.g., database) are always live, while other services are brought up as needed.
- Automate scaling and deployment using AWS CloudFormation or AWS Elastic Beanstalk.
- Warm Standby
- A scaled-down, fully functional replica of your production environment runs on AWS.
- During a disaster, scale up the warm standby environment to handle full production load.
- Multi-Site (Active-Active)
- Fully redundant environments run simultaneously in multiple AWS Regions.
- Traffic routing between sites is managed with AWS Route 53 or similar DNS solutions.
AWS offers several services to implement and enhance DR strategies:
- Storage and Backup:
- Amazon S3: Durable and scalable storage for backups.
- Amazon S3 Glacier: Cost-effective, long-term data archiving.
- AWS Backup: Centralized backup management across AWS services.
- Amazon RDS: Automated database backups and cross-region read replicas.
- EBS Snapshots: Backup for EC2 volumes.
- Compute and Networking:
- Amazon EC2: Use AMIs to launch instances quickly in another region.
- AWS Elastic Load Balancing (ELB): Distribute traffic across healthy instances.
- AWS Route 53: DNS-based routing for failover and disaster recovery.
- Replication:
- AWS Database Migration Service (DMS): Continuous replication for databases.
- Amazon Aurora Global Database: Near real-time replication across regions.
- Automation:
- AWS CloudFormation: Automate infrastructure deployment.
- AWS Elastic Disaster Recovery (DRS): Continuous replication and fast recovery for servers.
- AWS Lambda: Trigger automated recovery workflows.
- Deploy resources across multiple Availability Zones (AZs) within a Region for high availability.
- Use multi-region replication to prepare for region-level failures:
- Amazon S3 Cross-Region Replication (CRR)
- Amazon DynamoDB Global Tables
- Amazon RDS Read Replicas
- Perform regular DR drills using tools like AWS Fault Injection Simulator.
- Test backups, recovery processes, and failover configurations to ensure they meet your RTO and RPO.
- Use Amazon CloudWatch to monitor infrastructure.
- Set up alarms and notifications for failover events or anomalies.
- Continuously optimize your DR plan based on evolving business needs.
- Primary Region: Full production setup.
- Secondary Region:
- A scaled-down version with replicated data (e.g., databases, S3).
- Services like Route 53 redirect traffic during a failover.
- Choose cost-effective storage options (e.g., S3 Glacier for archives).
- Leverage spot instances for warm standby setups.
- Automate scaling to minimize resource usage during normal operation.
Disaster recovery on AWS is highly customizable and scalable.
By leveraging AWS’s global infrastructure and services, you can build resilient systems tailored to your organization’s needs and budget.