High Availability | Fault Tolerance | Durability - FullstackCodingGuy/Developer-Fundamentals GitHub Wiki
- HA and FT are better works together to ensure the zero service disruptions and zero downtime
- HA is achieved by enabling LB to distribute the workload to multiple servers, preferably in multiple availability zones (data centers)
- Auto scaling is enabled to scale the servers accross AZs based on the demand
- FT is achieved by providing additional backup support for the infrastructure by the provider
- If a server is down or becomes faulty/unhealthy, then ASG automatically creates a new server to handle the load
- If the availability zone itself fails then existing AZ will handle the load by adding more servers as the metrics (CPU utilization %) will shoot up due to the additional traffic.
- Minimal Service Interruption
- Designed to ensure No Single Point of Failure (Redundancy)
- Uptime measure in %, ex: 99.99% - i.e how many 9s the service is guaranteed to support
- Sync or async replication to perform the operations
- Lower cost compared to FT
- Ways to create HA
- Elastic Load Balancing - for distributing incoming traffic to multiple nodes
- EC2 Auto Scaling
- No service interruption
- Hence you would require specialized hardware with instantaneous failover
- so, No downtime guaranteed
- Synchronous replication is a Must - to enable replication in real time to ensure the zero data loss
- Higher cost compared to HA as it involves replication of hardware systems too
- Ways to create FT
- Fault tolerant Network Interface Cards (NIC) - introduce additional network interface card for backup
- Disk Mirroring (RAID1) - add additional hard drives to back up the data (if a data is saved in 1 drive, it goes to drive 2 as well immediately)
- Synchronous DB replication
- Redundant power backup for the data center
How load balancing works?
What is the difference between ASG and ALB?
In the AWS environment, Application Load Balancer (ALB) allows you to efficiently route traffic to the right servers. In other words, ALB does the same job as the baton of your conductor who manages the orchestra. Auto Scaling Group (ASG) enables additional servers to be activated when application traffic increases
Amazon Web Services (AWS) provides us with a broad set of tools to build resilient and scalable infrastructures. Application Load Balancer (ALB) and Automatic Scaling Groups (ASG), which are the basic components of modern web architectures, are two important tools that AWS offers us to create scalable and durable infrastructures.
Application Load Balancer (ALB) allows you to efficiently route traffic to the right servers.
ALB is the gateway controller of your web traffic. It also enables incoming user requests to be distributed to multiple targets, such as EC2 instances, containers, and IP addresses in multiple Availability Zones. ALB not only distributes the load effectively, but also adds redundancy critical for high availability. ALB quickly redirects web traffic to healthy instances when a server fails. This routing also provides a smooth user experience.
Auto Scaling Group (ASG) enables additional servers to be activated when application traffic increases.
ASG dynamically adjusts the number of EC2 instances in response to traffic demands. It is extremely important to structure the ASG well. With a well-structured ASG, you don’t just scale; At the same time, you maintain the optimum balance between performance and cost. When demand increases, ASG launches new cloud servers to meet the load. ASG scales back when demand drops, ensuring you only pay for the resource you need. This allows you to benefit from AWS resources in a cost-effective manner.
Using Monitoring and Metrics for auto scaling, The Role of CloudWatch
If you want to keep a system under control/observation and monitor a system you created in AWS, the service you should use is the AWS Cloudwatch service.
AWS CloudWatch plays a critical role in keeping your entire system under control. This service monitors your applications based on metrics such as network input/output or CPU usage. While AWS CloudWatch service monitors your application, if the monitored values exceed the thresholds you specify, CloudWatch triggers alarms.
These alarms, which you will set up with the AWS CloudWatch service, can be configured to notify you about exceeding the threshold in any metric or in case of any anomaly detection, and even trigger the ASG to scale up, as in the example you will experience in the hands-on section.
The Practicality of Scalability
It is important to understand how ALB and ASG work effectively together.
If you use these two services together effectively, you will provide a much better experience to the users and establish a system that is effective in terms of cost and time.
The practical implementation of ALB and ASG in AWS is extremely comprehensive. Imagine you have an e-commerce website.
For example, there is a Black Friday sale and your website is under heavy traffic. What will happen in this case? Thousands of users will flock to your website. But you don’t need to worry. ALB will ensure that no single server carries too much load. At the same time, the ASG will simultaneously receive information from CloudWatch metrics. ASG will help you adapt to possible traffic fluctuations by activating additional servers to adapt to increased traffic. This not only prevents possible disasters, but also ensures that you maintain speed and reliability. As a result, customer satisfaction is achieved and business success occurs.
High Availability
-
AWS EC2 Auto Scaling Explained: Ultimate Tutorial + Live Demo - Part 18
-
AWS ALB (Application Load Balancer) - Step By Step Tutorial (Part -9)
-
Route Traffic to Multiple Target Groups using Load Balancer Listener Rules | AWS Load Balancing
-
AWS Load Balancer HTTPS Setup with Route 53 and Certificate Manager & HTTP Redirect to HTTPS
-
AWS Load Balancers | ALB vs NLB vs GWLB | Detailed Comparison
- Random
- Round Robin (Basic)
- Weight Based Round Robin - ex: add more weight to high performing server, more weight = more requests
- Ratio Based - ex: double the server size send twice as much traffic to it
Durability
Fault Tolerance
Follow Design for failure principle to make your application fault tolerant.
-
Avoid single point of failure
-
Assume everything fails, and design backwards
-
Goal: Applications should continue to function even if the underlying physical hardware fails or removed/replaced
-
Design your recovery process
-
Trade off business needs vs cost of high availability
-
Use multiple availability zones
-
Replicate data across multiple AZs
-
Use Real-time monitoring (Cloudwatch)
-
Use EBS(Elastic Block Store) for persistent file systems
-
Take EBS snapshots and use s3 for backups
AWS takes care of all the ways to add redundancy at the infra level such as back up for Network cards, disk storage, power backup.
For issues like Intermittent Network Issues, Service Throttling and Application Timeouts, the application should quickly accept the failure and handle it appropriately.
Each micro service will perform transaction on its own db or more than one db.
For instance, if one transaction is failed in one of the microservice due to a timeout, the entire transaction should be rolled back to comply the ACID principle. In order to achieve this, we need to consider SAGA Orchestrator Pattern, which executes series of compensation tasks to reverse/rollback the transactions that were made by preceeding transactions.
Saga Pattern for distributed transaction - (similar to newton's 3rd lay - every action has its reaction - every operation has its compensation operation).
- Hands-on: Configuring Application Load Balancer (ALB) with Auto Scaling Group (ASG) using Launch Template
- Understanding High Availability and Fault Tolerance
- Designing Fault Tolerant Applications
- AWS Summit ANZ 2022 - Build resilient microservices using fault-tolerant patterns (DEV5)
- Load Balancers are not Magic - Dissecting Atlassian Outage