fault‐tolerant cloud services - knowlesy/AZ400 GitHub Wiki

fault tolerance Failures in large mission-critical systems can result in significant monetary losses to all parties concerned. The very nature of cloud computing systems is that they have a layered architecture. Thus, a fault in one layer of the cloud resources can trigger a failure in other layers above, or hide access to the layers below.

Profiling and testing Load and stress testing cloud resources in order to understand possible causes of failure is essential to ensure the availability of services.

Over-provisioning is the practice of deploying resources in volumes that are larger than the general projected utilization of the resources at a given time.

Critical system components can be duplicated by using additional hardware and software components to silently handle failures in parts of the system without the entire system failing. Replication has two basic strategies:

Active replication, where all replicated resources are alive concurrently and respond to and process all requests. This means that for any client request, all resources receive the same request, all resources respond to the same request, and the order of the requests maintains state across all resources.
Passive replication, where only the primary unit processes requests, and secondary units merely maintain state and take over once the primary unit fails. The client is only in contact with the primary resource, which relays the state change to all secondary resources. The disadvantage of passive replication is that there may be either dropped requests or degraded QoS in switching from the primary to the secondary instance.
N+1: This basically means that for an application that needs N nodes to function properly, one extra resource is provisioned as a fail-safe.
2N: At this level, one extra node for each node required for normal function is provisioned as a fail-safe.
2N+1: At this level, one extra node for each node required for normal function, and one additional node overall, is provisioned as a fail-safe.

Checks and monitoring

Ping-echo: The monitoring service asks each resource for its state and is given a time window to respond.
Heartbeat: Each instance sends status to the monitoring service at regular intervals, without any trigger.

Cloud services need to be built with redundancy and fault tolerance in mind, as no single component of a large distributed system can guarantee 100% availability or uptime.

All failures (including failures of dependencies in the same node, rack, datacenter, or regionally redundant deployments) need to be handled gracefully without affecting the entirety of the system. Testing the ability of the system to handle catastrophic failures is important, as sometimes even a few seconds of downtime or service degradation can cause hundreds of thousands, if not millions, of dollars in revenue loss.

The need for load balancing in computing stems from two basic requirements: First, high availability can be improved by replication. Second, performance can be improved through parallel processing.

Though all types of network load balancers will simply forward the user's information along with any context to the back-end servers, when it comes to serving the response back to the client, they may employ one of two basic strategies:1

Proxying: In this approach, the load balancer receives the response from the back end and relays it back to the client. The load balancer behaves as a standard web proxy and is involved in both halves of a network transaction, namely forwarding the request to the client and sending back the response.
TCP handoff In this approach, the TCP connection with the client is handed off to the back-end server. Therefore the server sends the response directly to the client, without going through the load balanc

Hash-based distribution This approach tries to ensure that at any point, the requests made by a client through the same connection always end up on the same server. In addition, in order to balance the traffic distribution of requests, it is done in a random order.

SSL offload: Network transactions via SSL have an extra cost associated with them since they need to have processing for encryption and authentication. Instead of serving all requests via SSL, the client connection to the load balancer can be made via SSL, while redirect requests to each individual server can be made via HTTP. This reduces the load on the servers considerably. Additionally, security is maintained as long as the redirect requests are not made over an open network.
TCP buffering: This is a strategy to offload clients with slow connections on to the load balancer in order to relieve servers that are serving responses to these clients.
Caching: In certain scenarios, the load balancer can maintain a cache for the most popular requests (or requests that can be handled without going to the servers, like static content) so that it reduces the load on the servers.
Traffic shaping: For some applications, a load balancer can be used to delay/reprioritize the flow of packets such that traffic can be molded to suit the server configuration. This does affect the QoS for some requests, but makes sure that the incoming load can be served.

Scaling strategies

Horizontal scaling is a strategy where additional resources can be added to the system or extraneous resources can be removed from the system. This type of scaling is beneficial for the server tier, when the load on the system is unpredictable and fluctuates inconsistently. The nature of the fluctuating load makes it essential to efficiently provision the correct amount of resources to handle the load at all times.

There are certain kinds of loads for service providers that are more predictable than others. For example, if you know from historical patterns that the number of requests will always be 10k-15k, then you can comfortably assume that one server that can serve 20k requests will be good enough for the service provider's purposes. These loads may increase in the future, but as long they increase in a consistent manner, the service can be moved to a larger instance that can serve more requests. This is suitable for small applications that experience a low amount of traffic.

A stateless service design lends itself to a scalable architecture. A stateless service essentially means that the client request contains all the information necessary to serve a request by the server.

What is tail latency?

Most cloud applications are large, distributed systems that often rely on parallelization to reduce latency. A common technique is to fan out a request received at a root node (for example, a front-end web server) to many leaf nodes (back-end compute servers). The performance improvement is driven by the parallelism of the distributed computation, and also by the fact that extremely expensive data-moving costs are avoided. We simply move the computation to the place where the data is stored. Of course, each leaf node concurrently operates on hundreds or even thousands of parallel requests.

Consider the example of searching for a movie on Netflix. As a user begins to type in the search box, this will generate several parallel events from the root web server. At a minimum, these events include the following requests:

To the autocomplete engine, to actually predict the search being made based on past trends and the user's profile.
To the correction engine, which finds errors in the typed query based on a constantly adapting language model.
Individual search results for each of the component words of a multi-word query, which must be combined based on the rank and relevance of the movies.
Additional post-processing and filtering of results to meet the user's "safe-search" preferences.

To resolve the response time variability that leads to this tail latency problem, we must understand the sources of performance variability.1

Use of shared resources: Many different VMs (and applications within those VMs) contend for a shared pool of compute resources. In rare cases, it is possible that this contention leads to low latency for some requests. For critical tasks, it may make sense to use dedicated instances and periodically run benchmarks when idle, to ensure that it behaves correctly.
Background daemons and maintenance: We have already spoken about the need for background processes to create checkpoints, create backups, update logs, collect garbage, and handle resource cleanup. However, these can degrade the performance of the system while executing. To mitigate this, it is important to synchronize disruptions due to maintenance threads to minimize the impact on the flow of traffic. This will cause all variation to occur in a short, well-known window rather than randomly over the lifetime of the application.
Queueing: Another common source of variability is the burstiness of traffic arrival patterns.1 This variability is exacerbated if the OS uses a scheduling algorithm other than FIFO. Linux systems often schedule threads out of order to optimize the overall throughput and maximize utilization of the server. Studies have found that using FIFO scheduling in the OS reduces tail latency at the cost of lowering overall throughput of the system.
All-to-all incast: The pattern shown in Figure 8 above is known as all-to-all communication. Since most network communication is over TCP, this leads to thousands of simultaneous requests and responses between the front-end web server and all the back-end processing nodes. This is an extremely bursty pattern of communication and often leads to a special kind of congestion failure known TCP incast collapse.1, 2 The intense sudden response from thousands of servers leads to many dropped and retransmitted packets, eventually causing a network avalanche of traffic for packets of data that are very small. Large datacenters and cloud applications often need to use custom network drivers to dynamically adjust the TCP receiving window and the retransmission timer. Routers may also be configured to drop traffic that exceeds a specific rate and reduce the size of the sending.
Power and temperature management: Finally, variability is a byproduct of other cost reduction techniques like using idle states or CPU frequency downscaling. A processor may often spend a non-trivial amount of time scaling up from an idle state. Turning off such cost optimizations leads to higher energy usage and costs, but lower variability. This is less of a problem in the public cloud, as pricing models rarely consider internal utilization metrics of the customer's resources.

Pricing models Cloud providers generally charge for resources based on one of the following three types of parameters:

Time-based: Resources are charged based on the amount of time they are provisioned to the user. For example, you pay a certain amount per hour/day/month/year to have a virtual machine running on an IaaS cloud. The granularity of the charging period varies from cloud provider to cloud provider. Amazon, for example, charges users per hour on a non-prorated basis.
Capacity-based: Users are charged based on the amount of a particular resource that is utilized or consumed. This is a popular charging model for cloud storage systems. For example, users are charged a certain amount for storing a gigabyte on cloud object storage systems such as Azure Blob storage.
Performance-based: In many cloud providers, users can select a higher performance level for resources by paying a higher rate. For virtual machines, larger, more powerful machines with more CPU, memory, and disk capacity can be provisioned at a higher hourly rate.