10 30 Perspectives on the CAP Theorem - lkuper/CSE232-2024-09 GitHub Wiki

Discussion for: Seth Gilbert and Nancy A. Lynch, "Perspectives on the CAP Theorem" (IEEE Computer, 2012)

Paper overview

In any application domain, we want the strongest possible benefits with the weakest possible costs. However, some desirable sets of benefits are "too good to be true": they're so good that no cost would be sufficient. This is just as true for distributed systems, where we want (C) correct behavior and (A) timely behavior, even (P) in the presence of external bad behavior. Characterizing the boundary between what we can have and what we can't is an essential task for understanding what kinds of system are feasible.

It turns out that even a relatively well-behaved (but imperfect) environment can force sharp costs on a distributed system design. In this paper, the authors show how, in a network that may experience partitions -- which is to say, messages may be delayed arbitrarily or lost entirely -- it is impossible to implement a collective system of machines that both implements a replicated state machine (i.e. behaves as though implemented by a single machine) and eventually responds to every query.

Since most realistic networks may suffer partitions, this discovery -- known as the CAP theorem -- has led to an increased interest in weaker system models and availability properties that can coexist under partitions. For instance, content-delivery networks (CDNs) and conflict-free replicated data types (CRDTs) typically do not need to export such a stringent consistency model to ensure correct client behavior. On the other hand, clients of services like Chubby (a distributed solution to the mutual exclusion problem) absolutely require the strongest guarantees, so they sacrifice availability and assume (or are deployed in settings where) partitions occur rarely.

Discussion questions

Q1

(question contributed by William Self)

The authors discuss the tradeoffs between safety and liveness in various systems, noting that different types of systems have different requirements and tradeoffs are dependent on the type or segment of the system. I want to know how this tradeoff is played out in different real-world domains. First, can you think of some industry domains where we prioritize safety over liveness, and liveness over safety? Second, how might this tradeoff play out differently in domains like healthcare and financial systems, where both safety & liveness are critical? Are there any modern or industry-specific approaches to tackling that tradeoff for such systems, or must we still make tough compromises based on the CAP theorem?

Lindsey adds: In section 4 of the paper, Gilbert and Lynch note, "Different operations may require different levels of consistency and availability." Suppose that you are implementing a social networking app for photo sharing. What are some operations for which you would want to prioritize consistency? What are some operations for which you would want to prioritize availability? (If you can't think of any, the PNUTS paper, which Gilbert and Lynch cite, might have some ideas for you.)

Discussion summary for Q1

Some industry domains that prioritize safety over liveness could include but are not limited to:

healthcare, planes, nuclear power plants, banking, and self-driving cars--essentially any domain where human life is at stake

Some industry domains that prioritize liveness over safety could include but are not limited to:

social media, video streaming, and gaming, where responsiveness and user experience are key

In healthcare and financial systems, balancing safety and liveness is more complex because both are crucial. Healthcare prioritizes safety to protect human life, while financial systems emphasize safety to prevent losses. Modern approaches, such as continuous consistency models (e.g., TACT), allow systems to adjust which properties to prioritize based on the operation's context.

For a photo-sharing social networking app, availability is often prioritized for operations like accessing photos, likes, and comments to ensure a seamless user experience. However, consistency would be prioritized for operations like photo-sharing restrictions (e.g., after blocking a user).

Q2

This paper frames the tradeoff between consistency and availability in a distributed system as a classic safety/liveness tradeoff. A few weeks ago, in Waldo et al.'s A Note on Distributed Computing, we read about "soft mounts" and "hard mounts" in the context of NFS:

Soft mounts expose network or server failure to the client program. Read and write operations return a failure status much more often than in the single-system case, and programs written with no allowance for these failures can easily corrupt the files used by the program. In the early days of NFS, system administrators tried to tune various parameters (time-out length, number of retries) to avoid these problems. These efforts failed. Today, soft mounts are seldom used, and when they are used, their use is generally restricted to read-only file systems or special applications.

Hard mounts mean that the application hangs until the server comes back up. This generally prevents a client program from seeing partial failure, but it leads to a malady familiar to users of workstation networks: one server crashes, and many workstations—even those apparently having nothing to do with that server—freeze. Figuring out the chain of causality is very difficult, and even when the cause of the failure can be determined, the individual user can rarely do anything about it but wait. This kind of brittleness can be reduced only with strong policies and network administration aimed at reducing interdependencies. Nonetheless, hard mounts are now almost universal.

Imagine that you need to choose between using soft mounts and hard mounts in a system you are designing. Compare this tradeoff to the consistency/availability tradeoff. Which choice seems to prioritize safety, and which choice seems to prioritize liveness? Why do soft mounts become a more viable choice when data is read-only?

Discussion summary for Q2

Hard mount can be viewed as prioritizing consistency/safety, while soft mount can be viewed as prioritizing availability/liveness. Hard mounts ensure the correctness of the data being read or written, but at the cost of potentially blocking the client for an indefinite amount of time if there is a failure. Soft mounts allow the client to continue operating even if there is a failure, but at the cost of potentially returning incorrect or stale data. Soft mounts are more viable when data is read-only because the client can continue to operate without the risk of corrupting the data, even if the data is stale.

Q3

(question contributed by former CSE232 student Yunqian Cheng)

The discussions around CAP in this paper have been surrounding consistancy and availability, however, why is there little discussion about partition tolerance? Is it considered a default that partition will always, eventually happen for a system with parts that communicate with one another with messages? What would a partition tolerant system look like?

Discussion summary for Q3

When we talk about "partition tolerance", what we're ultimately assuming is the asynchronous model of distributed systems, in which messages can be arbitrarily delayed or lost entirely. A "partition" is just a way of explaining why messages may get dropped or delayed, but the specific reason doesn't matter as much as the symptoms it causes. Since most distributed systems are designed for a practical setting in which messages can be dropped or delayed, those systems must necessarily already tolerate "partitions" to some extent.

A system that doesn't experience partitions, and thus doesn't need to "tolerate" them, must fall into a small handful of cases. Either (1) the deployment environment does provide some assurances about message delivery, in which case it isn't a totally-asynchronous setting; (2) coordination amongst nodes isn't necessary for their successful implementation of a service, in which case what we really have is a collection of independent services; or (3) we only have one node to begin with.

If a system assumes that network partitions never occur, but is deployed in an environment where they may in fact occur, then a variety of both liveness and safety failures might occur. One common failure is a "split brain" situation, in which both sides of the partition believe they are in a consistent state, but when the partition heals they discover that they are in mutually inconsistent states. Of course, a possible liveness failure is that the system seizes up on some blocking network communication. In other words, such a system might get neither consistency nor availability! This is clearly the worst of all worlds.