11 29 Dynamo Amazons Highly Available Key value Store - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for: Giuseppe DeCandia et al., "Dynamo: Amazon’s Highly Available Key-value Store" (SOSP 2007)

Paper overview

(to be filled in by scribes)

Discussion questions

Q1

(Contributed by @pan-jessica)

The previous paper we read, “Paxos Made Live," discusses the fault-tolerant system at Google, Chubby. How does Chubby compare to Dynamo? What are the differences between the two in terms of implementation and design?

Discussion summary for Q1

(to be filled in by scribes)

Dynamo prioritizes availability, whereas Chubby prioritizes consistency. Dynamo's choice to accept all writes and only surface conflicts to clients at read-time demonstrates that it sacrifices C(onsistency) from CAP in favor of A(valability). Another difference is that Chubby has a strict master/replica hierarchy, while Dynamo has no hierarchy among its nodes. This is important, because Dynamo is made to run on a large and varying number of nodes. Dynamo uses consistent hashing and circular arrangement of nodes for better scalability, while Chubby on the other hand, uses only a fixed set of linearly arranged nodes and does not allow additions or deletions. Chubby relies on Paxos algorithm to reach consensus while Dynamo leaves it to the client application to decide on the value, while providing some basic conflict resolution mechanisms like ‘Last Write wins’.

Q2

(Contributed by @nliittsc)

Dynamo uses what is called a gossip-style method for failure detection. Roughly, for failure detection, if a process P sees that a process Q is not responding to its requests, P can attempt to gossip (broadcast) that Q is down to the rest of the nodes in the network. In this way, failure detection is essentially decentralized. What do you think are the main advantages and disadvantages of a decentralized failure detection model, as opposed to having a centralized observer node which serves as the failure detector?

Discussion summary for Q2

(to be filled in by scribes)

An advantage of a decentralized failure detection model is that every node can gain some perspective on which nodes are currently available. The ability to have some view of other nodes' availability is not hampered by the scale of the cluster or the outage of some particular node. However, a disadvantage of this approach is that each node develops a separate perspective of other nodes' availability. Assuming that nodes eventually arrive at the same perspective, there's still significant time periods where the nodes' do not agree about which nodes are available. In case of a network partition, the nodes on one side of the partition would conclude that the nodes on the other side are down, but nodes on both sides are wrong in this case. The decentralized failure model prioritizes liveness. Nodes will eventually gain information about who is up/down. There might be some disagreement or uncertainty, but the idea is that they will eventually get the information.

An advantage of a centralized failure detection model is that most nodes will agree about which other nodes are up or down, because most nodes will have an up-to-date perspective provided by the centralized detector. Disadvantages of this approach include: The detector is a single point of failure, for a large scale cluster the detector might not fail but it could become slow and inaccurate, and that in case of a network partition the detector will necessarily be on one side of the partition. On the detector side of the partition, nodes will see the other side as down. On the other side of the partition, it's not clear what nodes will see; perhaps they'll just use the stale view, or perhaps they fall back to a decentralized approach. In centralized failure detection, everyone who has information has the same "good" information, so centralized detection prioritizes a safety property.

Q3

What exactly does the paper mean by "at Amazon, SLAs are expressed and measured at the 99.9th percentile of the distribution"? What is an SLA, and what distribution are the authors talking about? What would the alternative be?

Discussion summary for Q3

(to be filled in by scribes)

An SLA, or Service Level Agreement, is measured at the 99.9th percentile means that 99.9% of some category of things will fall within SLA. So if the category is "request response time" and the SLA is "less than 50ms" then it means that 99.9% of requests will have a response in less than 50ms.

An alternative might be to measure the SLA in terms of the average, or the maximum. However, an SLA for "average response time of 50ms" could admit a very noisy distribution of response times (many taking 1ms, and some taking seconds). Likewise, an SLA for "maximum response time of 50ms" is too inflexible and would frequently be broken by unpredictable network spikes, occasional OS swapping, etc.

This isn’t necessary because if 99.9% of requests meet the time threshold, it is likely that the average time per request is also incredibly low.

SLA provides the metrics that services/clients tend to really care about ie., how fast are the slowest response times. It provides an assurance that some of the slowest of queries are returned within a specified latency, as long as it is within 99.9% of the distribution.

Q4

Section 2.3 of the paper states, "Many traditional data stores execute conflict resolution during writes and keep the read complexity simple [7]. In such systems, writes may be rejected if the data store cannot reach all (or a majority of) the replicas at a given time. On the other hand, Dynamo targets the design space of an “always writeable” data store (i.e., a data store that is highly available for writes)... this requirement forces us to push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected." What type of system would benefit from being "always readable", with conflict resolution on writes rather than reads? Name a few examples.

Discussion summary for Q4

(to be filled in by scribes)

Version control systems like Git require a file to have a single source of truth. Git greatly benefits from being "always readable". In order to make it "always readable", conflict resolutions happen on writes (ex: merge conflicts). This type of system benefits from being easily readable, because in the case of popular projects, reads will be the more common and important operation. Any time someone wants to grab a file, they don't want to check the files they're grabbing for conflicts, they just want to read the files.

Errata

Running list of typos / issues found in the paper:

(to be filled in by scribes, if any)

"by examine their vector clocks" -> "by examining their vector clocks" in Section 4.4

Other

Any interesting points or questions from the group discussion that didn't fit above:

(to be filled in by scribes, if any)