11 19 Paxos Made Live An Engineering Perspective - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for: Tushar Chandra, Robert Griesemer, and Joshua Redstone, “Paxos Made Live: An Engineering Perspective” (invited talk, PODC 2007)

Paper overview

This paper was written by distributed systems developers at Google who were trying to share their experience of taking the well-established Paxos algorithm, implementing it, and integrating it into Chubby, their fault-tolerant lock service and file system (similar to Apache ZooKeeper and Etcd). The authors describe the Chubby implementation in depth, focusing on the parts of the Paxos algorithm that needed refining for use in a real-world application. Specifically, the authors needed to add additional mechanisms on top of the Paxos algorithm since the original algorithm specification did not cover it. The paper concludes with the authors' thoughts about the gap between distributed algorithm specifications and implementations, and includes a call for distributed systems tooling to make such systems easier to develop.

Discussion questions

Q1

In section 4, the authors use slightly different terminology from "Paxos Made Simple" to overview the Paxos algorithm. How can we map these concepts to the terminology in "Paxos Made Simple"?

Discussion summary for Q1

The following table presents terms used in "Paxos Made Live" that have direct mappings to terms used in "Paxos Made Simple." The former paper will be referred to as the implementation, and the latter paper will be referred to as the specification.

Paxos Made Live Paxos Made Simple
Coordinator/Master Proposer
Acknowledge/Commit (replicas) Accepted (acceptors/learners)
Sequence number Proposal number

There are other terms that are used in the implementation that do not appear in the specification. For example, the implementation uses terms such as master leases, epochs, and by extension, epoch numbers. The reason for the existence of such terms is because an implementation of a defined specification often requires the implementation to refine said specification. Implementations need to address things that the specification doesn’t address, either because they were assumed by the specification, or weren’t mentioned and are needed in order for the implementation to work correctly.

Note on the term master: The paper uses the term "master" in three places:

  1. Section 2: Chubby itself has a master. This is the node that handles read/write requests to Chubby.
  2. Section 4.2: The coordinator of the multi-Paxos algorithm, which sends a propose message for multiple Paxos rounds, is called a master.
  3. Section 5.2: Regarding master leases, where the multi-Paxos master stays the master for longer, so that a) other Paxos nodes cannot propose higher sequence numbers, and as a result b) the master has up-to-date data and can serve read requests locally.

Noting the layered design of the whole thing, the latter two masters noted above refer to the Fault-tolerant Log. The middle layer, i.e. the Fault-tolerant DB, is simply a client of the that uses the Log like a write-ahead log, and maintains the current state of the DB in its own data structures. On the other hand, the first "master" refers to Chubby itself. While the authors mention that their layered architecture makes their code easier to reason about and more reusable, it's not hard to imagine that they put in some extra interfaces to allow the Chubby master to run on the same instance that is running the Log master, hence reducing bandwidth usage between the two. Also, another path they may have taken is making the Chubby layer stateless after adding the reliable Paxos layer, so that the first mentioned master, i.e. the Chubby master, is no longer required. This would require allowing multiple nodes to talk to the Paxos master at the same time.

Q2

Section 6.2 of the paper describes a "database consistency check" and various "inconsistency incidents" that have occurred. Since Paxos supposedly provides strongly consistent replication, what's going on here?

Discussion summary for Q2

An explanation for this is simply that Paxos guarantees strong consistency in the crash fault model. The faults listed in section 6.2 of "Paxos Made Live" are all categorized as Byzantine faults: an operator error, hardware corruption, and illegal memory accesses. Paxos cannot guarantee strong consistency in the Byzantine fault model.

However, it is worth noting that the definition of "consistency" as implied by "Paxos Made Live" is not the same as consistency as in distributed data consistency models that mean following causality. The paper’s use of consistency more closely aligns with the “C” in ACID, the set of database properties.

Q3

(Contributed by @rahula00)

The developers of Chubby (which uses Paxos) decided to build their own compiler that translates state machine specifications to C++ for the purpose of allowing “a full algorithm [to be] rendered on single screen” and “log state transitions and measure mode coverage to assist in debugging and testing.” Are these benefits enough to outweigh the extra complexity of having to develop a custom compiler vs. writing code directly in C++? What other drawbacks does this approach have?

Discussion summary for Q3

The pros and cons of designing a custom compiler, or using a domain-specific-language (DSL) approach are as follows:

Pros
  • Concise code.
  • Code is easier to understand at a high level.
  • Once established, can be used for other algorithms that fit the DSL.
Cons
  • Increases the size of the trusted computing base (TCB) by introducing a new, possibly unverified, compiler.
  • DSL error messages are often not great.
  • Upfront cost of designing and implementing compiler.
  • Need to learn how to use the DSL.

The main cons are that a DSL-approach for handling state machine implementation requires much more effort at the beginning. That is, the compiler needs to be set up and needs to be validated for correctness. Having this additional compiler also increases the size of the trusted computing base (TCB): the set of hardware/software entities that needs to be trusted for some application to work properly. The pros, in some sense, outweigh the cons. Once established, the DSL compiler can be treated as a general purpose tool for use with other algorithms that fit the DSL. Using a DSL also means that written code is 1) more concise (since it is designed to perfectly express the problem) and 2) easier to understand at a high level.

Q4

(Contributed by @nliittsc)

In the closing paragraphs of the paper, the Google team points out that the Fault-Tolerance community has not bridged the gap between the theory and practice of building fault-tolerant distributed systems, due to a lack of tooling. They contrast this with the Compiler community, which has created a vast amount of tooling to help with the implementation of compilers, and distilled its theory to a level of being taught to undergraduates. Based on your experience reading the papers in this course so far, why do you think this gap has been difficult to bridge? In particular, consider the parsing tool yacc. What would be an analogous tool for a concept in distributed systems?

Discussion summary for Q4

A big reason why distributed system tooling is not as fleshed as compiler tooling is simply due to the difficulty of reasoning in the two fields. The acts of lexing, parsing, and generating intermediate representations (to name a few duties of a compiler) are all very concrete and well-defined, and can be applied generally across several languages. On the other hand, distributed systems tend to be more free form, making it harder to make generic tools that can accommodate a variety of them. These tools do exist though, an obvious example being the DSL compiler mentioned in "Paxos Made Live." In fact, this approach of creating a state-machine compiler could probably be used with several of the distributed computing algorithms we've read so far, as many of them have been modeled as state machines. Another tool is Lamport's TLA+, a language designed to describe and model distributed systems.

Errata

Running list of typos / issues found in the paper:

  • n/a

Other

Any interesting points or questions from the group discussion that didn't fit above:

  • n/a