10 04 Time Clocks and the Ordering of Events in a Distributed System - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for: Leslie Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" (CACM 1978)

Paper overview

The paper outlines a theoretical framework for understanding how events within a distributed system should be ordered. It starts by outlining the partial ordering of events which essentially says an event a happens before an event b. It denotes this using the happens before relation, a->b. The paper then characterizes logical clocks, which have a useful property called the Clock condition stating that if a -> b, the logical clock of a must be less than the logical clock of b. With these definitions in place, Lamport defines a type of logical clock satisfying this Clock condition, which we will call Lamport Clocks, and shows how these can be used to define a total ordering for events in a distributed system. Finally, Lamport covers various applications of these logical clocks, such as resolving mutual exclusion and implementing a distributed state machine.

Discussion questions

Q0 (warm-up question)

This paper has a section called "Ordering the Events Totally" that contains several distinct ideas. If you were to insert a couple more subsection headings to make the paper easier to digest, what would those headings be?

Discussion summary for Q0

Some of the subsection headings discussed in class were:

  • Creating a Total Ordering
  • Total vs Partial Ordering
  • Why do we need Ordering?
  • Clock Conditions
  • Distributed Algorithm (the main logic)
  • Resolving Mutual Exclusion with Logical Clocks
  • Mutual Exclusion
  • Using Ordering to solve the mutual exclusion problem
  • Implementing State Machines with Logical Clocks
  • Shortcomings of Logical Clocks

Q1

This paper defines the happens-before relation on events, which is the same definition of happens-before that we talked about in class. However, one can also define a happens-before relation on messages. There's a clue about this in the paper ("we identify a message with the event of sending it"), but not a complete definition. How could one define a happens-before relation on messages? (Hint: There should be three parts to the definition, just like for events.)

Discussion summary for Q1

The set of rules that the class agreed upon was:

  1. On a process P, if a message M1 is sent before a message M2 is sent, M1 -> M2.
  2. On a process P, if a message M1 is received before P sends M2, M1 -> M2.
  3. If M1 -> M2 and M2 -> M3, then M1 -> M3.

Q2

(Contributed by @Donnie-Stewart)

The first paper reading in this class by Waldo et al. covered several fundamental differences between local and distributed systems. Which of these differences were also highlighted in this paper and accounted for in Lamport's proposed algorithms (for logical clocks, mutual exclusion, etc.)? Which were left out? How do the problems that were left out affect how Lamport's proposals can be implemented in practice?

Discussion summary for Q2

We all agreed that Lamport did not address partial failure or shared memory in his paper, but did address concurrency. However, there was some debate about whether or not he addressed latency, with some arguing that it was implicitly addressed in what was a more theory-oriented paper while others thought he had not addressed it thoroughly enough. Some of us felt that these shortcomings definitely weakened the paper and concepts as a whole, as they leave pretty substantial failure points in his algorithms. Others thought that while he abstracted away certain issues, the issues he did address were solved very thoroughly. However, other techniques would need to be developed to address the remaining issues before these techniques were feasible in practice instead of in theory.

Q3

Decades after this paper was published, Lamport wrote:

So, I wrote this paper, which is about how to implement an arbitrary distributed state machine. As an illustration, I used the simplest example of a distributed system I could think of--a distributed mutual exclusion algorithm.

This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I've even had to go back and reread it to convince myself that I really did remember what I had written.

My experience matches Lamport's: the stuff about state machines is not the part of the paper that most people who have read it seem to remember! Why do you think it is that what Lamport said was the essential idea in this paper is not actually what most readers remember about it?

(You can read the rest of Lamport's commentary if you like, but it's not necessary to answer this question.)

Discussion summary for Q3

The consensus in our discussions was that this was primarily a structural issue with the paper. It's hard to expect a reader to think that a concept is central to the paper when it is not highlighted in the abstract or even broken out into its own section. We also thought that he was somewhat a victim of his own success, in that he had quite a few groundbreaking ideas in this paper. While the state machines were definitely an interesting concept, many of the other concepts were just as, if not more, interesting. As an author, you can't really control how people interpret your work, and this is definitely the case with this paper.

The video below might be of interest as well, in which Lamport himself says in one of his talks that some people thought that the main idea of the paper was to discuss partial ordering, and some people thought that the paper was about distributed mutual exclusion, but almost nobody thought that it was about state machines. Link to Lamport's talk: https://www.youtube.com/watch?v=nfRouGH0oMg

Errata

Running list of typos/issues found in the paper:

N/A

Other

Any interesting points or questions from the group discussion that didn't fit above:

N/A