10 07 Time Clocks and the Ordering of Events in a Distributed System - lkuper/CSE232-2024-09 GitHub Wiki

Discussion for: Leslie Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System" (CACM 1978)

Paper overview

    The paper addresses the challenge of ordering events in asynchronous, distributed systems with unsynchronized clocks. It introduces the "happens-before" relation to define a partial ordering of events and extends it using logical clocks to create a total ordering, helping to resolve synchronization issues. Furthermore, Lamport demonstrates how the total ordering can be used to facilitate the mutual exclusion problem by ensuring a consistent event sequence across processes. However, the solution has limitations, such as low fault tolerance in failure scenarios and reliance on ideal behavior from processes.

For reference, the "happens before" relation is defined as follows in the paper:

  1. If a and b are events in the same process, then a -> b
  2. If a is the sending of a message by once process and b is the receipt of the same message by another process, then a -> b
  3. If a -> b and b -> c, then a -> c Two distinct events a and b are said to be concurrent if neither a -> b nor b -> a.

Discussion questions

Q1

This paper has a section called "Ordering the Events Totally" that contains several distinct ideas. If you were to insert a few more section headings to break up this section into chunks and make the paper easier to digest, what would those headings be?

Discussion summary for Q1

  • The section called “Ordering the Events Totally” could be split into smaller sections as below:
    • Definition and the need for total ordering
    • Mutual exclusion example
      • include visualization to help readers understand
    • Algorithms, conditions and assumptions for total ordering (request queues for implementation)

Q2

(question contributed by Hamed Tangestanizadeh)

The paper introduces the concept of logical clocks to synchronize events in distributed systems, ensuring a consistent ordering of events based on causality. However, the solution does not account for system failures or message loss, which are common in real-world distributed systems. How do you think the use of logical clocks would fare in modern large-scale distributed systems like cloud infrastructure, where node failures, network delays, and message losses are inevitable? What extensions or additional mechanisms would you propose to address these challenges while maintaining consistent event ordering?

Discussion summary for Q2

Logical clocks, while useful for tracking events, often lack the precision needed to establish the exact ordering of events across large networks of nodes. As we transition to modern, large-scale distributed systems, factors like node failures, network delays, and message losses introduce substantial communication and processing overhead. Fault-tolerant protocols are essential to address these challenges and maintain system reliability.

When a node fails or dies, the system can suffer from the loss of critical information that the node was responsible for. This can result in incomplete data or missing chunks of computation that are needed for the system to function correctly. Recovering from such failures requires mechanisms to detect the missing data, redistribute tasks, or restore state, all of which increase the system's complexity. Without appropriate fault tolerance measures in place, the failure of a single node could cascade into larger system failures, disrupting the overall consistency and integrity of the entire system.

To address these issues, more reliable protocols, such TCP network protocols which provide affirmations, may be used. However, this is not a complete solution as acknowledgements may not always arrive and timeouts are not an accurate representation of system failure.


Q3

This paper defines the happens-before relation on events, which is the same definition of happens-before that we talked about in class. However, one can also define a happens-before relation on messages. There's a clue about this in the paper ("we identify a message with the event of sending it"), but not a complete definition. How could one define a happens-before relation on messages? (Hint: There should be three parts to the definition, just like for events.)

Discussion summary for Q3

To define a “happens-before” relation on messages, we start by identifying each message with the event that created it (aka. the send event). The “happens-before” relation between messages correlates to that of events:

  • If the event that sends message $m_{a}$ happens before the event that sends message $m_b$, then $m_a \rightarrow m_b$ if and only if the event $a \rightarrow b$.
  • If a process receives $m_a$, and sends $m_2$ then we can say $m_1$ happens before $m_2$.
  • Additionally, when a message is received, it counts as an event on the receiving process, which can influence future send events. This approach ensures consistency in defining causality across both messages and events.

Q4 (optional; only do this if you have time)

Decades after this paper was published, Lamport wrote:

So, I wrote this paper, which is about how to implement an arbitrary distributed state machine. As an illustration, I used the simplest example of a distributed system I could think of--a distributed mutual exclusion algorithm.

This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I've even had to go back and reread it to convince myself that I really did remember what I had written.

Lamport's own take on what is important and memorable about this paper is not the part of the paper that most readers seem to find important and memorable. Why do you think this is the case?

Discussion summary for Q4 (if we get to it)

[to be filled in by scribes, if we get to it]