10 11 Detecting Causal Relationships in Distributed Computations In Search of the Holy Grail - lkuper/CSE232-2024-09 GitHub Wiki

Discussion for: Reinhard Schwarz and Friedemann Mattern, "Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail" (Distributed Computing, 1994)

Paper overview

Today's discussion explores several critical concepts related to distributed systems, specifically message ordering, causality, and the view change problem. Through questions and examples, the discussion highlights scenarios where out-of-order message delivery might be acceptable, challenges posed by dynamic participation or error events in distributed systems (i.e., participants joining or leaving the system), and the importance of causality in recovery points for ensuring system correctness.

Out-of-Order Message Delivery: We examined scenarios where allowing out-of-order message delivery could be beneficial. This includes performance optimization in real-time systems (e.g., streaming) where delivering messages as they arrive improves responsiveness, and systems with eventual consistency, where the system can tolerate temporary inconsistencies as long as it reaches a consistent state eventually.

View Changes and Logical Clocks: We discussed how dynamic systems, where participants join or leave, pose challenges for logical clocks like vector clocks, which assume a fixed number of participants. Dynamic membership requires resizing vector clocks and managing historical information. Possible solutions include dynamic vector clocks and hierarchical clocks to adapt to changing system participants.

Causality in Recovery Points: Causality is critical in determining consistent recovery points for distributed systems. Without considering causality, recovery could capture inconsistent states, such as message receipts without corresponding sends. This could lead to data corruption, phantom messages, or inconsistent system states. Techniques like the Chandy-Lamport Snapshot Algorithm ensure snapshots respect causality, preventing these issues and ensuring a coherent recovery process.

[to be filled in by scribes]

Discussion questions

Q1

(question contributed by former CSE232 student Reese Levine)

In Figure 1 of the paper, events e_12 and e_13 are message sends, while e_21 and e_22 are message receives. Notice that from the perspective of P_2, the messages were received (and possibly delivered) in an opposite order than they were sent. Are there scenarios where we might want to allow messages to be delivered "out of order" like this?

Discussion summary for Q1

In some scenarios, allowing messages to be delivered out of order can be beneficial or acceptable. For example:

  • Applications with built-in ordering mechanisms: Some systems, such as multi-threaded downloading, use other protocols (like indexed file chunks) to maintain the correct order of data, making message order less critical.
  • Non-time sensitive events: In cases where the order doesn't affect the system's correctness, like voting systems where each vote is counted regardless of the order in which it arrives.
  • Performance-focused systems: Protocols like UDP prioritize speed over message reliability, accepting out-of-order delivery for real-time performance, such as in media streaming or distributed counting systems where ordering isn't essential.

Q2

(question contributed by Jonathan Castello)

The vector clock discussed in this paper tracks one integer for every participant. However, some systems allow participants to leave the system and new participants to join. (This is called, for some reason, the "view change" problem.) What obstacles does this pose for using logical clocks? Do you have any ideas for working around those obstacles?

Discussion summary for Q2

[to be filled in by scribes]

Q3

(question contributed by Neili Hu (with some rephrasing from Lindsey))

In section 1.2, the authors write that one application of causality is "determining consistent recovery points" of distributed executions. Why is causality important for determining recovery points? When recording a snapshot of a distributed execution that could be used for failure recovery, how might things go wrong if we don't take causality into account?

Discussion summary for Q3

Causality is critical for determining consistent recovery points in distributed systems because it ensures that the relationships between events, such as message sending and receiving, are respected. If causality is not considered, the following issues may arise:

  • Message Reception Without Send: A process might record a state where it has received a message, but the sending process hasn't recorded that it sent it. This breaks the logical order, leading to a corrupted state after recovery.

  • Phantom Messages (Lost or Duplicated Information): If snapshots don't capture the causal sequence, recovery might replay operations that weren’t supposed to happen or fail to replay critical ones. This can cause duplicate messages or missed actions, leading to data loss or inconsistencies.

  • Inconsistent Global State: Without considering causal dependencies, processes may record inconsistent local states. One process might act based on assumptions that aren't reflected in the snapshot, causing system-wide divergence upon recovery.

Causality ensures a coherent system state by guaranteeing that all interdependent actions are recorded in the proper order during snapshot creation for reliable recovery.