10 08 Detecting Causal Relationships in Distributed Computations In Search of the Holy Grail - lkuper/CSE232-2021-09 GitHub Wiki
Discussion for: Reinhard Schwarz and Friedemann Mattern, "Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail" (Distributed Computing, 1994)
Paper overview
This paper talks about the significance of causality in distributed systems, how causality relates to causal history, and how to use vector time to characterize causality. In particular, it identifies that we can determine causality by assigning causal history to each event (Definition2.1, Lemma 2.2), and a causal history can be summarized into a vector (Observation 2.3), thus proposes an algorithm to track causality (Definition 2.5). In the end, it discusses Lamport time, and how vector clock is consistent with real time but Lamport clock is not.
Discussion questions
Q1
(Contributed by @reeselevine)
In Figure 1 of the paper, events e_12
and e_13
are message sends, while e_21
and e_22
are message receives. Notice that from the perspective of P_2
, the messages were received (and possibly delivered) in an opposite order than they were sent. While in class we talked about how to implement FIFO delivery and causal delivery, are there scenarios where we might want to allow messages to be delivered out of order like this?
Discussion summary for Q1
In a very general sense, any message that modifies information independent of other messages (for example if one message is modifying X and the other messages are not using X), can be delivered out of order, and delivered before the later usage of that information.
A couple of scenarios where order might not matter:
- An associative set of operations, like (a + b) + c = a + (b + c). So, if message 'a' is delayed and 'b' and 'c' have arrived, one could compute (b + c) first, instead of waiting for 'a'. It should still be noted that if (a + b) + c = a + (b + c), the intermediate states between these messages' arrivals, before the final result of a+b+c is received, will still be different, depending on the order in which the messages arrive.
- A scenario where the client-database interactions have no intersections, i.e. if the client is updating different key-value stores in the database, or the end result is a set (like adding items to a virtual e-shopping cart) - where order does not matter.
- A scenario that consists of query-only operations, i.e, read-only events.
Q2
(Contributed by @sukeswan)
The last paragraph of section 3 states that Lamport time cannot prove events are causally related and that this is quite important for the analysis of distributed systems. If a distributed system is running thousands of processes, it can become expensive to keep track of large vectors. Is the overhead of vector clocks worth it? Or are Lamport clocks sufficient? Do the benefits of characterizing causality outweigh the costs?
Discussion summary for Q2
Lamport Clocks don't provide as much detail on causality as vector clocks, so if causality is a key focus of the system, then using vector clocks is an approach worth taking. Also, because a Lamport Clock doesn't precisely characterize causality, it could unnecessarily order events that are not causally related, leading to poor performance.
But there are certain scenarios where the overhead of vector clocks is not feasible/not worthwhile, like:
- A grow-only data structure that has no deletions -- this may not even need logical clocks.
- An embedded system, low-memory environment, that is always ON -- one could stick to Lamport Clocks here.
- An alternative to vector clocks could be using Lamport clocks as well as some additional info about each process.
- Another alternative could be to mix Lamport clocks and Vector clocks, where leaf nodes with low-memory that are far away from the central hub(s) could operate using Lamport clocks, but intermediate and central systems, with beefier memory, could synchronize using vector clocks.
- If you only need causality when debugging, or don't expect to need causality at runtime, then you can just keep track of who you are receiving a message from and what their Lamport clock is and you can rebuild the Lamport diagram from that information after the fact to determine what the problem is.
Q3
Section 1.2 of the paper says:
the notion of consistency in distributed systems is basically an issue of correctly reflecting causality.
Here, Schwarz and Mattern are using the word "consistency" to refer to the notion of consistent global snapshots, which we will discuss soon. Unfortunately, "consistency" is a very overloaded term in distributed systems. I can think of at least three or four ways in which the word "consistency" is used, and only some of them have to do with causality.
With your group, brainstorm ways in which you've heard the term "consistency" used in discussions of distributed systems, and what it means in those contexts (or, if you're not sure, what you think it might mean). Afterward, as a group, we'll try to sort out and nail down several distinct meanings.
Discussion summary for Q3
(to be filled in by scribes) Couple of ways to define 'consistency':
- Causal consistency -- "all processes agree on the order of causally-related operations" Source: Wikipedia
- Strong consistency -- "when all the replicas have consensus on the data they have. That is, all the replicas have exactly the same data". It is a safe property.
- Consistent system -- an abstract way of looking at consistency -- a system is consistent if it generates the same output for the same input and algorithm, in a 'reasonably similar' amount of time.
- Eventual (Weak) consistency -- "if updates stop arriving, replicas eventually agree". It is a liveness property.
- Consistency on the previous paper -- mutual exclusion/synchronization
- Consistency in ACID properties denoted by character C -- It talks about consistency of shared data(X) on which multiple threads work on. If these multiple transactions are a) Atomic(A) in nature, b) Isolated in nature, and c) Durable, then that state of X remains consistent.
- Consistent hashing -- A way to partitioning data on multiple nodes with minimal data movements when required.
- Architectural/protocol consistency -- all nodes use the same architecture/protocol
Errata
Running list of typos / issues found in the paper:
- (to be filled in by scribes, if any)
Other
Any interesting points or questions from the group discussion that didn't fit above:
- (to be filled in by scribes, if any)