11 10 Impossibility of Distributed Consensus with One Faulty Process - lkuper/CSE232-2021-09 GitHub Wiki
Discussion for Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, "Impossibility of Distributed Consensus with One Faulty Process" (JACM 1985)
Paper overview
This paper establishes the proof of the impossibility of having a totally-correct consensus protocol in an asynchronous crash-fault model of a distributed system. It models a distributed system where processes change state in steps, each of which is defined by an event. These steps result in a change in the configuration of the system. The paper progressively uses three lemmas to prove the result. Lemma one from the paper says that schedules that are disjoint are commutative, leading to the same end configuration. Lemma two proves that some initial configuration is bi-valent, by disproving a contradiction using two adjacent configurations that differ by only one initial value. Lemma three proves using contradiction, that when starting with an initial bi-valent configuration, it always results in an end state of a bi-valent configuration. These three proofs are sufficient to show that such a consensus protocol cannot be totally correct.
Although a consensus protocol cannot be totally correct, the authors mention that the derived proofs don’t necessarily establish that these problems cannot be solved in practice. Realistic assumptions must be made and concessions need to be built into the system to solve this problem.
Discussion questions
Q1
(Contributed by @pan-jessica)
In lecture, we talked about the properties of consensus being termination, agreement, and validity. How would you define “partially correct” and “totally correct” in terms of these three properties?
Discussion summary for Q1
According to Paper, a partial correct algorithm satisfy the following conditions:-
(1) No accessible configuration has more than one decision value. Condition 1 is equivalent to the agreement.
(2) For each v E (0, I), some accessible configuration has decision value v Condition 2 is equivalent to the validity.
So the partially correct algorithm only guarantees agreement and validity and does not guarantee termination.
According to Paper, a totally correct consensus algorithm is partially correct(validity and agreement), and satisfy the following condition:-
(1) every admissible run is a deciding run. This condition is equivalent to termination.
So, totally correct consensus algorithm guarantees all three properties required to reach consensus - termination, agreement and validity. This is an ideal algorithm and is generally difficult to implement.
Q2
(Contributed by @nliittsc)
When reading theoretical papers, it is important to understand where certain assumptions are being used. In their proof of theorem 1, the authors opt for a proof by contradiction, by assuming that there is a totally correct protocol P in spite of a faulty process. Recall a totally correct process is necessarily partially correct. Where is partial correctness used in the proof? (Cite a paragraph, or lemma.) What is derived from that assumption? Where is the assumption that the a process is faulty used? What is derived as a result? If you have extra time, say why these assumptions are important to the overall proof.
Discussion summary for Q2
The assumption of partial correctness is largely used in the derivation of Lemma 2, in the first paragraph. By assuming partial correctness, the authors are able to claim that the protocol P has both 0-valent and 1-valent initial configurations.
The assumption that a process is faulty is also used in Lemma 2, in the second paragraph. They assume there is an admissible deciding run in which a process p takes no steps. Since a process is faulty if it does not take infinitely many steps, the assumption that a process is faulty is being used here. Using the associated schedule σ from this run, the authors derive a bivalent initial configuration, proving the lemma.
Q3
The authors conclude the paper by suggesting that perhap consensus can be solved "in practice" by altering the model or the problem specification. Can you think of another example of "more refined models of distributed computing that better reflect realistic assumptions" or of "less stringent requirements on the solution to such problems"?
Discussion summary for Q3
It turns out that just because it is impossible to reach consensus in the crash fault, asynchronous model, it does not mean that in-practice, applications don’t attempt to reach consensus. In real-world applications, we can usually get away with having a non-idealistic consensus protocol by relaxing one or more model assumptions - that is, changing the problem specification.
A few examples of this can be:
A) Using timeouts- This adds synchrony to the system (and thus relaxes the asynchronous model specification). Example: Paxos uses timeouts.
B) Exactly-once message delivery/exactly-once processing- Many systems claim to do this. These are effectively exactly-once processing and not exactly once delivered. So even though messages were sent more than once, messages are effectively processed only once if they are idempotent in nature.
C) Probabilistic models/Agreement with probability approaching 1- We can probabilistically solve consensus if all we care about is consensus happening with arbitrarily high probability.
D) Probabilistic models/Termination with probability approaching 1- Here we can assume that the longer the run gets, the less probable it is to terminate. So, we can cut off the algorithm and start over again.
Errata
Running list of typos / issues found in the paper:
- (to be filled in by scribes, if any)
Other
Any interesting points or questions from the group discussion that didn't fit above:
- (to be filled in by scribes, if any)