11 01 Fundamentals of Fault Tolerant Distributed Computing in Asynchronous Environments - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for Felix C. Gärtner, "Fundamentals of Fault-Tolerant Distributed Computing in Asynchronous Environments" (ACM Computing Surveys, 1999)

Paper overview

The paper aims to establish definitions of faults and fault tolerance in distributed systems, specifically in the asynchronous network model. It models a distributed system as a state machine that transitions between local states through guarded commands, describing a fault as a possible, but unwanted transition. It presents four forms of fault tolerance: masking, fail-safe, non-masking, and not fault tolerance. Masking fault tolerance means that the liveness and safety properties of a system are guaranteed in the event of a fault. Fail-safe fault-tolerance guarantees safety but not liveness. Non-masking fault-tolerance guarantees liveness but not safety. Systems that guarantee neither liveness nor safety are considered not fault tolerant. The paper concludes with discussion on tolerating faults with redundancy in either time or space, detecting faults, and trying to recover from them. Fault detection is needed in order to ensure safety properties, and recovery is needed to ensure liveness (eventually transitioning back to a good state from a bad state).

Discussion questions

Q1

(Contributed by @gshen42)

Section 2.2 of the paper mentions several fault models. How do these differ from the fault models we discussed in class? Can we arrange them in a hierarchy like we did for the ones we discussed in class?

Discussion summary for Q1

Section 2.2 of the paper gives examples of the following fault models: crash, fail-stop, and Byzantine. Our class previously discussed the crash, omission, and Byzantine fault models. The faults that occur in the crash model assumed in the paper and in class agree: a process fails by halting at some point in time. We agree on the Byzantine model as well: processes can behave in arbitrary or malicious ways. What wasn’t discussed in class is the fail-stop model, in which it is assumed that processes fail by halting (like in the crash model), but process failures can be detected by the environment. From these assumptions, it’s clear that the fail-stop model is a subset of the crash model. The hierarchy of all the fault models discussed both in the paper and in class is as follows: fail-stop ⊂ crash ⊂ omission ⊂ Byzantine.

Fault model hierarchy

Q2

(Based on questions contributed by several people)

In section 4 of the paper, Gärtner explains that in nonmasking fault tolerance, safety properties may be violated, but at least liveness properties hold. (Contrast with masking fault tolerance, in which both safety and liveness hold, and fail-safe fault tolerance, in which only safety holds.) Gärtner does not seem to be a big fan of nonmasking fault tolerance. He calls it an "ugly duckling" and writes that "application scenarios for this type of fault tolerance are not readily visible".

Let's suppose that masking fault tolerance is infeasible or impossible, and so nonmasking and fail-safe are the only options. What are some real-life situations in which you would opt for nonmasking fault tolerance instead of fail-safe? Conversely, what are some real-life situations in which you definitely wouldn't opt for nonmasking fault tolerance?

Discussion summary for Q2

Non-masking fault tolerance is desired in applications where being able to continually service requests (liveness) is more important than preventing some unsafe thing from happening. Examples of these applications are live streaming and video games. Live streaming platforms are more concerned with being able to send out frames for video playback rather than making sure that frames are never missed, skipped, or delivered in the wrong order. With video games, it’s more important that gameplay continue rather than preventing certain glitches from occurring.

Fail-safe fault tolerance is desired in applications where bad things can happen in the event of faults, and those bad things cannot be allowed to occur (safety). An example of this would be with a nuclear power plant. If a nuclear power plant started exhibiting unwanted behavior, it should be turned off to prevent the possibility of it leaking (or even blowing up!) instead of continuing to run in order to produce power.

Q3

Section 7.2 of the paper discusses how it can be difficult to determine whether a global predicate holds in a distributed system. Gärtner writes, "In fact, settings can easily be constructed in which two nodes observe the same computation but arrive at different decisions on whether a global predicate held or not." Can you come up with such an example? (If your group has trouble constructing an example on your own, you could try looking at the two papers Gärtner cites to support this point.)

Discussion summary for Q3

An example of nodes observing the same computation but arriving at different conclusions about some global predicate is with a ghost deadlock. A ghost deadlock can occur when an external process takes an inconsistent global snapshot of the messages exchanged between other processes. Although the snapshot contains events that did occur, an incorrect conclusion can be reached. Observe the following diagram:

Ghost protocol example

This diagram is based on Figure 4.2 in Babaoğlu and Marzullo's Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. The snapshot contains the events to the left of the cut (the past) and thus contains none of the events to the right of the cut (the future). From the perspective of this snapshot:

A sent a request to C and is awaiting a response
C sent a request to B and is awaiting a response
B sent a request to A and is awaiting a response

From this cyclic awaiting of responses, it seems that a deadlock has occurred. The conclusion that a deadlock has occurred would not be reached if the events to the right of the cut were included, which is what makes the example a ghost deadlock.

Errata

Running list of typos / issues found in the paper:

Other

Any interesting points or questions from the group discussion that didn't fit above:

Brief note in class about how perfect failure detection is impossible in an asynchronous environment.