09 29 A Note on Distributed Computing - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for: Jim Waldo et al., "A Note on Distributed Computing" (1994)

Paper overview

The paper talks about how the unified object oriented programming paradigm approached implementation in both local and distributed systems and the risk involved in ignoring their differences. The authors are critical of this paradigm, emphasizing the differences between local and distributed systems, and showing through Sun’s NFS how issues arise with memory access patterns, differing latencies, and concurrency.

Discussion questions

Q0 (warm-up question)

What do Waldo et al. say are the major differences between local computing and distributed computing? Which of these differences do they say is most fundamental? Which do they say is least fundamental?

Discussion summary for Q0

Waldo et al. discuss four major differences between distributed and local computing: Latency, Memory Access, Partial Failures, and Concurrency. In order of least to most fundamental:

Latency - Since there are delays when invoking distributed objects vs. local objects, the amount of time it takes to actually invoke a distributed object needs to be factored into design considerations. In principle, hardware advances can mitigate or remove the impact of latency.
Memory Access - This largely has to do with pointer management. If one tries to hide pointer management of remote objects, and ease the burden on programmers, errors are bound to occur. On the other hand, making pointer and memory management more explicit requires programmers learn how to work with such tools, and thus makes their task more difficult.
Partial Failures - In a distributed system, it is possible for a link or node to fail in the network, while others continue, which is largely not the case in local computing. Such partial failures are difficult to diagnose: because any common agent designed to observe and report failures is, itself, a node in the network, it is entirely possible the agent itself, or its communication link fails. In this scenario, there is no way to diagnose the cause of failure, without yet another common agent (which is still apart of the network!).
Concurrency - Since distributed objects must handle concurrent invocations, all objects must have some concurrent semantics. The issue at hand is that all objects need the same semantics for dealing with concurrency, else there will be inconsistencies between different distributed objects after the same set concurrent object invocations.

Largely, the most fundamental points to distributed systems are issues (3) and (4). These issues cannot be "papered over" by hardware advances in the same way as issues (1) and (2), at least in principle.

Q1

(Contributed by @Aditya-1996)

In section 4.3, when discussing partial failure, Waldo et al. write, "Not only is the failure of the distributed components independent, but there is no common agent that is able to determine what component has failed and inform the other components of that failure, no global state that can be examined that allows determination of exactly what error has occurred." What exactly is the reason that a component's failure cannot be detected and communicated in a distributed system? Would it be possible to have 'common agents' installed at several points on a distributed network to ensure that the exact component failure is picked up?

Discussion summary for Q1

In a distributed system it is hard to distinguish between a high latency response, the component itself failing, and a network issue. This results in an environment where identifying failure is unreliable at the best of times. Having multiple common agents to detect the failure of a component extends this issue across the whole network. As any monitoring agent is liable to the same failures of the components it is supervising. A possible solution would be to implement all-to-all communication. In this environment with every single component monitoring each other the probability of detecting a failure increases. This, however, comes at a great cost, as it's not only expensive but also infeasible to implement this type of solution. Even with this approach, perfect error detection is impossible.

Q2

(Contributed by @jabowen)

Section 6 of the paper mentions that NFS uses a centralized resource manager, and because of that could be argued to not be a "genuinely distributed" system. Do you think that use of a centralized resource manager makes NFS a non-distributed system? Why or why not?

Discussion summary for Q2

It really depends on how we define “distributed system”. If a distributed system is just a collection of interconnected computers, then sure having a centralized resource manager still counts as a distributed system. However, if we define a distributed system as being characterized by partial failure, then having a centralized manager means there’s no partial failure. As the manager can shut the whole system down if it detects failure and prevent anomalous behavior, and if the manager fails this results in a full system failure. This contravenes the definition of a distributed system being defined by partial failure.

Q3

In section 4.3, when discussing the need for distributed applications to handle concurrent invocations of operations, Waldo et al. write, "One might argue that a multi-threaded application needs to deal with these same issues. However, there is a subtle difference. In a multi-threaded application, there is no real source of indeterminacy of invocations of operations. The application programmer has complete control over invocation order when desired. A distributed system by its nature introduces truly asynchronous operation invocations."

Based on your experience and intuition, how would you characterize the differences between shared-memory multi-threaded programming and distributed programming? Do you agree with Waldo et al. that "there is no real source of indeterminacy" in single-machine, shared-memory multi-threaded programming, and that it is possible for the programmer to have "complete control over invocation order when desired"?

(Let's assume that "indeterminacy" is a synonym for "nondeterminism", and that the kind of nondeterminism we are talking about is schedule nondeterminism, that is, nondeterminism in the order in which operations are executed, as opposed to other kinds of nondeterminism, for instance, the intentional nondeterminism of a random number generator.)

Discussion summary for Q3

The statement by Waldo et al. that "there is no real source of indeterminacy" in a single-machine, shared-memory, multi-threaded programming environment does not ring true. While the programmer does have control over the order in which threads are dispatched to perform certain tasks, it's up to the kernel scheduler (or whatever scheduler) to determine the order in which those threads run. As a programmer, there are certain routines to give threads custom priorities, but to be able to totally order thread operations may not be something an ordinary programmer can do with their privilege in some system, and if they could, it wouldn't be a trivial task, and yields indeterminacy.

Errata

Running list of typos / issues found in the paper:

None found

Other

Any interesting points or questions from the group discussion that didn't fit above:

Section 8 is an interesting read