10 02 A Note on Distributed Computing - lkuper/CSE232-2024-09 GitHub Wiki

Discussion for: Jim Waldo et al., "A Note on Distributed Computing" (1994)

Paper overview

This paper aims to discuss and distinguish the fundamental differences between local and distributed systems, in the context of computing. When designing distributed systems, there arise several problems that we did not have to think about with local machines, namely latency, memory access, concurrency, and partial failure. The authors challenge the existing assumption that objects in distributed systems can be treated just like objects in local machines, moving towards the conclusion that novel data structures must be created to serve the needs of the distributed systems and address the challenges mentioned previously. These robust design decisions are the key towards building scalable systems and cannot be ignored. One example of a key difference is the use of memory pointers; certain pointers are valid for a local address space but would not be valid for another remote machine, thus creating a need for a global object. Similarly, local machines operate with limited latency, but when dealing with distributed systems, we now need to consider the latency provided by the network. As such, we need to create a new set of assumptions and implement various controls like timeouts when designing distributed systems.

Discussion questions

Q1: Easy warm-up question

What do Waldo et al. say are the major differences between local computing and distributed computing? Which of these differences do they say is most fundamental, and which do they say is least fundamental?

Discussion summary for Q1

Concurrency and partial failure are emphasized as the most fundamental challenges in distributed systems because these problems occur more frequently and require careful reasoning compared to local computing. Latency is considered the least fundamental, as it's more easily predictable but still introduces complications, especially when dealing with multiple requests and queuing.

Additionally, memory access also emerges as a crucial consideration due to the need for new handling techniques in distributed systems.

Q2: Has latency gotten better since 1994?

In 1994 (when this paper was published), Waldo et al. wrote:

The most obvious difference between a local object invocation and the invocation of an operation on a remote (or possibly remote) object has to do with the latency of the two calls. The difference between the two is currently between four and five orders of magnitude, and given the relative rates at which processor speed and network latency speeds are changing, the difference in the future promises to be at best no better, and will likely be worse. It is this disparity in efficiency that is often seen as the essential difference between local and distributed computing.

Colin Scott's "Latency Numbers Every Programmer Should Know" is an interactive tool for exploring latencies of various computer operations between 1990 and 2020. If we trust Colin's numbers: which latencies have changed a lot in that 30-year span of time, and which haven't changed? What do you think are the reasons for the change or lack thereof? Do you think Waldo et al. were correct in their 1994 prediction about the future of latency change?

Discussion summary for Q2

Latency improvements over the past 30 years have varied significantly between local and remote operations. Local computation has seen considerable speed gains, largely due to advancements in multi-core processors, memory hierarchies, and faster hardware. Disk I/O, for example, has improved dramatically. However, network latency, particularly over long distances, has not seen the same degree of improvement, constrained by the physical limits of communication mediums like fiber optics and the speed of light.

Remote communication over commodity networks has seen some improvement, but latency on LANs or across global networks remains largely unchanged. This stagnation in network latency is due to physical limitations that are difficult to overcome with hardware alone.

Waldo et al.'s 1994 prediction was largely correct: the disparity between local and remote latency has grown rather than decreased. As hardware approaches its limits, the focus has shifted toward software optimizations, but physical distance will always introduce a latency bottleneck that cannot be completely eliminated. Thus, even if local and remote latency were equalized, it would still be unwise to treat them as functionally equivalent.

Q3: What does "concurrency" mean?

(question contributed by Gurpreet Dhillon)

In page 8 on concurrency, the paper states “distributed objects by their nature must handle concurrent method invocations.” and “since dealing with concurrency can take place only by passing information from one object to another through the agency of the interface.” What does concurrency exactly mean in the context of distributed computing? I understand in local computing, it is the notion of two or more processes going back and forth on a core. In a distributed context, a single component must be handling multiple tasks simultaneously, and then it gets more challenging in a distributed nature where we have more than one component. So is concurrency in distributed systems the idea of a collective group of distributed objects going back and forth between invocations simultaneously, operating as one unit? What is your definition or understanding of concurrency in distributed computing (maybe give an example)?

Discussion summary for Q3

Traditionally, concurrency is the handling of independent tasks at the same time, where tasks can overlap and occur simultaneously. It is about how to share computing resources among different processes while concurrency in distributed systems is about how to handle conflicts when multiple requests are sent to the same object. In the context of distributed computing, concurrency is the ordering of events that could happen. In a distributed system, there is non-determinism in the order in which things are happening.

Q4: How might replicated objects reveal that they are replicated?

(question contributed by Jonathan Castello)

In modern (and not-so-modern) systems, replicated objects (such as CRDTs, Conflict-free Replicated Data Types) have become popular. These objects exist at many locations ("replicas") in the system, and an application at the same location as a replica may interact with it as if it were a local object. Following the authors' conclusions, this abstraction must be leaky. What are some ways in which the concerns of distributed systems might leak through? (One possible hint: can you design such an object as easily as you could a regular local object?)

Discussion summary for Q4

When dealing with replication particularly with distributed systems, we might find it difficult to know if the data we are accessing is actually the most updated value. It becomes a problem to manage data dependencies and as such, we must have some sort of robust system to handle these dependencies. Additionally, we need some way to resolve and merge conflicts when they are arise - and in distributed systems, such conflicts will arise often. A final concern is that when designing a global object, we must ensure that they do not reference other objects that are unique to a local machine, rather, global objects must be universal in their pointers. So to address these concerns, we propose a couple different solutions. The first solution is to implement some version control system like git but in the context of distributed systems. We might implement this by abstracting the distributed system as a repository and each machine as a user, such that version history can easily be tied to the contributing machine. Another option may be storing the task tree and lineage for each task as part of the metadata and having this data replicate for every event tied to specific tasks. If we do this, it should be simple to rebuild task lineages from this unique object passed in the metadata and we can determine the exact contributing machines.