11 25 MapReduce Simplified Data Processing on Large Clusters - lkuper/CSE232-2024-09 GitHub Wiki

Discussion for: Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)

Paper overview

MapReduce is a programming model designed to simplify large-scale data processing by abstracting the complexities of distributed systems. It breaks computations into two key phases: map, which processes input data into intermediate key-value pairs, and reduce, which aggregates these intermediate results. This model automates parallelization, fault tolerance, and resource management, making it accessible even for programmers without expertise in distributed systems. By leveraging clusters of commodity hardware, MapReduce achieves high scalability and handles failures gracefully through task re-execution.

Discussion questions

Q1

Section 3.3 of the paper discusses how MapReduce handles crash faults. When discussing master failure, the authors claim that "given that there is only a single master, its failure is unlikely". Why does this make sense? Doesn't it contradict what we've said about fault tolerance up until now in the course?

Discussion summary for Q1

Having only one dedicated master node could make sense in following case, if there is only one master node, the probability of this one node failing can be less than that of a worker node failing which are more in number, particularly if we are able to control the conditions of the test environment / data center such that they make failure of a more important node less likely than that of a less important node.

However, if the test environment is realistic and less controlled, each node worker or master has the same amount of probability of failure since they have also have similar hardware, and are indistinguishable from each other, they only play different roles in this scenario it would not make sense to have a single dedicated node as master as it is equally likely to fail as the worker nodes.

Having a single master node does seem to contradict the understanding of fault tolerance we have made so far, particularly from the Chain replication and Paxos protocol where we saw: having multiple replicas ~ improved fault tolerance So in this case having only a single master alone could lead to a significant weakness in the implementation as it could expose the algorithm to single point failure even though the cost of recomputing MapReduce operation again after failure is cheap.

Q2

(question contributed by Inje Kim)

Why does MapReduce re-execute failed tasks rather than checkpointing intermediate results?

Discussion summary for Q2

Re-execution of failed task would be better than checkpointing due to following reasons:

Re-execution of failed tasks is only required for map workers since they do not write data to global memory.
It is simpler to re-execute failed tasks compared to continuing from checkpoint stored in memory or at master, as it would require additional implementation changes to reload and restart execution from checkpoint.
The map workers work and store results to their local memory or disk. If a machine fails, that memory is also inaccessible along with the machine to other machines in the network including master, so checkpointing would fail to recover the recent state from the memory. This could be handled by writing to a global memory in map phase (similar to reduce) however it would require additional code and protocol changes to make this work, and since all the writes to global memory would be over a network the whole process would also be slower.
Also, loading checkpoints and resuming executions from checkpoint could also increase the chances of bugs in MapReduce.
Re-execution of failed tasks at nodes uses less memory since the worker doesn’t need to store the checkpoints locally or globally.
Each master knows what data was allocated and was stored in GFS thus it is quite easier for the master to send the data back to the worker for re-execution as easily as sending a message out to the workers.

Q3

Sorting with MapReduce seems to require very little code. The paper explains that a MapReduce sort implementation was able to beat a well-known benchmark for sorting a terabyte of data, using a very short map function and a reduce function that is merely the identity function. How is this possible, and by what mechanism in MapReduce does the sorting actually get done?

Discussion summary for Q3

The MapReduce is able to sort large amounts of data with very little code and efficiently due to the following two methods:

Mapping Keys: The three-line map function extracts a 10 byte sorting key from a text line and emits the key and the original text line as the intermediate key/value pair.
Shuffling: This process groups all the intermdediate key/value pairs' values with the corresponding keys. Similar to merge sort, the shuffling process moves the values into the correct "buckets"

In conclusion, the MapReduce splits up the sorting process into smaller tasks and distributes it among the reduce workers. Since each reduce worker is only responsible for a small portion of the data, the sorting process can be done in parallel and efficiently.

Q4 (optional; if you have time)

(two related questions, contributed by Prathamesh Khole and Jonathan Castello)

If there is only one map worker and one reduce worker along with a master, does MapReduce perform comparably to a sequential execution of the algorithm or is it much worse?

The distributed MapReduce system described in this paper requires a fair amount of orchestration in order to shuttle data around and execute tasks on the right machines. Qualitatively, how large of a dataset do you think you'd need to have before this system becomes worth using? What bottlenecks might exist on a single-node map/reduce implementation that the distributed system is able to overcome?

Discussion summary for Q4 (optional; if we get to it)

Sequential execution is more efficient than MapReduce with only one map worker and one reduce worker. MapReduce includes overhead to store the intermediate data and requires reduce workers to make remote memory accesses to the map worker's local memory. However, having one map worker and one reduce worker could be useful in scenarios if it could stream the results as they are computed.

The dataset needs to be larger than the memory of a single machine to make the MapReduce System worth using. If all the data could fit on one machine, it would be silly to split it up into different machines. Each map and reduce worker needs to have several large enough chunks of data to work on. In other words, try to avoid a situation in which a distributed system has the same configuration as a single-node system as the complexity of the distributed system would not be worth the overhead.