12 03 MapReduce Simplified Data Processing on Large Clusters - lkuper/CSE232-2021-09 GitHub Wiki

Discussion for: Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters" (OSDI 2004)

Paper overview

This paper outlines the internals of the MapReduce programming model meant to process large datasets. The report addresses the problem of building applications that tackle large datasets; typically those that require some form of processing to gain insights. The paper's essential contribution is the development of a simple interface (the MapReduce library) that allows automatic parallelization and distribution of large-scale computations. The computations run on regular/commodity computers and don't require any specific hardware.

Discussion questions

Q1

(Inspired by questions asked by multiple people)

Section 3.3 of the paper discusses how MapReduce handles crash faults. When discussing master failure, the authors claim that "given that there is only a single master, its failure is unlikely". Why does this make sense? Doesn't it contradict what we've said about fault tolerance up until now in the course?

Discussion summary for Q1

One way of looking at this statement is from a simple probability perspective -- since there is only a single master, its probability of failure out of N nodes is 1/N. In contrast, the probability of failure of any worker nodes is (N-1)/N. Note that N here is typically very large, easily on the order of tens or hundreds of thousands.

There is nothing about MapReduce that is decentralized. They do not particularly care about redundancy of the master and the associated complexity of choosing one amongst multiple nodes as seen in Paxos. On the other hand, they care about redundant execution, and tasks' failure/slowness results in retries.

Q2

(Contributed by @nileshnegi)

Section 4.3 of the paper says that use of a combiner function "significantly speeds up certain classes of MapReduce operations". What are some examples of applications that would benefit from executing combine operations locally on the map workers in MapReduce?

Discussion summary for Q2

Combiner functions are usually local executions of the user-defined Reduce functions at the Map nodes before the Shuffle. They can be anything that lessens the amount of data to be shuffled, without changing the result of the whole computation. Generally, to use the same reduce function as the combine function, they can only be executed for Reduce functions that are associative and commutative.

Examples include:

Counting the number of words in documents
Removing duplicate occurrences of words in documents
Finding the maximum or minimum in a large list of numbers

An example of a Reduce function that cannot be applied as a Combiner is the average of elements, i.e., sum of all elements, divided by the total no. of elements. An easy fix for this though, is propagating the sums and the counts instead of the averages.

With the first Combiner (i.e., the wrong one), if the first Map worker computes Avg(1, 2, 3) = 2, while the second Map worker computes Avg(4, 5) = 4.5, which will result in intermediate results 2 and 4.5, and the Reduce worker computes Avg(2, 4.5) = 3.25. But if there is no Combiner stage, the final average would be Avg(1, 2, 3, 4, 5) = 3, which is not equal to 3.25.

Q3

Sorting with MapReduce seems to require very little code. The paper explains that a MapReduce sort implementation was able to beat a well-known benchmark for sorting a terabyte of data, using a very short map function and a reduce function that is merely the identity function. How is this possible, and by what mechanism in MapReduce does the sorting actually get done?

Discussion summary for Q3

A crucial point in Google’s implementation of the TeraSort benchmark is the Map function that “extracts a 10-byte sorting key from a text line” and the Shuffle function that “has built-in knowledge of the distribution of keys”, which it uses to order and sort the intermediate key-value pairs.

Simply put, the distribution of the keys has to be known beforehand, or, for example, it should be computed with another M/R computation. Then, the shuffle phase tries to break the key range down to R ranges (where R is the number of Reduce workers), and then each Reduce worker sorts its own smaller range.

Running list of typos / issues found in the paper:

Other

Any interesting points or questions from the group discussion that didn't fit above:

(to be filled in by scribes, if any)