MapReduce - nzsaurabh/hadoop_training GitHub Wiki

Why MapReduce?

Allows processing data on a cluster of multiple servers. Hence, big data can be processed.
Inherits Hadoop's in built advantages of backup, resiliency, availability.

Has two components - Mapper and Reducer
Mapper:
- converts raw source data into key/value pairs
- Shuffles (groups) and Sorts the key value pairs
Reducer:
- applies the specified operations on the values of each key e.g. sum, padding a string
- sorts by the first column. i.e. if columns are interchanged into value:key, sorting will be by value
- Multiple reducer steps can be run in sequence. Automatically uses output from previous step. For e.g. see TopMoviesChallenge_S2L14.py
sorting is in ascending order as strings
- numbers treated as character strings in the first column
- hence need to be padded by zeros e.g. yield str(sum(values)).zfill(5), key