MapReduce - nzsaurabh/hadoop_training GitHub Wiki
Why MapReduce?
- Allows processing data on a cluster of multiple servers. Hence, big data can be processed.
- Inherits Hadoop's in built advantages of backup, resiliency, availability.
How it works?
- Has two components - Mapper and Reducer
- Mapper:
- converts raw source data into key/value pairs
- Shuffles (groups) and Sorts the key value pairs
- Reducer:
- applies the specified operations on the values of each key e.g. sum, padding a string
- sorts by the first column. i.e. if columns are interchanged into value:key, sorting will be by value
- Multiple reducer steps can be run in sequence. Automatically uses output from previous step. For e.g. see TopMoviesChallenge_S2L14.py
- sorting is in ascending order as strings
- numbers treated as character strings in the first column
- hence need to be padded by zeros e.g. yield str(sum(values)).zfill(5), key