MapReduce - nzsaurabh/hadoop_training GitHub Wiki

Why MapReduce?

  • Allows processing data on a cluster of multiple servers. Hence, big data can be processed.
  • Inherits Hadoop's in built advantages of backup, resiliency, availability.

How it works?

  • Has two components - Mapper and Reducer
  • Mapper:
    • converts raw source data into key/value pairs
    • Shuffles (groups) and Sorts the key value pairs
  • Reducer:
    • applies the specified operations on the values of each key e.g. sum, padding a string
    • sorts by the first column. i.e. if columns are interchanged into value:key, sorting will be by value
    • Multiple reducer steps can be run in sequence. Automatically uses output from previous step. For e.g. see TopMoviesChallenge_S2L14.py
  • sorting is in ascending order as strings
    • numbers treated as character strings in the first column
    • hence need to be padded by zeros e.g. yield str(sum(values)).zfill(5), key