Combiner - MatteoDJon/CloudProgrammingTonellotto GitHub Wiki

Code

job.setCombinerClass(KMeansReducer.class);

Explanation

The task of the combiner is exactly the same of the Reducer, namely iterate all the points associated to a cluster id and emit a point with the components and count sum of all the others. The difference is on what the two components works. The reducer will receive as input all the points associated to the id of a cluster; the combiner on the other hand will receive as input all the points associated to the output of a single map task. If the initial file size is in fact too big, the file will be divided into n input split, where n is the result of the division (file dimension/hdfs block dimension); each one of this input split will be the input of a map function, and the output will be the input of the combiner. In other words, the combiner will receive in input all the points assigned to a cluster associated with an input split. The combiner will then proceed to do the Point "sum" for each cluster id and emit it; the next Reducer will then have to do the sum of the "sum Point" associated to a cluster id