reduceByKey vs groupBykey vs aggregateByKey vs combineByKey - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki

There is two different ways to compute counts:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()

​ val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

**reduceByKey **

groupBykey

Here are more functions to prefer over groupByKey: