reduceByKey vs groupBykey vs aggregateByKey vs combineByKey - vaquarkhan/Apache-Kafka-poc-and-notes GitHub Wiki

There is two different ways to compute counts:

val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD .reduceByKey(_ + _) .collect()

val wordCountsWithGroup = wordPairsRDD .groupByKey() .map(t => (t._1, t._2.sum)) .collect()

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

**reduceByKey **

groupBykey

Here are more functions to prefer over groupByKey:

combineByKey can be used when you are combining elements but your return type differs from your input value type.
foldByKey merges the values for each key using an associative function and a neutral "zero value".
http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
http://apache-spark-user-list.1001560.n3.nabble.com/aggregateByKey-vs-combineByKey-td15321.html
http://stackoverflow.com/questions/24804619/how-does-spark-aggregate-function-aggregatebykey-work