RDD KeyValue Pairs - awantik/spark GitHub Wiki
-
A few special operations are only available on RDDs of key-value pairs.
-
The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.
-
These operations work on tuples
lines = sc.parallelize([1,2,2,3,3,4,4,1]) pairs = lines.map( lambda d: (d,1) ) counts = pairs.reduceByKey(lambda a,b : a+b)