RDD KeyValue Pairs - awantik/spark GitHub Wiki

  • A few special operations are only available on RDDs of key-value pairs.

  • The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

  • These operations work on tuples

    lines = sc.parallelize([1,2,2,3,3,4,4,1]) pairs = lines.map( lambda d: (d,1) ) counts = pairs.reduceByKey(lambda a,b : a+b)