Set Operations - ignacio-alorre/Spark GitHub Wiki
RDDs aren't distinct, so they may content duplicated elements, so they may differ from mathematical set operations in how they handle duplicates. For example, union
merely combines its arguments, so the result of union will always have the size of both RDDs combined. intersection
and subtract
are defined similarly to set counterparts, but since RDDs may have duplicates, results may be unexpected. For example subtract
removes all of the elements in the first RDD that have a key present in the second RDD. So it is possible that by subtracting, the result will be smaller than the size of the first RDD minus the size of the second, breaking one of the laws of set theory.
intersection
co-groups the argument RDDs using their values as keys and filters out those elements that don't appear in both. The result of RDD intersection contains no duplicates.
val a = Array(1, 2, 3, 4, 4, 4, 4)
val b = Array(3, 4)
val rddA = sc.parallelize(a)
val rddB = sc.parallelize(b)
val intersection = rddA.intersection(rddB)
val subtraction= rddA.subtraction(rddB)
val union = rddA.union(rddB)