Iterator to Iterator Transformations with mapPartitions - ignacio-alorre/Spark GitHub Wiki
-
The
RDD mapPartitions
function takes as its argument a function from an iterator of records (representing the records on one partition) to another iterator of records (representing the output partition) -
mapPartitions
transformation is one of the most powerful in Spark, since it lets the user define an arbitrary routine on one partition of data. -
mapPartitions
transformation can be used for very simple data transformations like string parsing, but it can also be used for complex, expensive data-processing work to solve problems such as secondary sort or highly custom aggregations. -
To allow Spark the flexibility to spill some record to disk, it is important to represent your functions inside of
mapPartitions
in such a way that your functions do not force loading the entire partition in-memory (e.g. implicitly converting to a list). -
When a transformation directly takes and returns an iterator without forcing it through another collection, we call it an iterator-to-iterator transformation
What is an Iterator-to-Iterator Transformation?
-
A scala iterator object is not actually a collection, but a function that defines a process of accessing the elements in a collection one-by-one.
-
Not only are iterators immutable, but the same element in an iterator can be only be accessed one time [can only be traversed once]
-
They have same methods defined on them as other immutable Scala collections, such as mappings (
map
andflatMap
), additions (++
), folds (foldLeft
,reduceRight
,reduce
), element conditions (forall
andexists
) and traversals (next
andforeach
) -
Any method that requires looking at all the elements in the iterator will leave the original iterator empty. It is easy to accidentally consume an iterator by calling an object that traverses through the iterator such as size. Converting an iterator into another collection type requires accessing each of the elements. Thus, after it has been converted to a new collection type, an iterator will be empty.
-
Like an RDD, an iterator is actually a set of evaluation instructions rather than a stored state.
-
Some iterator methods, like
next
,size
andforeach
, traverse the iterator (which is really a set of evaluation instructions), much like RDD transformations. Others likemap
andflatMap
return a new iterator, which is really a set of evaluation instructions. -
Much like RDD transformations, iterator transformations are executed linearly, one element at a time rather in parallel. This makes iterators slower but much easier t use than if they could be executed in parallel.
-
One-to-one functions are also not chained together in iterator operations so using three map calls still requires looking at each element in the iterator three times.
-
By iterator-to-iterator transformations we mean using one of these iterator transformations to return a new iterator, rather than:
- converting the iterator to a different collection
- evaluating the iterator with one of the iterator actions and building a new collection
-
Note Using a while loop to traverse the elements of an iterator and build a new collection (even a new iterator) does not qualify as an "iterator-to-iterator" transformation
Space and Time Advantages
-
Primary advantage of using "iterator-to-iterator" transformations in Spark routines is that their transformations allow Spark to selectively spill data to disk.
-
Spark can apply the evaluation of elements one at a time to batches of records rather than reading an entire partition into memory or creating a collection with all of the output records in-memory and then returning it. Consequently, "iterator-to-iterator" transformations allow Spark to manipulate partitions that are too large to fit in memory on a single executor without out memory errors.
-
Keeping the partition as an iterator allows Spark to use disk space more selectively. Rather than spilling an entire partition when it doesn't fit in memory, the "iterator-to-iterator" transformation allows Spark to spill only those records that do not fin in memory, saving disk I/O and the cost of recomputation.
-
Using methods defined on itertor avoids defining intermediary data structures. Reducing the number of large intermediate data structures is a way to avoid unnecessary object creation, which can slow down garbage collection.