Minimizing Object Creation - ignacio-alorre/Spark GitHub Wiki

"Garbage collection" is the process of freeing up the memory allocated for an object once that object is no longer needed. Garbage collection can quickly become an expensive part of our Spark job. We can minimize the GC cost by reducing the number of objects and the size of those objects. We can achieve this by reusing existing objects and by using data structures (such as primitive types) that take up less space in memory.

Reusing Existing Objects

Some RDD transformations allow us to modify the parameters in the lambda expression rather than returning a new object. For example, in the sequence of the aggregation function for aggregateByKey and aggregate, we can modify the original accumulator argument and define the combine function in such a way that the combination is created by modifying the first of the two accumulators. A common and effective paradigm for complicated aggregations is to define a Scala class with sequence and combine operations that return the existing object using the this.type annotations.

Using Smaller Data Structures

An important way to optimize Spark jobs for both time and space is to stick to primitive types rather than custom classes. Although it may make code less readable, using arrays rather than custom classes. Although it may make code less readable, using arrays rather than case classes or tuples can reduce GC overhead.

Scala arrays are the most memory-efficient of the Scala collection types. Scala tuples are objects, so in some instances it might be better to use a two- or three-element array rather than a tuple for expensive operations. The Scala collection types in general incur in higher GC overhead than arrays.

Note: Beyond reducing the objects that are directly allocated, Scala's implicit conversions can sometimes cause additional allocations in the process of converting.