The importance of caching - derlin/bda-lsa-project GitHub Wiki

About caching

There are two ways to cache a spark collection (RDD, dataset, dataframe): cache and persist.

The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY

Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.

In a nutshell, caching is useful when the lineage (the DAG) is not linear. From this stackoverflow answer:

Only when an action is called upon an RDD, .count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the transformations will be applied and the result will be calculated.

On a linear lineage cache() is not needed. It is only useful when the lineage of the RDD branches out. Here is a branching example:

val wordsRDD = textFile.flatMap(line => line.split("\\W"))
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()

Here, each filter issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously (in this case the flatMap) is preserved and reused.

In summary, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing ==> Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.

Live example of the caching utility

To preprocess the whole wikipedia dump, we used the DAPLAB. In one of our first attempts, we didn't really cared about caching and got the following:

preprocessing without caching

CountVectorizer and IDF work on the same RDD, filtered, which contains term vectors for each document. The problem is, we created the RDD using a long list of transformations and forgot to call filtered.cache. The result: the whole transformations are computed three times, taking each more than 4 hours.

Adding the simple line filtered.cache(), we get:

preprocessing with caching

As you can see, the first time it still takes some time, but after that the two remaining stages take minutes to finish. Viva el caching !! Note: the other differences come from other changes in the code, they are not significant for this example.