Lazy Evaluation - ayushmathur94/Spark GitHub Wiki

Lazy Evaluation

Lazy Evaluation means that Spark does not evaluate each transformation as they arrive (called), but instead queues them togather and evaluate all at once, as an Action is called.

All transformations in Spark are Lazy, ie they do not compute their result right away. Instead they just remember the transformations applied to some base dataset (eg, a file [metadata]) . The transformations are only computed when an action requires a result to be returned to the driver program.

This design enables spark to run more efficiently. For eg. We can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver , rather than the larger mapped dataset.

By default, each transformed RDD may be re-computed each time we run an action on it. However, we may also persist an RDD in memory using the persist (or cache) method, in which case , Spark will keep the elements around on the cluster for much faster access, the next time you query it. There is also support for persisting RDDs on disk, or replicate accross multiple nodes.

Lazy evaluation means that when, if we call a transformation or an action on an RDD (for eg. calling map()), the operation is not immediately performed. Instead Spark internally records metadata to indicate that this operation has been requested.

Rather than thinking of an RDD as containing specific data , it is best to think each RDD methodology as consisting of instructions on how to compute the data that we build up through transformation phases.

For Understanding
Common Transformation and Actions There are relatively two most common transformations : map() and filter()

The map() transformation takes in a function and applies it to each element in the RDD with the result of the function being a new value of each element in the resulting RDD.

The filter() transformation takes in a function and returns an RDD that only has elements that pass the filter() function.

mapfil

⚠️ **GitHub.com Fallback** ⚠️