Shared Variables - ignacio-alorre/Spark GitHub Wiki
Broadcast Variables
-
Give us a way to take a local value on the driver and distribute a read-only copy to each machine rather than shipping a new copy with each task.
-
The savings of only sending one copy per machine versus sending one copy per task can make a huge difference, especially when the same broadcast variable is used in additional transformations.
-
Creating a broadcast variable is done by calling
broadcast
on theSparkContext
. This distributes the value to the workers and gives us back a wrapper that allows us to access the value on the workers. -
The value for a broadcast variable must be a local, serializable value: no RDDs or other distributed data structures
-
If a broadcast variable is no longer needed, you can explicitly remove it by calling
unpersist()
on the broadcast variable
Accumulators
-
They allow us to collect by-product information from a transformation or action on the workers and then bring the result back to the driver.
-
Spark adds values to accumulators only once the computation has been triggered (e.g by an action). If the computation happens multiple times, Spark will update the accumulator each time. This can be disastrous for data-related information like counting the number of invalid records
-
In their current state, they are best used where potential multiple counting is the desired behavior. They can be unpredictable [As per Spark 2.0]
-
Are not intended for collecting large amounts of information.