RDDs - ignacio-alorre/Spark GitHub Wiki
Introduction
-
Spark represents large datasets as RDDs [immutable, distributed collections of objects] which are stored in the executors (or slave nodes). The objects that comprise RDDs are called partitions and may be (but do not need to be) computed on different nodes of a distributed system.
-
The Spark execution engine itself distributes data across the executors for a computation.
-
Rather than evaluating each transformation as soon as specified by the driver program, Spark evaluates RDDs lazily, computing RDD transformations only when the final RDD data needs to be computed (often by writing out to storage or collecting an aggregate to the driver).
-
Spark can keep an RDD loaded in-memory on the executor nodes throughout the life of Spark application for faster access in repeated computations.
-
RDDs are immutable, so transforming an RDD returns a new RDD rather than modifying the original one
Immutability and the RDD Interface
RDDs can be created in three ways:
- By transforming an existing RDD
- From a
SparkContext
, which is the API's gateway to Spark for your application. Options are usingmakeRDD
/parallelize
methods or by reading from stable storage - Converting a
DataFrame
orDataset
The SparkContext
represents the connection between a Spark cluster and one running Spark application.
Spark uses five main properties to represent an RDD.
- [Required] List of partition objects that make up the RDD
- [Required] Function for computing an iterator of each partition
- [Required] List of dependencies on other RDDs
- [Optional] Partitioner, for RDDs of rows of key/value pairs represented as Scala tuples
- [Optional] list of preferred locations
Most likely you will be using predefined RDD transformations.
These five properties correspond to the follwoing five methods available to the end user (you):
-
partitions()
: Returns an array of the partition objects that make up the parts of the distributed dataset. In the case of an RDD with a partitioner, the value of the index of each partition will correspond to the value of thegetPartition()
function for each key in the data associated with that partition -
iterator(p, parentIters)
: Computes the elements of partitionp
given iterators for each of its parent partitions. This function is called in order to compute each of the partitions in this RDD. Note This method is used by Spark when computing actions, not directly by the user. -
dependencies()
: Returns a sequence of dependency objects. The dependencies let the scheduler know how this RDD depends on other RDDs. There are two kinds of dependencies:- narrow dependencies, which represent partitions that depend on one or a small subset of partitions in the parent
- wide dependencies, which are used when a partition can only by computed by shuffling data in the cluster
-
partitioner()
: Returns a Scala Option type of a partitioner object if the RDD has a function between element and partititon associated with it, such as hashPartitioner. This function returnsNone
for all RDDs that are not of type tuple (do not represent key/value data) -
preferredLocations(p)
: Returns information about the data locality of a partitionp
. Specifically, this function returns a sequence of strings representing some information about each of the nodes where the splitp
is stored.
Types of RDDs
-
The abstract class RDD contains not only the five core functions of RDDs, but also those transformations and actions that are available to all RDDs, such as
map
andcollect
. -
Functions defined only on RDDs of a particular type are defined in several RDD function classes, including
PairRDDFunctions
,OrderedRDDFunctions
andGroupedRDDFunctions
. -
You can find out the type of RDD by using the
toDebugString
function, which is defined on all RDDs. This will tell you what kind of RDD you have and provide a list of its parent RDDs
Functions on RDDs: Transformations vs Actions
-
Actions: Are functions that return something that is not an RDD, including a side effect
-
Transformation: Are functions that return another RDD
-
Each Spark program must contain an action, since actions either bring information back to the driver or write the data to stable storage. Actions are what force evaluation of a Spark program. Persist calls also force evaluation, but usually do not mark the end of the Spark job. Actions that bring data back to the driver include
collect
,count
,collectAsMap
,sample
,reduce
andtake
-
Note Some of these actions do not scale well, since they can cause memory errors in the driver. In general, it is best to use actions like
take
,count
andreduce
, which bring back a fixed amount of data to the driver, rather thancollect
orsample
-
Actions that write to storage include
saveAsTextFile
,saveAsSequenceFile
andsaveAsObjectFile
. -
Functions that return nothing, suck as
foreach
are also actions, since they force the execution of a Spark job. -
Most of the power of the Spark API is in its transformations. Spark transformations are coarse-grained transformations used to
sort
,reduce
,group
,filter
andmap
distributed data
Wide vs Narrow Dependencies
-
Transformations fall into two categories: transformations with narrow dependencies and transformations with wide dependencies
-
Narrow transformations are those in which each partition in the child RDD has simple, finite dependencies on partitions in the parent RDD. Dependencies are only narrow if they can be determined at design time, irrespective of the value of the records in the parent partitions, and if each parent has at most one child partition.
-
Specifically, partitions in narrow transofmations can either depend on one parent (such in the
map
operator), or a unique subset of the parent partitions that is know at design time (coalesce
). -
Narrow Transformations can be executed on an arbitraty subset of the data without any information about the other partitions.
-
Wide Transformations can not be executed on arbitraty rows and instead require the data to be partitioned in a particular way, e.g. according to value of their key. For example, records have to be partitioned so that keys in the same range are on the same partition, Transformations with wide dependencies include
sort
,reduceByKey
,groupByKey
,join
and anything which calls therePartition
function -
When Spark already knows the data is partitioned in a certain way, operations with wide dependencies do not cause a shuffle.
-
Below we have a diagram which illustrate narrow dependencies in which each child partition depends on a know subset of parent partitions. The left represents a dependency graph of narrow transformations (such as
map
,filter
,mapPArtitions
andflatMap
). On the upper right are dependencies between partitions forcoalesce
. -
Note that a transformation can still qualify as narrow if the child partitions may depend on multiple parent partitions, so long as the set of parent partitions can be determined regardless of the values of the data in the partitions
- The below diagram shows wide dependencies between partitions. In this case the child partitions depend on an arbitrary set of parent partitions.
The wide dependencies cannot be known fully before the data is evaluated. In contrast to the
coalesce
operation, data is partitioned according to its value.
join
function can have wide or narrow dependencies depending on how the two parent RDD are partitioned