RDDs - ignacio-alorre/Spark GitHub Wiki

Introduction

Spark represents large datasets as RDDs [immutable, distributed collections of objects] which are stored in the executors (or slave nodes). The objects that comprise RDDs are called partitions and may be (but do not need to be) computed on different nodes of a distributed system.
The Spark execution engine itself distributes data across the executors for a computation.
Rather than evaluating each transformation as soon as specified by the driver program, Spark evaluates RDDs lazily, computing RDD transformations only when the final RDD data needs to be computed (often by writing out to storage or collecting an aggregate to the driver).
Spark can keep an RDD loaded in-memory on the executor nodes throughout the life of Spark application for faster access in repeated computations.
RDDs are immutable, so transforming an RDD returns a new RDD rather than modifying the original one

Immutability and the RDD Interface

RDDs can be created in three ways:

By transforming an existing RDD
From a SparkContext, which is the API's gateway to Spark for your application. Options are using makeRDD/parallelize methods or by reading from stable storage
Converting a DataFrame or Dataset

The SparkContext represents the connection between a Spark cluster and one running Spark application.

Spark uses five main properties to represent an RDD.

[Required] List of partition objects that make up the RDD
[Required] Function for computing an iterator of each partition
[Required] List of dependencies on other RDDs
[Optional] Partitioner, for RDDs of rows of key/value pairs represented as Scala tuples
[Optional] list of preferred locations

Most likely you will be using predefined RDD transformations.

These five properties correspond to the follwoing five methods available to the end user (you):

partitions(): Returns an array of the partition objects that make up the parts of the distributed dataset. In the case of an RDD with a partitioner, the value of the index of each partition will correspond to the value of the getPartition() function for each key in the data associated with that partition
iterator(p, parentIters): Computes the elements of partition p given iterators for each of its parent partitions. This function is called in order to compute each of the partitions in this RDD. Note This method is used by Spark when computing actions, not directly by the user.
dependencies(): Returns a sequence of dependency objects. The dependencies let the scheduler know how this RDD depends on other RDDs. There are two kinds of dependencies:
- narrow dependencies, which represent partitions that depend on one or a small subset of partitions in the parent
- wide dependencies, which are used when a partition can only by computed by shuffling data in the cluster
partitioner(): Returns a Scala Option type of a partitioner object if the RDD has a function between element and partititon associated with it, such as hashPartitioner. This function returns None for all RDDs that are not of type tuple (do not represent key/value data)
preferredLocations(p): Returns information about the data locality of a partition p. Specifically, this function returns a sequence of strings representing some information about each of the nodes where the split p is stored.

Types of RDDs

The abstract class RDD contains not only the five core functions of RDDs, but also those transformations and actions that are available to all RDDs, such as map and collect.
Functions defined only on RDDs of a particular type are defined in several RDD function classes, including PairRDDFunctions, OrderedRDDFunctions and GroupedRDDFunctions.
You can find out the type of RDD by using the toDebugString function, which is defined on all RDDs. This will tell you what kind of RDD you have and provide a list of its parent RDDs

Functions on RDDs: Transformations vs Actions

Actions: Are functions that return something that is not an RDD, including a side effect
Transformation: Are functions that return another RDD
Each Spark program must contain an action, since actions either bring information back to the driver or write the data to stable storage. Actions are what force evaluation of a Spark program. Persist calls also force evaluation, but usually do not mark the end of the Spark job. Actions that bring data back to the driver include collect, count, collectAsMap, sample, reduce and take
Note Some of these actions do not scale well, since they can cause memory errors in the driver. In general, it is best to use actions like take, count and reduce, which bring back a fixed amount of data to the driver, rather than collect or sample
Actions that write to storage include saveAsTextFile, saveAsSequenceFile and saveAsObjectFile.
Functions that return nothing, suck as foreach are also actions, since they force the execution of a Spark job.
Most of the power of the Spark API is in its transformations. Spark transformations are coarse-grained transformations used to sort, reduce, group, filter and map distributed data

Wide vs Narrow Dependencies

Transformations fall into two categories: transformations with narrow dependencies and transformations with wide dependencies
Narrow transformations are those in which each partition in the child RDD has simple, finite dependencies on partitions in the parent RDD. Dependencies are only narrow if they can be determined at design time, irrespective of the value of the records in the parent partitions, and if each parent has at most one child partition.
Specifically, partitions in narrow transofmations can either depend on one parent (such in the map operator), or a unique subset of the parent partitions that is know at design time (coalesce).
Narrow Transformations can be executed on an arbitraty subset of the data without any information about the other partitions.
Wide Transformations can not be executed on arbitraty rows and instead require the data to be partitioned in a particular way, e.g. according to value of their key. For example, records have to be partitioned so that keys in the same range are on the same partition, Transformations with wide dependencies include sort, reduceByKey, groupByKey, join and anything which calls the rePartition function
When Spark already knows the data is partitioned in a certain way, operations with wide dependencies do not cause a shuffle.
Below we have a diagram which illustrate narrow dependencies in which each child partition depends on a know subset of parent partitions. The left represents a dependency graph of narrow transformations (such as map, filter, mapPArtitions and flatMap). On the upper right are dependencies between partitions for coalesce.
Note that a transformation can still qualify as narrow if the child partitions may depend on multiple parent partitions, so long as the set of parent partitions can be determined regardless of the values of the data in the partitions

The below diagram shows wide dependencies between partitions. In this case the child partitions depend on an arbitrary set of parent partitions. The wide dependencies cannot be known fully before the data is evaluated. In contrast to the coalesce operation, data is partitioned according to its value.

join function can have wide or narrow dependencies depending on how the two parent RDD are partitioned