Things which should be fit somewhere - ignacio-alorre/Spark GitHub Wiki

Main features of Spark

Open Source
Allows execution in Parallel
Scalable
In-memory computing
Powerful caching
Can be used for Real-time

Spark Eco-System

Why RDD?

In iterative distributed computing, that is processing the data over multiple jobs, we need to reuse or share the data among multiple jobs. The replications and serializations make the process much slower. This is avoid with in-memory data sharing

What is RDD?

RDD represent a collection of items distributed across many compute nodes that can be manipulated in parallel. They are Spark's main programming abstraction.

Each dataset presented in an RDD is devided into logical partitions which may be computed in different nodes inside the cluster. So it is possible to compute actions or transformation on the entire dataset parallely, leaving Spark to take care of the distribution.

RDD are resilient, and can recover from failure, sicne same data is replicated among multiple executor nodes.

RDD are immutable, that is once created can not be modified

Features of RDD

Ways to create RDD

Parallelized Collections

val data = Array(1, 2, 3, 4, 5)
// Parallelize the data using 5 partitions
val distData = sc.parallelize(data,5)

From RDDs

val arr = Array(1,2,3,4,5)
val myRDD = sc.parallelize(arr)
val newRDD = myRDD .map(element => (element*2))

External Data

val distFile = sc.textFile("data.txt")

RDD Operation

RDD support two type of operations:

Transformations
Actions

RDD Transformation

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.

Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.

Transformations are lazy in nature, they get executed when we call an action.

After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or same size (e.g. map)

There are two types of transformations:

Narrow transformation - All the elements that are required to compute the records in single partition live in the singlepartition of the parent RDD. A limited subset of partition is used to calculate the result.

Wide transformation - In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.