Things which should be fit somewhere - ignacio-alorre/Spark GitHub Wiki
Main features of Spark
- Open Source
- Allows execution in Parallel
- Scalable
- In-memory computing
- Powerful caching
- Can be used for Real-time
Spark Eco-System
Why RDD?
In iterative distributed computing, that is processing the data over multiple jobs, we need to reuse or share the data among multiple jobs. The replications and serializations make the process much slower. This is avoid with in-memory data sharing
What is RDD?
RDD represent a collection of items distributed across many compute nodes that can be manipulated in parallel. They are Spark's main programming abstraction.
Each dataset presented in an RDD is devided into logical partitions which may be computed in different nodes inside the cluster. So it is possible to compute actions or transformation on the entire dataset parallely, leaving Spark to take care of the distribution.
RDD are resilient, and can recover from failure, sicne same data is replicated among multiple executor nodes.
RDD are immutable, that is once created can not be modified
Features of RDD
Ways to create RDD
- Parallelized Collections
val data = Array(1, 2, 3, 4, 5)
// Parallelize the data using 5 partitions
val distData = sc.parallelize(data,5)
- From RDDs
val arr = Array(1,2,3,4,5)
val myRDD = sc.parallelize(arr)
val newRDD = myRDD .map(element => (element*2))
- External Data
val distFile = sc.textFile("data.txt")
RDD Operation
RDD support two type of operations:
- Transformations
- Actions
RDD Transformation
Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output.
Applying transformation built an RDD lineage, with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan i.e., it is Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD.
Transformations are lazy in nature, they get executed when we call an action.
After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap, union, cartesian) or same size (e.g. map)
There are two types of transformations:
- Narrow transformation - All the elements that are required to compute the records in single partition live in the singlepartition of the parent RDD. A limited subset of partition is used to calculate the result.
- Wide transformation - In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD.