Apache Spark Architecture - rohith-nallam/Bigdata GitHub Wiki

Explaining about Spark Architecture

It is based on two main abstractions:

Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG)

What is RDD?

RDDs are the building blocks of any Spark application. RDDs Stands for:

Resilient: Fault tolerant and is capable of rebuilding data on failure
Distributed: Distributed data among the multiple nodes in a cluster
Dataset: Collection of partitioned data with values

RDD is split into chunks based on a key. RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes

There are two ways to create RDDs − parallelizing an existing collection in your driver program, or by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc.

Main Architecture

Edge Node submit the job to Master Node. In Driver Program the first thing we need to create Spark Context(SC). The SC works with cluster manager and takes care of execution within the cluster. An RDD is created in Spark context, it can be distributed across various nodes and can be cached there.

Source

What is RDD?

It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. Itertaive and Intercative operations can be performed faster than mapreduce.

Spark execution

What happens we execute spark-submit?

Conversion of transformations and actions into logical DAG(lineage) by the driver program
Conversion of logical DAD to physical plan and creation of stages
Based on partitions divide them stages into small execution tasks
The spark context talks to cluster manager and manager allocate resources(worker nodes)
In case of failure of an executor worker node worker nodes talks to driver program and rebuilds RDD

What is the key difference Map and flapMap ? It applies to each element of RDD and it returns the result as new RDD. In the Map, the operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD. Spark flatMap function expresses a one-to-many transformation. It transforms each element to 0 or more elements.

What is the difference between reduce and reduceByKey operation in spark? reduce is an action where reduceByKey is a transformation

reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

val distFile = sc.textFile("data.txt") distFile.map(s => s.length).reduce((a, b) => a + b)

Driver Responsibilities
Maintaining Info about Spark application
distributing work to executors
responding to user program or input

what is spark session?
Sparksession is the way spark executes user-defined manipulations across the cluster.sparksession is available as Spark

what is Dataframe in Spark?
DF is structured API in spark represents data in tables in rows and columns

What are the two types of transformations? Transformation is converting one rdd to anothe rdd by implementing a logic

Narrow transformation - each input partition is responsible for one output partition
Wide Transformation - One input partition is responsible for multiple out parts

During wide transformation shufflings occurs where spark writes back the data to disk whereas narrow transformations occurs in-memory because of pipelining