Apache Spark Architecture - rohith-nallam/Bigdata GitHub Wiki
Explaining about Spark Architecture
It is based on two main abstractions:
Resilient Distributed Dataset (RDD) Directed Acyclic Graph (DAG)
What is RDD?
RDDs are the building blocks of any Spark application. RDDs Stands for:
- Resilient: Fault tolerant and is capable of rebuilding data on failure
- Distributed: Distributed data among the multiple nodes in a cluster
- Dataset: Collection of partitioned data with values
RDD is split into chunks based on a key. RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. Thus, even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc.
Main Architecture
Edge Node submit the job to Master Node. In Driver Program the first thing we need to create Spark Context(SC). The SC works with cluster manager and takes care of execution within the cluster. An RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
What is RDD?
It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
RDD is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. Itertaive and Intercative operations can be performed faster than mapreduce.
Spark execution
What happens we execute spark-submit?
- Conversion of transformations and actions into logical DAG(lineage) by the driver program
- Conversion of logical DAD to physical plan and creation of stages
- Based on partitions divide them stages into small execution tasks
- The spark context talks to cluster manager and manager allocate resources(worker nodes)
- In case of failure of an executor worker node worker nodes talks to driver program and rebuilds RDD
What is the key difference Map and flapMap ? It applies to each element of RDD and it returns the result as new RDD. In the Map, the operation developer can define his own custom business logic. The same logic will be applied to all the elements of RDD. Spark flatMap function expresses a one-to-many transformation. It transforms each element to 0 or more elements.
What is the difference between reduce and reduceByKey operation in spark? reduce is an action where reduceByKey is a transformation
reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.
val distFile = sc.textFile("data.txt") distFile.map(s => s.length).reduce((a, b) => a + b)
Driver Responsibilities
Maintaining Info about Spark application
distributing work to executors
responding to user program or input
what is spark session?
Sparksession is the way spark executes user-defined manipulations across the cluster.sparksession is available as Spark
what is Dataframe in Spark?
DF is structured API in spark represents data in tables in rows and columns
What are the two types of transformations? Transformation is converting one rdd to anothe rdd by implementing a logic
- Narrow transformation - each input partition is responsible for one output partition
- Wide Transformation - One input partition is responsible for multiple out parts
During wide transformation shufflings occurs where spark writes back the data to disk whereas narrow transformations occurs in-memory because of pipelining