Spark Interview Questions II - ignacio-alorre/Spark GitHub Wiki

1- What are the features of RDD, that makes RDD an important abstraction of Spark?

Answer

RDD (Resilient Distributed Dataset) is a basic abstraction in Apache Spark. Spark RDD is an immutable, partitioned collection of elements on the cluster which can operates in parallel.

Each RDD is characterized by five main properties :

Below operations are lineage operations.

List or Set of partitions.
List of dependencies on other (parent) RDD
A function to compute each partition

Below operations are used for optimization during execution.

Optional preferred location [i.e. block location of an HDFS file] [it’s about data locality]
Optional partitioned info [i.e. Hash-Partition for Key/Value pair –> When data shuffled how data will travel]

2- List out the ways of creating RDD in Apache Spark

Answer

There are three ways to create RDD

By Parallelizing collections in the driver program
By loading an external dataset
Creating RDD from already existing RDDs

Create RDD By Parallelizing collections

val rdd1 = Array(1,2,3,4,5)
val rdd2 = sc.parallelize(rdd1)

Create by loading an external Dataset

csv

import org.apache.spark.sql.SparkSession

val spark =  SparkSession.builder.appName("AvgAnsTime").master("local").getOrCreate()
val dataRDD = spark.read.csv("path/of/csv/file").rdd

json

val dataRDD = spark.read.json("path/of/json/file").rdd

textFile

val dataRDD = spark.read.textFile("path/of/text/file").rdd

Creating RDD from existing RDD

Transformation mutates one RDD into another RDD, thus transformation is the way to create an RDD from already existing RDD.

val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
 "dog"))
val wordPair = words.map(w => (w.charAt(0), w))
wordPair.foreach(println)

3- Explain Transformation in RDD. How is lazy evaluation helpful in reducing the complexity of the System?

Answer

Transformations are lazy evaluated operations on RDD that create one or many new RDDs, e.g. map, filter, reduceByKey, join, cogroup, randomSplit. Transformations are functions which take an RDD as the input and produces one or many RDDs as output. They don’t change the input RDD as RDDs are immutable and hence cannot change or modify but always produces new RDD by applying the computations operations on them. By applying transformations you incrementally build an RDD lineage with all the ancestor RDDs of the final RDD(s).

Transformations are lazy, i.e. are not executed immediately. Transformations can execute only when actions are called. After executing a transformation, the result RDD(s) will always be different from their ancestors RDD and can be smaller (e.g. filter, distinct, sample), bigger (e.g. flatMap, union, cartesian) or the same size (e.g. map) or it can vary in size.

RDD allows you to create dependencies between RDDs. Dependencies are the steps for producing results in a program. Each RDD in lineage chain, string of dependencies has a function for operating its data and has a pointer dependency to its ancestor RDD. Spark will divide RDD dependencies into stages and tasks and then send those to workers for execution.

4- What are the types of Transformation in Spark RDD Operations?

Answer

Narrow transformations:

Each partion of the parent RDD is used by at most one partition of the child RDD.
No shuffling of data across the nodes in the cluster

Transformations with narrow dependencies:

map
mapValues
flatMap
filter
mapPartitions
mapPartitionsWithIndex

In the image below we have the parent RDD (RDD1), which has got 2 partitions stored in 2 different nodes. After applying the narrow transformation on RDD1 we generate RDD2, which also have 2 partitions in the same 2 nodes.

img

Wide Transformations

Each partition of the parents RDD may be depend on multiple child partitions
Each partition may belong to different nodes, resulting in shuffling of data across nodes in the Spark cluster

Transformations with wide dependencies

cogroup
groupWith
join
leftOuterJoin
rightOuterJoin
groupByKey
reduceByKey
combinedByKey
distinct
intersection
repartition
coalesce

img

The scheduler will examine the type of dependencies and group the narrow dependency RDD into a unit of processing called a stage. Wide dependencies will span across consecutive stages within the execution and require the number of partition of the child RDD to be explicitly specified.

img

A typical execution sequence is as follows:

1 RDD is created originally from external data sources (e.g. HDFS, Local file ... etc)
2 RDD undergoes a sequence of TRANSFORMATION (e.g. map, flatMap, filter, groupBy, join), each provide a different RDD that feed into the next transformation.
3 Finally the last step is an ACTION (e.g. count, collect, save, take), which convert the last RDD into an output to external data sources The above sequence of processing is called a lineage (outcome of the topological sort of the DAG).

Note:: Each RDD produced within the lineage is immutable. In fact, unless if it is cached, it is used only once to feed the next transformation to produce the next RDD and finally produce some action output.

Fault Resilience in Spark

Fault resiliency in Spark takes a different approach. First of all, as a large scale compute cluster, Spark is not meant to be a large scale data cluster at all. Spark makes two assumptions of its workload:

The processing time is finite (although the longer it takes, the cost of recovery after fault will be higher)
Data persistence is the responsibility of external data sources, which keeps the data stable within the duration of processing.

Spark has made a tradeoff decision that in case of any data lost during the execution, it will re-execute the previous steps to recover the lost data. However, this doesn't mean everything done so far is discarded and we need to start from scratch at the beginning. We just need to re-executed the corresponding partition in the parent RDD which is responsible for generating the lost partitions (In case of narrow dependencies, this resolved to the same machine).

Notice that the re-execution of lost partition is exactly the same as the lazy evaluation of the DAG, which starts from the leaf node of the DAG, tracing back the dependencies on what parent RDD is needed, eventually tracking all the way to the source node. Recomputing the lost partition is done in a similar way, but taking partition as an extra piece of information to determine which parent RDD partition is needed.

However, re-execution across wide dependencies can touch a lot of parent RDD across multiple machines and may cause re-execution of everything. To mitigate this, Spark persist the intermediate data output from a Map phase before it shuffle them to different machines executing the reduce phase. In case of machine crash, the re-execution (from another surviving machine) just need to trace back to fetch the intermediate data from the corresponding partition of the mapper's persisted output. Spark also provide a checkpoint API to explicitly persist intermediate RDD so re-execution (when crash) doesn't need to trace all the way back to the beginning. In future, Spark will perform check-pointing automatically by figuring out a good balance between the latency of recovery and the overhead of check-pointing based on statistical result.

5- What is the reason behind Transformation being lazy operation in Apache Spark RDD? How is it useful?

Answer

Whenever a transformation operation is performed in Apache Spark, it is lazily evaluated. It won't execute until an action is performed. Apache Spark just adds an entry of the transformation operation to the DAG (Directed Acyclic Graph) of computation, which of computation, which is a directed finite graph with no cycles. In this DAG, all the operations are classified into different stages, with no shuffling of data in a single stage.

By this way, Spark can optimize the execution by looking at the DAG at its entirety, and return the appropriate result to the driver program.

For example, consider a 1TB of a log file in HDFS containing errors, warnings, and other information. Below are the operations being performed in the driver program to fetch the first error message:

Creating an RDD of this log file
Performing a flatmap() operation on this RDD to split the data in the log file based on tab delimiter.
Performing a filter() operation to extract data containing only error messages
Perform first() operation to fetch only the first error message.

If all the transformations in the above driver program are eagerly evaluated, then:

The whole log file will load into memory
All of the data within the file will split base on the tab, now either it needs to write the output of flatMap() somewhere or keep it in the memory. Spark needs to wait until the next operation is performed with the resource blocked for the upcoming operation
Apart from this, for each and every operation spark need to scan all the records, like for flatMap() process all the records then again process them in filter operation

On the other hand, if all the transformations are lazily evaluated Spark will look at the DAG on the whole and prepare the execution plan for the application, now this plan will optimize the operation will combine/merge into stages then the execution will start. The optimized plan created by Spark improves job’s efficiency and overall throughput.

By this lazy evaluation in Spark, the number of switches between driver program and cluster is also reduced thereby saving time and resources in memory, and also there is an increase in the speed of computation.

6- What is RDD lineage graph? How is it useful in achieving Fault Tolerance?

Answer

The RDD Lineage Graph or RDD operator graph could be a graph of the entire parent RDDs of an RDD. It’s engineered as a result of materializing transformations to the RDD and then creating a logical execution set up.

The RDDs in Apache Spark rely on one or a lot of alternative RDDs. The illustration of dependencies in between RDDs is understood because of the lineage graph. Lineage graph info is employed to cypher every RDD on demand, so whenever a district of persistent RDD is lost, {the data | the info | the info} that’s lost will recover using the lineage graph information.

7- Explain the various Transformation on Apache Spark RDD like distinct(), union(), intersection(), and subtract().

Answer

distnct() Used when you want only unique elements in a RDD

val d1 = sc.parallelize(List("c","c","p","m","t"))
val result = d1.distnct()
result.foreach{println}

// Output:
// p
// t
// m
// c

union() Outputs RDD which contains the data from both sources. If the duplicates are present in the input RDD, an output of union() transformation will contain duplicate also which can fix using distinct().

val u1 = sc.parallelize(List("c","c","p","m","t"))
val u2 = sc.parallelize(List("c","m","k"))
val result = u1.union(u2)
result.foreach{println}

// Output
// c
// c
// p
// m
// t
// c
// m
// k

intersection() It returns the elements which are present in both the RDDs and remove all the duplicate including duplicated in single RDD

val is1 = sc.parallelize(List("c","c","p","m","t"))
val is2 = sc.parallelize(List("c","m","k"))
val result = is1.union(is2)
result.foreach{println}

// Output:
// m
// c

subtract() returns an RDD that has an only value present in the first RDD and not in second RDD.

val s1 = sc.parallelize(List("c","c","p","m","t"))
val s2 = sc.parallelize(List("c","m","k"))
val result = s1.subtract(s2)
result.foreach{println}

// Output
// t
// p

8- What is the FlatMap Transformation in Apache Spark RDD? -TODO: Add examples-

Answer

flatMap() is a transformation operation in Apache Spark to create an RDD from existing RDD. It takes one element from an RDD and can produce 0, 1 or many outputs based on business logic. So if we perform map() operation on an RDD of length N, output RDD will also be of length N. But for FlatMap operation output RDD can be of different lenght based on business logic.

We could also say flatMap() transforms an RDD of length N into a collection of N collectios, then flattens into a single RDD of results.

9- Explain first() operation in Apache Spark RDD

Answer

It is an action and returns the first element of the RDD

10- Describe join() operation. How is outer join supported?

Answer

join() returns an RDD containing all pairs of elements with matching keys in left_RDD (this) and right_RDD(other). Join will return a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other

It will perform a hash join across the cluster

val myrdd1 = sc.parallelize(Seq((1,2),(3,4),(3,6)))
// myrdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0]
val myrdd2 = sc.parallelize(Seq((3,9)))
// myrdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[1]
val myjoinedrdd = myrdd1.join(myrdd2)
myjoinedrdd.collect

// Output
// Array[(Int, (Int, Int))] = Array((3,(4,9)), (3,(6,9)))

Note that on this there are two elements with the k=3, but in other there is only one. For this reason 9 appears twice in the result

11- Describe coalesce() operation. When can you coalesce to a larger number of partitions?

Answer

Coalesce() operation changes a number of the partition where data is stored. It returns a new RDD that is reduced into numPartitions partitions. Useful for running operations more efficiently after filtering down a large dataset.

This results in a narrow dependency, if you go from 1000 partitions to 100 partitions, there will not be shuffle, instead, each of the 100 new partitions will claim 10 of the current partitions.

However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step but means the current upstream partitions will execut in parallel (per whatever the current partitioning is).

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.

12- repartition() vs coalesce()

Answer

coalesce() avoids a full shuffle. If it's known that the number is decreasing, then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.

Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12

We we coalesce down to 2 partitions, we could end up with something like:

Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

Notice that Node 1 and Node 3 did not require its original data to move.

repartition() does a full shuffle and creates new partitions with data that's distributed evenly. Let's create a DataFrame with the numbers from 1 to 12.

val x = (1 to 12).toList
val numbersDf = x.toDF("number")

Let's assume that numbersDf contains 4 partitions:

numbersDf.rdd.partitions.size 
// Output
// 4

Below is how data could be divided on the different partitions:

Partition 00000: 1, 2, 3
Partition 00001: 4, 5, 6
Partition 00002: 7, 8, 9
Partition 00003: 10, 11, 12

Now let's do a full-shuffle with the repartitons method:

val numbersDfR = numbersDf.repartition(2)

This is how partitions could look like now:

Partition A: 1, 3, 4, 6, 10, 12
Partition B: 2, 5, 7, 8, 9, 11

The repartition method makes new partitions and evenly distributes the data in the new partitions (Note the data distribution is more even for larger data sets).

Difference between coalesce and repartition

By default coalesce() uses existing partitions to minimize the amount of data that's shuffled. repartition() creates new partitions and does a full shuffle.

However, it is possible to use coalesce() to increase the number of partitions setting shuffle parameter as true: repartition(n) is the same as coalesce(n, shuffle = true)

coalesce() results in partitions with different amounts of data (sometimes partitions that have much different sizes) and repartition() results in roughly equal sized partitions.

Is coalesce or repartition faster?

coalesce() may run faster than repartition(), but unequal sized partitions are generally slower to work with than equal sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition to be faster overall because Spark is built to work with equal sized partitions.

13- What is the key difference between textFile and wholeTextFile method?

Answer

textFile()

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings
For example sc.textFile(“/home/hdadmin/wc-data.txt”) will create an RDD in which each individual line is an element.

wholeTextFiles()

def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
Rather than create basic RDD, the wholeTextFile() returns pairRDD.
For example, you have few files in a directory so by using wholeTextFile() method, creates pair RDD with a filename+path as key, and the whole file as a string as value

14- What are the ways in which one can know that the given operation is Transformation or Action?

Answer

In order to identify the operation, one needs to look at the return type of an operation.

If the operation returns a new RDD, in that case, an operation is ‘Transformation’
If the operation returns any other type than RDD, in that case, an operation is ‘Action’

Hence, Transformation constructs a new RDD from an existing one (previous one) while Action computes the result based on applied transformation and returns the result to either driver program or save it to the external storage.

15- Describe Partition and Partitioner in Apache Spark.

Answer

A partition in Spark is a logical division of data stored on a node in the cluster. They are the basic units of parallelism in Apache Spark. RDDs are a collection of partitions. When some actions are executed, a task is launched per partition.

By default, partitions are automatically created by the framework. However, the number of partitions in Spark are configurable to suit the needs. For the number of partitions, if spark.default.parallelism is set, then we should use the value from SparkContext defaultParallelism, othewrwise we should use the max number of upstream partitions. Unless spark.default.parallelism is set, the number of partitions will be the same as that of the largest upstream RDD, as this would least likely cause out-of-memory errors.

A Partitioner is an object that defines how the elements in a key-value pair RDD are partitioned by key, maps each key to a partition ID from 0 to numPartitions – 1. It captures the data distribution at the output. With the help of partitioner, the scheduler can optimize the future operations. The contract of partitioner ensures that records for a given key have to reside on a single partition.

We should choose a partitioner to use for a cogroup-like operations. If any of the RDDs already has a partitioner, we should choose that one. Otherwise, we use a default HashPartitioner.

There are three types of partitioners in Spark :

Hash Partitioner
Range Partitioner
Custom Partitioner

Hash – Partitioner: Hash- partitioning attempts to spread the data evenly across various partitions based on the key.

Range – Partitioner: In Range- Partitioning method, tuples having keys within same range will appear on the same partition.

RDDs can create with specific partitioning in two ways :

i) Providing explicit partitioner by calling partitionBy method on an RDD

ii) Applying transformations that return RDDs with specific partitioners.

16- Name the two types of shared variable available in Apache Spark.

Answer

There are two types of shared variables in Apache Spark

Accumulators: used to Aggregate information in the executers
Broadcast variable: to efficiently distribute large values

When we pass the function to Spark, say filter(), this function can use the variable which defined outside of the function but within the Driver program but when we submit the task to Cluster, each worker node gets a new copy of variables and update from these variables not propagated back to Driver program.

Accumulators and Broadcast variable are used to remove above drawback ( i.e. we can get the updated values back to our Driver program)

17- What are accumulators in Apache Spark?

Answer

An accumulator is a shared variable in Apache Spark, used to aggregate information across the workers to later send it back to the driver program

Why Accumulator :

When we use a function inside the operation like map(), filter() etc these functions can use the variables which defined outside these function scope in the driver program.
When we submit the task to cluster, each task running on the cluster gets a new copy of these variables and updates from these variable do not propagate back to the driver program.
Accumulator lowers this restriction

Use Cases :

One of the most common uses of accumulator counts the events that occur during job execution for debugging purpose.
Meaning count the no. of blank lines from the input file, no. of bad packets from a network during a session, during Olympic data analysis we have to find age where we said (age != ‘NA’) in SQL query in short finding bad/corrupted records.

18- Explain SparkContext and SparkSession in Apache Spark [TODO]

19- Spark Architecture

Answer

The components of the Spark application are:

Driver
Application Master
Spark Context
Executors
Cluster Resource Manager (Cluster Manager)

Driver

The Driver (aka the application’s driver process) is responsible for converting a user application into smaller execution units called tasks. Then schedules those tasks to be run on the executors. The Driver is also responsible for the execution of the spark application and returning the status/results to the user.

Spark Driver contains various components:

DAGScheduler
TaskScheduler
BackendScheduler
BlockManager

They are responsible for the translation of user code into actual spark jobs executed on the cluster.

Other Driver properties:

Can run in an independent process, or on one of the worker node for High Availability(HA)
Stores the metadata about all the Resilient Distributed Databases and their partitions
Is created once the user submits the spark application to the cluster manager(YARN in our case);
Runs in its JVM;
Optimizes the logical DAG of transformations and combine them into stages if possible;
Brings up Spark WebUI with application details;

Application Master

It is a framework-specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the application tasks.

Spark Master is created at the same time as the Driver on the same node (in case of cluster mode) when the user submits the spark application using spark-submit

The Driver informs the Application Master about the executor's requirements for the application and the Application Master negotiates the resources with the Resource Manager to host these executors.

In a standalone mode, the Spark Master plays the role of a Cluster manager.

Spark Context

The Spark Context is the main entry point for Spark functionality and so the heart of any Spark application. It allows Spark Driver to access the cluster through a Cluster Resource Manager and it can be used to create RDDs, accumulators and broadcast variables on the cluster. Spark Context also keeps track of live executors by sending heartbeat messages regularly.

The Spark Context is created by the Spark Driver for each Spark application when it is first submitted by the user. It exists throughout the entire life of a spark application.

The Spark Context terminates once the spark application completes. Only one Spark Context can be active per JVM. You must stop() the active Spark Context before creating a new one.

Cluster Resource Manager

Cluster Manager in a distributed Spark application is the process that monitors, governs, reserves resources in the form of containers on the cluster worker nodes. These containers are reserved upon request by the Application Masters and allocated to the Application Master when released or available.

Once the Cluster Manager allocates the containers, the Application Master provides the container's resources back to Spark Driver and Spark Driver will be responsible for executing the various stages and tasks of Spark application.

Executors

Executors are processes on the worker nodes whose job is to execute the assigned tasks. These tasks are executed on the partitioned RDDs on the worker nodes and then return the result to the Spark Driver.

Executors launch once at the beginning of Spark Application and then they run for the entire lifetime of an application this phenomenon is known as "Static Allocation of Executors". However, users can also opt for dynamic allocations of executors wherein they can add or remove spark executors dynamically to match with the overall workload. Even if the Spark executor fails, the Spark application can continue.

Executors provide in-memory storage for RDD's partitions that are cached(locally) in Spark applications (via BlockManager).

Other executor properties:

Stores the data in the cache in JVM heap or on HDDs
Reads data from external sources
Writes data to external sources
Performs all the data processing

Spark Application running steps

Let's create and break down one of the simplest spark applications.

    from pyspark.sql import SparkSession
    
    # initialization of spark context
    conf = SparkConf().setAppName(appName).setMaster(master) 
    sc = SparkSession\
            .builder\
            .appName("PythonWordCount")\
    				.config(conf=conf)
            .getOrCreate()
    
    # read data from HDFS, as a result we get RDD of lines
    linesRDD = sc.textFile("hdfs://...")
    
    # from RDD of lines create RDD of lists of words 
    wordsRDD = linesRDD.flatMap(lambda line: line.split(" ")
    
    # from RDD of lists of words make RDD of words tuples where 
    # the first element is word and the second is counter, at the
    # beginning it should be 1
    wordCountRDD= wordsRDD.map(lambda word: (word, 1))
    
    # combine elements with the same word value
    resultRDD = wordCountRDD.reduceByKey(lambda a, b: a + b)
    
    # write it back to HDFS
    resultRDD.saveAsTextFile("hdfs://...")
    spark.stop()

Steps:

1- When submitting a Spark application via the cluster mode, spark-submit utility will interact with the Cluster Resource Manager to start the Application Master.
2- The Resource Manager gets responsibility for allocating the required container where the Application Master will be launched. Then Resource Manager launches the Application Master.
3- The Application Master registers itself on Resource Manager. Registration allows the client program to ask the Resource Manager for specific information which allows it to directly communicate with its own Application Master.
4- Next Spark Driver runs on the Application Master container(in case of cluster mode).
5- The driver implicitly converts user code that contains transformations and actions into a logical plan called DAG of RDDs. All RDDs are created on the Driver and do not do anything until action is called. On this step, Driver also performs optimizations such as pipelining transformations.
6- After that, it converts the DAG into a physical execution plan. After converting into a physical execution plan, it creates physical execution units called tasks under each stage.
7- Now the Driver talks to the Cluster Manager and negotiates the resources. Cluster Manager will then allocate containers and launches executors on all the allocated containers and assigns tasks to run on behalf of the Driver. When executors start, they register themselves with Driver. So, the Driver will have a complete view of executors that are executing the task.
8- At this point, the driver will send the tasks to the executors based on data placement. The cluster manager is responsible for the scheduling and allocation of resources across the worker machines forming the cluster. At this point, the Driver sends tasks to the Cluster Manager based on data placement.
9- Upon successful receipt of the containers, the Application Master launches the container by providing the Node Manager with a container configuration.
10- Inside the container, the user application code starts. It provides information (stage of execution, status) to Application Master.
11- So, on this step, we will start executing our code. Our first RDD will be created by reading the data from HDFS into different partitions on different nodes in parallel. So each node will have a subset of the data.
12- After reading the data we have two map transformations that will be executing in parallel on each partition.
13- Then we have a reduceByKey transformation, it's not a standard pipe operation like map hence it will create an additional stage. It combines the records with the same keys, then it moves data between nodes(shuffle) and partitions to combine the same record's keys.
14- Then we have the action operation -- writing back to HDFS, which will trigger the whole DAG execution.
15- During the user application execution, the client communicates with the Application Master to obtain the status of the application.
16- When the application has completed execution and all the necessary work has been finished, the Application Master deregisters from Resource Manager and shuts down, freeing its container for other purposes.

20- Describe the run-time architecture of Spark.

Answer

There are 3 important components of Runtime architecture of Apache Spark as described below.

Client process
Driver
Executor

Responsibilities of the client process component

The client process starts the driver program.

For example, the client process can be a spark-submit script for running applications, a spark-shell script, or a custom application using Spark API. The client process prepares the classpath and all configuration options for the Spark application. It also passes application arguments, if any, to the application running on the driver.

Responsibilities of the driver component

The driver orchestrates and monitors the execution of a Spark application. There’s always one driver per Spark application.

The driver is like a wrapper around the application. The driver and its subcomponents (the Spark context and scheduler ) are responsible for:

requesting memory and CPU resources from cluster managers
breaking application logic into stages and tasks
sending tasks to executors
collecting the results

Responsibilities of the executors

The executors, which is a JVM processes, accept tasks from the driver, execute those tasks, and return the results to the driver. Each executor has several task slots (or CPU cores) for running tasks in parallel.

21- Describe Spark SQL

Answer

Spark SQL is a Spark interface to work with Structured and Semi-Structured data (data that as defined fields i.e. tables). It provides abstraction layer called DataFrame and DataSet through with we can work with data easily.

One can say that DataFrame is like a table in a relational database. Spark SQL can read and write data in a variety of Structured and Semi-Structured formats like Parquets, JSON, Hive. This empowers us to load data and query it with SQL. we can also combine it with “regular” program code in Python, Java or Scala.

22- Explain createOrReplaceTempView [TODO: enrich a little bit]

Answer

It creates a temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

val df = spark.read.csv("/home/hdadmin/titanic_data.txt")
df.createOrReplaceTempView("titanicdata")

23- What are the various advantages of DataFrame over RDD in Apache Spark?

Answer

Dataframes are distributed collections of data organized into named columns. It is conceptually similar to a table in a relational database.

As in the case of RDDS, DataFrames are evaluated lazily (computation only happens when an action is required).

24- What is a DataSet? What are its advantages over DataFrame and RDD?

Answer

In Apache Spark, Datasets are an extension of DataFrame API. It offers object-oriented programming interface. Through Spark SQL, it takes advantage of Spark’s Catalyst optimizer by exposing e data fields to a query planner.

In SparkSQL, Dataset is a data structure which is strongly typed and is a map to a relational schema. Also, represents structured queries with encoders.

On-demand access to individual attributes without having to de-serialize an entire object is provided by an encoder. Spark SQL uses a SerDe framework, to make input-output time and space efficient. Due to encoder knows the schema of record, it became possible to achieve serialization as well as deserialization.

Spark Dataset is structured and lazy query expression(lazy Evolution) that triggers the action. Internally dataset represents a logical plan. The logical plan tells the computational query that we need to produce the data. the logical plan is a base catalyst query plan for the logical operator to form a logical query plan. When we analyze this and resolve we can form a physical query plan.

As Dataset introduced after RDD and DataFrame, it clubs the features of both. It offers the following similar features:

The convenience of RDD.
Performance optimization of DataFrame.
Static type-safety of Scala.

Hence, we have observed that Datasets provides a more functional programming interface to work with structured data.

25- On what all basis can you differentiate RDD, DataFrame, and DataSet? [Not very complete]

Answer

DataFrame: It is a data abstraction and domain-specific language (DSL) applicable on a structure and semi-structured data. DataFrames are distributed collection of data in the form of named columns and rows, so in some senses can be seen as a table in a relational database but with richer optimization. Each column may be different types (numeric, logical, factor, or character ).

RDD is the representation of a set of records, immutable collection of objects with distributed computing. RDD is a large collection of data or RDD is an array of reference of partitioned objects.

DataSet in Apache Spark, Datasets are an extension of DataFrame API. It offers object-oriented programming interface. Through Spark SQL, it takes advantage of Spark’s Catalyst optimizer by exposing e data fields to a query planner.

//////// More complete: https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/ ///////

26- What is Apache Spark Streaming? How is the processing of streaming data archieved in Apache Spark?

Answer

Data arriving continuously, in an unbounded sequence is a data stream. Continuously flowing input data is divided into discrete units with the help of streaming for further processing. Through Stream processing analyzing of streaming data is possible. Also, it is a low latency processing.

Streaming offers scalable, high-throughput and fault-tolerant stream processing of live data streams. It is possible to do Data ingestion from many sources. For Example Apache Flume, Kafka, Amazon Kinesis or TCP sockets. By using complex algorithms that are expressed with high-level functions processing can do. For example reduce, map, join and window. Afterwards, processed data can push out to live dashboards, filesystems and databases.

Streaming’s Key abstraction is Discretized Stream. It is also known as Spark DStream. A stream of data divided into small batches is represented by it. DStreams are built on Spark’s core data abstraction ”RDDs“

27- What is the basic abstraction of Spark Streaming?

Answer

A Discretized Stream (DStream). It is a continuous sequence of RDDs representing a continuous stream of data. DStreams can either create from live data (such as, data from HDFS, Kafka or Flume) or it can be generated by transformation existing DStreams using operations such as map(), window() and reduceByKeyAndWindow().

Internally, there are few basic properties by which DStreams is characterized:

DStream depends on the list of other DStreams.
A time interval at which the DStream generates an RDD
A function that is used to generate an RDD after each time interval

28- Explain what are the various types of Transformation on DStream?

Answer

There are two types of transformation on DStream.

Stateless transformation: In stateless transformation, the processing of each batch does not depend on the data of its previous batches.
Stateful transformation: Stateful transformation use data or intermediate results from previous batches to compute the result of the current batch.

29- Explain the level of parallelism in Spark Streaming. Also, describe its need.

Answer

In order to reduce the processing time, one need to increase the parallelism. In Spark Streaming, there are three ways to increase the parallelism:

Increase the number of receivers: If there are too many records for single receiver (single machine) to read in and distribute so that is bottleneck. So we can increase the no. of receiver depends on scenario.
Re-partition the receive data: If one is not in a position to increase the no. of receivers in that case redistribute the data by re-partitioning.
Increase parallelism in aggregation: ??

30- Discuss Write Ahead logging (WAL) in Apache Spark Streaming.

Answer

There are two types of failures in any Apache Spark job – Either the driver failure or the worker failure.

When any worker node fails, the executor processes running in that worker node will kill, and the tasks which were scheduled on that worker node will be automatically moved to any of the other running worker nodes, and the tasks will accomplish.

When the driver or master node fails, all of the associated worker nodes running the executors will be killed, what will release the data stored in each of the executors’ memory. In the case of files being read from reliable and fault tolerant file systems like HDFS, zero data loss is always guaranteed, as the data is ready to be read anytime from the file system. However this is not always the case, so zero data loss is not always guaranteed, as the data will buffer in the executors’ memory until they get processed.

Checkpointing can be used to ensure fault tolerance in Spark, by periodically saving the application data in specific intervals.

To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can apply to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will store in the executor’s memory. With WAL enabled, these received data will also store in the log files.

WAL can enable by performing the below:

Setting the checkpoint directory, by using streamingContext.checkpoint(path)
Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.

31- What do you mean by Speculative execution in Apache Spark?

Answer

It is a health check process which spot tasks runing slower than the median of successfully completed task in the task sheet. Such tasks are submitted to another worker. That new copy of the task will run in parallel, without shutting down the slow one.

Sources

Spark Interview Questions II - ignacio-alorre/Spark GitHub Wiki

1- What are the features of RDD, that makes RDD an important abstraction of Spark?

2- List out the ways of creating RDD in Apache Spark

3- Explain Transformation in RDD. How is lazy evaluation helpful in reducing the complexity of the System?

4- What are the types of Transformation in Spark RDD Operations?

5- What is the reason behind Transformation being lazy operation in Apache Spark RDD? How is it useful?

6- What is RDD lineage graph? How is it useful in achieving Fault Tolerance?

7- Explain the various Transformation on Apache Spark RDD like distinct(), union(), intersection(), and subtract().

8- What is the FlatMap Transformation in Apache Spark RDD? -TODO: Add examples-

9- Explain first() operation in Apache Spark RDD

10- Describe join() operation. How is outer join supported?

11- Describe coalesce() operation. When can you coalesce to a larger number of partitions?

12- repartition() vs coalesce()

13- What is the key difference between textFile and wholeTextFile method?

14- What are the ways in which one can know that the given operation is Transformation or Action?

15- Describe Partition and Partitioner in Apache Spark.

16- Name the two types of shared variable available in Apache Spark.

17- What are accumulators in Apache Spark?

18- Explain SparkContext and SparkSession in Apache Spark [TODO]

19- Spark Architecture

20- Describe the run-time architecture of Spark.

21- Describe Spark SQL

22- Explain createOrReplaceTempView [TODO: enrich a little bit]

23- What are the various advantages of DataFrame over RDD in Apache Spark?

24- What is a DataSet? What are its advantages over DataFrame and RDD?

25- On what all basis can you differentiate RDD, DataFrame, and DataSet? [Not very complete]

26- What is Apache Spark Streaming? How is the processing of streaming data archieved in Apache Spark?

27- What is the basic abstraction of Spark Streaming?

28- Explain what are the various types of Transformation on DStream?

29- Explain the level of parallelism in Spark Streaming. Also, describe its need.

30- Discuss Write Ahead logging (WAL) in Apache Spark Streaming.

31- What do you mean by Speculative execution in Apache Spark?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️