Spark SQL Interview Questions - ignacio-alorre/Spark GitHub Wiki

1- Can we do real-time processing using Spark SQL?

Answer

Not directly, but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

2- What is a Spark Session?

Answer

Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having a SparkContext, HiveContext, SqlContext, now all of it is encapsulated in a Spark session.

3- What is a Sparse Vector [MLLib API]?

Answer

A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where:

3 is the size of the vector.
[0, 2] are the non-zero indexes
[1.0, 3.0] are the values of the non-zero indexes

It is used for storing non-zero entries for saving space. As explained above, besides the dimension, it has two parallel arrays:

One for indices
The other for values

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

// Converting from Sparse Vector to Dense Vector
val sv = Vectors.sparse(5, Array(0, 3), Array(1.5, -1.5))
sv.toDense

// Converting from Dense Vector into Sparse Vector
dv.toSparse()

Specially useful when many of the values in the original vector are zero. Also some ML Algorithms can not handle 0 values.

source

4- Explain About Transformations And Actions In The Context Of Rdds.

Answer

Transformations: Are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions: Are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

5- How Can You Minimize Data Transfers When Working With Spark?

Answer

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators– Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.

Why Shuffle happens?

During computations, a single task will operate on a single partition. Thus, to organize all the data for a single reduceByKey() reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key.

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

What operations produce shuffle?

Operations which can cause a shuffle include repartition operations like repartition and coalesce, *ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

cogroup
groupWith
join: hash partition
leftOuterJoin: hash partition
rightOuterJoin: hash partition
groupByKey: hash partition
reduceByKey: hash partition
combineByKey: hash partition
sortByKey: range partition
distinct
intersection: hash partition
repartition
coalesce

6- Why is there a need for broadcast variables when working with Apache Spark?

Answer

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup()

7- What is Lineage Graph?

Answer

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is know as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

8- What is the significance of sliding window operation?

Answer

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

9- What is a Dstream?

Answer

Discretized Stream is a sequence of Resilient Distributed Datasets that represent a stream of data. Dstreams can be created from various sources like Apache Kafka, HDFS and Apache Flume. DStreams have two operations:

Transformations that produce a new DStream
Output operations that write data to an external system.

10- What is the advantage of parquet file?

Answer

Parquet file is a columnar format file that helps:

Limit I/O operations
Consumes less space
Fetches only required columns

11- What do you understand by Pair RDD?

Answer

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey() method that collects data based on each key in parallel. They have a reduceByKey() method that collects data based on each key and a join() method that combines different RDDs together, based on the elements having the same key.

12- Explain the different types of transformations on DStreams

Answer

Stateless Transformations: Processing of the batch does not depend on the output of the previous batch. Examples: map(), reduceByKey(), filter().
Stateful Transformations: Processing of the batch depends on the intermediary results of the previous batch. Examples: Transformations that depend on sliding windows.

13- Is Apache Spark a good fit for Reinforcement Learning?

Answer

No. Spark works well only for simple machine learning algorithms like clustering, regression and classification.

14- What is the difference between Persists() and Cache()?

Answer

Persist() allows the user to specify the storage level whereas Cache() uses the default storage level.

15- What are the various levels of persistence

Answer

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist() method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. The various storage/persistence levels in Spark are:

DISK_ONLY - Stores the RDD partitions only on the disk.
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
MEMORY_ONLY_SER - Stores RDD as serialized Java objects with one-byte array per partition.
MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
MEMORY_ONLY - The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory.

16- Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?

Answer

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

17- Explain About The Core Components Of A Distributed Spark Application

Answer

Driver: The process that runs the main() method of the program to create RDDs and perform transformations and actions on them.
Executor: The worker processes that run the individual tasks of a Spark job.
Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.

18- What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce?

Answer

Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark doesn't have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.

19- What Are Benefits Of Spark Over Mapreduce?

Answer

Enhanced Speed – MapReduce makes use of persistent storage for carrying out any of the data processing tasks. On the contrary, Spark uses in-memory processing that offers about 10 to 100 times faster processing than the Hadoop MapReduce.
Multitasking – Hadoop only supports batch processing via inbuilt libraries. Apache Spark, on the other end, comes with built-in libraries for performing multiple tasks from the same core, including batch processing, interactive SQL queries, machine learning, and streaming.
No Disk-Dependency – While Hadoop MapReduce is highly disk-dependent, Spark mostly uses caching and in-memory data storage.
Iterative Computation – Performing computations several times on the same dataset is termed as iterative computation. Spark is capable of iterative computation while Hadoop MapReduce isn’t.

Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.

Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

20- What Is The Default Level Of Parallelism In Apache Spark?

Answer

If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.

21- Explain About The Common Workflow Of A Spark Program

Answer

The foremost step in a Spark program involves creating input RDD's from external data.
Use various RDD transformations like filter() to create new transformed RDD's based on the business logic.
persist() any intermediate RDD's which might have to be reused in future.
Launch various RDD actions like first(), count() to begin parallel computation, which will then be optimized and executed by Spark.

22- Explain the concept of RDD (Resilient Distributed Dataset). Also, state how you can create RDDs in Apache Spark

Answer

An RDD or Resilient Distribution Dataset is a fault-tolerant collection of operational elements that are capable to run in parallel. Any partitioned data in an RDD is distributed and immutable.

Fundamentally, RDDs are portions of data that are stored in the memory distributed over many nodes. These RDDs are lazily evaluated in Spark, which is the main factor contributing to the hastier speed achieved by Apache Spark. RDDs are of two types:

Hadoop Datasets: Perform functions on each file record in HDFS (Hadoop Distributed File System) or other types of storage systems
Parallelized Collections: Extant RDDs running parallel with one another

There are two ways of creating an RDD in Apache Spark:

By parallelizing a collection in the Driver program. It makes use of SparkContext’s parallelize() method. For instance:

val DataArray = Array(22,24,46,81,101) val DataRDD = sc.parallelize(DataArray)

By means of loading an external dataset from some external storage, including HBase, HDFS, and shared file system

23- What do you understand by the Parquet file?

Answer

Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:

Able to fetch specific columns for access
Consumes less space
Follows type-specific encoding
Limited I/O operations
Offers better-summarized data

24- Explain how to minimize data transfers while working with Spark?

Answer

Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:

Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs

25- Can you list down the limitations of using Apache Spark?

Answer

It doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system
Higher latency but consequently, lower throughput
No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is micro-batch processing and not truly real-time data processing
Lesser number of algorithms available
Spark streaming doesn’t support record-based window criteria
The work needs to be distributed over multiple clusters instead of running everything on a single node
While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck

26- What is Executor Memory in a Spark application?

Answer

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

27- Define Partitions in Apache Spark.

Answer

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

28- What do you understand by Transformations in Spark?

Answer

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.

val rawData=sc.textFile("movies.txt")
val moviesData=rawData.map(x => x.split(" "))

As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

29- Define Actions in Spark.

Answer

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.

moviesData.saveAsTextFile(“MoviesData.txt”)

As we can see here, moviesData RDD is saved into a text file called MoviesData.txt.

30- What is RDD Lineage?

Answer

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

31- What is Spark Driver?

Answer

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkSession, connected to a given Spark Master. The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

32- What is Spark Executor?

Answer

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

33- What do you understand by worker node?

Answer

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work, and then worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

34- Explain accumulators in Apache Spark.

Answer

Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.

35- Does Apache Spark provide checkpoints?

Answer

Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic.

Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

36- What do you understand by Lazy Evaluation?

Answer

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

37- What do you understand by SchemaRDD in Apache Spark RDD?

Answer

SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.

Now, it is officially renamed to DataFrame API on Spark’s latest trunk.

Spark SQL Interview Questions - ignacio-alorre/Spark GitHub Wiki

1- Can we do real-time processing using Spark SQL?

2- What is a Spark Session?

3- What is a Sparse Vector [MLLib API]?

4- Explain About Transformations And Actions In The Context Of Rdds.

5- How Can You Minimize Data Transfers When Working With Spark?

Why Shuffle happens?

What operations produce shuffle?

6- Why is there a need for broadcast variables when working with Apache Spark?

7- What is Lineage Graph?

8- What is the significance of sliding window operation?

9- What is a Dstream?

10- What is the advantage of parquet file?

11- What do you understand by Pair RDD?

12- Explain the different types of transformations on DStreams

13- Is Apache Spark a good fit for Reinforcement Learning?

14- What is the difference between Persists() and Cache()?

15- What are the various levels of persistence

16- Hadoop Uses Replication To Achieve Fault Tolerance. How Is This Achieved In Apache Spark?

17- Explain About The Core Components Of A Distributed Spark Application

18- What Are The Disadvantages Of Using Apache Spark Over Hadoop Mapreduce?

19- What Are Benefits Of Spark Over Mapreduce?

20- What Is The Default Level Of Parallelism In Apache Spark?

21- Explain About The Common Workflow Of A Spark Program

22- Explain the concept of RDD (Resilient Distributed Dataset). Also, state how you can create RDDs in Apache Spark

23- What do you understand by the Parquet file?

24- Explain how to minimize data transfers while working with Spark?

25- Can you list down the limitations of using Apache Spark?

26- What is Executor Memory in a Spark application?

27- Define Partitions in Apache Spark.

28- What do you understand by Transformations in Spark?

29- Define Actions in Spark.

30- What is RDD Lineage?

31- What is Spark Driver?

32- What is Spark Executor?

33- What do you understand by worker node?

34- Explain accumulators in Apache Spark.

35- Does Apache Spark provide checkpoints?

36- What do you understand by Lazy Evaluation?

37- What do you understand by SchemaRDD in Apache Spark RDD?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️