Spark SQL Interview Questions - ignacio-alorre/Spark GitHub Wiki
Answer
Not directly, but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.
Answer
Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with a lesser number of constructs. Instead of having a SparkContext
, HiveContext
, SqlContext
, now all of it is encapsulated in a Spark session.
Answer
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector (1.0, 0.0, 3.0) can be represented in dense format as [1.0, 0.0, 3.0] or in sparse format as (3, [0, 2], [1.0, 3.0]), where:
- 3 is the size of the vector.
- [0, 2] are the non-zero indexes
- [1.0, 3.0] are the values of the non-zero indexes
It is used for storing non-zero entries for saving space. As explained above, besides the dimension, it has two parallel arrays:
- One for indices
- The other for values
import org.apache.spark.mllib.linalg.{Vector, Vectors}
// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
// Converting from Sparse Vector to Dense Vector
val sv = Vectors.sparse(5, Array(0, 3), Array(1.5, -1.5))
sv.toDense
// Converting from Dense Vector into Sparse Vector
dv.toSparse()
Specially useful when many of the values in the original vector are zero. Also some ML Algorithms can not handle 0 values.
Answer
-
Transformations: Are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include
map
,filter
andreduceByKey
. -
Actions: Are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include
reduce
,collect
,first
, andtake
.
Answer
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
- Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
- Using Accumulators– Accumulators help update the values of variables in parallel while executing.
- The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
During computations, a single task will operate on a single partition. Thus, to organize all the data for a single reduceByKey()
reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key.
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Operations which can cause a shuffle include repartition operations like repartition
and coalesce
, *ByKey
operations (except for counting) like groupByKey
and reduceByKey
, and join
operations like cogroup
and join
.
-
join
: hash partition -
leftOuterJoin
: hash partition -
rightOuterJoin
: hash partition -
groupByKey
: hash partition -
reduceByKey
: hash partition -
combineByKey
: hash partition -
sortByKey
: range partition -
intersection
: hash partition
Answer
These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup()
Answer
The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is know as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Answer
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Answer
Discretized Stream is a sequence of Resilient Distributed Datasets that represent a stream of data. Dstreams can be created from various sources like Apache Kafka, HDFS and Apache Flume. DStreams have two operations:
- Transformations that produce a new DStream
- Output operations that write data to an external system.
Answer
Parquet file is a columnar format file that helps:
- Limit I/O operations
- Consumes less space
- Fetches only required columns
Answer
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey()
method that collects data based on each key in parallel. They have a reduceByKey()
method that collects data based on each key and a join()
method that combines different RDDs together, based on the elements having the same key.
Answer
-
Stateless Transformations: Processing of the batch does not depend on the output of the previous batch. Examples:
map()
,reduceByKey()
,filter()
. - Stateful Transformations: Processing of the batch depends on the intermediary results of the previous batch. Examples: Transformations that depend on sliding windows.
Answer
No. Spark works well only for simple machine learning algorithms like clustering, regression and classification.
Answer
Persist()
allows the user to specify the storage level whereas Cache()
uses the default storage level.
Answer
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist()
method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. The various storage/persistence levels in Spark are:
- DISK_ONLY - Stores the RDD partitions only on the disk.
- MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
- MEMORY_ONLY_SER - Stores RDD as serialized Java objects with one-byte array per partition.
- MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
- MEMORY_ONLY - The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
- OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
Answer
Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.
Answer
- Driver: The process that runs the
main()
method of the program to create RDDs and perform transformations and actions on them. - Executor: The worker processes that run the individual tasks of a Spark job.
- Cluster Manager: A pluggable component in Spark, to launch Executors and Drivers. The cluster manager allows Spark to run on top of other external managers like Apache Mesos or YARN.
Answer
Apache spark does not scale well for compute intensive jobs and consumes large number of system resources. Apache Spark’s in-memory capability at times comes a major roadblock for cost efficient processing of big data. Also, Spark doesn't have its own file management system and hence needs to be integrated with other cloud based data platforms or apache hadoop.
Answer
- Enhanced Speed – MapReduce makes use of persistent storage for carrying out any of the data processing tasks. On the contrary, Spark uses in-memory processing that offers about 10 to 100 times faster processing than the Hadoop MapReduce.
- Multitasking – Hadoop only supports batch processing via inbuilt libraries. Apache Spark, on the other end, comes with built-in libraries for performing multiple tasks from the same core, including batch processing, interactive SQL queries, machine learning, and streaming.
- No Disk-Dependency – While Hadoop MapReduce is highly disk-dependent, Spark mostly uses caching and in-memory data storage.
- Iterative Computation – Performing computations several times on the same dataset is termed as iterative computation. Spark is capable of iterative computation while Hadoop MapReduce isn’t.
Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
- Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
- Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
- Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
Answer
If the user does not explicitly specify then the number of partitions are considered as default level of parallelism in Apache Spark.
Answer
- The foremost step in a Spark program involves creating input RDD's from external data.
- Use various RDD transformations like
filter()
to create new transformed RDD's based on the business logic. -
persist()
any intermediate RDD's which might have to be reused in future. - Launch various RDD actions like
first()
,count()
to begin parallel computation, which will then be optimized and executed by Spark.
22- Explain the concept of RDD (Resilient Distributed Dataset). Also, state how you can create RDDs in Apache Spark
Answer
An RDD or Resilient Distribution Dataset is a fault-tolerant collection of operational elements that are capable to run in parallel. Any partitioned data in an RDD is distributed and immutable.
Fundamentally, RDDs are portions of data that are stored in the memory distributed over many nodes. These RDDs are lazily evaluated in Spark, which is the main factor contributing to the hastier speed achieved by Apache Spark. RDDs are of two types:
- Hadoop Datasets: Perform functions on each file record in HDFS (Hadoop Distributed File System) or other types of storage systems
- Parallelized Collections: Extant RDDs running parallel with one another
There are two ways of creating an RDD in Apache Spark:
- By parallelizing a collection in the Driver program. It makes use of SparkContext’s
parallelize()
method. For instance:
val DataArray = Array(22,24,46,81,101) val DataRDD = sc.parallelize(DataArray)
- By means of loading an external dataset from some external storage, including HBase, HDFS, and shared file system
Answer
Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:
- Able to fetch specific columns for access
- Consumes less space
- Follows type-specific encoding
- Limited I/O operations
- Offers better-summarized data
Answer
Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:
- Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
- Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
- Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs
Answer
- It doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system
- Higher latency but consequently, lower throughput
- No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is micro-batch processing and not truly real-time data processing
- Lesser number of algorithms available
- Spark streaming doesn’t support record-based window criteria
- The work needs to be distributed over multiple clusters instead of running everything on a single node
- While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck
Answer
Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory
property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
Answer
As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.
Answer
Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map()
and filter()
are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter()
creates a new RDD by selecting elements from current RDD that pass function argument.
val rawData=sc.textFile("movies.txt")
val moviesData=rawData.map(x => x.split(" "))
As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.
Answer
An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.
reduce()
is an action that implements the function passed again and again until one value if left. take()
action takes all the values from RDD to a local node.
moviesData.saveAsTextFile(“MoviesData.txt”)
As we can see here, moviesData RDD is saved into a text file called MoviesData.txt.
Answer
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.
Answer
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkSession, connected to a given Spark Master. The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
Answer
When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
Answer
Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.
Worker node is basically the slave node. Master node assigns work, and then worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.
Answer
Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.
Answer
Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic.
Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
Answer
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
Answer
SchemaRDD
is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.
Now, it is officially renamed to DataFrame API on Spark’s latest trunk.