Spark Interview Questions - ayushmathur94/Spark GitHub Wiki

1.Why spark over map reduce ? / how spark gives better performance than map reduce?

Broadcasters

3.Executor vs executor cores

SPARK Optimization TECHNIQUES
Executor memory vs driver memory

6.spark sql

Transformations
Hive INTERNAL WORKING
Partitioning , Bucketing (Both in Spark and Hive)
Hbase
SQL WINDOWS FUNCTION LEAD LAG HOW TO USE ?
SQL AND DATAFRAME
DIFFERENCE B/W SPARK RDD AND DATAFRAME AND DATASET
APPLY JOIN IN DATAFRAME
READ FROM HDD
HDFS
Consistency , Accessibility , Partitioning (hbase or mongo)

Question: Can you explain how to minimize data transfers while working with Spark? Answer: Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:

Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs Question: What are broadcast variables in Apache Spark? Why do we need them? Answer: Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine.

Broadcast variables are also used to provide every node with a copy of a large input dataset. Apache Spark tries to distribute broadcast variables by using effectual broadcast algorithms for reducing communication costs.

Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances retrieval efficiency.

Question: Please provide an explanation on DStream in Spark. Answer: DStream is a contraction for Discretized Stream. It is the basic abstraction offered by Spark Streaming and is a continuous stream of data. DStream is received from either a processed data stream generated by transforming the input stream or directly from a data source.

A DStream is represented by a continuous series of RDDs, where each RDD contains data from a certain interval. An operation applied to a DStream is analogous to applying the same operation on the underlying RDDs. A DStream has two operations:

Output operations responsible for writing data to an external system Transformations resulting in the production of a new DStream It is possible to create DStream from various sources, including Apache Kafka, Apache Flume, and HDFS. Also, Spark Streaming provides support for several DStream transformations.

Question: Does Apache Spark provide checkpoints? Answer: Yes, Apache Spark provides checkpoints. They allow for a program to run all around the clock in addition to making it resilient towards failures not related to application logic. Lineage graphs are used for recovering RDDs from a failure.

Apache Spark comes with an API for adding and managing checkpoints. The user then decides which data to the checkpoint. Checkpoints are preferred over lineage graphs when the latter are long and have wider dependencies.

Question: What are the different levels of persistence in Spark? Answer: Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.

Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:

DISK_ONLY - Stores the RDD partitions only on the disk. MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises. MEMORY_ONLY_SER - Stores RDD as serialized Java objects with one-byte array per partition. MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required. MEMORY_ONLY - The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required. OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory. Question: Can you list down the limitations of using Apache Spark? Answer:

It doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system Higher latency but consequently, lower throughput No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is micro-batch processing and not truly real-time data processing Lesser number of algorithms available Spark streaming doesn’t support record-based window criteria The work needs to be distributed over multiple clusters instead of running everything on a single node While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck