2020_interview_Data_engineer - RatneshKumarSrivastava/Ratnesh GitHub Wiki

  1. what's your cluster size. production : 590 T test 200T analyst 75 T ads - 100T

  2. how much data you deal with on daily basis. transaction -75 milions dimension : 218 milions

  3. what is your role in your big data project.

  4. are you using on-premise setup or it's in cloud.

  5. which big data distribution are you using.

  6. what's the most challenging thing that you have faced in your big data project. how did you overcome that.

  7. what's the configuration of each node in your cluster.

  8. Did you ever face any performance challenges with your spark job? how did you optimize that.

  9. what is difference between standalone mode, yarn mode and local mode in spark?

  10. As we know that transformation is lazy evaluated in spark , but when write any transformation function in spark-shell, how they give output in spark-shell before applying action on RDD?

  11. how to process the image , vedio and unstructured format like images, gif using spark ? Because we say that big data is used for structured, unstructured and semistructured data format processing and storing but we generally store and process/analyze the structured format and semi structured format like csv, paraquet, orc but not vedios, images why?

12.What Is the difference between cache and persist ( memory_only) ?

  1. How will you go about choosing among one of the above. please explain your thought process.

If it's structured data, I will go with Data Frame as it gives more flexibility in handling the data i.e. as we have a schema with us we can have more control over the data and take the advantage of optimizations provided in data frame API. If data is unstructured/semi-structured data then I will go with RDD

  1. Do we still have to use RDD's? or can we totally avoid them.

We can't avoid RDD completely, if we get unstructured/semi-structured data then we can use RDD to process that data, moreover, data frame and dataset APIs are abstractions on top of RDD, so internally they are using RDD only.

  1. What do you prefer between Data Frames and Data Sets? Explain why.. I didn't work on Data set API so can't comment on this

############################################################################################ Apache Spark Join Optimization - 2 Large Tables

Our intent should be to minimize the shuffling & Maximize the parallelism.

Minimize Shuffling - Try filtering the data before the shuffle. Cut down the size as early as possible to minimize the shuffling / do any of the aggregation before only.

Maximize Parallelism -

  1. procure the right number of CPU cores. if you request for 20 executors with 5 cores. Then at the max we can have 100 tasks running in parallel.

  2. set the right number of shuffle partitions. if you have set it to 50. Then you can have at the max 50 parallel tasks.

  3. cardinality should be high. If you have 20 distinct keys then at the max 20 tasks will be doing work in parallel.

so if you get 20 containers with 5 cores each. your shuffle partitions is set to 50. There are 20 distinct keys.

Then you can have 20 tasks running in parallel.

Max Parallelism = Min(total_CPU_Cores, shuffle_Partitions, Cardinality)

  1. Avoid partition skew - Consider 20 tasks. 1 of them is overloaded with work. The completion of job is dependent on slowest performing task.

################################################################# Spark SQL Internals

Consider you are given the below 2 Spark SQL queries which looks very identical.

Edited the question - removed the order by clause.

query 1 - select order_customer_id , date_format(order_date, 'MMMM') orderdt, count(1) count, first(cast(date_format(order_date, 'M')as int)) monthnum from orders group by order_customer_id, orderdt

query 2 - select order_customer_id , date_format(order_date, 'MMMM') orderdt, count(1) count, first(date_format(order_date, 'M')) monthnum from orders group by order_customer_id, orderdt

Which query is more performant. Please explain!

ANSWER: i think the first one will give the results in date time because in the first case we are doing the ordering on an integer and in the other case it is a string and ordering on an integer is comparatively faster than the string so first one will be fast .

########################################################### Memory Management in Apache Spark.

Consider you have a 8.5 GB file stored in HDFS.

You load this file in a spark rdd.

you then cache this rdd using cache.

How much memory it should occupy ideally for caching. is it <8.5 GB , =8GB, >8.5 GB. please explain!

Also, what if we do not have sufficient memory to cache it. Then what will be the behavior?

ANSWER: Cache in DF is ideally MEMORY_AND_DISK so if there is o sufficient memory then it will spill to disk. When 8.5GB file is getting cached then the memory it is going to occupy is always slightly more than 8.5GB(when enough memory is available) . I believe the reason behind this is the meta data that RDD maintains and it will vary based on number of partitions. In case of RDD ONLY MEMORY. But in case DF it is both memory and disk.

In this case the size will spike to nearly double as it will store data in deserialized way which also which includes the process size and metadata as well. so in our case it will take around 16GB of memory. Only partial RDD will be cached. Remaining will be created on the fly(re-computation will takes place).

it is taking the exact size of the data based on the data type. For example if you have RDD[String] then you need to calculate the size on that type, ie 48 byes for one char. You can SizeEstimator from spark utils to verify the size

##############################################Apache Spark Internals Sort Aggregate Vs Hash Aggregate

Consider we have a file with orders data - size 2.6 GB

Few sample records.. order_id,order_date,order_customer_id,order_status 1,2013-07-25 00:00:00.0,11599,CLOSED 2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT

We created a Spark Dataframe out of it.

Now, we grabbed 11 executors each with 1 GB RAM & 1 CPU core.

query-1 (3.9 minutes)

select order_customer_id, date_format(order_date, 'MMMM') orderdt, count(1) cnt, first(date_format(order_date,'M')) monthnum from orders group by order_customer_id, orderdt

query-2 (1.2 minutes)

select order_customer_id, date_format(order_date, 'MMMM') orderdt, count(1) cnt, first(cast(date_format(order_date,'M') as int)) monthnum from orders group by order_customer_id, orderdt

As you can see that query 2 is much more performant.

The reason is that first query uses Sort Aggregate and second query uses Hash Aggregate.

Note: Hash Aggregate is much faster than Sort Aggregate.

Now try answering this, why the 1st query used Sort Aggregate and the 2nd one used Hash Aggregate? ##################################################################### Performance Optimization in Apache Spark

In Big Data world, one of the biggest pain point is shuffling of data.

As a good developer, we always intend to avoid/minimize shuffling.

Let's think this from a Join operation perspective.. We have 2 scenarios:

  1. Both Datasets are Big (single executor cannot handle)

In this case we cannot avoid shuffling. However, we intend to minimize it.

possible ways - a) filter the data as early as possible b) preparing in advance by bucketing both datasets. This makes sure that we just do shuffling once.

  1. One table is big and other is small

We can completely avoid shuffling by broadcasting the smaller dataset. This makes sure we work on principle of data locality.

what are other ways to avoid/minimize shuffling? #######################################################################

  1. ORC file format works very well with Apache Hive, Parquet works extremely well with Apache Spark. please explain why?

  2. when we have Spark SQL these days then why do we still require Hive. can Spark SQL replace Hive completely? ANSWER: Hive has its special ability of frequent switching between engines.Spark works well with real time data scenarios than hive. And Hive works better with batch modes.

  3. Spark Datasets vs Dataframes We know that Datasets provide compile time safety but Dataframes do not.. Please demonstrate this with an example.

  4. Spark Dataframe reader API. Is read a transformation or action? Please explain..

######################################################################## Consider you have a huge text file in HDFS.

Step 1: you load that file in a RDD

Step 2: we use a map to create a pair rdd

Step 3: we then use a reduceByKey

step 4: we then do a filter

step 5: we do a map

step 6: then we use collect

step 7: finally we use count on result of step 5 to count the number of records

question 1: How many jobs will you see on Spark web UI

question 2: how does the dag looks like for each job.

question 3: how many stages will be created.

######################################################################### Top 10 Big Data interview questions:

  1. Kindly explain your project architecture?

  2. What optimisation techniques have you used in your project?

  3. What is your role in your project?

  4. What is the most challenging problem you have solved in your big data project?

  5. Can you explain what happens internally when we run a Spark job using Spark-Submit?

  6. What is a catalyst optimiser?

  7. What is the size of data you deal with on daily basis?

  8. What is the size of your Hadoop cluster and the configuration of each node?

  9. How to tune a spark job? Please explain the techniques we can try.

  10. When would you prefer to use Hive, and when would you prefer Spark SQL? ################################################################################ be prepared for all kinds of surprises, the first question itself can be, tell me about your project?

most of the times candidates are not prepared for big data managerial round.

below questions can be a nightmare if not prepared well.

how much data you deal with on daily basis?

what is the size of your hadoop cluster?

what is the configuration of each node?

what is your day to day activity in your project?

what is the most challenging thing you faced in your project?

what is your role in the project? ########################################################################### Serde is a combination of two things. Serialization + Deserialization Serialization - converting data to a form that can be transferred over the network and can be stored. Deserialization - converting data back to a form which can be understood easily. ############################################################################

https://www.linkedin.com/posts/bigdatabysumit_sumitteaches-bigdata-apachespark-activity-6697398116166975488-ycv0/