Hive and Spark Interview Questions - rohith-nallam/Bigdata GitHub Wiki

Interview questions

hive

  1. What is the difference between UDF,UADF,UDTF?
    Answer
  2. What the Differences between Sort by ,order by, cluster by, distribute by?
    Answer
    Order by doesn't work properly with large datasets
    Sort By doesn't ensure proper ordering of data cluster By also doesn't ensure prospering ordering of data

order by causes all the data to single reducer whereas sort by causes same data to got ot multiple reducers causing irregular soritng to ensure all the data we need to use sort by with cluster by or just only distribute by
3. What is explode and lateral view?
Explode function displays each element of collection data type as single row.

CREATE TABLE Products

(id INT, ProductName STRING, ProductColorOptions ARRAY);

hive> select * from products;

1 Watches [“Red”,”Green”]

2 Clothes [“Blue”,”Green”]

3 Books [“Blue”,”Green”,”Red”]

hive > select explode(ProductColorOptions ) from products;

Red

Green

But I want to join with other columns like id but that leads to an error.

In that case, we need to use Lateral View.Lateral view creates a virtual table for exploded columns and make join with the base table.

We need not to worry about the virtual table as it is done by hive internally.

SELECT p.id,p.productname,colors.colorselection FROM default.products P LATERAL VIEW EXPLODE(p.productcoloroptions) colors as colorselection;

1 Watches Red

1 Watches Green

2 Clothes Blue

  1. Explain about partitioning in Hive and when do we use it?
  2. Explain about bucketing in Hive and when do we use it?
  3. How do you run Hive jobs in real time?
  4. How do you load a json file or xml into Hive?
  5. How do you load fixed width delimited files into Hive?
  6. What is multi insert in Hive?
  7. How do you perform updates and deletes in Hive?
  8. How do you perform incremental Loads in Hive?
  9. How do you implement SCD type 2 in Hive?
  10. What is percentile function in Hive?
  11. What is the Function to calculate median in Hive?
  12. What is the difference between an external and internal/managed table in Hive?
    If we create a table as an external table even we drop the table only metadata will be dropped the data will be preserved whereas for internal table the data gets stored in hive warehouse deafult location and will be data and metadata will dropped if table dropped
  13. Can you explain to me about an Hive job you developed from start to end?
  14. Do you know what are the virtual columns in Hive?
  15. Do you know the concept of vectorization in Hive?
  16. How do we set the number of reducers while loading into Hive tables?
  17. Which version of Hive you are using?
  18. What does Rank function do in Hive?
  19. What is SerDe in Hive?
  20. What is SMB Join?

A. When we have tables which are sorted and bucketed Sort merge Bucketed join to enable SMB join we have to follow below criteria

  • All tables should be bucketed
  • The numbers of buckets should be divisible by another table
  • The bucketed column and join column should be same
  • enable below properties
  1. Which one is faster distinct or group by in Hive?
  2. How can you decide on the number of buckets in Hive?
  3. What is map Join?
  • when we use map join hint or automap join property before launching the original MapReduce job Hadoop serializes the small hash table which will get distributed cache of mappers thereby easily available for join
  • we can't use right outer join and full outer join for map joins
  1. How to do insert update delete in Hive?
  • We have to enable properties in hive-site.xml or use the properties for Hive session
  • We have to do bucket the table and the file format should be ORC
  1. How to transpose/pivot data in the hive?
  1. What does collect_set, collect_list functions do in hive? Differences Link
  2. Explain about the important functions used in Hive?
  3. How to combine results two tables in Hive?
    Using union in Hive
  4. Can we alter data type and rename a column in Hive?
    alter table table_name change old_col_name new_col_name new_col_type;
  5. How do you pass multiple parameters in Hive?
    https://stackoverflow.com/questions/46654725/how-to-pass-multiple-parameter-in-hive-script

spark

  1. What is the difference between MapReduce and spark?

Apache Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action, so Spark should outperform Hadoop MapReduce.
Spark provides different types of languages and easy to write.
Spark evaluates lazily which helps in better performance.
Spark provides caching and persistence.

  1. What is the difference between takeOrdered and sortByKey?
    Itversity Video

we have to use take ordered over sortbykey

  1. What does glom function do in Spark?

Return an RDD created by coalescing all elements within each partition into a list. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> sorted(rdd.glom().collect()) [1, 2], [3, 4](/rohith-nallam/Bigdata/wiki/1,-2],-[3,-4)

  1. What is the difference between running a job in Cluster mode vs Client mode?
    https://stackoverflow.com/questions/37027732/apache-spark-differences-between-client-and-cluster-deploy-modes
  2. What is dynamic resource allocation in Spark?

we have to turn on the external shuffle service to enable dynamic resource allocation

  1. What is the difference between Transformation and Action and what are the types of transformations?

Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
Two different Types of transformations Narrow and Wide
Narrow Transformation means all the elements that are required to compute the records in single partition live in the single partition of parent RDD
Examples: map, filter, flat map, union, sample
In Wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. The partition may live in many partitions of parent RDD
Examples: ReduceByKey, Join, Coalesce, repartition
Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.
Examples: collect,countByValue, fold, reduce, for each

  1. What is the difference between repartition and coalesce?

Suppose large dataset and we filter the data set and we want to decrease partitions we use coalesce if we want to decrease or increase we use repartition coalesce doesn't do network shuffles whereas repartition performs extensive network shuffles and takes a long time to run coalesce creates unequal sized partitions whereas repartition performs equal sized partitions

  1. What is the difference between RDD, Dataframe, and Dataset?

RDD Typesafe
Developer has to take of optimizations
Not as good as datasets in performance
Not memory efficient
Data frame
Not Typesafe Auto optimization using Catalyst optimizer
perform not as good as datasets
Not memory efficient

serialization is not memory efficient in Dataframe and RDD
we cannot use .map,.filter other RDD functions on Dataset
so spark came up with Datasets

Datasets uses off head memory which is called direct buffers to store records and uses encoders instead of spark serialization. It is more memory efficient than datasets

  1. When to select between Dataframe and Dataset?

we can think Dataframe as immutable and collection of generic objects and they are not type-safe when you say type safe if you use a column in data frames which doesn't exist in the df we see the error only at run time

Datasets are strongly typed which means you will get an error during compile time if the column selected not in the dataset
When to use Dataset?
if you wanna use the functional programming constructs we have to use dataset
want to use catalyst and tungsten efficient code generation we have to go for Datasets When to use dataframe?
You use python and R
Don't need type-safety and most of the operations will be defined using spark SQL

  1. How do you query about the tables in spark or hive?

we should use spark catalog to query the tables

  1. What is Catalyst Optimizer?

catalyst optimizer is the core of the sparkSQL module
uses advanced features of scala pattern matching and quasi-quotes etc
it is written scala functional programming language
In oracle db if we have a table which has millions of rows when we trying to extract small data we have to apply filter or aggregation on db side intead of job exec side which saves disk/network io
Catalyst library is based on TreeNode data structure and Catalyst framework converts the spark code and the tree is known as an abstract source tree
Rules: Four phases in query execution

  1. Analyzing query Rule-based Optimization uses catalog table and do the mappings and get the datatypes and validate them. Use unique ID to each attribute and use it for accessing
  2. Logical optimization Rule-based Optimization constant folding - compilation technique to do to operation for all the rows predicate pushdown - predicate means where clause in SQL query , filter in dataframe API, pass the predicate direct to the datasource project prunning: select only required columns Physical operators: Dataframe finally converted into RDD hence all the actual operators of RDD needs to be defined
  3. Physical planning Cost-Based Optimization Takes the optimized logical plan and use develop many physical plans and based on it decides which is the best plan
  4. Converting into byte code
    uses quasi-quotes to perform the operation and do the operation efficiently The tree is scala object which is immutable and uses pattern matching
  5. What is the difference between stage and task?

Once we have an execution plan spark will divide the job into stages and they are formed based on chunks of processing in a parallel manner without shuffling and tasks distributed on individual nodes on the cluster

  1. What is DAG?
    In Spark, a DAG (Directed Acyclic Graph) of consecutive computation stages is formed. In this way, we optimize the execution plan, e.g. to minimize shuffling data around.

  2. What is RDD lineage?

  3. What is accumulator?
    An accumulator is created from an initial value v by calling SparkContext.accumulator(v)..Tasks running on a cluster can then add to it using the add method or the += operator. However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action.

  4. What is meant by a broadcast variable?
    explicitly creating broadcast variables are only beneficial when tasks across multiple stages need the same data or when caching the data in the deserialized form is important.
    The variable broadCastDictionary will be sent to each node only once. The value can be accessed by calling the method .value() on broadcast variables.

  5. Can you explain about the Spark Job execution?

  6. What is the difference between Spark session and Spark Context?
    In spark 1.x we need to create separate import to access hive,sql contexts and sparkcontext is the single entry for the driver to connect to cluster whereas in spark 2.x multiple users can connect to sparkcontext through spark session which comes with hive support and we can create our own local temp views per session.

  7. What is the difference between groupByKey and reduceByKey?
    In reducebykey we have internal combiner which does the aggregation for each partition whereas groupbykey has a lot of shuffling to perform the aggregation.

  8. What is the difference between Map and FlatMap?
    map :It returns a new RDD by applying a function to each element of the RDD.Function in map can return only one item.
    flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened. Also, function in flatMap can return a list of elements (0 or more)
    Map vs FlatMap

  9. What is the difference between Impala and Hive?

  10. How do you handle schema mismatches in Spark?
    Just use DROPMALFORMED and follow the log. If malformed records are present there are dumped to the log, up to the limit set by maxMalformedLogPerPartition option.

spark.read.format("csv") .schema(schema) .option("header", false) .option("mode", "DROPMALFORMED") .option("maxMalformedLogPerPartition", 128) .load(inputCsvPath)
23. What is the differnce between Broadcast and cache in spark?

hdfs

what is HDFS? The Hadoop Distributed File System is used for storing and retrieving big data distributed across several nodes within a Hadoop cluster
What is Namenode and data node?
Master node which maintains all the information of metadata of data blocks in data nodes.datanodes is where actual data is stored in the form of blocks
what is Hadoop?
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallelly. There are basically two components in Hadoop: HDFS AND YARN. Hadoop is a master-slave architecture what are the key advantages of Hadoop?
Hadoop helps to create a fault-tolerant, parallel processing, distributed systems
fault-tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such a situation. HDFS is highly fault tolerant. It handles faults by the process of replica creation. The replica of users data is created on different machines in the HDFS cluster. So whenever if any machine in the cluster goes down, then data can be accessed from other machines in which same copy of data was created
HDFS also helps in horizontal scaling we can access data faster due to Data locality
YARN
YARN contains resource manager and node manager
ResourceManager is again a master node. It receives the processing requests and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. NodeManagers are installed on every DataNode. It is responsible for the execution of the task on every single DataNode. what happens when you execute a command in HDFS?
How do you read parquet and avro file formats?
parquet-tools

sqoop

how do you load a table from RDBMS to hdfs/hive if there is no primary table?
use split-by and specify the attribute how do you do incremental imports in sqoop?
how do you set the --num-mappers or number of mappers in sqoop?

  1. Database type.
  2. Hardware that is used for your database server. Impact to other requests that your database needs to serve. how many reducers operations in sqoop?
    zero.sqoop is map only
  3. Good source for sqoop interview questions?
    sqoop questions