Hive Interview Questions - ignacio-alorre/Hive GitHub Wiki

1- Is Hive suitable to be used for OLTP systems? Why?

Answer

No. It is not suitable for OLTP systems since it does not offer insert and update at the row level.

All Big Data systems are OLAP.

OLTP would be mostly RDBMS, like Oracle, where you can query and update a column on the same operation.

Note OLTP => Online Transaction Processing (OLTP). The term transaction denotes an atomic change of state. OLTP systems follow ACID principles:

In database systems, ACID (Atomicity, Consistency, Isolation, Durability) refers to a standard set of properties that guarantee database transactions are processed reliably. ACID is especially concerned with how a database recovers from any failure that might occur while processing a transaction

2- Where does the data of a Hive table gets stored?

Answer

In a HDFS directory. The Hive table is stored by default only. Moreover, by specifyig the desired directory in ```hive.metastore.warehouse.dir``` configuration parameter present in the hive-site.xml

3- What is a metastore in Hive?

Answer

Basically, to store the metadata information in the Hive we use Metastore.

4- Why does Hive not store metadata information in HDFS?

Answer

Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.

5- What is the difference between the external table and managed table?

Answer

Managed Table: If you drop a managed table the metadata information along with the table data is deleted from the Hive warehouse directory.

External Table: Hive just deletes the metadata information regarding the table, but the table data stored in HDFS remeinds untoched.

6- When should we use SORT BY instead of ORDER BY?

Answer

We should use SORT BY when we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.

7- What are partitions in Hive?

Answer

It is a way of dividing a table into related parts based on the values of the selected **partitioned columns** such as date, city, department. Using a partition it is easy to query a portion of the data.

Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.

For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data:

/tab1/employeedata/filecombined

id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013

The above data is partitioned into two files using year:

/tab1/employeedata/2012/file2

1, gopal, TP, 2012
2, kiran, HR, 2012

/tab1/employeedata/2013/file3

3, kaleel,SC, 2013
4, Prasanth, SC, 2013

Summary:

In non-partitioned tables all queries have to scan all files in the directory.
Partitions are actually horizontal slices of data that allow larger sets of data to be separated into more manageable chunks.

Dynamic and Statis Partitioning in Hive

Static Partitioning in Hive

In the static partitioning mode, you can insert or input the data files individually into a partition table. You can create new partitions as needed, and define the new partitions using the ADD PARTITION clause.

While loading data, you need to specify which partition to store the data in. This means that with each load, you need to specify the partition column value. Optionally you can specify as well the directory location

ALTER TABLE accounts 
ADD PARTITION (accounts_date='2016-30-02')
LOCATION '/accounts/y2016';

Dynamic Partitioning in Hive

With dynamic partitioning in hive, partitions get created automatically at load times. New partitions can be created dynamically from existing data.

Partitions are automatically created based on the value of the last column. If the partition does not already exist, it will be created. In case the partition exists, it will be overwritten by the OVERWRITE keyword as shown in the below example.

INSERT OVERWRITE TABLE accounts_by_state
PARTITION (state)
SELECT cust_id, fname, lname, address, city
FROM accounts;

As you see in the example, a partition is being overwritten. When you have a large amount of data stored in a table, then the dynamic partition is suitable.

Note: By default, dynamic partitioning is disabled in HIVE to prevent accidental partition creation. Enable the following settings to use dynamic partitioning:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;.

Partitions Commands

1- SHOW the partitions of a partitioned table

SHOW PARTITIONS account;

2- DROP a partition of a partitioned table

ALTER TABLE accounts
DROP PARTITION IF EXISTS (accounts date='2016-30-02');

3- REANMING a partition of a partitioned table

ALTER TABLE employee PARTITION (year='1203')
RENAME TO PARTITION (Yoj='1203'),

When to use Partitioning?

Reading the entire data set takes too long
Queries almost always filter on the partition columns
There are a reasonable number of different values for partition columns

When should you avoid partitioning?

Avoid partition on columns that have too many unique rows
Be cautious while creating a dynamic partition as it can lead to a high number of partitions
Try to limit partition to less than 20K

8- Bucketing in Hive

Answer

It is recommended to segregate HIVE table data into multiple files only when there is a limited number of partitions. However, there may be instances where partitioning the tables results in a large number of partitions.

This is where the concept of bucketing comes in. Bucketing is an optimization technique similar to partitioning. You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions. The Bucketing optimization technique in Hive can be shown in the following diagram:

What do buckets do?

The distribute the data load into a user-dfined set of clusters by calculating the hash code of the key mentioned in the query.

CREATE TABLE page_views(user_id INT, session_id BIGINT, url STRING)
PARTITIONED BY (day INT)
CLUSTERED BY (user_id) INTO 100;

As per the syntax, the data would be classified depending on the hash number of user underscore id into 100 buckets.

The processor will first calculate the hash number of the user underscore id in the query and will look for only that bucket.

9- Built-in functions of Hive

Answer

Basic Hive has the following categories of built-in functions

Mathematical: For mathematical operations, you can use the examples of the round, floor, and so on.
Collection: For collections, you can use size, map keys, and so on.
Type conversion: For data type conversions, you can use a cast.
Date: For dates, use the following APIs like a year, datediff, and so on.
Conditional: For conditional functions, use if, case, and coalesce.
String: For string files, use length, reverse, and so on.

HiveQL functions

WHERE
GROUP
SORT
JOIN

10- What is the significance of the line?

Answer

``` set hive.mapred.mode = strict ```

It sets the mapreduce jobs to strict mode. By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.

11- How do you check if a particular partition exists?

Answer

This can be done with following query

SHOW PARTITIONS table_name PARTITION(partitioned_column='partition_value')

12- Write a query to insert a new column(new_col INT) into a hive table(htab) at a position before an existing column(x_col)

Answer

```SQL ALTER TABLE table_name CHANGE COLUMN new_col INT BEFORE x_col ```

13- Does the archiving of Hive tables give any space saving in HDFS?

Answer

No. It only reduces the number of files which becomes easier for namenode to manage.

........data-flair

14- What is the difference between local and remote metastore

Answer

**Local Metastore**

The metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.

Remote MEtastore

In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM

15- Is it possible to change the default location of a managed table?

Answer

Yes, by using the clause ```LOCATION hdfs_path``` we can change the default location of a managed table

16- Why do we perform partitioning in Hive?

Answer

In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency

17- What is dynamic partitioning and when is it used? !!!Dont like the answer!!!

Answer

Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table. Usage:

While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
While we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.

18- Why do we need buckets? !!!Dont like the answer!!!

Answer

For performing bucketing to a partition there are two main reasons:

A map side join requires the data belonging to a unique join key to be present in the same partition
It allows us to decrease the query time. Also, makes the sampling process more efficient.

19- How Hive distributes the rows into buckets?

Answer

By using the formula: ```hash_function (bucketing_column's value) modulo (num_of_buckets)``` Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be: ```hash_function (int_type_column)= value of int_type_column```

20- How can you configure remote metastore mode in Hive?

Answer

Basically, hive-site.xml file has to be configured with the below property, to configure metastore in Hive – hive.metastore.uris ```thrift: //node1 (or IP Address):9083``` Using IP address and port of the metastore host

21- How does data transfer happen from HDFS to Hive?

Answer

Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore the user just has to define the table.

CREATE EXTERNAL TABLE table_name (
 id int,
 myfields string
)
location ‘/my/location/in/hdfs’;

22- What are the different components of a Hive architecture?

Answer

There are several components of Hive Architecture:

User Interface: Basically, it calls the execute interface to the driver. Further, driver creates a session handle to the query. Then sends the query to the compiler to generate an execution plan for it.
Metastore: It is used to Send the metadata to the compiler. Basically, for the execution of the query on receiving the send MetaData request.
Compiler: It generates the execution plan. Especially, that is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
Execute Engine: Basically, by managing the dependencies for submitting each of these stages to the relevant components we use Execute engine.

23- Features and Limitations of Hive.

Answer

**Features of Hive**

The best feature is it offers data summarization, query, and analysis in much easier manner.
To process data without actually storing in HDFS, Hive supports external tables.
Moreover, it fits the low-level interface requirement of Hadoop perfectly.

Limitation of Hive

We can not perform real-time queries with Hive.
for interactive data browsing Hive offers acceptable latency.
Hive is not the right choice for online transaction processing (It does not offer row-level updates).

24- What is SerDe in Apache Hive? !!! Not a good answer!!!

Answer

Basically, for Serializer/Deserializer, SerDe is an acronym. However, for the purpose of IO, Hive uses the Hive SerDe interface. Hence, it handles both serialization and deserialization in Hive. Also, interprets the results of serialization as individual fields for processing.

25- What do you mean by schema on reading? !! No answer

26- What is the significance of the line set hive.mapred.mode = strict;?

Answer

It avoid that queries on partitioned tables could run without a WHERE clause. Hence, it prevents very large job running for a long time.

What are views in Hive?

27- How do you check if a particular partition exists?

Answer

```SQL SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’) ```

28- How to optimize Hive Performance?

Answer

Some of the Optimize Hive Performance techniques are:

Tez-Execution Engine in Hive
Usage of Suitable File Format in Hive
Hive Partitioning
Bucketing in Hive
Vectorization In Hive
Cost-Based Optimization in Hive (CBO)
Hive Indexing

29- Can we change the data type of a column in a hive table? !!Not sure if also valid for external tables!!

Answer

By using REPLACE column option we change the data type of a column in a hive table

ALTER TABLE table_name REPLACE COLUMNS ……

30- How to add the partition in existing table without the partition table? !!! Dont like the answer!!!

Answer

??Basically, we cannot add/create the partition in the existing table, especially which was not partitioned while creation of the table.??

Although, there is one possible way, using PARTITIONED BY clause. But the condition is if you had partitioned the existing table, then by using the ALTER TABLE command, you will be allowed to add the partition. So, here are the create and alter commands:

CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING); ALTER TABLE tab02 ADD PARTITION (mon =’10’) location ‘/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv5.txt’;

31- How Hive organize the data?

Answer

Basically, there are 3 ways possible in which Hive organizes data. Such as:

Tables
Partitions
Buckets

32- Explain Clustering in Hive?

Answer

Basically, to decompose table data sets into more manageable parts is Clustering in Hive

To be more specific, the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. In addition, “clustered by” clause is used to divide the table into buckets.

33- Hive Managed Tables vs Hive External Tables

Answer

It is also known an internal table. When we create a table in Hive, it by default manages the data. This means that Hive moves the data into its warehouse directory. Usage:

We want Hive to completely manage the lifecycle of the data and table.
Data is temporary

Hive External Tables-

We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory. Usage:

Data is used outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
We are not creating a table based on the existing table.

34- Difference between order by and sort by in Hive?

Answer

So, the difference is:

Sort by

SELECT  E.EMP_ID FROM Employee E SORT BY E.empid;

For final output, it may use multiple reducers.
Within a reducer only guarantees to order of rows.
It gives partially ordered result.

Order by

SELECT  E.EMP_ID FROM Employee E order BY E.empid;

To guarantee the total order in output Uses single reducer.
To minimize sort time LIMIT can be used.

35- Differences between Hive and Hbase !! No anwser

Answer

36- What is the importance of driver in Hive?

Answer

Driver manages the life cycle of Hive QL Queries. It receives the queries from UI and fetches on JDBC interfaces to process the query. Also, it creates a separate section to handle the query.

37- Difference between Hive and Impala?

Answer

**Query Process**

Hive: Hive query has the problem of a "cold start" ??
Impala: Impala always use daemons which are started at the boot time itself, which are always ready for executing the queries

Intermediate Results

Hive: Hive materializes all intermediate results. Enabling better scalability and fault tolerance. However, that has an adverse effect on slowing down the data processing.
Impala: It’s streaming intermediate results between executors. Although, that trades off scalability as such.

During the Runtime

Hive: At Compile time, Hive generates query expressions
Impala: During Runtime Impala generates code for "big loops" ??

38- What are types of Hive Built-In Functions?

Answer

* Collection Functions * Hive Date Functions * Mathematical Functions * Conditional Functions * Hive String Functions

39- Types of Hive DDL Commands.

Answer

Ans. However, there are several types of Hive DDL commands, we commonly use. such as:

Create Database Statement
Hive Show Database
Drop database
Creating Hive Tables
Browse the table
Altering and Dropping Tables
Hive Select Data from Table
Hive Load Data

Sources

Hive Interview Questions - ignacio-alorre/Hive GitHub Wiki

1- Is Hive suitable to be used for OLTP systems? Why?

2- Where does the data of a Hive table gets stored?

3- What is a metastore in Hive?

4- Why does Hive not store metadata information in HDFS?

5- What is the difference between the external table and managed table?

6- When should we use SORT BY instead of ORDER BY?

7- What are partitions in Hive?

8- Bucketing in Hive

9- Built-in functions of Hive

10- What is the significance of the line?

11- How do you check if a particular partition exists?

12- Write a query to insert a new column(new_col INT) into a hive table(htab) at a position before an existing column(x_col)

13- Does the archiving of Hive tables give any space saving in HDFS?

14- What is the difference between local and remote metastore

15- Is it possible to change the default location of a managed table?

16- Why do we perform partitioning in Hive?

17- What is dynamic partitioning and when is it used? !!!Dont like the answer!!!

18- Why do we need buckets? !!!Dont like the answer!!!

19- How Hive distributes the rows into buckets?

20- How can you configure remote metastore mode in Hive?

21- How does data transfer happen from HDFS to Hive?

22- What are the different components of a Hive architecture?

23- Features and Limitations of Hive.

24- What is SerDe in Apache Hive? !!! Not a good answer!!!

25- What do you mean by schema on reading? !! No answer

26- What is the significance of the line set hive.mapred.mode = strict;?

What are views in Hive?

27- How do you check if a particular partition exists?

28- How to optimize Hive Performance?

29- Can we change the data type of a column in a hive table? !!Not sure if also valid for external tables!!

30- How to add the partition in existing table without the partition table? !!! Dont like the answer!!!

31- How Hive organize the data?

32- Explain Clustering in Hive?

33- Hive Managed Tables vs Hive External Tables

34- Difference between order by and sort by in Hive?

35- Differences between Hive and Hbase !! No anwser

36- What is the importance of driver in Hive?

37- Difference between Hive and Impala?

38- What are types of Hive Built-In Functions?

39- Types of Hive DDL Commands.

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️