Hive Interview Questions - ignacio-alorre/Hive GitHub Wiki
Answer
No. It is not suitable for OLTP systems since it does not offer insert and update at the row level.
All Big Data systems are OLAP.
OLTP would be mostly RDBMS, like Oracle, where you can query and update a column on the same operation.
Note OLTP => Online Transaction Processing (OLTP). The term transaction denotes an atomic change of state. OLTP systems follow ACID principles:
In database systems, ACID (Atomicity, Consistency, Isolation, Durability) refers to a standard set of properties that guarantee database transactions are processed reliably. ACID is especially concerned with how a database recovers from any failure that might occur while processing a transaction
Answer
In a HDFS directory. The Hive table is stored by default only. Moreover, by specifyig the desired directory in ```hive.metastore.warehouse.dir``` configuration parameter present in the hive-site.xmlAnswer
Basically, to store the metadata information in the Hive we use Metastore.Answer
Using RDBMS instead of HDFS, Hive stores metadata information in the metastore. Basically, to achieve low latency we use RDBMS. Because HDFS read/write operations are time-consuming processes.Answer
Managed Table: If you drop a managed table the metadata information along with the table data is deleted from the Hive warehouse directory.
External Table: Hive just deletes the metadata information regarding the table, but the table data stored in HDFS remeinds untoched.
Answer
We should use SORT BY when we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.Answer
It is a way of dividing a table into related parts based on the values of the selected **partitioned columns** such as date, city, department. Using a partition it is easy to query a portion of the data.Tables or partitions are sub-divided into buckets, to provide extra structure to the data that may be used for more efficient querying. Bucketing works based on the value of hash function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept, and yoj (i.e., year of joining). Suppose you need to retrieve the details of all employees who joined in 2012. A query searches the whole table for the required information. However, if you partition the employee data with the year and store it in a separate file, it reduces the query processing time. The following example shows how to partition a file and its data:
/tab1/employeedata/filecombined
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
The above data is partitioned into two files using year:
/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012
/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
Summary:
- In non-partitioned tables all queries have to scan all files in the directory.
- Partitions are actually horizontal slices of data that allow larger sets of data to be separated into more manageable chunks.
Dynamic and Statis Partitioning in Hive
Static Partitioning in Hive
In the static partitioning mode, you can insert or input the data files individually into a partition table. You can create new partitions as needed, and define the new partitions using the ADD PARTITION clause.
While loading data, you need to specify which partition to store the data in. This means that with each load, you need to specify the partition column value. Optionally you can specify as well the directory location
ALTER TABLE accounts
ADD PARTITION (accounts_date='2016-30-02')
LOCATION '/accounts/y2016';
Dynamic Partitioning in Hive
With dynamic partitioning in hive, partitions get created automatically at load times. New partitions can be created dynamically from existing data.
Partitions are automatically created based on the value of the last column. If the partition does not already exist, it will be created. In case the partition exists, it will be overwritten by the OVERWRITE keyword as shown in the below example.
INSERT OVERWRITE TABLE accounts_by_state
PARTITION (state)
SELECT cust_id, fname, lname, address, city
FROM accounts;
As you see in the example, a partition is being overwritten. When you have a large amount of data stored in a table, then the dynamic partition is suitable.
Note: By default, dynamic partitioning is disabled in HIVE to prevent accidental partition creation. Enable the following settings to use dynamic partitioning:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;.
Partitions Commands
- 1-
SHOW
the partitions of a partitioned table
SHOW PARTITIONS account;
- 2-
DROP
a partition of a partitioned table
ALTER TABLE accounts
DROP PARTITION IF EXISTS (accounts date='2016-30-02');
- 3-
REANMING
a partition of a partitioned table
ALTER TABLE employee PARTITION (year='1203')
RENAME TO PARTITION (Yoj='1203'),
When to use Partitioning?
- Reading the entire data set takes too long
- Queries almost always filter on the partition columns
- There are a reasonable number of different values for partition columns
When should you avoid partitioning?
- Avoid partition on columns that have too many unique rows
- Be cautious while creating a dynamic partition as it can lead to a high number of partitions
- Try to limit partition to less than 20K
Answer
It is recommended to segregate HIVE table data into multiple files only when there is a limited number of partitions. However, there may be instances where partitioning the tables results in a large number of partitions.This is where the concept of bucketing comes in. Bucketing is an optimization technique similar to partitioning. You can use bucketing if you need to run queries on columns that have huge data, which makes it difficult to create partitions. The Bucketing optimization technique in Hive can be shown in the following diagram:

What do buckets do?
The distribute the data load into a user-dfined set of clusters by calculating the hash code of the key mentioned in the query.
CREATE TABLE page_views(user_id INT, session_id BIGINT, url STRING)
PARTITIONED BY (day INT)
CLUSTERED BY (user_id) INTO 100;
As per the syntax, the data would be classified depending on the hash number of user underscore id into 100 buckets.
The processor will first calculate the hash number of the user underscore id in the query and will look for only that bucket.
Answer
Basic Hive has the following categories of built-in functions-
Mathematical: For mathematical operations, you can use the examples of the round, floor, and so on.
-
Collection: For collections, you can use size, map keys, and so on.
-
Type conversion: For data type conversions, you can use a cast.
-
Date: For dates, use the following APIs like a year, datediff, and so on.
-
Conditional: For conditional functions, use if, case, and coalesce.
-
String: For string files, use length, reverse, and so on.
HiveQL functions
- WHERE
- GROUP
- SORT
- JOIN
Answer
``` set hive.mapred.mode = strict ```It sets the mapreduce jobs to strict mode. By which the queries on partitioned tables can not run without a WHERE clause. This prevents very large job running for long time.
Answer
This can be done with following querySHOW PARTITIONS table_name PARTITION(partitioned_column='partition_value')
12- Write a query to insert a new column(new_col INT) into a hive table(htab) at a position before an existing column(x_col)
Answer
```SQL ALTER TABLE table_name CHANGE COLUMN new_col INT BEFORE x_col ```Answer
No. It only reduces the number of files which becomes easier for namenode to manage.........data-flair
Answer
**Local Metastore**The metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote MEtastore
In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM
Answer
Yes, by using the clause ```LOCATION hdfs_path``` we can change the default location of a managed tableAnswer
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latencyAnswer
Dynamic partitioning values for partition columns are known in the runtime. In other words, it is known during loading of the data into a Hive table. Usage:- While we Load data from an existing non-partitioned table, in order to improve the sampling. Thus it decreases the query latency.
- While we do not know all the values of the partitions beforehand. Thus, finding these partition values manually from a huge dataset is a tedious task.
Answer
For performing bucketing to a partition there are two main reasons:- A map side join requires the data belonging to a unique join key to be present in the same partition
- It allows us to decrease the query time. Also, makes the sampling process more efficient.
Answer
By using the formula: ```hash_function (bucketing_column's value) modulo (num_of_buckets)``` Hive determines the bucket number for a row. Basically, hash_function depends on the column data type. Although, hash_function for integer data type will be: ```hash_function (int_type_column)= value of int_type_column```Answer
Basically, hive-site.xml file has to be configured with the below property, to configure metastore in Hive – hive.metastore.uris ```thrift: //node1 (or IP Address):9083``` Using IP address and port of the metastore hostAnswer
Basically, the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. But only if data is already present in HDFS. Hence, using the keyword external that creates the table definition in the hive metastore the user just has to define the table.CREATE EXTERNAL TABLE table_name (
id int,
myfields string
)
location ‘/my/location/in/hdfs’;
Answer
There are several components of Hive Architecture:- User Interface: Basically, it calls the execute interface to the driver. Further, driver creates a session handle to the query. Then sends the query to the compiler to generate an execution plan for it.
- Metastore: It is used to Send the metadata to the compiler. Basically, for the execution of the query on receiving the send MetaData request.
- Compiler: It generates the execution plan. Especially, that is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine: Basically, by managing the dependencies for submitting each of these stages to the relevant components we use Execute engine.
Answer
**Features of Hive**- The best feature is it offers data summarization, query, and analysis in much easier manner.
- To process data without actually storing in HDFS, Hive supports external tables.
- Moreover, it fits the low-level interface requirement of Hadoop perfectly.
Limitation of Hive
- We can not perform real-time queries with Hive.
- for interactive data browsing Hive offers acceptable latency.
- Hive is not the right choice for online transaction processing (It does not offer row-level updates).
Answer
Basically, for Serializer/Deserializer, SerDe is an acronym. However, for the purpose of IO, Hive uses the Hive SerDe interface. Hence, it handles both serialization and deserialization in Hive. Also, interprets the results of serialization as individual fields for processing.Answer
It avoid that queries on partitioned tables could run without a WHERE clause. Hence, it prevents very large job running for a long time.Answer
```SQL SHOW PARTITIONS table_name PARTITION(partitioned_column=’partition_value’) ```Answer
Some of the Optimize Hive Performance techniques are:- Tez-Execution Engine in Hive
- Usage of Suitable File Format in Hive
- Hive Partitioning
- Bucketing in Hive
- Vectorization In Hive
- Cost-Based Optimization in Hive (CBO)
- Hive Indexing
29- Can we change the data type of a column in a hive table? !!Not sure if also valid for external tables!!
Answer
By using REPLACE column option we change the data type of a column in a hive tableALTER TABLE table_name REPLACE COLUMNS ……
30- How to add the partition in existing table without the partition table? !!! Dont like the answer!!!
Answer
??Basically, we cannot add/create the partition in the existing table, especially which was not partitioned while creation of the table.??Although, there is one possible way, using PARTITIONED BY
clause. But the condition is if you had partitioned the existing table, then by using the ALTER TABLE command, you will be allowed to add the partition.
So, here are the create and alter commands:
CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING); ALTER TABLE tab02 ADD PARTITION (mon =’10’) location ‘/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv5.txt’;
Answer
Basically, there are 3 ways possible in which Hive organizes data. Such as:- Tables
- Partitions
- Buckets
Answer
Basically, to decompose table data sets into more manageable parts is Clustering in HiveTo be more specific, the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters. In addition, “clustered by” clause is used to divide the table into buckets.
Answer
It is also known an internal table. When we create a table in Hive, it by default manages the data. This means that Hive moves the data into its warehouse directory. Usage:- We want Hive to completely manage the lifecycle of the data and table.
- Data is temporary
Hive External Tables-
We can also create an external table. It tells Hive to refer to the data that is at an existing location outside the warehouse directory. Usage:
- Data is used outside of Hive. For example, the data files are read and processed by an existing program that does not lock the files.
- We are not creating a table based on the existing table.
Answer
So, the difference is:Sort by
SELECT E.EMP_ID FROM Employee E SORT BY E.empid;
- For final output, it may use multiple reducers.
- Within a reducer only guarantees to order of rows.
- It gives partially ordered result.
Order by
SELECT E.EMP_ID FROM Employee E order BY E.empid;
- To guarantee the total order in output Uses single reducer.
- To minimize sort time LIMIT can be used.
Answer
Answer
Driver manages the life cycle of Hive QL Queries. It receives the queries from UI and fetches on JDBC interfaces to process the query. Also, it creates a separate section to handle the query.Answer
**Query Process**- Hive: Hive query has the problem of a "cold start" ??
- Impala: Impala always use daemons which are started at the boot time itself, which are always ready for executing the queries
Intermediate Results
-
Hive: Hive materializes all intermediate results. Enabling better scalability and fault tolerance. However, that has an adverse effect on slowing down the data processing.
-
Impala: It’s streaming intermediate results between executors. Although, that trades off scalability as such.
During the Runtime
- Hive: At Compile time, Hive generates query expressions
- Impala: During Runtime Impala generates code for "big loops" ??
Answer
* Collection Functions * Hive Date Functions * Mathematical Functions * Conditional Functions * Hive String FunctionsAnswer
Ans. However, there are several types of Hive DDL commands, we commonly use. such as:- Create Database Statement
- Hive Show Database
- Drop database
- Creating Hive Tables
- Browse the table
- Altering and Dropping Tables
- Hive Select Data from Table
- Hive Load Data
Sources