Top Hive Notes - dtoinagn/flyingbird.github.io GitHub Wiki

Hive vs SQL

SQL Hive
ETL (Extract Transform Load) ELT(Extract Load Transform)
Schema enforced on Write Schema enforced on Read
No serialized data support Good serialized data support
Good for small data Good for large data
SQL Supports Structured Supports structured and semi-structured
1. Serialization and deserialization formats are known as SerDes. Hive allows the framework to read and write data in certain format. These formats parse the structured or unstructured data bytes stored in HDFS in accordance with the schema definition of Hive tables. Hive provides a set of built-in SerDes and also allows the user to create custom SerDes based on their data definition. These are as follows:
* LazySimpleSerDe (Hive default SerDes format)
* RegexSerDe
* AvroSerDe
* OrcSerDe
* ParquetHiveSerDe
* JSONSerDe
* CSVSerDe

What is the Hive default metastore?

Hive has default db - Derby database to store the metadata, which is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It also provides client access to this information with the help of metastore service API

Why Hive does not store metadata information in hdfs?

Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.

What are the default types of tables in Hive?

Hive supports two types of tables:

  1. Managed table - When table dropped, backend directory associated with the table is deleted as well. Use if for staging purpose.
  2. External Table - When table dropped, backend directory associated with the table would still exist. Use it for target system

How do you load data in Hive?

  • LOAD DATA LOCAL INPATH '/home/cloudera/datasets/customers.csv' into table customers; # Local file path
  • LOAD DATA INPATH '/user/cloudera/datasets/customers.csv' into table customers; # HDFS path

How do you skip header rows from a table in Hive?

CREATE EXTERNAL TABLE employee (
name STRING,
job STRING,
dob STRING,
id INT,
salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED By '_' STORED AS TEXTFILE
LOCATION '/user/data'
TBLPROPERTIES("skip.header.ine.count"="2");

What is the difference between Sort By and Order By?

We should use sort by instead of 'order by' when we have to sort huge datasets because sort by clause sorts the data using mulitple reducers whereas order by sorts all of the data together using a single reduce. Therefore using order by against large volume of inputs will take lot of time to execute..

Why do we perform partitioning in Hive?

In a Hive table, partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.

What are the different types of partitions available in Hive?

We have two types of partitions - Static Partition and Dynamic Partition.

Static load partition: in this partition, we provide the partition name to which the data must be loaded, and we are manually providing the partition:

CREATE EXTERNAL TABLE partitioned_table(
 state varchar(255),
 capital varchar(255),
 language varchar(255)
)
partitioned by (country varchar(100))
row format delimited fields terminated by ','
stored as textfile
location '/user/cloudera/partitioned_dir'
load data local inpath '/home/cloudera/country/India' into table partitioned_table
partition(country='IND');
load data local inpath '/home/cloudera/country/America' into table partitioned_table
partition(country='US');

Static Insert Partition: Inserting the data from a non-partitioned table to a partitioned table.

  1. We must create a non-partitioned table to load this data into
CREATE EXTERNAL TABLE non_partitioned_table (
 state varchar(255),
 capital varchar(255),
 country varchar(255)
)
row format delimited fields terminated by ','
stored as textfile
location '/user/cloudera/non_partitioned_dir1';
  1. Load the file into this table: ''' LOAD DATA LOCAL INPATH '/home/cloudera/country/countries' into table non_partitioned_table '''
  2. Create a partitioned table
CREATE EXTERNAL TABLE partitioned_table_country(
 state varchar(255),
 capital varchar(255),
 language varchar(255)
)
partitioned by (country varchar(100))
row format delimited fields terminated by ',' stored as textfile
location '/user/cloudera/partitioned_dir_country1';
  1. Insert from a non-partitioned table into a partitioned table:
Insert into partitioned_table_country partition(country='IND')
select state, capital, language
from non_partitioned_table where country = 'IND'

Dynamic Load Partition

The data will be loaded into a non partitioned table and further inserted into the partitioned table dynamically. The only difference is in static insert, we tell the partition name that the data needs to go, while in dynamic insertion those partitions are automatically created from the query result.

-- non-partitioned table
CREATE EXTERNAL TABLE source_country(
 state varchar(255),
 capital varchar(255),
 country varchar(255)
)
row format delimited fields terminated by ','
stored as textfile
location '/user/cloudera/source_country';

-- Load the files into this location
LOAD DATA LOCAL INPATH '/home/cloudera/country/countries' INTO TABLE source_country;

-- Create partitioned table:
CREATE EXTERNAL TABLE target_country (
 state varchar(255),
 capital varchar(255),
 country varchar(255)
)
partitioned by (country varchar(100))
row format delimited fields terminated by ','
stored as textfile
location '/user/cloudera/target_country';

-- Insert data from non-partitioned table to partitioned table:
insert into target_country partition(country)
select state, capital, language, country
from source_country;

What is the difference between partitioning and bucketing in Hive?

Refer to this Tips for optimizing Hive performance

Partitioning is a way of dividing a large table into smaller, more manageable parts based on the values of one ore more columns. The partition key is used to create a directory hierarchy that represents the partitioning scheme, and each partition is stored as a separate directory with its own set of files. Bucketing is a way of dividing data within a partition into more manageable chunks called buckets, based on the hash value of one or more columns.

For example, let's say we have a table with a column named id. Id has values ranging from 1 to 100. We can divide this data into 5 files (buckets) using the concept of bucketing. Data is populated into each file by doing a modulo operation on the id column.

Partitioning is useful when we need to filter data based on the partition key, as Hive can quickly skip over partitions that don't match the filter condition. While bucketing is useful when we need to join or aggregate data on the specific column(s), as it can help reduce the number of data files that need to be scanned.

Given a table named "country", you are asked to create the table. It has 4 columns, which may be dropped or added. How would you design the table?

Since schema is going to change, you can use AVRO file format to accommodate the schema changes

CREATE EXTERNAL TABLE country_stg1
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS AVRO
LOCATION '/user/cloudera/country_stg1'
TBLPROPERTIES('avro.schema.url'='/user/cloudera/country_stg1')

How to execute the Hive commands using a Unix Shell?

You can place all your commands in a file, e.g. hive.hql and executed it as hive -f hive.hql

How does MAPJOIN optimize your query?

  • Map join is a Hive feature that is used to speed up Hive queries. It lets a table to be loaded into memory so that a join could be performed within a mapper without using a Map/Reduce step.
  • When map join (hive.auto.convert.join) is enabled, when a table with a size less than 25 MB (hive.mapjoin.smalltable.filesize) is found, the jobs are converted to map-based.
  • When three or more tables are involved in the join condition. Hive can generate three ore more map-side joins with an assumption that all tables are of smaller size and combine these joins into single map-side join if size of n-1 table is less than 10 MB.

I have data in a hdfs location. I created a Hive table on top of this location. The table is partitioned. When I query my table, I see no data. What could be the reason?

When we create a partitioned table, partitions are generated and stored in the Hive metastore. When creating a partitioned table against the existing data, partitions are not automatically registered in the Hive metastore. We need to recover the partitions by running the following command:

MSCK REPAIR TABLE table_name

I have an application that runs every month. It does two things:

  1. It loads the data into hdfs subdirectory by first creating the current month folder and then loading data into that month.
  2. It creates a partition for the Hive table

For the month of June, the application was able to ingest the data into June folder, however it was unable to create the partition on hive. When querying for June partition there is no data. How do I fix it?

-- Run the following HQL
ALTER TABLE partitioned_transaction ADD PARTITION(month='June') LOCATION '/user/hive/June';