Hive CLI vs Beeline: Indexing hive - prabhu914/Hadoop-Interview-Question GitHub Wiki

http: blog.cloudera.com blog 2014 02 migrating from hive cli to beeline a primer

http://toodey.com/2015/07/25/what-is-beeline-in-latest-hiveserver2-difference-between-hiveserver1-and-hiveserver2/

Hive CLI.

Limitation – Support for single user at a time

No Authentication support provided

Hive Cl is simple to use and widely used interface, still in production use

HiveServer1

Hive server is a server client model service

Allow users to connect using Hive CLI interface and using thrift client.

Support for remote client connection but only one client can connect at a time.

No session management support.

Because of thrift no concurrecy control due to thrift API.

HiveServer2

Hive server 2 is also a client and server model.

It allows to connect many different clients like thrift

HiveServer2 gives multi client support where many clients can connect at the same time

Authentication is much better using kerberose.

Suppport for JDBC and ODBC driver connection.

Beeline cli is used for connecting to HiveServer2

Beeline CLI

Beeline is a command line interface for HiveServer2

This is based on SQLLine CLI.

It gives better support for JDBC/ODBC.

This is not compatible with old HiveServer1

To configure beeline over HiveServer2 you need some extra configuration.

Indexing

The goal of Hive indexing is to improve the speed of query lookup on certain columns of a table. Without an index, queries with predicates like 'WHERE tab1.col1 = 10' load the entire table or partition and process all the rows. But if an index exists for col1, then only a portion of the file needs to be loaded and processed.

The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index.

whenever we perform a query on a table that has an index, there is no need for the query to scan all the rows in the table. Further, it checks the index first and then goes to the particular column and performs the operation.

So if we maintain indexes, it will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.

Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned depending on the size of the data we have.

When to use Indexing?

Indexing can be use under the following circumstances:

If the dataset is very large. If the query execution is more amount of time than you expected. If a speedy query execution is required. When building a data model.

Differences between Compact and Bitmap Indexing

The main difference is the storing of the mapped values of the rows in the different blocks. When the data inside a Hive table is stored by default in the HDFS, they are distributed across the nodes in a cluster. There needs to be a proper identification of the data, like the data in block indexing. This data will be able to identity which row is present in which block, so that when a query is triggered it can go directly into that block. So, while performing a query, it will first check the index and then go directly into that block.

Compact indexing stores the pair of indexed column’s value and its blockid.

Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.

Example:

CREATE INDEX index_name

ON TABLE table_name (columns,....)

AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

WITH DEFERRED REBUILD

The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table.

The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.

This syntax will create an index for our table, but to complete the creation, we need to complete the REBUILD statement. For this to happen, we need to add one more alter statement. A MapReduce job will be launched and the index creation is now completed.

ALTER INDEX index_name on table_name REBUILD;

This ALTER statement will complete our REBUILDED index creation for the table.