Bucketing in Hive - ignacio-alorre/Hive GitHub Wiki

[TODO] bring more information about this from other pages

############ Reshape this ############################

What are the factors to be considered while deciding the number of buckets?

One factor could be the block size as each bucket is a separate file in HDFS. The file size should be at least the same as the block size. Other factor could be the volume of data.

At the time of table creation, the data volume may not be known. In such case, what is the recommended number of buckets? And as and when more data gets ingested, it may be that the number of buckets chosen at the time table of table creation may have to be increased. Is it possible to increase the number of buckets periodically so that the data gets balanced?

Overall, what is the best practice?

Depends what you are trying to do:

One thing buckets are used for is to increase load performance

Essentially when you load data you often do not want one load per mapper (especially for partitioned loads because this results in small files), buckets are a good way to define the number of reducers running. So if your cluster has 40 task slots and you want the fastest ORC creation performance possible you would want 40 buckets. (Or distribute your data by 40 and set number of reducers)

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

SELECT performance ( predicate pushdown )

Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer id if you have to have buckets for some of the other reasons.

Join Performance ( bucket join )

Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one.

Sampling performance

Some sample operations can get faster with buckets.

So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). And I think sampling is a bit niche.

So all in all avoid if you don't know too well what you are doing, and regarding size? 1-10 blocks is a pretty good size for hadoop files. Buckets should not be much smaller ( unless you want very fast answer rates on small amounts of data no rule without exceptions )

================================

Bucketing is a method to evenly distributed the data across many files. Create multiple buckets and then place each record into one of the buckets based on some logic mostly some hashing algorithm.

Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. While creating a Hive table, a user needs to give the columns to be used for bucketing and the number of buckets to store the data into. Which records go to which bucket are decided by the Hash value of columns used for bucketing.

[Hash(column(s))] MOD [Number of buckets]

Hash value for different columns types is calculated differently. For int columns, the hash value is equal to the value of int. For String columns, the hash value is calculated using some computation on each character present in the String.

Data for each bucket is stored in a separate HDFS file under the table directory on HDFS. Inside each bucket, we can define the arrangement of data by providing the SORT BY column while creating the table.

Lets See an Example

Creating a Hive table using bucketing

For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets.

CREATE TABLE Employee( ID BIGINT, NAME STRING, AGE INT, SALARY BIGINT, DEPARTMENT STRING ) COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets' CLUSTERED BY(ID) INTO 5 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Inserting data into a bucketed table

We have following data in Employee_old table.

0: jdbc:hive2://localhost:10000> select * from employee_old; +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department | +------------------+--------------------+-------------------+----------------------+--------------------------+--+ | 1 | Sudip | 34 | 62000 | HR | | 2 | Suresh | 45 | 76000 | FINANCE | | 3 | Aarti | 25 | 37000 | BIGDATA | | 4 | Neha | 27 | 39000 | FINANCE | | 5 | Rajesh | 29 | 59000 | BIGDATA | | 6 | Suman | 37 | 63000 | HR | | 7 | Paresh | 42 | 71000 | BIGDATA | | 8 | Rami | 33 | 56000 | HR | | 9 | Arpit | 41 | 46000 | HR | | 10 | Sanjeev | 51 | 99000 | FINANCE | | 11 | Sanjay | 32 | 67000 | FINANCE | +------------------+--------------------+-------------------+----------------------+--------------------------+--+ We will select data from the table Employee_old and insert it into our bucketed table Employee.

We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table.

Set the property

set hive.enforce.bucketing=true;

Insert data into Bucketed table employee

INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;

Verify the Data in Buckets

Once we execute the INSERT query, we can verify that 5 files are created Under the Employee table directory on HDFS.

Name Type 000000_0 file 000001_0 file 000002_0 file 000003_0 file 000004_0 file Each file represents a bucket. Let us see the contents of these files.

Content of 000000_0

All records with Hash(ID) mod 5 == 0 goes into this file.

5,Rajesh,29,59000,BIGDATA 10,Sanjeev,51,99000,FINANCE Content of 000001_0

All records with Hash(ID) mod 5 == 1 goes into this file.

1,Sudip,34,62000,HR 6,Suman,37,63000,HR 11,Sanjay,32,67000,FINANCE Content of 000002_0

All records with Hash(ID) mod 5 == 2 goes into this file.

2,Suresh,45,76000,FINANCE 7,Paresh,42,71000,BIGDATA Content of 000003_0

All records with Hash(ID) mod 5 == 3 goes into this file.

3,Aarti,25,37000,BIGDATA 8,Rami,33,56000,HR Content of 000004_0

All records with Hash(ID) mod 5 == 4 goes into this file.

4,Neha,27,39000,FINANCE 9,Arpit,41,46000,HR

#####################################################

http://hadooptutorial.info/bucketing-in-hive/

Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when:

• There are limited number of partitions
• Comparatively equal sized partitions

But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data) where as small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). So, In these cases Partitioning will not be ideal.

To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.

• Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets). The hash_function depends on the type of the bucketing column.
• Records with the same bucketed column will always be stored in the same bucket.
• We use CLUSTERED BY clause to divide the table into buckets.
• Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.
• Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
• Bucketed tables will create almost equally distributed data file parts.

Advantages

• Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction of data for testing and debugging purpose when the original data sets are very huge.
• As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
• Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables.
• Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. This makes map-side joins even more efficient, since the join of each bucket becomes an efficient merge-sort.

Limitations

• Specifying bucketing doesn’t ensure that the table is properly populated. Data Loading into buckets needs to be handled by our-self.

Example Use Case

Lets explore the remaining features of Bucketing in Hive with an example Use case, by creating buckets for sample user records provided in the previous post on partitioning –> UserRecords

Let us create the table partitioned by country and bucketed by state and sorted in ascending order of cities.

Creation of Bucketed Tables We can create bucketed tables with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement. With the help of the below HiveQL we can create bucketed_user table with above given requirement.