Hive Upserts (Insert and Update) - Gowthamsb12/BigData-Blogs GitHub Wiki

Hive Upserts (Insert and Update).

How to handle Upserts in the hive?

Hive supports ACID But doing updates directly in Row-level causes performance issue in hive.

Type1 Create an intermediate table with a partition to store all the recent records and then do a join with the main table and overwrite the partition in the main table (Insert overwrite). Or the same can be done by MERGE command in the hive

Type2 Use Hive HBase integration – HBase can handle the updates effectively but doing aggregation in HBase via Hive will slowdowns the performance because we don’t have partition and bucket in hive-HBase table.

Or after HBase updates load the final data to a standard hive table so that you can do all the joins and aggregations effectively in the hive (but this a kind of long process)

Type3 Hudi /HOODIE is a uber’s opensource framework which handles the hive upserts.

Type4 We can use JDBC, Spark to make a program level upserts in the hive, for example, Join 2 data frames in spark and perform an inner join then load these output data as Updates in hive and then again do a left join between the same 2 Data frame then load these output data as Inserts. (This approach Is okay when your dataset size is small for the give cluster resource), because direct updates are costly or you can still use the other types specified above.

Note: The above types can be chosen based on your data load and the batch time-frequency