Tips for Optimizing Hive Performance - dtoinagn/flyingbird.github.io GitHub Wiki

Optimizing Hive queries is crucial for achieving better performance and scalability in a data warehouse environment. Here are some tips and best practices for optimizing Hive queries:

Partitioning

Partitioning your data can significantly improve query performance by reducing the amount of data scanned during query execution.
Partition your data based on commonly filtered columns, such as date or category.
Use static partitioning for columns with a limited number of distinct values and dynamic partitioning for columns with high cardinality.
Consider using partitioned tables for time-series data to improve query performance on date-range queries.

Bucketing

Bucketing distributes data into a fixed number of buckets based on the hash value of one or more columns.
Use bucketing to distribute data across files and improve data locality evenly.
Choose the number of buckets wisely based on the size of your data and the available resources.
Bucketing is particularly useful for optimizing join operations and aggregations.

Optimizing Join Operations:

Use map-side joins for small tables that can fit into memory to avoid shuffling data across the network.
Use broadcast joins for joining a small table with a large table, broadcasting the small table to all nodes to avoid data shuffling.
Avoid cross joins (cartesian products) as they can result in a significant increase in data volume and degrade performance.
Optimize join order and join conditions to minimize the amount of data shuffled during join operations.

Some tips to improve inner join: Improve filtering to prevent partition pruning (Partition pruning refers to the mechanism where a query can skip reading the data files corresponding to one or more partitions). If you can arrange for queries to prune large number of unnecessary partitions from the query execution plan, the queries use fewer resources and are thus proportionally faster and more scalable.

Column Pruning:

Avoid using SELECT * and explicitly specify only the columns needed for the query results.
Column pruning reduces the amount of data read from disk and improves query performance.

Optimizing File Formats:

Choose appropriate file formats such as ORC or Parquet, which are optimized for query performance and storage efficiency.
These file formats support compression and predicate pushdown, which can further improve query performance.

Statistics Collection:

Collect table and column statistics using the ANALYZE TABLE command to help the query optimizer make better decisions.
Update statistics regularly, especially after data loading or significant data changes.

Tuning Hive Configuration:

Adjust Hive configuration parameters such as memory allocation, parallelism settings, and query execution parameters based on the characteristics of your workload and cluster resources.
Monitor query performance and resource utilization to identify bottlenecks and fine-tune configuration settings accordingly.

Vectorization: By default, Hive will process 1 row at a time, i.e. 1 mapper for 1 row. If you enable Vectorization, Hive mapper will process 1024 rows at a time. The disadvantage is that it will occupy a lot of memory

-- To enable vectorization
SET hive.vectorized.execution.enable=true;

Hive Parallel execution: If you have queries that are independent of each other, you can make them run in parallel by enabling the Hive parallel execution:

SET hive.exec.parallel=true;