Performance tunning - puneet3663/databricks GitHub Wiki
Optimize & Z-order
OPTIMIZE compacts the files to get a file size of up to 1GB, which is configurable
This command basically attempts to size the files to the size that you have configured (or 1GB by default if not configured).
You can also combine the OPTIMIZE command with the ZORDER, which physically sorts or co-locates data by chosen column(s).
In Databricks, the OPTIMIZE command is used to improve the layout of data files in Delta Lake tables by compacting small files into larger ones and optionally reorganizing data to improve query performance. It primarily addresses the problem of many small files resulting from continuous data writes or streaming data ingestion, which can degrade performance.
Key aspects of the OPTIMIZE command in Databricks are:
-
It performs bin-packing optimization to create evenly balanced data files with respect to their size on disk, which can reduce the number of files and improve data skipping efficiency. This bin-packing is idempotent, meaning running OPTIMIZE multiple times on unchanged data has no additional effect.
-
Z-Ordering can be applied optionally within OPTIMIZE by specifying columns. This sorts and clusters data based on those columns, improving query speed by co-locating similar values and enhancing data skipping. Z-Ordering is incremental and non-idempotent.
-
OPTIMIZE can be run on the entire table or a subset of data by using a WHERE predicate to optimize specific partitions.
-
It does not change the data content but improves file layout, so query results remain consistent before and after optimization.
-
The maximum output file size can be controlled via configurations (default is 1 GB).
-
Auto compaction can be enabled to automatically run OPTIMIZE after writes to keep files optimally sized.
The syntax for OPTIMIZE is:
OPTIMIZE table_name [FULL] [WHERE predicate] [ZORDER BY (col1, col2, ...)] Example: To optimize a Delta table and apply Z-Ordering by a column:
OPTIMIZE sales_table ZORDER BY (customer_id) OPTIMIZE improves query performance by reducing file count, improving data locality, enabling better data skipping, and helping analytical queries run faster on Delta Lake tables in Databricks.
Therefore, it is best to rely on auto-compaction for continuous, small write optimization but run OPTIMIZE manually or on a schedule for comprehensive file compaction, especially for large tables, and to leverage advanced optimizations like Z-Ordering