Open Table Format iceberg - ghdrako/doc_snipets GitHub Wiki
Apache Iceberg is an open-source table format designed for highperformance data lake tables. It provides a robust and flexible foundation for managing large datasets in distributed storage systems such as Hadoop and cloud object stores.
A table format helps to organize the data files for each table. It consists of information related to data files like schema details, file cre ate and update time, number of records in each file, and record-level operation types (addition and deletion). Table formats give data lakes ACID features, support for updates and deletes, and data-skipping capabilities, thus improving the performance of queries.
Three popular open table formats—Apache Iceberg, Apache Hudi, and Delta Lake for implementing lakehouse architecture.
Iceberg
Supported by various vendor platforms, including Dremio, Snowflake, and Tabular. It has rich feature set and ability to support Parquet, ORC, and Avro file formats.
Iceberg consists of two major parts: a metadata layer and a data layer.
The directory structure of Iceberg is as follows:
- Iceberg catalog - gives the table location and points to the latest metadata file.
- Metadata layer - consists of multiple elements:
- Like the Hive metastore, Iceberg stores the schema details partition information in a metadata file.
- There is a section in the metadata file known as Snapshot. Each snapshot points to a file known as a “manifest list.” For every new transaction, a new snapshot is added to the metadata file.
- A manifest list points to all the manifest files that belong to specific snapshots.
- These manifest files store a list of data files and column statistics details.
- Data layer
- Iceberg supports Parquet, ORC, and Avro file formats for writing the data in files
Partitioning strategies and configuration
Iceberg supports several partitioning strategies, including identity, range, and bucketing. When selecting a partitioning strategy, consider your query patterns and access requirements. For example, if your queries often filter on specific columns, you may benefit from partitioning your data on those columns using an identity or range strategy. If your queries are more focused on evenly distributing data across partitions, you may consider using bucketing.
Apache Iceberg does not support native Python bindings, so you cannot directly read or write Iceberg tables with pandas. However, you can use PySpark (the Python API for Apache Spark) to work with Iceberg tables. Here is an example of how to read data from an Iceberg table into a Spark DataFrame, and then convert that to a pandas DataFrame:
from pyspark.sql import SparkSession
# Initialising Spark
spark = SparkSession.builder \
.appName("example") \
.getOrCreate()
# Configure the Iceberg source
spark.conf.set("spark.sql.catalog.catalog-name", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.catalog-name.type", "hadoop")
spark.conf.set("spark.sql.catalog.catalog-name.warehouse", "/path/to/warehouse")
# Reading an Iceberg table into a Spark DataFrame
df = spark.sql("SELECT * FROM catalog-name.database.table-name")
# Converting the Spark DataFrame to a pandas DataFrame
pandas_df = df.toPandas()
# Now you can use pandas operations on the pandas_df object
In this example, replace catalog-name, database, and table-name with your Iceberg catalog name, database name, and table name, respectively. Additionally, replace "/path/to/warehouse" with the path to your Hadoop warehouse.
use the pyiceberg library to work directly with the files on S3 storage.