Iceberg & Spark - cniackz/public GitHub Wiki

Diagram:

telegram-cloud-photo-size-1-4924772586253036367-y

Objective:

To understand iceberg concepts using Spark

Pages:

Pre-Steps:

  1. Get Spark working:

Steps:

  1. spark shell with iceberg:
  • With Python:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
  • With Java:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0

You should see:

$ ./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
:: resolution report :: resolve 45ms :: artifacts dl 4ms
	:: modules in use:
	org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   0   |   0   |   0   ||   1   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e
	confs: [default]
	0 artifacts copied, 1 already retrieved (0kB/3ms)
22/12/02 11:00:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Python version 3.10.8 (main, Oct 13 2022 09:48:40)
Spark context Web UI available at http://192.168.1.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1670000456741).
SparkSession available as 'spark'.
>>>
$ ./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/1.1.0/iceberg-spark-runtime-3.2_2.12-1.1.0.jar ...
	[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0!iceberg-spark-runtime-3.2_2.12.jar (2604ms)
:: resolution report :: resolve 464ms :: artifacts dl 2606ms
	:: modules in use:
	org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   1   |   1   |   1   |   0   ||   1   |   1   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699
	confs: [default]
	1 artifacts copied, 0 already retrieved (26171kB/31ms)
22/12/02 10:55:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1670000132839).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/
         
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 18)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 
  1. Adding catalogs
  • Catalogs enable SQL Commands to manage Tables!
./spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0\
    --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
    --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
    --conf spark.sql.catalog.spark_catalog.type=hive \
    --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.local.type=hadoop \
    --conf spark.sql.catalog.local.warehouse=$PWD/warehouse
  1. Create table with above spark-sql line:
spark-sql> CREATE TABLE local.db.table (id bigint, data string) USING iceberg;
Time taken: 0.844 seconds
  1. Insert Data in the Table:
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');

You should see:

spark-sql> select * from local.db.table;
1	a
2	b
3	c
Time taken: 0.28 seconds, Fetched 3 row(s)

Conclusion:

Spark is the Engine, Iceberg is just the format, the engine loads that format and produces data in iceberg format.