Iceberg & Spark - cniackz/public GitHub Wiki
To understand iceberg concepts using Spark
- Get Spark working:
- spark shell with iceberg:
- With Python:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
- With Java:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
You should see:
$ ./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e;1.0
confs: [default]
found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
:: resolution report :: resolve 45ms :: artifacts dl 4ms
:: modules in use:
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
| default | 1 | 0 | 0 | 0 || 1 | 0 |
:: retrieving :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/3ms)
22/12/02 11:00:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.1
Using Python version 3.10.8 (main, Oct 13 2022 09:48:40)
Spark context Web UI available at
Spark context available as 'sc' (master = local[*], app id = local-1670000456741).
SparkSession available as 'spark'.
$ ./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699;1.0
confs: [default]
found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
downloading ...
[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0!iceberg-spark-runtime-3.2_2.12.jar (2604ms)
:: resolution report :: resolve 464ms :: artifacts dl 2606ms
:: modules in use:
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
| default | 1 | 1 | 1 | 0 || 1 | 1 |
:: retrieving :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699
confs: [default]
1 artifacts copied, 0 already retrieved (26171kB/31ms)
22/12/02 10:55:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at
Spark context available as 'sc' (master = local[*], app id = local-1670000132839).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 18)
Type in expressions to have them evaluated.
Type :help for more information.
- Adding catalogs
- Catalogs enable SQL Commands to manage Tables!
./spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
- Create table with above
spark-sql> CREATE TABLE local.db.table (id bigint, data string) USING iceberg;
Time taken: 0.844 seconds
- Insert Data in the Table:
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
You should see:
spark-sql> select * from local.db.table;
1 a
2 b
3 c
Time taken: 0.28 seconds, Fetched 3 row(s)
Spark is the Engine, Iceberg is just the format, the engine loads that format and produces data in iceberg format.