Iceberg & Spark - cniackz/public GitHub Wiki
Diagram:
Objective:
To understand iceberg concepts using Spark
Pages:
- https://github.com/cniackz/public/wiki/Spark
- https://iceberg.apache.org/docs/latest/getting-started/
Pre-Steps:
- Get Spark working:
Steps:
- spark shell with iceberg:
- With Python:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
- With Java:
cd /Users/cniackz/spark/spark-3.3.1-bin-hadoop3/bin
./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
You should see:
$ ./pyspark --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e;1.0
confs: [default]
found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
:: resolution report :: resolve 45ms :: artifacts dl 4ms
:: modules in use:
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 1 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-e0067d20-6680-4ed4-8e24-56343ff7b73e
confs: [default]
0 artifacts copied, 1 already retrieved (0kB/3ms)
22/12/02 11:00:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Python version 3.10.8 (main, Oct 13 2022 09:48:40)
Spark context Web UI available at http://192.168.1.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1670000456741).
SparkSession available as 'spark'.
>>>
$ ./spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0
:: loading settings :: url = jar:file:/Users/cniackz/spark/spark-3.3.1-bin-hadoop3/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/cniackz/.ivy2/cache
The jars for the packages stored in: /Users/cniackz/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699;1.0
confs: [default]
found org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 in central
downloading https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/1.1.0/iceberg-spark-runtime-3.2_2.12-1.1.0.jar ...
[SUCCESSFUL ] org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0!iceberg-spark-runtime-3.2_2.12.jar (2604ms)
:: resolution report :: resolve 464ms :: artifacts dl 2606ms
:: modules in use:
org.apache.iceberg#iceberg-spark-runtime-3.2_2.12;1.1.0 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 1 | 1 | 0 || 1 | 1 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c63357e7-5a4b-4d9e-88e1-3484451fa699
confs: [default]
1 artifacts copied, 0 already retrieved (26171kB/31ms)
22/12/02 10:55:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.4:4040
Spark context available as 'sc' (master = local[*], app id = local-1670000132839).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15 (Java HotSpot(TM) 64-Bit Server VM, Java 18)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
- Adding catalogs
- Catalogs enable SQL Commands to manage Tables!
./spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.1.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
--conf spark.sql.catalog.local.warehouse=$PWD/warehouse
- Create table with above
spark-sql
line:
spark-sql> CREATE TABLE local.db.table (id bigint, data string) USING iceberg;
Time taken: 0.844 seconds
- Insert Data in the Table:
INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c');
You should see:
spark-sql> select * from local.db.table;
1 a
2 b
3 c
Time taken: 0.28 seconds, Fetched 3 row(s)
Conclusion:
Spark is the Engine, Iceberg is just the format, the engine loads that format and produces data in iceberg format.