Spark Session - ignacio-alorre/Spark GitHub Wiki

1- Creating a Spark Session

Spark session is a unified entry point of a spark application from Spark 2.0. It provides a way to interact with various spark’s functionality with less number of constructs. Instead of having a spark context, hive context, SQL context... now all of it is encapsulated in a Spark session.

Before Spark 2.0, Spark Context was the entry point of any spark application and used to access all spark features and needed a sparkConf which had all the cluster configs and parameters to create a Spark Context object. We could primarily create just RDDs using Spark Context and we had to create specific spark contexts for any other spark interactions. For SQL SQLContext, hive HiveContext, streaming Streaming Application.

// Before Spark 2.0
val conf = new SparkConf().setMaster(master).setAppName(appName)
val sc = new SparkContext(conf)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val ssContext = new org.apache.spark.streaming.StreamingContext(sc)

In a nutshell, Spark session is a combination of all these 3 different contexts. Internally, Spark session creates a new SparkContext for all the operations and also all the above-mentioned contexts can be accessed using the SparkSession object.

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Dummy App")
  .config("spark.master", "local")
  .enableHiveSupport()
  .getOrCreate()

The spark session builder will try to get a spark session if there is one already created or will create a new one and assigns the newly created SparkSession as the global default.

In the script above enableHiveSupport() here is similar to creating a HiveContext and all it does is enables access to Hive metastore, Hive serdes, Hive udfs...

Note: We don't have to create a spark session object when using spark-shell. It is already created for us with the variable spark.

We can later access the different contexts using the spark session object.

spark.sparkContext
spark.sqlContext

We can still access the spark's configurations using spark session the same way

val mConf = spark.conf

2- Why do I need Spark session when I already have Spark context?

  • The session unifies all the different contexts in spark, avoiding the developer to worry about different contexts.

  • In the case multiple users access the same notebook environment which share spark context, but we need the users share spark context having isolated environment. Before Spark 2.0, the solution was to create multiple spark contexts per isolated environments, which is an expensive operation. Spark session addresses this issue. Note: We can still have multiple spark contexts by setting spark.driver.allowMultipleContexts to true. But this is not encouraged.

3- How do I create multiple sessions?

It is possible to generate a new spark session like follows:

val session2 = spark.newSession()

If check the hash of the spark and session2 we can see both are different. But the underneath spark context is the same:

scala> val session2 = spark.newSession()
session2: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@691fffb9

scala> spark
res22: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@506bc254

scala> spark.sparkContext
res26: org.apache.spark.SparkContext = org.apache.spark.SparkContext@715fceaf

scala> session2.sparkContext
res27: org.apache.spark.SparkContext = org.apache.spark.SparkContext@715fceaf

Also, we can verify that the spark session gives a unified view of all the contexts and isolation of configuration and environment.

We can directly query without creating a SQL Context like we used and run the queries similarly. Let’s say we have a table called people_session1 .This table will be only visible in the session spark .

Now let's imagine we created a new session session2.These tables won’t be visible for when we try to access them and also we can create another table with the same name without affecting the table in spark session.

This isolation is for the configurations as well. Both sessions can have their own configs.

scala> spark.conf.get("spark.sql.crossJoin.enabled")
res21: String = true

scala>   session2.conf.get("spark.sql.crossJoin.enabled")
res25: String = false

4- Get/Set Configurations

It is possible to modify the session configuration like:

scala> spark.conf.get("spark.sql.crossJoin.enabled")
res4: String = false

scala> spark.conf.set("spark.sql.crossJoin.enabled", "true")

scala> spark.conf.get("spark.sql.crossJoin.enabled")
res6: String = true

Sources: