Spark Context and Spark Session - datacouch-io/spark-java GitHub Wiki

Apache Spark is a powerful big data processing framework that provides two essential entry points for interacting with the Spark cluster: Spark Context and Spark Session. In this documentation, we'll explore both Spark Context and Spark Session, highlighting their characteristics, use cases, and differences. We'll also provide Java examples to illustrate how to create and use them.

Spark Context

Introduction

Description: Spark Context (also known as SparkContext) is the entry point to the core functionality of Apache Spark. It represents the connection to a Spark cluster and is responsible for coordinating tasks and distributing data across the cluster.
Use Cases: Spark Context is primarily used for creating Resilient Distributed Datasets (RDDs), which are the fundamental data structures in Spark. It is suitable for Spark's standalone mode, as well as cluster deployments with Apache Mesos or Hadoop YARN.

Creating Spark Context (Java Example)

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;

public class SparkContextExample {
    public static void main(String[] args) {
        // Create a SparkConf object to configure Spark
        SparkConf conf = new SparkConf().setAppName("SparkContextExample").setMaster("local");

        // Create a SparkContext using the SparkConf object
        SparkContext sc = new SparkContext(conf);

        // Your Spark operations go here

        // Stop the SparkContext when done
        sc.stop();
    }
}

Spark Session

Introduction

Description: Spark Session (also known as SparkSession) is introduced in Spark 2.0 as a higher-level API that provides a unified entry point for various Spark features, including DataFrames, Datasets, and SQL operations. It is designed to simplify the development process.
Use Cases: Spark Session is the recommended entry point for all Spark functionality, especially when working with structured data, SQL queries, and machine learning libraries like Spark MLlib. It provides optimizations for querying and managing structured data.

Creating Spark Session (Java Example)

import org.apache.spark.SparkConf;
import org.apache.spark.sql.SparkSession;

public class SparkSessionExample {
    public static void main(String[] args) {
        // Create a SparkConf object to configure Spark
        SparkConf conf = new SparkConf().setAppName("SparkSessionExample").setMaster("local");

        // Create a SparkSession using the SparkConf object
        SparkSession spark = SparkSession.builder().config(conf).getOrCreate();

        // Your Spark operations go here

        // Stop the SparkSession when done
        spark.stop();
    }
}

Comparing Spark Context and Spark Session

1. Purpose

Spark Context: Mainly used for RDD-based operations and lower-level Spark functionality.
Spark Session: Designed for higher-level abstractions like DataFrames, Datasets, SQL, and MLlib. Provides a unified entry point for Spark features.

2. Entry Point

Spark Context: Traditional entry point for Spark, used before Spark 2.0.
Spark Session: Recommended entry point for Spark 2.0 and later, simplifying development.

3. RDDs vs. DataFrames

Spark Context: Works well with RDDs (Resilient Distributed Datasets).
Spark Session: Primarily used with DataFrames and Datasets, which provide a more structured and optimized way to work with data.

4. Use Cases

Spark Context: Suitable for low-level, custom data processing tasks and RDD transformations.
Spark Session: Ideal for structured data processing, SQL queries, and machine learning tasks.

5. Configuration

Spark Context: Configured using a SparkConf object.
Spark Session: Configured using a SparkConf object and created through the SparkSession.builder() method.

6. Spark UI

Spark Context: Typically has its separate Spark UI.
Spark Session: Uses the same Spark UI as the Spark Context.

7. Example Use Cases

Spark Context: Legacy or specific RDD-based processing tasks.
Spark Session: Modern, structured data processing tasks, SQL queries, and machine learning pipelines.

Conclusion

In summary, Spark Context and Spark Session serve as the entry points to Apache Spark, catering to different use cases and levels of abstraction. While Spark Context is suitable for lower-level RDD operations, Spark Session is recommended for structured data processing and provides a unified API for various Spark features. Choose the appropriate entry point based on your specific needs and the version of Spark you are using.