Evaluation of Spark - Texera/texera GitHub Wiki

Author: Yuran Yan

Overview

This page gives a summary of Apache Spark basics that has been discussed in this GitHub Issue

Spark Cluster Architecture

There are 3 major components of a typical Spark cluster

Driver Program

This is the machine that users use to initiate the entire program, also known as the entry point of the application.

Cluster Manager

This is the machine that is used for managing the cluster. It keeps track of the states of the worker machines and tell the Driver which machine it can connect and send tasks. This machine is also known as the "Master node" of the cluster.

Worker Node

These are the machines that actually do the computation of all the spark applications. Each worker node has to connect to the Cluster Manager to become part of the Spark cluster. Each of the worker node will receive different tasks during the entire application process. The Worker Node is also known as the "Slave node" of the cluster.

There are also some important small components within each major component

SparkContext

This is the spark object the hold the spark application, which is also the beginning of the code of a spark program. Each spark application will have exactly one SparkContext object

Executors

Executors are the actual abstraction units that handles the spark computation. The Worker node, which is a machine, is just a place holder for executor(s). A Worker node can have one or more executors with each of them take control of part of the Worker node's cores and memories. Tasks will be sent to executors for execution. Each executor will have control on some CPU cores and memory of its working node.

Tasks

Task is the smallest unit of work for a spark program. For now, you can regard them as fragments of code or commands that the executors need to perform. The detail of its concept will be discussed in the "Physical plan" part.

For more references about the architecture of spark, read these links:

http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ https://spark.apache.org/docs/latest/cluster-overview.html

Spark Programming Concept

Spark has its own way of processing data. All data inputs have to be transformed into the data type used by spark. There are two types of them.

Resilient Distributed dataset(RDD)

RDD is the core and basic data abstraction used by Apache Spark. It is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. Three basic features about RDD:

  1. Resilient: Able to recompute missing or damaged partitions due to failures. With help of RDD Lineage graph.
  2. Distributed: Allow distribution for Multiple nodes
  3. Dataset: Allow different kinds of value types

Dataset(Dataframe)

This new basic data abstraction is introduced after Spark 2.0. According to the Spark official site, it is strongly-typed like an RDD, but it has richer optimization. However, the actual performance compared to the RDD need to be explored. Most properties apply to RDD also apply to Dataset. https://spark.apache.org/docs/latest/quick-start.html

RDD operation and its laziness

To prevent the system from doing unnecessary computation, Spark will evaluate all the RDD operation and will only do the computation if the result is explicitly asked by the Driver program. To understand what it means, you have to first understand the category of RDD operations.

First of all, all RDDs are immutable, which means no RDD can be changed and you can only build new RDDs based on the old one. The data of the old RDD will remain until it gets erased by the garbage collector unless get cached. There are two types of RDD operations: Transformation and Action.

Transformation ONLY generate new RDDs from other RDDs. Both of the input and output are RDDs.

Action ONLY generate output whose type is not RDD. For all spark programs, data collections that are not RDD will be pushed back to the Driver program. This is where the Driver program explicitly asks for the computation results.

Following the definition. Operations like map() or filter() will be considered as Transformation. Operation like collect() or count() will be considered as Action. The operation getNumPartitions() does not belong to either of them. Spark will never do any Transformation until it is asked to perform an Action. So in the example given, if the last line with collect() is removed. No map() or filter() will be executed when you run the program.

Be wary if the Spark program has multiple Actions that refer to the RDDs that are computed before the previous Actions, the RDDs will be recomputed unless been persist() or cache().