Evaluation of Spark - Texera/texera GitHub Wiki
Author: Yuran Yan
Overview
This page gives a summary of Apache Spark basics that has been discussed in this GitHub Issue
Spark Cluster Architecture
There are 3 major components of a typical Spark cluster
Driver Program
This is the machine that users use to initiate the entire program, also known as the entry point of the application.
Cluster Manager
This is the machine that is used for managing the cluster. It keeps track of the states of the worker machines and tell the Driver which machine it can connect and send tasks. This machine is also known as the "Master node" of the cluster.
Worker Node
These are the machines that actually do the computation of all the spark applications. Each worker node has to connect to the Cluster Manager to become part of the Spark cluster. Each of the worker node will receive different tasks during the entire application process. The Worker Node is also known as the "Slave node" of the cluster.
There are also some important small components within each major component
SparkContext
This is the spark object the hold the spark application, which is also the beginning of the code of a spark program. Each spark application will have exactly one SparkContext object
Executors
Executors are the actual abstraction units that handles the spark computation. The Worker node, which is a machine, is just a place holder for executor(s). A Worker node can have one or more executors with each of them take control of part of the Worker node's cores and memories. Tasks will be sent to executors for execution. Each executor will have control on some CPU cores and memory of its working node.
Tasks
Task is the smallest unit of work for a spark program. For now, you can regard them as fragments of code or commands that the executors need to perform. The detail of its concept will be discussed in the "Physical plan" part.
For more references about the architecture of spark, read these links:
http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/ https://spark.apache.org/docs/latest/cluster-overview.html
Spark Programming Concept
Spark has its own way of processing data. All data inputs have to be transformed into the data type used by spark. There are two types of them.
Resilient Distributed dataset(RDD)
RDD is the core and basic data abstraction used by Apache Spark. It is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. Three basic features about RDD:
- Resilient: Able to recompute missing or damaged partitions due to failures. With help of RDD Lineage graph.
- Distributed: Allow distribution for Multiple nodes
- Dataset: Allow different kinds of value types
Dataset(Dataframe)
This new basic data abstraction is introduced after Spark 2.0. According to the Spark official site, it is strongly-typed like an RDD, but it has richer optimization. However, the actual performance compared to the RDD need to be explored. Most properties apply to RDD also apply to Dataset. https://spark.apache.org/docs/latest/quick-start.html
RDD operation and its laziness
To prevent the system from doing unnecessary computation, Spark will evaluate all the RDD operation and will only do the computation if the result is explicitly asked by the Driver program. To understand what it means, you have to first understand the category of RDD operations.
First of all, all RDDs are immutable, which means no RDD can be changed and you can only build new RDDs based on the old one. The data of the old RDD will remain until it gets erased by the garbage collector unless get cached. There are two types of RDD operations: Transformation and Action.
Transformation ONLY generate new RDDs from other RDDs. Both of the input and output are RDDs.
Action ONLY generate output whose type is not RDD. For all spark programs, data collections that are not RDD will be pushed back to the Driver program. This is where the Driver program explicitly asks for the computation results.
Following the definition. Operations like map()
or filter()
will be considered as Transformation.
Operation like collect()
or count()
will be considered as Action.
The operation getNumPartitions()
does not belong to either of them.
Spark will never do any Transformation until it is asked to perform an Action.
So in the example given, if the last line with collect()
is removed. No map()
or filter()
will be executed when you run the program.
Be wary if the Spark program has multiple Actions that refer to the RDDs that are computed before the previous Actions, the RDDs will be recomputed unless been persist()
or cache()
.