Spark Job Scheduling - ignacio-alorre/Spark GitHub Wiki
A Spark application consists of a driver process, where high-level Spark logic is written, and a series of executors processes that can be scattered across the nodes of a cluster.
Spark program runs in the driver node and sends some instructions to the executors. One Spark cluster can run several Spark applications concurrently.
Spark applications can run multiple concurrent jobs. Jobs correspond to each action called on an RDD in a given application.
Resource Allocation Across Applications
-
Static Allocation: Each application is allotted a finite maximum of resources on the cluster and reserves them for the duration of the application (as long as the SparkContext is running)
-
Dynamic Allocation: Executors are added and removed from a Spark application as needed, based on a set of heuristics for estimated resource requirements.
The Spark Application
It is a set of Spark jobs defined by one SparkContext
in the driver program. A Spark application begins when a SparkContext
is started.
When SparkContext
is started, a driver and a series of executors are started on the worker nodes of the cluster. Each executor is its own JVM. An executor cannot span multiple nodes, although one node may contains several executors.
The SparkContext
determines how many resources are allotted to each executor. In this way, we can think of one SparkContext
as one set of configuration parameters for running Spark jobs. These parameters are exposed in the SparkConf
object.
Note: RDDs
cannot be shared between applications. Thus transformations such as join, which use more than one RDD
must have the same SparkContext
Image below we can see what happens when we start SparkContext
:
- Driver program pings the cluster manager.
- Cluster manager launches a number of Spark executors (JVMs shown as green boxes in the diagram) on the worker nodes of the cluster (blue circles).
- One node can have multiple Spark executors, but an executor cannot span multiple nodes.
- An RDD will be evaluated across the executor in partitions (shown as red rectangles)
- Each executor can have multiple partitions, but a partition cannot be spread across multiple executors
Default Spark Scheduler
By default, Spark schedules jobs on "first in, first out" basis. However, Spark offers also a fair scheduler, which assigns tasks to concurrent jobs in round-robin fashion, i.e. parcelling out a few tasks for each job until the job are all complete. The fair scheduler ensures that jobs get a more even share of cluster resources. The Spark application then launches jobs in the order that their corresponding actions were called on the SparkContext