Architecture and Features - ignacio-alorre/Spark GitHub Wiki

Apache Spark Architecture

Key Points

Cluster Manager: Allocate the resources and instruct workers to execute the job. Tracks submitted jobs and report back the status of jobs
Worker Node: Process the data stored on the node under request of the Cluster Manager (Master)
Spark Driver: Runs main() function and creates the SparkContext. Also hosts processes like the DAGScheduler or the TaskScheduler
Spark Executor: Host the running tasks. These tasks operate on a subset of RDDs present on that node.
When you invoke an action on an RDD, a Job is created. Jobs are work submitted to Spark.
Jobs are divided into Stages based on the shuffle boundary
Each stage is further divided into Tasks based on the number of partitions in the RDD. Tasks are units of execution that the Spark driver schedules ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.

Expand

Cluster Manager

Allocate the resources and instruct workers to execute the job. Tracks submitted jobs and report back the status of jobs

Worker Node

Nodes which run the application code in the cluster. Master node assign work and worker node actually perform the assigned tasks. Worker node processes the data stored on the node, they report the resources to the master.

Spark Driver
Driver is the process that runs the main() function of the application and also creates the SparkContext. A driver is a separate JVM process and it is the responsibility of the driver to analyze, distribute, schedule and monitor work across the worker nodes.

Each application launched or submitted on a cluster will have its own separate Driver running, and even if there are multiple applications running simultaneously on a cluster, these Drivers will not talk to each other in any way. The Driver program also plays host to a bunch of processes which are part of the application like

SparkEnv
DAGScheduler
TaskScheduler
SparkUI

Spark Executor
The Driver program launches the tasks which run on individual worker nodes. These tasks are what operate on a subset of RDDs that are present on that node. These programs running on the worker nodes are called executors. The actual program written in your application gets executed by these executors.

The Driver program after getting started interacts with the Cluster Manager (YARN, Mesos, Default) to spin off the resources on the Worker nodes and then assign the tasks to the executors. Tasks are the basic units of execution.

SparkSession and SparkContext
It is the main entry point to spark core. It allows us to access further functionalities of Spark. This helps to establish a connection to Spark execution environment. It provides access to Spark cluster even with a resource manager. If offers various functions such as:

Getting the current status of Spark application
Cancelling the Job
Cancelling the Stage
Running Job Synchronously
Running Job Asynchronously
Accessing persistent RDD
Un-persisting RDD
Programmable dynamic allocation

Spark Application

It is self-contained computation that runs user-supplied code to compute a result. Even when there is no job running, Spark Application can have processes running on its behalf

When you invoke an action on an RDD, a Job is created. Jobs are work submitted to Spark.
Jobs are divided into Stages based on the shuffle boundary
Each stage is further divided into Tasks based on the number of partitions in the RDD. Tasks are units of execution that the Spark driver schedules ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.

Apache Spark Features

Expand

Lazy Evaluation

Other systems for in-memory storage are based on "fine-grained" updates to mutable objects, i.e., calls to a particular cell in a table by storing intermediate results.
However, evaluation of RDDs is completely lazy. Spark does not begin computing the partitions until an action is called.
An action is a Spark operation that returns something other than an RDD, triggering evaluation of partitions and possible returning some output to a non-Spark system (outside of the Spark executors); for example, bringing data back to the driver or writting data to an external storage storage system.
Actions trigger the scheduler, which builds a directed acyclic graph (called the DAG), based on the dependencies between RDD transformations.
Spark evaluates an action y working backward to define the series of steps it has to take to produce each object in the final distributed dataset (each partition)
Using this series of steps, called the execution plan, the scheduler computes the missing partitions for each state until it computes the result

Note Not all transformations are 100% lazy. sortByKey needs to evaluate the RDD to determine the range of data, so it involves both a transformation and an action.

Performance and usability advantages of lazy evaluation

Lazy evaluation allows Spark to combine operations that don’t require communication with the driver (called transformations with one-to-one dependencies) to avoid doing multiple passes through the data
Suppose a Spark program calls a map and a filter function on the same RDD. Spark can send the instructions for both the map and the filter to each executor.
Then Spark can perform both the map and filter on each partition, which requires accessing the records only once, rather than sending two sets of instructions and accessing each partition twice, therefore reducing the computational complexity.
The lazy evaluation paradigm is also easier to implement compared with MapReduce, where developer needs to consolidate the mapping operations. Spark let us in a few lines chain together operations with narrow dependencies and let the Spark evaluation engine do the work of consolidating them.

Lazy evaluation and debugging

Lazy evaluation has important consequences for debugging since ti means that a Spark program will fail only at the point of action.
Note: Because of lazy evaluation, stack traces from failed Spark jobs (specially when embedded in larger systems) will often appear to fail consistently at the point of the action, even if the problem in the logic occurs in a transformation much earlier in the program.

In-Memory persistence and Memory Management

The greatest Spark's performance advantage over MapReduce occurs when repeated computations is required. Much of this performance increase is due to Spark's use of in-memory persistence.
Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory.
That way, the data on each partition is available in-memory each time it needs to be accessed.

Spark offers three options for memory management:

In memory as deserialized Java objects: Store objects in RDDs as the original deserialized Java objects that are defined by the driver program. This form of in-memory storage is the fastest since it reduces serialization time. However it may not be the most memory efficient, since it requires the data to be stored as objects.
As serialized data: Spark objects are converted into streams of bytes as they are moved around the network. This approach may be slower since serialized data is more CPU-intensive to read than deserialized data; However, it is often more memory efficient, since it allows the user to choose a more efficient representation. Kryo serialization can be even more space efficient.
On disk: RDDs whose partitions are too large to be stored in RAM on each of the executors can be written to disk. This is slower for repeated computations, but can be more fault-tolerant for long sequences of transformations, and may be the only feasible option for enormous computations.

The persist() function in the RDD lets the user control how the RDD is stored. By default, persist() stores an RDD as deserialized objects in memory, but the user can pass one of the numerous storage options to the persist() function to control how the RDD is stored.

Reliability - Fault Tolerance

Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).

By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.

Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.

Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.

The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

Apache Spark vs MapReduce

Key Points

MapReduce persist full dataset after running each job. Spark can pass directly the output operation as input of another operation, without persisting
Spark offers in-memory caching abstraction. Ideal for workloads where multiple operations access the same input data
MapReduce starts a new JVM for each task, which may take seconds. Spark keeps an executor JVM running on each node, so launching a task takes few miliseconds.
Spark use of the lazy evaluation and DAG of consecutive stages produce an optimized execution plan which minimize shuffling

Expand

Spark performs from 10x to 100x times faster than MapReduce. Here are some of the reasons:

One of the main limitations of MapReduce is that it persists the full dataset to HDFS after running each job. This is very expensive, because it incurs both three times (for replication) the size of the dataset in disk I/O and a similar amount of network I/O. Spark takes a more holistic view of a pipeline of operations. When the output of an operation needs to be fed into another operation, Spark passes the data directly without writing to persistent storage. This is an innovation over MapReduce that came from Microsoft's Dryad paper, and is not original to Spark.
The main innovation of Spark was to introduce an in-memory caching abstraction. This makes Spark ideal for workloads where multiple operations access the same input data. Users can instruct Spark to cache input data sets in memory, so they don't need to be read from disk for each operation.
What about Spark jobs that would boil down to a single MapReduce job? In many cases also these run faster on Spark than on MapReduce. The primary advantage Spark has here is that it can launch tasks much faster. MapReduce starts a new JVM for each task, which can take seconds with loading JARs, JITing, parsing configuration XML, etc. Spark keeps an executor JVM running on each node, so launching a task is simply a matter of making an RPC to it and passing a Runnable to a thread pool, which takes in the single digits of milliseconds.
Spark uses "lazy evaluation" to form a directed acyclic graph (DAG) of consecutive computation stages. In this way, the execution plan can be optimized, e.g. to minimize shuffling data around. In contrast this should be done manually in MapReduce by tuning each MR step

Lastly, a common misconception probably worth mentioning is that Spark somehow runs entirely in memory while MapReduce does not. This is simply not the case. Spark's shuffle implementation works very similarly to MapReduce's: each record is serialized and written out to disk on the map side and then fetched and deserialized on the reduce side.

When to use MapReduce

Linear Processing of large Dataset
No intermediate Solution required

When to use Apache Spark

Fast and interactive data processing
Joining Datasets
Graph processing
Iterative jobs
Real-time processing
Machine Learning

Sources:

Architecture and Features - ignacio-alorre/Spark GitHub Wiki

Apache Spark Architecture

Apache Spark Features

Lazy Evaluation

In-Memory persistence and Memory Management

Reliability - Fault Tolerance

Apache Spark vs MapReduce

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️