Spark Intro - ayushmathur94/Spark GitHub Wiki

What is an Iterative Approach ?

An iterative method is a procedure that uses an initial value to generate a sequence of improving 
approximate solutions for a class of problem, in which the n-th approximation is derived from the previous 
ones.

What is Spark ?

• Spark is fast general purpose cluster computing system. 
• Spark provides high-level APIs in JAVA, Scala, Python and R. 
• Spark uses in-memory computation (via Random Access Memory) to achieve high performance.
• It provides an optimized engine that supports general execution graphs. 
• It is designed to perform both batch processing and stream processing. 
• Spark supports rich set of high level tools like SparkSQL for SQL and structured data processing, 
  GraphX for graph processing , MLlib for machine learning and SparkStreaming. 
• Spark Provides fast iterative access to datasets. 
• It can run in cluster (like Hadoop Yarn or Apache Mesos) or in StandAlone Mode. 
• Sprak can access data from any of the Hadoop Data Source like Hbase, Cassandra, Hive, HDFS.
• Spark does not have its own storage system, it relies on HDFS or other file storage for storing the data.
• Spark can “run programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.”

What are the features of Spark ?

 
1.) Speed 
> The main feature of Apache Spark is its in-memory cluster computing that increases the processing 
  speed of an application. 
> Spark runs up to 100 times faster than Hadoop MapReduce for large-scale data processing. It is also
 able to achieve this speed through controlled partitioning. 

2.) Powerful Caching  
> Simple programming layer  provides powerful caching and disk persistence capabilities. 

3.) Deployment 
> It can be deployed through Mesos, Hadoop via YARN, or Spark's own cluster manager.

4.) Real Time 
> It offers Real-Time computation and low latency because of in-memory computation.

5.)Polygot 
> Spark provides high-level  APIs in Java, Scala , Python and R. Spark code can be written in any of these 
  four languages. It also provides a shell in Scala and Python.

Spark Architecture

sparkarch

Driver Program in the Apache Spark architecture calls the main program of an application and creates SparkContext. A SparkContext consists of all the basic functionalities. Spark Driver contains various other components such as a DAG scheduler, Task Scheduler, Backend Scheduler and Block Manager , which are responsible for translating the user-written code into jobs that are actually executed on the cluster.

Spark Driver and Spark Context collectively watch over the job execution within the cluster. Spark Driver works with Cluster Manager to manage various other jobs. Cluster Manager does the resource allocating work. And then, the Job is split into multiple smaller tasks which are further distributed to worker nodes.

Whenever an RDD is created in the SparkContext, it can be distributed across many worker nodes and can also be cached there.

Worker nodes execute the task assigned by the Cluster Manager and return it back to the SparkContext.

An executor is responsible for the execution of these tasks. The lifetime of executors is the same as that of the Spark Application. If we want to increase the performance of the system, we can increase the number of workers so that the jobs can be divided into more logical portions.

STEP 1: The client submits spark user application code. When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this stage, it also performs optimizations such as pipelining transformations.

STEP 2: After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are bundled and sent to the cluster.

STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster Manager launches executors in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data placement. When executor start, they register themselves with drivers. So, the driver will have a complete view of executors that are executing the task.

STEP 4: During the course of execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.

Do I need Hadoop to run Spark?

No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

Spark Core > Spark SQL > Spark Streaming > ML Lib > GraphX

                         > Spark Standalone
                         > Hadoop YARN
                         > Mesos

Spark is a distributed platform for executing complex multi-stage applications, like machine learning algorithms, and interactive ad hoc queries. Spark provides an efficient abstraction for in-memory cluster computing called Resilient Distributed Dataset.

If you have large amounts of data that requires low latency processing that a typical MapReduce program cannot provide, Spark is a viable alternative.

Access any data type across any data source.

Huge demand for storage and data processing.

Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).It is Spark’s goal to be a general-purpose computing platform with various specialized applications frameworks on top of a single unified engine.

Spark Processing -: At a high level, any Spark application creates RDDs out of some input, run (lazy) transformations of these RDDs to some other form (shape), and finally perform actions to collect or store data.

There was no powerful engine in the industry that can process data both in real time and in batch mode. Also there was a requirement for an engine that can respond in sub second and perform in-memory computation