Apache Spark Note - ayaohsu/Personal-Resources GitHub Wiki

Spark Basics

What is Spark? Spark is a fast and general engine for large-scale data processing.

Driver Program (Spark Context) --> Cluster Manager (Spark, YARN) --> multiple executors (cache and tasks)

DAG engine to optimize workflows so it is very fast ("100 times faster than MapReduce")

Built around one main concept: Resilient Distributed Dataset

Lazy evaluation: Nothing actually happens in your driver program until an action is called!
(Actions: collect, count, reduce, etc.)

Actions will result in python data type, instead of RDD. If you want RDD, you will have to remain in the transformation domain. For example, use the "hard way" to compute count by value:

wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

SparkSQL and DataFrames

You can use SQL command across entire cluster to query massive data set that might be too large to run on a traditional vertically scaled database.

DataFrames:

  • Contain Row objects
  • Can run SQL queries
  • Can have a schema (leading to more efficient storage)
  • Read and write to JSON, Hive, parquet, csv...
  • Communicates with JDBC/ODBC, Tableau

Spark on a cluster

Hadoop YARN: Hadoop's cluster manager
Easier to use AWS console to manage the cluster
EMR: run Spark on top of Hadoop YARN

Steps:

  • Create key pair on EC2 and download it (.pem file)
  • Use puttygen.exe and load the pem file. This will save as ppk file that putty can use (put in Auth in SSH section, in Putty)
  • AWS console -> EMC -> create cluster
    • choose: Spark as application; 3 instances of m3.xlarge; EC2 key pair option
  • After cluster created, click on Master public DNS -> use Putty to connect to it
  • Open port 22 (for ssh) from security group (edit inbound rule)