SPARK Basics - adarshatm/My-Notes GitHub Wiki

SPARK:

spark-shell --queue=mm2

Modes:

  1. Standalone
  2. Mesos
  3. YARN Spark can connect to any storage: HDFS, CSV, HIVE, JSON, Parque, Etc…

Spark processes data in memory

Context till SPARK 2.0 SPARK Context Hive Context SQL Context

Context from Spark 2.0 is Spark Session. It contains all there context combined. RDD : fundamental high level API in framework. Even Data frames and Dataset generated RDD internally.

RDD is just a plan. Doesn’t save anything in beginning. While executing RDD, it will compute data in memory. To save RDD data, use rdd.persist method

RDD:

  1. Transformations Ex: map
  2. Actions : count

As it is lazy loading, it creates DAG (Directed Acyclic Graph) for execution plan.

In memory computation framework Three ways to create RDD.

  1. From a file
  2. From data in memory
  3. From another RDD

val myList = List(1,2,3,4) val rdd1 = sc.parallelize(myList)

Transformation: Result will be RDD. By using transformation, data will not be pulled into memory as it does lazy evaluation Action: When Action is called, it finds all the relevant transformations and pull records into them. Even though there are transformations declared before actions which are not relevant to it, data will not be pulled into it.

Commonly used transformations : map and filter

Run RuleEngine: cdts10hdbe01d:mm2dusr:/development/mm2/apps/ProjectEagle/RulesEngine/current/scripts> ./startRuleEngine.sh