SPARK Basics - adarshatm/My-Notes GitHub Wiki
SPARK:
spark-shell --queue=mm2
Modes:
- Standalone
- Mesos
- YARN Spark can connect to any storage: HDFS, CSV, HIVE, JSON, Parque, Etc…
Spark processes data in memory
Context till SPARK 2.0 SPARK Context Hive Context SQL Context
Context from Spark 2.0 is Spark Session. It contains all there context combined. RDD : fundamental high level API in framework. Even Data frames and Dataset generated RDD internally.
RDD is just a plan. Doesn’t save anything in beginning. While executing RDD, it will compute data in memory. To save RDD data, use rdd.persist method
RDD:
- Transformations Ex: map
- Actions : count
As it is lazy loading, it creates DAG (Directed Acyclic Graph) for execution plan.
In memory computation framework Three ways to create RDD.
- From a file
- From data in memory
- From another RDD
val myList = List(1,2,3,4) val rdd1 = sc.parallelize(myList)
Transformation: Result will be RDD. By using transformation, data will not be pulled into memory as it does lazy evaluation Action: When Action is called, it finds all the relevant transformations and pull records into them. Even though there are transformations declared before actions which are not relevant to it, data will not be pulled into it.
Commonly used transformations : map and filter
Run RuleEngine: cdts10hdbe01d:mm2dusr:/development/mm2/apps/ProjectEagle/RulesEngine/current/scripts> ./startRuleEngine.sh