Spark - salmanbaig8/imp GitHub Wiki

URL:https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/courseware/50e2f47dec3341ab984fb0505c202b99/7f3e68eea7e7416e9481ec7e69f212b4/

•Explain the purpose of Spark: Spark is a computing platform designed to be fast and general purpose and easy to use -speed: In memory computations, faster than (extends)mapreduce for comlex applications on disk -Generality: Covers a widerange of workloads on one system, batch applications(map reduce), iterative algo's, initeractive queries and streaming -Ease of use:APi's for Scala,Python,Java, Libraries for SQL,machine learning,streaming and graph processing, Runs on Hadoop clusters or standAlone

why Spark ?? Parallel Distributed processing, fault tolerance on commodity hardware, scalability, inmemory computing, high level APi's

•List and describe the components of the Spark unified stack: Spark SQL, Spark streaming(real time processing), Mlib machine learning, GraphX graph processing,Spark core,Standalone scheduler, YARN, MESOS

•Understand the basics of Resilient Distributed Dataset (RDD): Sparks primary abstraction, distributed collection of elements, parallelized across the cluster,Fault tolerance,Caching 2 types of RDD operations: -Transformations: Creates a DAG, lazy evaluations, no return value -Actions: performs the transformations and actions that follows, returns a value EX RDD flow: Hadoop RDD -> Filtered RDD -> Mapped RDD -> Reduced RDD

•Downloading and installing Spark standalone •Scala and Python overview •Launch and use Spark’s Scala and Python shell