BD: Spark - dudycooly/1235 GitHub Wiki
Analytics - Process of getting a meaningful insight of data you have (Is my sales going to grow?, Where are the issues for not meeting target)
Batch Processing - Processing a data collected over a period of time to answer non-time critical queries Real Time Processing (not true RT like Storm, does microbatching) - Instantaneous of process of real time data collected from the field to answer a time critical queries
Apache Spark: High Performance Distributed Cluster Computing System meant for both Batch and Real Time processing
Alternatives: Hadoop + MAPR (Batch Processing) Hadoop + Apache Storm (Real Time Processing)
Applications: May not fit small scale data (fitting in one server)
Building Blocks: Resilient Distributed Dataset (RDD) - Manage data distribution, computation and co-ordination Direct Acyclic Graph (DAG) or Lineage- Execution flow in the form of graph comprising computation nodes (workers?) and weights SparkContext - Orchestration-- Entry point for Spark application Transformation - CSV to RDD, RDD to RDD Actions - (LazyLoading) filter.collect
Problem addressing: Originally Hadoop
- only meant for Batch time processing
- was slow in processing (How slow?)
- MapR problem is complicated
Good Read: https://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html